Cisco Systems - Authorea

by author

by title

by keyword

Cryptanalysis of ring-LWE based key exchange with key share reuse

Scott Fluhrer

January 27, 2016

ABSTRACT This paper shows how several ring-LWE based key exchange protocols can be broken, under the assumption that the same key share is used for multiple exchanges. This indicates that, if these key exchange protocols are used, then it will be necessary for a fresh key share be generated for each exchange, and that these key exchange protocols cannot be used as a drop in replacement for designs which use Diffie-Hellman static key shares. INTRODUCTION Key agreement protocols are one of the oldest public key primitives known, dating back to the Diffie-Hellman protocol. In a key agreement protocol, each side selects a private value, and exchange a public value (which is often called a key share); in most such protocols, each side sends a single message. Then, both side do computations based on their private value and the other side’s public key shares, and derive the same secret value. The security goal is that someone listening into the exchanged public key shares would find it infeasible to derive that secret value. Diffie-Hellman (and the similar Elliptic Curve Diffie-Hellman) protocols would be vulnerable to a Quantum Computer; someone with a large Quantum Computer could rederive the private values and thus obtain the common secret value. One research topic is to find alternatives that would be secure in that environment. Several such proposed alternatives are based on the ring-LWE problem. From a protocol standpoint, these proposals work largely like Diffie-Hellman, each side selects private values, one side sends its public value, the other side replies, and then they both compute a shared secret. With Diffie-Hellman, it is perfectly safe to reuse the same public key share for multiple exchanges. One such use is the “ephemeral-static” mode; in this case, Alice might select a private value, and publish the corresponding key share. Then, when Bob, Carol, Dave and Eve want to communicate with Alice, they can take Alice’s key share, select their own private values, and then send to Alice their key shares, thus creating a secure connection. As long as Alice takes some well-known precautions, the connections are independent; Eve gets no advantage on deriving the secret used in the Alice to Bob connection. This paper will show that ring-LWE based key agreement protocols are not safe in this mode; if Alice uses the same private value repeatedly, then Eve (by sending a series of messages, and seeing how Alice responds) is able to recover Alice’s private value (and if Alice used that same static value to generate the keys used to communicate to Bob, Eve can then listen into that conversation). RING-LWE KEY EXCHANGE The Ring-LWE problem works in the Ring ℤ[x]/(xN + 1, p), for integers N and p. A member of this ring can be viewed as consisting of N integers (coefficients) each in [0, p). It has been proposed as the basis for a number of cryptographical systems. These include a number of key agreement protocols, such as , , , and . While these protocols differ in the details, they all follow the same basic paradigm. - Alice and Bob agree on an element A; it may be a global parameter, or it may be based on a seed provided by Alice - Alice selects “small” elements S and E; the value S (Alice doesn’t actually need to remember the value of E) is Alice’s private secret. - Alice computes the value B = AS + E (where the multiplication and addition are done using the ring primitives); this value is Alice’s public key share, which she sends to Bob - Bob also selects small elements S′ and E′; he computes the value U = AS′+E′ and the value V′=BS′. - Bob then uses V′ to compute an error-reconcilation vector C; he sends U, C to Alice - Alice computes the value V = US - Both sides then use the error-reconciliation vector C to convert their V, V′ into a shared secret, converting each coefficient of V, V′ into one bit. Some versions of the key agreement add additional error vectors at some places; as the attack can be modified to account for this, we will ignore it for now. The idea behind this protocol is that Alice computes V = SS′A + SE′, while Bob computes V′=SSA′+S′E, they differ by SE′−S′E, as S, S′,E, E′ are small elements, this is (with high probability) also small (in that each coefficient is close to 0), and so each element of V is “close” to the corresponding element of V′. Of course, while they are close, they aren’t identical. If we used any fixed mapping of the coefficients of V, V′ to the shared secrets would mean that we would have some probability that an element of V would be on one side of the boundary, while the same element of V′ would be on the other. That’s where the error-reconciliation vector comes in. We logically split up the possible coefficient values into four quadrants; quadrant I ([0, p/4)), quadrant II ([p/4, p/2)), quadrant III ([p/2, 3p/4)), and quadrant IV ([3p/4, p − 1]) There is a bit within the error-reconciliation vector that corresponds to each coefficient; it determines whether quadrants I and II are considered the same (e.g. values there may map to a 0 bit, while values in quadrants III and IV would map to a 1 bit); or whether quadrants I and IV are considered the same (e.g. values there may map to a 0 bit, while values in quadrants II and III would map to a 1 bit). The intention is that Bob would, for each coefficient value of V′, determine the value of the error-reconcilation bit that would give the largest possible leeway for errors. As long as the absolute value of any coefficient of SE′−S′E is no more than p/8, it will always be possible for Alice and Bob to agree on the shared secret. By choosing the size of the small values for S, S′,E, E′, we can make this happen with extremely high probability. THE ATTACK Here is the scenerio that we will assume for the attack: - Alice uses a ring-LWE key exchange protocol to establish secure connections - Alice uses the same key share to communicate with both Bob and the attacker Eve - Eve’s goal is to recover the value S the corresponds to Alice’s public key share (and thus be able to decrypt Alice’s traffic) - Eve can perform the ring-LWE exchange protocol with Alice multiple times (with Eve providing a fresh key share each time) - Each time after Alice and Eve has performed the key exchange protocol, Alice will derive her shared secret; Eve when then be able to generate one guess to Alice’s shared secret, and Alice will indicate whether that guess matches what she has or not. This last step can be implemented in practice if the ring-LWE key exchange is used to generate symmetric keys that Alice and Eve would use to communicate. What Eve can do is generate her symmetric keys based on her guess; if Alice is able to decrypt (and respond) based on those keys, then (with high probability) Eve’s guess of the shared secret was correct; if Alice rejects the exchange, then Eve’s guess was not correct. Basic Oracle Query To make a query, Eve mostly follows the protocol; she selects small S′,E′ values (albeit not randomly), she computes V′, and generates the error-reconcilation vector C honestly, except for one location. She deliberately selects S′,E′ so that coefficient 0 of Alice’s computation of US is near 0 (that is, near the boundary of quadrants I and IV; actually any of the quadrant boundaries could be used). For coefficient 0, Eve sets that error-reconcliation bit to indicate that the values in the range [0, p/2) are mapped to one bit value, while values in [p/2, p − 1] are mapped to the other. As Eve is able to compute correctly all the other bits of the shared secret (as she is performing the rest of the protocol honestly), this gives her a way to test the sign of the value of that intermediate coefficient. In particular, if we select an S′, and call δ = (SS′A)[0] (where the notation F[i] specifies coefficient i-th of the ring element F), and we use E′=(X)−i (for some integer i, where (X) stands for the ring element with a 1 in coefficient 1, and 0 elsewhere), then a key share of (jS′A + kE′) (for small integer j, k) would allow us to determine the sign of (jSS′A + kS ⋅ (X)−i)[0]=δj + k ⋅ S[i]. The actual attack The first step in the attack for Eve is to find a lightweight value S′ where (SS′A)[0]= ± 1. She can do this by searching for values S′ which consists of at most three coefficents are [1, −1] and the rest 0, and for which S′B[0] is a small value; as B is Alice’s key share (which Eve can learn by running one exchange), this computation can be done off-line. As S′B = S′(AS + E)=SS′A + S′E where E is known to be small, such a S′ has a nontrivial probability of meeting the criteria. We can further refine potential S′ values by probing how a S′ value works with several i coefficients; by fixing an i value, and trying several j values, and for each such value, to a binary search for the k value such that jδ + kS[i] is one sign, and jδ + (k + 1)S[i] is the other; this gives us the approximate value of S[i]/δ value, and we can use this to deduce whether δ = ±1. We can also deduce the sign of δ. Once we have such an S′ value, then the attack is easy. We can set k = δ, and then for any i, we can query the sign of j + S[i], or in other words, whether j ≥ S[i]. Binary search will quickly (with a handful of probes) give us the exact value of S[i]. We query S[i] for each i, that gives us the entire value of S, and the attack has succeeded. The above can be done with perhaps 4,000 queries (assuming a ring size of N = 1024 and assuming that the value of S was generated using a discrete Gaussian Distribution with a standard deviation of circa 3). ATTACK VARIANTS One issue is that the variant of the protocol has Alice add a second error vector to the computed V vector before doing the reconciliation; this would add in an error that Eve cannot control. Eve can compensate for this by either running multiple probes (and averaging out the error), or by increasing the j, k values (to attempt to magnify the signal over the fixed noise level). In addition, the test (as written) assumes that Eve gets only a single bit per probe (that is, she can test whether her guess of the shared secret was accurate or not). If Alice sends the first encrypted message, then it is possible that Eve might probe several bits per attempt. That is, she might arrange to have several coefficients be near a Quadrant border, and she would be able to compute the shared secrets for each setting of the bits under test, and see which version matches the encrypted data she sees from Alice). CONCLUSIONS AND RECOMMENDATIONS The above shows how ring-LWE based key exchange can be broken practically if the same key share is reused. Ring-LWE is still believed to be safe when a fresh key share is used every time; however one needs to be careful that is the situation. One place where this can potentially come up is in the current TLS 1.3 draft. In this draft, they allow a server to declare a ’static keyshare’. A client who wants to reestablish a connection with the server is able to send a message which includes both the client’s key share and a message encrypted by a key derived from the server’s static keyshare and the clients key share; this is called 0-RTT. The TLS 1.3 draft uses either DH or ECDH, which are both safe when used in this manner. However, if one were to replace the DH or ECDH with a ring-LWE based key exchange, this would become insecure. These results have been specific to ring-LWE, however it would appear likely that these results would also extend to similar LWE-based protocols.

Scalar Blinding on Elliptic Curves with Special Structure

Scott Fluhrer

June 13, 2015

ABSTRACT This paper shows how scalar blinding can provide protection against side channel attacks when performing elliptic curve operations with modest cost, even if the characteristic of the field has a sparse representation. This may indicate that, for hardware implementations, random primes might not have as large of an advantage over special primes as previously claimed. BACKGROUND Elliptic curves are a useful tool within cryptography. An Elliptic Curve is a mathematical group, and some Elliptic Curves have this useful property: given a group member (point) G and an integer k, the point H = kG can be computed in time proportional to logk; however given two points G, H, computing the integer k such that H = kG takes time proportional to $$ (using the best known algorithm). By selecting k (and the Elliptic Curve) to be an appropriate size, we can make finding H given k and G (known as point multiplication) relatively quick, while making finding k given H and G (known as the discrete logarithm problem) infeasible. The most common Elliptic Curves used in practice are defined over a prime field GF(p), for a large (perhaps 256 bit) prime p (the characteristic) that we pick when we generate the curve. One thing this means in practice is that when we compute a point multiplication kG, we spend the majority of the time computing the modular multiplication $a \times b \pmod p$ for two values a, b ∈ GF(p). To accelerate this operation, one approach is to select a prime of the form p = 2e − c, where c has a simple representation in binary, and is considerably smaller than 2e. This allows us to accelerate the computation of the modular reduction by taking advantage of the identity: $$a \cdot 2^e + b \equiv a \cdot c + b $$ If the binary representation of c is simple enough, we can compute a ⋅ c without doing a full multiply, and hence we can compute this modular reduction significantly faster than we could for an arbitrary prime. This allows us to compute the modular multiplication of two numbers in not much more time than it takes to perform a bignum multiplication of those two numbers. Examples of Elliptic Curves that allow this optimization include the so-called NIST curves, Curve25519, and the Microsoft NUMS curves. If we instead select a random prime without such a special structure (such as was done when defining the Brainpool curves), there are still some optimizations we can do beyond the obvious ’perform a multiply, and then perform a generic modulo reduction’. We can implement Montgomery Multiplication, which replaces the modulo operation with some multiplies and shifts; the net result is that a multiplication followed by a modular reduction can be done in the time of approximately two bignum multiplications; in other words, modular multiplication of a special form prime can be done approximately twice as fast as an arbitrary prime. If this were the only consideration, the choice of whether we should use a prime with special structure would be an obvious one. However, there is another issue. Sometimes, Elliptic Curves are implemented by hardware that needs to operate in hostile environments, and can be expected to be subject to side channel attacks, such as Differential Power Analysis. In these types of attacks, the cryptanalyst runs the system, performing the same operation repeatedly, and takes careful measurements of power consumed (or EM radiation emitted) on a cycle-by-cycle basis; by statistically combining these measurements, the attacker hopes to recover the internal states (which includes the private key). To combat these sorts of attacks, one of the strategies that we need to employ is blinding; we include random data in our computations, and while the end results is independent of the random value, the intermediate values are strongly dependent, and thus the correlations between the intermediate states and anything that the attacker wants (such as the private key) is much weaker. One such method of blinding Elliptic Curve calculations (first published by Coron) takes advantage of a property of Elliptic Curve groups; we know an integer n such that nG = 0 (this value n is known as the order of the point G). Coron’s method to compute kG would be to select a random value t and computing first nt + k, and then (nt + k)G. Everytime we would perform a point multiplication, we would select a random t, and hence the bits of the integer we’re giving to the point multiplication logic are independent of the integer k we’re actually multiplying by. Because the time taken by the point multiplication is proportional to the log of the integer, this blinding method increases the time by a value proportional to be size of t; if we select (for example) a 64 bit t, this increase is relatively small compared to the time we would have taken computing kG anyways. However, as observed in , this straight-forward approach turns out not to work as well from primes with special structure. The order of the curve is always within the Hasse Interval; that is, we have: $$p + 1 - 2 \le hn \le p + 1 + 2$$ where h is the cofactor of the curve, and is usually a small power of 2. What this implies is that n ≈ p/h, and if the upper bits of p have a sparse structure, then the upper bits of n will as well. In other words, if p is a special structure prime, and if $t < $, then some of the bits of nt + k will be strongly correlated to some bits of k, and hence this supposed blinding operation does leak some information about k. This would appear to imply that primes with special structure would require significantly larger t values than random primes. And because the time taken to do a point multiplication is proportional to the length of the integer being multiplied, this would appear to imply that primes with special structure can be slower than random primes when implemented on hardware. SCALAR RANDOMIZATION WITH FIELDS WITH SPECIAL STRUCTURE One common way to compute the point multiplication kG is to express k in base b, as: $$k = d_i b^i + d_{i-1} b^{i-1} + d_{i-2} b^{i-2} + ... + d_2 b^2 + d_1 b^1 + d_0 b^0$$ and then perform the computation: $$kG = d_0G + b \cdot ( d_1 G + b \cdot (d_2 G + ... + b \cdot(d_{i-1}G + b \cdot (d_i G)))...)))$$ In the straight-forward way, this takes b − 2 additions to evaluate the values (0G, 1G, 2G, ..., (b − 1)G), and then i cycles of multiplying an intermediate point by the small integer b and adding the point that corresponds to the next digit. There are a number of variants to this approach, both to try to achieve constant time, and to reduce the number of additions required (for example, by using the digits in the range ( − b/2, b/2), taking advantage of the fact that we can compute the inverse −G cheaply within an Elliptic Curve group). The obvious choice is to make b a power of two (so b = 2m); this yields two immediate advantages: - If k is already expressed in binary, the decomposition into the form (di, di − 1, ..., d₀) is just extracting bits - The operation of multiplying a point by b can be efficiently done by doing m point doublings However, if we look at that value of n expressed in such a base b if the prime has special structure, we see a regular pattern. For example, the value of n for the Elliptic Curve Curve25519 (which has the special form prime 2²⁵⁵ − 19) expressed in base b = 32 is: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 27 27 28 29 29 08 23 23 19 19 11 05 16 04 19 03 03 09 14 15 11 20 31 13 Note the long string of zero’s at the beginning; these are what makes scalar randomization less effective. As one might expect, tn ≈ t2²⁵² + t2124.4, and if t < 2¹²⁸, then bits 251 and below of k + nt will be strongly correlated to the corresponding bits of k (because the bits of nt with nontrivial contributions to those bits of the sum will be zero). Other special form primes don’t have quite as striking of a form (I chose Curve25519 because the form of its n makes it quite obvious), but the other special form primes also have long strings of 0’s or b − 1 digits at the beginning, which yields the corresponding weakness. However, let us consider what happens if we consider a b which is not a power of 2. For example, if we were to take the same n expressed in base b = 48, we get: 01 28 34 41 23 00 42 31 16 04 20 44 19 13 01 16 37 17 42 41 16 36 22 39 40 06 14 29 09 44 17 13 24 04 31 28 46 05 19 16 46 07 18 12 42 13 Here, we don’t get any regular pattern, and this value would, at first glance, appear random. And, in fact, if we look at the digits of k + nt (for modest random t) expressed in base 48, we don’t detect any correlation between the digits of that and the digits of k expressed in base 48. This implies that blinding with this value (in base 48) is likely as effective as blinding based on a random prime in base 32. This is not only true of Curve25519; the same thing happens for the special prime curve P256, which has n in base b = 32 as: 01 31 31 31 31 31 31 16 00 00 00 00 00 03 31 31 31 31 31 31 31 31 31 31 31 31 29 28 28 27 29 10 27 09 24 23 19 26 02 15 07 14 14 10 24 11 30 06 06 09 10 17 (where the streaks of 31’s and 00’s cause a similar effect to the long string of 00’s in the Curve25519 n). In contrast, in base b = 48, it is: 25 27 29 39 32 12 33 29 24 47 03 05 12 17 33 24 45 02 26 34 11 37 13 27 43 25 45 05 37 02 04 08 39 09 42 06 04 27 15 01 47 30 12 25 23 01 Of course, once we consider nonpower-of-2 bases, we lose the two advantages that we formally had; let us examine how we can accommodate the loss of these two advantages. COSTS OF POINT MULTIPLICATION USING A NON-POWER-OF-2 BASE The first thing we want to look at is how much more expensive it is to use a base which is not a power of 2. After all, a base which is a power of 2 allows us to implement the fixed multiplication by b by using a handful of doublings, which is the most efficient method possible. However, it turns out there are other bases that are almost as cheap. To make a concrete comparison, we’ll outline the costs with both a power-of-2 base, and a nonpower-of-2 base. In both cases we’ll assume that we’re dealing with an Elliptic Curve with a 256 bit subgroup, and that we’ll use a 64 bit blinding value r, yielding an exponent which is 320 bits long. We’ll use the radix method, using a balanced representation of the digits (that is, the digits are values in the range [ − b/2, b/2], and we’ll assume that the addition-by-0 is masked somehow (whether the low-level addition routines handle it without a special case, or because in that case we’ll add by an arbitrary point and discard the result). To implement this using radix b = 32 (which is optimal over all powers-of-2 in this scenario), we’ll first compute the digits ( − 16G, −15G, ..., 15G, 16G) with 7 doublings, 7 additions[1] and 15 negations[2]. Then, we implement the actual addition chain; in this case, 320 bits is 64 digits[3]; this is 63 rounds of multiplying the current point by 32 (which is 5 doublings), and adding in the next digit (which is a single addition); this step takes us 315 doublings, and 63 additions, for a total of 322 doublings, 70 additions, and 15 negations. Now, let us look at the radix b = 48 case; computing the digits ( − 24G, −23G, ..., 24G) requires 11 doublings, 11 additions and 23 negations. Then, we implement the actual addition chain; in this case, 320 bits can be expressed in 58 base-48 digits; this is 57 rounds of multiplying the current point by 48 (which can be implemented by 5 doublings and an addition), and adding in the next digit (which is a single addition); this step requires 285 doublings, and 114 additions, giving us a grand total of 296 doublings, 125 additions and 23 negations. If our Elliptic Curve representation makes addition as cheap as doubling (which some do), and we ignore the negations (which are comparatively cheap), then the base-32 method turns out to be 7.3% faster than the base-48 method. If we instead assume that a doubling is 80% of the cost of an addition (another common assumption), then the base-32 method turns out to be 10.4% faster than the base-48 method. In other words, from this perspective, we can implement the blinding on a special format prime, and be within 7-10% of the performance of a random prime. These results are fairly stable if we tweak our assumptions (e.g. change the size of the group order or the size of t) When we multiply by a fixed point (for example, the curve generator G), one common optimization is to precompute various multiples kiG for various values ki, and use those to accelerate the point multiplication process. While this works even better with bases that aren’t powers of 2 (as we no longer need to perform multiplications by the fixed value b, and not restricting ourselves to power of 2 bases often allows us to fine-tune the base better), this technique does require us to store some precomputed tables, and hence is less likely to be considered useful for a hardware implementation. Hence, other than this quick note, we will ignore the possibility. WORKING WITH EXPONENTS IN NONPOWER-OF-2 BASES The other advantage that we discard if we work in an odd base is the fact that have to do something to convert our multiplier (which is in binary) into the base. The obvious approach would be to compute k + tn in binary, and then do a base conversion into our desired format. The problem with that is that the digits of k + tn will be expressed as a temporary, and thus will be subject to the same side channel attacks that we are trying to avoid. However, there are ways to avoid this issue. To demonstrate this, we will review two different representative protocols, and give a possibility of how this can be addressed in both of them. These are certainly not the only algorithms we would like to do point multiplication with; however these two should demonstrate the range of options that are possible. One note: the above point multiplication analysis assumed a balanced base-48 notation, while the below will assume a standard base-48 notation. This is because standard base-48 notation is easier to do arithmetic in, while it is not difficult to convert to a balanced notation, if that would be helpful to the point multiplication logic. The case of ECDH/ECIES The easiest case to handle is the case of ECDH and ECIES. In these cases, the integer that we multiply by is just a random number that we pick, and has no correlation with any other value (with the exception that we multiply two different points by the same integer). In this case, we can avoid the initial problem (how do we convert the binary integer into base-48 without giving a side channel attack) simply by selecting the initial random number in base-48. That is, we never explicitly express the multiplier in binary; instead, we pick a series of random values between 0 and 47, and use those as the base 48 digits. As for how to select such a random value between 0 and 47, it can be noted that a rejection method (where you generate 6 random bits as a value between 0 and 63 repeatedly until the selected value is in range) is safe; it is not constant time, however the time taken is uncorrelated the value eventually selected, and hence the timing doesn’t leak any data we care about. The other step is to apply the blinding factor, that is, compute k + nr is a way that has minimal correlation to k; this can be done by computing nr in binary, and then converting that to base-48 (and as nt has no correlation to k, we have less concern about leaking data during the conversion process); once that is done, we can perform a constant time addition of nr to k in base-48. The case of ECDSA Signature Generation Another case is where we are attempting to implement ECDSA, and in particular, the signature generation process. Here, we pick a random value k, and compute both the x-coordinate of r = kG (for the generator point G), and s = k−1(z + rd) (where z, r and d are integers). Because we need to do computations on k beyond using it to do point multiplication, the strategy of generating it in base-48 is less attractive. The obvious idea of computing k + bn, and then converting that to base-48 (for a random blinding factor b), and then using our base-48 style point multiplication also doesn’t work, because we initially express k + bn in binary, and the the intermediate bits of that will be correlated to the bits of k, and that’s what we’re trying to avoid. However, it is still possible by adding a few extra blinding factors. Consider this randomized procedure: - Select random values a, b from the range [0, n), u from the range [1, n) and t from the range (0, 2⁶⁴) (t will be the Coron blinding factor). - Compute t₁ = a + tn - Convert both t₁ and b into base-48, giving t₃, and t₄ Add t₃ and t₄ together as base-48 numbers, giving t₅ - Compute t₅G (using the base-48 point multiplication algorithm outlined earlier), with r being the x-coordinate of the resulting point - Compute $u_1 = au \bmod n$ and $u_2 = bu \bmod n$ - Compute $u_3 = u_1 + u_2 \bmod n$, and then compute $u_4 = u_3^{-1} \bmod n$ - Compute s = u₄u(z + rd) (where z, r and d have the normal meanings for ECDSA; z is the hash, r is the x-coordinate computed previously, and d is the ECDSA private key). If you go through this procedure, it should be clear that this is the ECDSA signature algorithm (with $k = a+b \bmod n$). It should also be clear that the value of k is selected without a bias. In addition, the internal bits of all the intermediate values are uncorrelated to the bits of k (in fact, except for t₅, the value of all intermediates are distributed independently of k), hence we have achieved blinding against first order side channel attacks. In addition, the operations that we have added over the straight-forward ECDSA signature generation with Coron blinding (generating 2logn additional random bits, three additional multiplications, one additional binary addition, one addition in base-48, and two base conversions) are relatively cheap (say, compared to computing the multiplicative inverse), and so we haven’t increased the expense significantly. SUMMARY In Requirements for Standard Elliptic Curves, the designers of the Brainpool curves gives two justifications for selecting a random prime; one is that a special prime does not give an special performance advantages in their environment, and secondly, the special primes make blinding operations more difficult. This paper has shown that the effort required to perform blinding when using a special prime has been overestimated; there appears to be ways to perform the required blinding at modest additional expense. [1] In some elliptic curve representations, the operation of adding a point to itself (doubling) is cheaper than adding two distinct points (additions), hence we track those two operations separately [2] Negation is such a cheap operation within Elliptic Curves that we typically don’t count it [3] Normally, it would be 65, because of the signed representation; however we could assume that t is a signed value as well, and that would bring us to the 64 digit level

Quantum Cryptanalysis of NTRU

Scott Fluhrer

May 22, 2015

ABSTRACT This paper explores some attacks that someone with a Quantum Computer may be able to perform against NTRUEncrypt, and in particular NTRUEncrypt as implemented by the publicly available library from Security Innovation. We show four attacks that an attacker with a Quantum Computer might be able to perform against encryption performed by this library. Two of these attacks recover the private key from the public key with less effort than expected; in one case taking advantage of how the published library is implemented, and the other, an academic attack that works against four of the parameter sets defined for NTRUEncrypt. In addition, we also show two attacks that are able to recover plaintext from the ciphertext and public key with less than expected effort. This has potential implications on the use of NTRU within TOR, as suggested by Whyte and Schanck INTRODUCTION NTRUEncrypt is a public key encryption system designed by Jeffrey Hoffstein, Jill Pipher and Joseph Silverman. It has several attractive features, one of which is that it is immune to attacks by Shor’s algorithm (as it does not rely on a factorization or discrete log hard problem). Hence, it looks to be a logical component as a part of a Quantum-Resistant cryptosystem. NTRU does appear to be immune to Shor’s algorithm (which allows the attacker to quickly factor large integers and compute discrete logarithms). However, a Quantum Computer also allows an attacker to run Grover’s algorithm, which is able to find a n bit solution to a problem in 2n/2 time. The question we would like to look at is ’how can Grover’s algorithm be used to advantage in attacking NTRU?’ There has been previous analysis of the Quantum Resistance of NTRU, such as by Wang, Ma and Ma, however those works studied previously defined parameter sets. This work is focusing on the parameter sets distributed with the current NTRU library. NTRU BASICS NTRU works in the ring of polynomials Z[x]/xN − 1, where N is a prime. Computations in this ring are actually done modulo a prime power; NTRUEncrypt actually evaluates additions and multiplications modulo a prime power (q) and a small polynomial (p) during the course of its operation. However, all the operations that we’ll examine are done modulo q (where q = 2048 is a typically choice), hence for the purposes of this paper, we can consider the ring to be Z[x]/(xN − 1, q). In addition to the base NTRU operation, NTRUEncrypt uses a padding mechanism called NAEP to protect the underlying NTRU primitive from the attacker being able to deduce information from decryption failures. There are a number of parameter sets defined for NTRU; each parameter set includes the value of N that are used during the NTRU operations, the values of p and q, as well as the expected security level for this parameter set (that is, the value k for which we expect any attack against this parameter set to take at least O(2k) operations. NTRUEncrypt is available as a free-for-noncommercial use library from Security Innovation; we will be analyzing the parameter sets and the key generation and padding methods as implemented by that library. When we select a private key for NTRUEncrypt, we select two polynomials F and G with specific sets of coefficients; we also need to make sure that F is invertible. Once we have that, we can compute the public polynomial H = F−1G. The public key decryption process uses the secret polynomial F to decrypt. When we encrypt a message m with an NTRUEncrypt public key, the library performs the following steps: - It first examines the public key to get the security level k of the parameter set that the public key belongs to. Currently defined parameter sets have k ∈ {112, 128, 192, 256} - It then selects a random k-bit value b - It encodes the value b, the message m and a portion of the public key into a string, and hash that string to form a value ρ. It uses SHA-1 as the hash function if k ≤ 160 and SHA-256 if k > 160 - It uses ρ to seed a random number generator, and uses the output of that random number generator to select a polynomial R - It encodes the message m and the random value b as a polynomial M which consists solely of coefficients in the set {0, 1, −1}; we run a check on the value M to make sure that it has sufficient coefficient diversity (that is, this check makes sure that each of the possible values occurs sufficiently many times); if not, we go back and select a different value for b. - It extracts the polynomial H from the public key, and generate the ciphertext HR + M (where the computation is done in the ring, calculating everything modulo q). All but the last step is actually the NAEP padding procedure, used for generate the polynomials R and M for the actual NTRU operation. During decryption, the decryptor recovers the values m and b; it then uses the above logic to recompute R (and checks to make sure that was the R used to generate the ciphertext; this prevents invalidly generated ciphertexts from becoming an issue). What this means is that the encryptor must use this formula to generate R from b, m and the public key. KEY RECOVERY ATTACK 1 The NTRU public key H is the polynomial F−1G (computed over modq), where F and G are sparse polynomials; the polynomial F is the private key. One way to try to recover the private key is to search for an F such the product polynomial FH is sparse (which, in this case, means consists of dG + 1 coefficients of the value p, dG coefficients of the value −p, and the rest 0, where dG is a constant from the parameter set). Now, the defined NTRU parameter sets work in one of two ways; in the straight-forward method, the key generation process selects a random polynomial whose coefficients consists of dF 1’s, dF -1’s, and the rest 0 (where dF is a parameter from the parameter set). The other method, known as the product form, has the key generation process select three polynomials F₁, F₂, F₃, and effectively sets F = F₁F₂ + F₃. Each of these polynomials F₁, F₂, F₃ is sparser than the target F polynomial; and computing F₁(F₂H)+F₃H is faster than computing FH directly. To use Grover’s algorithm, we could search for the polynomial F for which FH is sparse. However, doing this in this straight-forward manner doesn’t work (as for all defined parameter sets, there are more than 22k possible values of F). If we are working with a parameter set that uses the product form, one way to rewrite the equation FH = G is in the form F₁F₂H = G − F₃H, where we know that each coefficient of G is either 0, p or −p. The next obvious improvement is to check for this equality $\bmod p$; as all coefficients of G are $0 \bmod p$, this simplifies the equation to $F_1F_2 \equiv -F_3H $. Now, there is an obvious objection to this; both F₁F₂ and F₃H are computed modulo q, which is relatively prime to p; what happens if an addition within G − F₃H which “wraps”; that is, −F₃H has one sign, and G − F₃H has the other? Well, the obvious answer is “the search for the public key fails in that case”. It is worth noting that the probability of such a wrap happening is reasonable; the parameter sets in question have p = 3, q = 2048 and G having between 267 and 495 nonzero entries; assuming that the coefficents of F₃H are random, this gives us a success probability between 0.67 and 0.48. So, the obvious way to address this failure mode is, if the initial search fails, we add a random constant to both sides of the equation, and rerun the search; this has the effect of doubling the expected search time. So, the obvious search algorithm is to store the coefficients of $- F_3H \pmod p$ (for all possible F₃ values), and then perform the Quantum Computer search over all possible F₁, F₂ values for where the coefficients of $F_1F_2H \pmod 3$ is a match for one of the precomputed values. With reasonable probability the correct private key will be the only one that gives a plausible setting; in fact, we don’t need to search over all the coordinates, instead it is sufficient to search over enough to make any false hit unlikely. If we find a match, this will allow us to rederive F in (strictly speaking) fewer than 2k operations. In addition, we can reduce this work effort by taking advantage of the rotational symmetry of the multiply operation within F[x]/xN − 1; if we consider the product AB, and then consider the product A′B (where A′ has the same coefficients as A, only rotated by m positions, then the product A′B will have the same coefficients as AB, only rotated by m positions. This can easily be seen as the operation of rotating by m positions is the operation of multiplying by the polynomial xm, and (xmA)B = xm(AB). This allows us to reduce the number of F₃ polynomials we consider by a factor of N (because if a specific value of F₃ gives a solution with small coefficients, so will any rotation of F₃ and the corresponding F₁F₂. Ttere is also a second symmetry that we can take advantage of; if we rotate the elements of F₁ left by m positions, and the elements of F₂ right by m positions, the resulting product F′₁F′₂ will be unchanged. That is, (xmF₁)(x−m)F₂ = F₁F₂. This is a second symmetry that reduces the number of the products F₁F₂ we need to consider by a factor of N. The result of these two improvements reduce the number of $- F_3H \bmod p$ values we need to precompute by a factor of N, and the number of $F_1 F_2 \bmod p$ polynomials that we need the Quantum Computer to search over by a factor of N. When we consider the parameter set EES593EP1 (which has a design strength of 192), we find it has N = 593, and dF₁ = 10, dF₂ = 10 and dF₃ = 8. This implies that the total number of F₃ polynomials is ${16}{8}$ (because each F₃ polynomial coefficients consists of 8 ps, 8 −p’s, and 577 0s), and the number of F₁F₂ polynomials as ${20}{10}{20}{10}$. When we take into account the factor of N decrease (because of the rotational symmetry) , this gives us |F₃|/N ≈ 2107.285. When we consider the number of F₁F₂ polynomials, we take account of the factor N decrease (because of the second rotational symmetric), this gives us |F₁||F₂|/N ≈ 2271.165. A Quantum Computer is able to search over a set of this size in approximately 2¹³⁶ time, and this second step would dominate. When we account for the failure probability (and the possibility we’re need to rerun this procedure), this gives us an overall time of O(2¹³⁷), which is considerably smaller than the original design goal of 192 bits. When we consider all four product form parameter sets, we find that the 112 bit product form set (EES401EP2) can be attacked with an expected O(2¹⁰⁴) work, the 128 bit set (EES439EP1) can be attacked with an expected O(2¹¹²) work, and the 256 bit set (EES743EP1) can be attacked with an expected O(2¹⁹⁷) work. In this last case, it’s actually the derivation of the possible values of −F₃H which is the dominating factor (because dF₃ = 15 for this parameter set, which is comparitively large. Now, this approach has managed to recover the private key using fewer NTRU multiplications than expected. On the other hand, this approach is quite impractical (even beyond the number of operations involved); it assumes that we can practically check for the existence of an entry in a table with more than 2⁶⁵ entries in constant time. An obvious alternative approach would be to search for equalities between F₁F₂H and G − F₃H modulo 2; this would allow us to ignore the sign differences in the F₁, F₂, F₃ coefficients (and avoids the possibility of the attack failing because of a wrap, as all have parameter sets in question has q being a power of 2). However, this approach turns out not to work, because G happens to be relatively dense evaluated modulo 2 in the parameter sets in question, and so there’s no obvious way to determine when we’ve detected the correct F₁, F₂, F₃ set. KEY RECOVERY ATTACK 2 When the NTRU library generates a key, it goes through this procedure: - It selects a random bitstring that is the design strength k of the parameter set, plus 64 bits (for example, for a 128 bit parameter set, it obtains 192 bits from the internal random number generator) - It hashes this random number (with SHA-1 if the parameter strength k ≤ 160 and SHA-256 if k > 160) giving us a hash value h. - It uses that hash value h (and nothing else) to seed a random number generator, and the output of that random number generator selects a polynomial F - It uses a similar process to select a polynomial G This process is similar to the process used to select the random polynomial R during encryption (largely because it reuses the same code). This key generation procedure immediately gives us a potential key recovery attack; given a public key H, we use Grover’s algorithm to guess the hash value h; our verification step would seed the random number generator, selects a polynomial F′, and then compute the product $F'H $; if that consists solely of elements in the set { − p, 0, p}, then we accept. This works because if F = F′, then we have F′H = G (which has the special form listed). PLAINTEXT RECOVERY ATTACK 1 A similar approach to recover the plaintext given a ciphertext C and a public key H is to run Grover’s algorithm, with the guess this time being ρ the output of the hashed string. That is, they would use the function that takes a value for ρ, run it through the random number generator to generate a guess of R, and check if the polynomial C − HR consists solely of the coefficients {0, 1, −1}; the correct guess of the hash will do that (as C − HR is the encoded message M, which is a polynomial with those coefficients); an incorrect guess is quite unlikely to have coefficients limited to that range. For security levels 112, 128, we use a 160 bit hash (and 2¹⁶⁰ possible values for ρ), hence this approach will take an expected O(2⁸⁰) operations. For security levels 192, 256, we use a 256 bit hash, hence this approach will take an expected O(2¹²⁸) operations; in both cases, the work required is significantly smaller than the target security levels. This is quite similar to one of the key recovery attacks we have presented above (because both attacks work against the common logic used to generate both F and R). However, there is a distinction in that the key generation procedure could be modified to use a stronger method to select F (and G) without any interoperability issue. In contrast, we cannot modify how we generate R without modifying how decryption is done (as the decryptor will expect to be able to regenerate R as part of the post-decryption validation process. Covering this plaintext recovery attack would require modifying the NTRU padding method. PLAINTEXT RECOVERY ATTACK 2 Another way an attacker can attempt to recover the plaintext, if the plaintext is low entropy, is to notice that the ciphertext is a deterministic function of the plaintext, the public key, and the k-bit random value b. If the entropy of the plaintext is low (has only 2n possible values, for n ⋘ k), then what an attacker with a Quantum Computer could do is model the system as one with 2n + k inputs (which consists of the 2n possible inputs for the plaintext, paired with the 2k possible values for b), and then apply Grover’s algorithm to find a solution (that is, m and b) that generates the known ciphertext in O(2(n + k)/2) time, which is less than the security level 2k (and if n is sufficiently small, this value may be even smaller than the previous attack). Now, in practice, this attack would not generally appear to be a significant threat; in most cases, NTRUEncrypt will be used to pass symmetric keying data, and symmetric keying data has sufficiently high entropy to make this attack infeasible. However, in those cases where NTRUEncrypt is used to transmit low-entropy plaintexts or if the attacker might have a plausible guess for the plaintext (and it is important to make such verification infeasible), this attack is of concern. CONCLUSIONS AND RECOMMENDATIONS We have presented four attacks where an adversary with a Quantum Computer is able to attempt against NTRUEncrpyt (as implemented by the current NTRU library). One of the key recovery attacks actually attacks the key generation process that the NTRU library uses; it would be easy to modify the library to foil this approach (for example, both to use stronger hash functions when generating the hashed value h, and by extending the entropy extracted from the random number generator from k + 64 bits to at least 2k bits to prevent someone using Grover’s algorithm to guess the preimage value). Because of the ease of this modification, and because the modified library would continue to interoperate with existing NTRU implementations, we recommend that such a change be made (even if it is not clear if this attack is actually practical). We have also presented another key recovery attack that uses fewer operations than expected to recover the private key (assuming one of the four standardized parameter sets); however this attack is thoroughly impractical. We don’t recommend any change to cover this attack. Neither of the two plaintext recovery attacks we have presented actually attack the NTRU primitive itself; instead, they attack the NAEP padding method, and take advantage of the fact that the internal primitives selected for the parameter sets are scaled to withstand attacks by a classical computer, and are not sufficient if the attacker has a Quantum Computer. These attacks may have some practical impact. For example, in ’A quantum-safe circuit-extension handshake for Tor’, they suggest using the NTRUEncrypt parameter set EES439EP1 (a 128 bit parameter set) to protect Tor traffic; do these results mean that someone with a Quantum Computer could reconstruct the original sender of a message with O(2⁸⁰) work? There are three obvious ways to address the issues here. The first strategy would be to use a stronger parameter set; if your target is 128 bits security, use an NTRU parameter set targeted towards 256 bit security; such a parameter set would provide at least 128 bits security against all known attacks, even if the attacker has a Quantum Computer. The costs here are that the ciphertext and public key sizes increases, as well as a small increase in the encryption and decryption time. The second strategy would be to define alternative parameter sets that are designed to be Quantum-Resistant. These alternative parameter sets might use the same polynomials as the current sets; however we would modify the primitives used using the NAEP padding method; they would use wider hash functions, and larger b values. The costs here would be that the new parameter sets would be incompatible with libraries that only understood the old sets, plus a minor decrease in the amount of plaintext that NTRU could encrypt in a single message (because b is now larger). The third way of addressing this is to question whether this is actually a threat after all. O(2⁶⁴) work (as the worse case plaintext recovery attack) may sound like a conceivable amount of work; however that notation conceals a constant of proportionality, and that constant might be of significant size. In both of these attacks, the amount of work involves is actually O(2⁶⁴) encryption operations; in addition, each operation is done on a Quantum Computer (using entangled states, and using some Quantum Error Correction logic). It would appear plausible that such an operation may be considerably more expensive than the corresponding operation on a classical computer. However, we don’t have a working Quantum Computer, and so we can’t be certain how much more expensive these operations would be. Because of this uncertainty, while this line of argument may sound promising, our opinion is that we shouldn’t rely only on that; we recommend one of the previous two strategies (especially given their relatively low cost).