Image

Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Constant-time modular multiplication library in Java?

+2
−1

Has anyone written a constant-time BigInteger modular multiplication function in Java? I need to use modular multiplication for a cryptographic protocol, and I don't want to try to write it myself or rely on blinding.

For context, the reason I want to do this is to implement a cryptographic protocol. The following arithmetic is in a finite field. Suppose a number of peers, each of whom have a number $ a $ and a number $ b $, want to decide whether the sum each peer's $ a $ value equals the sum of each peer's $ b $ value. My plan to implement this is to have each peer generate a random value $ r $ and compute $ g^r $, then publish $ g^rg^a $ and $ g^rg^b $. Then the product of every peer's $ g^rg^a $ value will equal the product of every peer's $ g^rg^b $ values, as long as the sum of every peer's $ a $ is equal to the sum of every peer's $ b $. Since I assume $ a $ and $ b $ are high-entropy, this should not reveal a peer's value of $ a $ or $ b $.

In this protocol, I need a constant-time guarantee for multiplying $ g^r $ with $ g^a $ or $ g^b $, since I assume a or b are long-lived. The modular exponentiation doesn't need to be constant-time, since I only ever need to compute $ g^r $, $ g^a $, or $ g^b $ once.

I've also considered using ECC for this, but then I need to find a constant-time ECC implementation, and I'm more familiar with finite field arithmetic anyway.

History

0 comment threads

1 answer

+4
−0

Standard Disclaimer: As always for this kind of thing, for production use, you should use an existing library written by experts to do as much of this as possible.

Meta Note: I also highly encourage looking into ECC (and particularly Curve25519) as it has a lot of usability and performance benefits. Curve25519 keys are only 32-bytes long, which makes using the public key directly as an identifier much more practical, and can be generated by simply taking 256 bits of random data (though you'd usually normalize with some trivial bit masking). This is much simpler and less error-prone than finding some large primes as RSA requires. Being able to pick a convenient modulus, like $2^{255}-19$, can allow significant simplifications and speed-ups in modular operations. The paper discusses Bernstein's rationale on this choice.

To your explicit question, I don't know if someone has implemented a constant-time multiplication using Java's BigInteger library. I suspect not. Really, the question is whether Java's BigInteger library even provides enough timing guarantees for that to be possible, either by specification or in practice. Obviously, two BigIntegers can take up different amounts of memory depending on their magnitudes. This, by itself, will almost certainly lead to timing differences. You could attempt to finesse this by storing $x$ as $2^N + x$ for some fixed, suitably large $N$ such that $2^N$ is larger than any intermediate involving computation with the $x$s would be. This (along with reductions mod $2^N$ when necessary) would probably guarantee a consistent memory size but, I suspect, would still not guarantee consistent times.

In practice, for a production system, you wouldn't want to use BigInteger anyway. For any particular algorithm, the number of bits you need to represent the numbers is fixed, and it makes more sense to use a representation that represents these with a fixed amount of memory independent of magnitude.

Here's a route you could take that would be fairly easy to implement. First, I'm going to assume the modulus is public, so we don't need to worry about timing for operations that only involve the modulus (and other public information). Let $W$ be the number of bits in the primitive words you use, e.g. 32 or 64, and $N = BW$ be the number of bits you'll use to represent numbers, e.g. 2048, 4096 for RSA or 256 for something like Curve25519. Here I'm assuming its a multiple, $B$, of the number of bits in the primitive words. In other words, we're representing numbers as degree < $B$ polynomials with coefficients in $\{0,\dots, 2^W-1\}$, i.e. $x = \sum_{k=0}^{B-1} a_k 2^{kW}$. This polynomial is simply represented by an array of length $B$ of $W$-bit words corresponding to the $a_k$. (This polynomial-oriented view is very powerful and I highly recommend Multidigit Multiplication for Mathematicians by Dan Bernstein for more discussion on this.)

Let $n$ be our public modulus which we'll assume is coprime to $2^N$ though that will not matter for most of the following but is generally true in these contexts. Let $r_j = 2^{jW} \pmod n$, i.e. $2^{jW}$ reduced mod $n$ but we'll choose the choice with the lowest absolute value. That is, if the (Euclidean) remainder of $2^{jW}$ mod $n$ is greater than $n/2$, then have $r_j$ be that remainder minus $n$. This implies that $|r_j| \leq (n-1)/2 < 2^{N-1}$. You can pre-compute various values using the BigInteger library. In particular, pre-compute $r_{N+iW}$ for $i \in \{0,\dots,B-1\}$. We'll also want $r_j$ for various $j < N$ such that $2^j > n$. See the discussion on full reduction later to get an idea of which we need (or they could be computed on the fly but they will be used often).

To multiply $x$ and $y$ mod $n$, you can first just use the grade-school algorithm which is essentially the naive convolution algorithm with carrying to multiply $x$ and $y$ as (multi-digit) integers. The result will require a $2B$ word long array. We can view this result as $a_0 + 2^N \sum_{k=0}^{B-1}b_{0,k} 2^{kW}$. We then compute $a_0 + \sum_{k=0}^{B-1} b_{0,k} r_{N+kW}$. Since $a_0 < 2^N$, $b_{0,k} < 2^W$, and $|r_j| \leq (n-1)/2 < 2^{N-1}$, the magnitude of this sum is less than $(B+1)2^{N+2W-1}$. Assuming $B+1 < 2^{W+1}$ which will be the case for almost all realistic values of $B$ and $W$, we now have $a_1 + 2^N \sum_{k=0}^2 b_{1,k} 2^{kW}$ which we again map to $a_1 + \sum_{k=0}^2 b_{1,k}r_{N+kW}$ whose magnitude is less than $2^{N+2W+1}$. We can iterate this, in particular, writing the result as of an iteration as $a_i + 2^N b_i$, we see that $b_{i+1}$ requires at least one less bit to represent than $b_i$.

This is overly pessimistic, and we can pre-compute a schedule based on the actual bit sizes of the (public) $r_j$. Nevertheless, the value of a good choice of modulus becomes clear here. With $n=2^{255}-19$, $r_{256} = -38$ leading to hundreds of bits of reduction per iteration. (If we didn't care about data-independence, we would just use the actual magnitude of the result to determine when to stop iterating and how many coefficients we need to deal with each iteration. That would be simpler and faster but data-dependent.)

NOTE, the result of this is a number (represented in vector form) less than $2^N$ but NOT necessarily less than $n$. The result is correct mod $n$, but isn't necessarily fully reduced. Generally, this is fine for intermediate calculations. To perform a full reduction, we use $x = a + b2^{j_0} + c2^{j_1}$ where $2^{j_0} < n < 2^{j_0 + 1}$ and $j_0 < j_1 < N$. $a$ will be a $j_0$-bit number, $b$ a $(j_1 - j_0)$-bit number, and $c$ a $(N-j_1)$-bit number. We have $x = a + b2^{j_0} + cr_{j_1} \pmod n$, but since $r_{j_1} < n$, $x < a + (b+2c)2^{j_0}$. $b+2c$ has at most $\max(j_1 - j_0, N-j_1+1)+1$ bits. If $j_1$ was chosen as roughly halfway between $N$ and $j_0$, this means it takes roughly half the bits to represent the part beyond bit $j_0$. We can iterate this a fixed number of times in a fixed pattern (dependent on $N$ and $j_0$) roughly halving the number of bits required beyond $j_0$ each time. Eventually we can finish with a fixed few rounds of subtracting by $n$ with borrow where we add back an $n$ if the borrow flag is set. We do this in a data-independent way by always adding back an $n$ multiplied by the borrow flag. You can see why doing a full reduction is not desirable for every intermediate operation.

To implement this, you need to implement conversion to/from BigInteger, addition, subtraction, and scaling by a "small" scalar all with carries, but as just normal multi-digit integer arithmetic. For convenience, you could use an array of 64-bit numbers (say) but use them the represent 32-bit chunks. That is, you have $x = \sum_{k=0}^{B-1} a_k 2^{32k}$ but now $a_k$ is (inefficiently) represented as a 64-bit number. The benefit of doing this is you don't need to perform carries in your addition, subtraction, and multiplication algorithms, you can instead have a separate propagate_carries (or reduce) function that normalizes all the coefficients to $\{0,\dots,2^{32}\}$ and performs carries then. Worst-case, you do an explicit propagate_carries after every operation anyway, and it just lets you write a bit simpler code. In many cases, it's easy to see that you can delay a carry until later saving some computation. Of course, this is at the cost of storing and performing 64-bit operations everywhere.

As a final warning, and another reason you shouldn't do this yourself for production code without a lot of care, the above algorithms use a fixed pattern of constant-time operations (given $N$, $W$, and $n$) so are not data-dependent. However, this doesn't stop the compiler from doing some "optimization" that does lead to data-dependent timings. For example, maybe the compiler realizes that the carry will only ever be $0$ or $1$ and decides to "optimize" it to if (carry == 0) return result; else return add(x, r_N); Short of writing assembly code, it's hard to be certain something like this isn't happening. Even at the level of assembly, speculative evaluation can break this property hence things like Intel's Data Operand Independent Timing Mode.

History

1 comment thread

Thank you for including all this detail. I think this will probably push me to using Rust for this pr... (1 comment)

Sign up to answer this question »