Algorithms for Multiplication and Exponentiation
CS
463/480 Lecture, Dr.
Lawlor
There are quite a few different ways to do multiplication. The
simplest is to use the compiler's "*" operator, but there are a
number of situations where you might want to build something
yourself. For example, you might want finite field
multiplication (which has a different "add" operation), big integer
arithmetic (calling an "add big integers" function), or modular
arithmetic (reducing by the modulus each time).
Exponential-Time Multiply
The simplest approach is just to loop
over one of the operands and add:
var mul_slow(var A,var B) {
var product=0;
for (var i=0;i<B;i++) product+=A;
return product;
}
This has terrible performance, since it
uses O(2B) additions for a number with B bits--it's as
slow as brute forcing a key!
If you replace the "+=" with "*=", and initialize the product at
1, you can transform this slow multiply into a slow
exponential.
var exp_slow(var A,var B) {
var product=1;
for (var i=0;i<B;i++) product*=A;
return product;
}
It's still exponentially slow, though. (Nonetheless, a surprising number
of people use this algorithm).
Binary / Peasant Multiply
You can do exponentially better with the
famous "fast multiplication algorithm", also known as the "Peasant Algorithm" (it was evidently used by Russian
peasants, using even/odd to identify the value of the low
bit).
var mul_binary(var A,var B) {
var product=0;
for (var bit=0;B>=(1<<bit);bit++)
{
if (B&(1<<bit))
product+=A; // include A in product
A+=A; // left shift A by 1 bit
}
return product;
}
This is now O(B) additions for a number
with B bits, which is exponentially better than before.
Once again, replacing "+" with "*" gives the fast exponentiation by squaring trick:
var exp_binary(var A,var B) {
var product=1;
for (var bit=0;B>=(1<<bit);bit++)
{
if (B&(1<<bit))
product*=A; // include A in product
A*=A; // double exponent on A
}
return product;
}
Masked Binary Multiply
On most modern computers, the "if" above
gives fairly poor performance (the branch prediction is...
unpredictable). We can do about 30% better using a bit mask
to zero out the contribution to the product. This will also
eliminate a side channel attack based on timing the
branches.
Here, we build the mask using a clever trick with signed numbers:
negation converts 0,1 to 0,0xffffffff...
var mul_AND(var A,var B) {
var product=0;
var Bbit=B; // == B>>bit
for (var bit=0; Bbit!=0; bit++)
{
var mask=-(Bbit&1); // -0 == 0; -1 == all ones
product+=A&mask; // optionally include in product
Bbit=Bbit>>1; // move to next bit of B
A+=A; // move to next bit of A
}
return product;
}
There is one difficulty in converting to exponentiation: if the mask
is zero, the above will eliminate the entire product. We can
use the famous(?) bitwise selection technique to multiply by 1 if
the mask is zero:
var exp_AND(var A,var B) {
var product=1;
var Bbit=B; // == B>>bit
for (var bit=0; Bbit!=0; bit++)
{
var mask=-(Bbit&1); // -0 == 0; -1 == all ones
product*=(A&mask)|(1&~mask); // A if mask, 1 if not mask
Bbit=Bbit>>1; // move to next bit of B
A*=A; // move to next bit of A
}
return product;
}
Hardware Multiply
Any reasonably big modern CPU has a
hardware multiply instruction, of course. It's typically
very fast, only a few clock cycles latency, and is often the same
speed regardless of the values being multiplied.
var multiply_hardware(var A,var B) {
return A*B;
}
The only real downside is the hardware multiply circuit can only
handle 32 or 64 bit numbers, and we often need bigger numbers than
this.
Karatsuba Multiplication
For multiprecision arithmetic, where we loop over smaller digits to
perform arithmetic operations, addition takes O(N) time for an N
digit number. Multiplication via the binary methods above take
O(N) additions, or O(N2) total time for N digit numbers.
Karatsuba
multiplication takes O(Nlg 3) = O(N1.585)
time, using a divide-and-conquer algorithm. Here, x=x1*b + x0
and y=y1*b + y0, which is to say x1 and x0 are the high and low
halves of x. To multiply these, you might expand terms to do a
total of four half-size multiplications:
x*y = b^2*x1*y1 + b*(x1*y0 + x0*y1) + x0*y0
By contrast, Karatsuba calculates:
x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) +
(b+1)*x0*y0
This is algebraically equivalent, but the surprising fact about this
is it breaks down one N-digit multiplication into just three N/2
digit multiplications, one fewer than you might expect.
However, there are several additional additions and subtractions
required. For numbers
longer than 300-600 bits, Karatsuba is faster than either the
binary approach above or a straightforward recursive approach.
Really long numbers, on the order of 64K bits, can benefit from the
Schönhage–Strassen
algorithm, which uses a Fast Fourier transform to reach O(N
log N log log N) time. However, numbers this big are rarely
used in crypto.
Timing Comparison
For multiplying 64-bit numbers, here are the timings for a 64-bit
Sandy Bridge desktop:
Method mul slow:
3.83 ns for 2,1
9.15 ns for 2,10
48.37 ns for 2,100
314.58 ns for 2,1000
2968.24 ns for 2,10000
29560.18 ns for 2,100000
Method mul binary:
4.13 ns for 2,1
7.65 ns for 2,10
12.01 ns for 2,100
15.58 ns for 2,1000
20.93 ns for 2,10000
25.01 ns for 2,100000
Method mul AND:
4.13 ns for 2,1
5.93 ns for 2,10
8.58 ns for 2,100
10.71 ns for 2,1000
13.98 ns for 2,10000
16.52 ns for 2,100000
Method mul HW:
3.54 ns for 2,1
3.54 ns for 2,10
3.55 ns for 2,100
3.54 ns for 2,1000
3.55 ns for 2,10000
3.54 ns for 2,100000
(Try this in NetRun now!)
Note how the slow multiply is bad and getting worse, the fast
algorithms grow logarithmically slowly, and the hardware is
constant time.