Algorithms for Multiplication and Exponentiation

CS 463/480 Lecture, Dr. Lawlor

There are quite a few different ways to do multiplication.  The simplest is to use the compiler's "*" operator, but there are a number of situations where you might want to build something yourself.  For example, you might want finite field multiplication (which has a different "add" operation), big integer arithmetic (calling an "add big integers" function), or modular arithmetic (reducing by the modulus each time). 

Exponential-Time Multiply

The simplest approach is just to loop over one of the operands and add:
var mul_slow(var A,var B) {
	var product=0;
	for (var i=0;i<B;i++) product+=A;
	return product;
}
This has terrible performance, since it uses O(2B) additions for a number with B bits--it's as slow as brute forcing a key!

If you replace the "+=" with "*=", and initialize the product at 1, you can transform this slow multiply into a slow exponential. 

var exp_slow(var A,var B) {
	var product=1;
	for (var i=0;i<B;i++) product*=A;
	return product;
}
It's still exponentially slow, though.  (Nonetheless, a surprising number of people use this algorithm).

Binary / Peasant Multiply

You can do exponentially better with the famous "fast multiplication algorithm", also known as the "Peasant Algorithm" (it was evidently used by Russian peasants, using even/odd to identify the value of the low bit). 
var mul_binary(var A,var B) {
	var product=0;
	for (var bit=0;B>=(1<<bit);bit++)
	{
		if (B&(1<<bit))
			product+=A; // include A in product
		A+=A; // left shift A by 1 bit
	}
	return product;
}
This is now O(B) additions for a number with B bits, which is exponentially better than before. 

Once again, replacing "+" with "*" gives the 
fast exponentiation by squaring trick:
var exp_binary(var A,var B) {
	var product=1;
	for (var bit=0;B>=(1<<bit);bit++)
	{
		if (B&(1<<bit))
			product*=A; // include A in product
		A*=A; // double exponent on A
	}
	return product;
}

Masked Binary Multiply

On most modern computers, the "if" above gives fairly poor performance (the branch prediction is... unpredictable).  We can do about 30% better using a bit mask to zero out the contribution to the product.  This will also eliminate a side channel attack based on timing the branches. 

Here, we build the mask using a clever trick with signed numbers: negation converts 0,1 to 0,0xffffffff...

var mul_AND(var A,var B) {
	var product=0;
	var Bbit=B; // == B>>bit
	for (var bit=0; Bbit!=0; bit++)
	{
		var mask=-(Bbit&1); // -0 == 0; -1 == all ones
		product+=A&mask; // optionally include in product
		
		Bbit=Bbit>>1; // move to next bit of B
		A+=A; // move to next bit of A
	}
	return product;
}
There is one difficulty in converting to exponentiation: if the mask is zero, the above will eliminate the entire product.  We can use the famous(?) bitwise selection technique to multiply by 1 if the mask is zero:
var exp_AND(var A,var B) {
	var product=1;
	var Bbit=B; // == B>>bit
	for (var bit=0; Bbit!=0; bit++)
	{
var mask=-(Bbit&1); // -0 == 0; -1 == all ones product*=(A&mask)|(1&~mask); // A if mask, 1 if not mask Bbit=Bbit>>1; // move to next bit of B A*=A; // move to next bit of A } return product; }

Hardware Multiply

Any reasonably big modern CPU has a hardware multiply instruction, of course.  It's typically very fast, only a few clock cycles latency, and is often the same speed regardless of the values being multiplied.
var multiply_hardware(var A,var B) { 
	return A*B;
}
The only real downside is the hardware multiply circuit can only handle 32 or 64 bit numbers, and we often need bigger numbers than this.

Karatsuba Multiplication

For multiprecision arithmetic, where we loop over smaller digits to perform arithmetic operations, addition takes O(N) time for an N digit number.  Multiplication via the binary methods above take O(N) additions, or O(N2) total time for N digit numbers.

Karatsuba multiplication takes O(Nlg 3) = O(N1.585) time, using a divide-and-conquer algorithm.  Here, x=x1*b + x0 and y=y1*b + y0, which is to say x1 and x0 are the high and low halves of x.  To multiply these, you might expand terms to do a total of four half-size multiplications:
  x*y = b^2*x1*y1 + b*(x1*y0 + x0*y1) + x0*y0

By contrast, Karatsuba calculates:
  x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) + (b+1)*x0*y0

This is algebraically equivalent, but the surprising fact about this is it breaks down one N-digit multiplication into just three N/2 digit multiplications, one fewer than you might expect.  However, there are several additional additions and subtractions required.  For numbers longer than 300-600 bits, Karatsuba is faster than either the binary approach above or a straightforward recursive approach.

Really long numbers, on the order of 64K bits, can benefit from the Schönhage–Strassen algorithm, which uses a Fast Fourier transform to reach O(N log N log log N) time.  However, numbers this big are rarely used in crypto.

Timing Comparison

For multiplying 64-bit numbers, here are the timings for a 64-bit Sandy Bridge desktop:
Method mul slow:
3.83 ns for 2,1
9.15 ns for 2,10
48.37 ns for 2,100
314.58 ns for 2,1000
2968.24 ns for 2,10000
29560.18 ns for 2,100000

Method mul binary:
4.13 ns for 2,1
7.65 ns for 2,10
12.01 ns for 2,100
15.58 ns for 2,1000
20.93 ns for 2,10000
25.01 ns for 2,100000

Method mul AND:
4.13 ns for 2,1
5.93 ns for 2,10
8.58 ns for 2,100
10.71 ns for 2,1000
13.98 ns for 2,10000
16.52 ns for 2,100000

Method mul HW:
3.54 ns for 2,1
3.54 ns for 2,10
3.55 ns for 2,100
3.54 ns for 2,1000
3.55 ns for 2,10000 3.54 ns for 2,100000
(Try this in NetRun now!)

Note how the slow multiply is bad and getting worse, the fast algorithms grow logarithmically slowly, and the hardware is constant time.