Fast Multiplication/Exponentiation Algorithm

CS 463 Lecture, Dr. Lawlor

There are quite a few different ways to do multiplication if the hardware doesn't support what you want.  For example, you might want finite field multiplication (which has a different "add" operation), big integer arithmetic (calling an "add big integers" function), or modular arithmetic (reducing by the modulus each time). 

The simplest approach is just to loop over one of the operands and add:
var multiply_slow(var A,var B) {
	var product=0;
	for (var i=0;i<B;i++) product+=A;
	return (int)product;
}
This has terrible performance, since it's O(B). You can do better with the famous "fast multiplication algorithm", also known as the "Peasant Algorithm" (it was evidently used by Russian peasants). 
var multiply_fast_IF(var A,var B) {
	var product=0;
	for (unsigned int bit=0;B>=(1ul<<bit);bit++)
		if (B&(1<<bit))
			product+=A<<bit;
	return product;
}
This is now O(ln B), exponentially better than before.  If you really care about performance, you can swap A and B if A is smaller.  Replacing "+" with "*" gives the fast exponentiation by squaring trick, which turns out to be critical to getting good performance from RSA.

On most modern computers, the "if" above gives fairly poor performance (the branch prediction is... unpredictable).  We can do about 30% better using a bit mask to zero out the A<<bit part.  Here, we use the old "shift up to the sign bit, then shift down across all bits" trick to splat the mask from a single bit across all the bits.
var multiply_fast_AND(var A,var B) {
	var product=0;
	var Bbit=B; // == B>>bit
	for (unsigned int bit=0; Bbit!=0; bit++)
	{
		long mask=Bbit<<(8*sizeof(var)-1); // move low bit to high (sign) bit
		mask=mask>>(8*sizeof(var)-1); // signed shift: splat across all bits
		product+=(A<<bit)&mask; // optionally include in product
		Bbit=Bbit>>1; // move to next bit of B
	}
	return product;
}
Any reasonably big modern CPU has a hardware multiply instruction, of course.  It's typically very fast, only a few clock cycles latency, and the same speed regardless of the values being multiplied.
var multiply_hardware(var A,var B) { 
	return A*B;
}
(Try this in NetRun now!)

Here are the timings for a 64-bit Sandy Bridge desktop:
Value 1:
slow multiply: 2.36 ns/call
fast multiply (if): 3.54 ns/call
fast multiply (and): 2.36 ns/call
hardware multiply: 1.77 ns/call

Value 10:
slow multiply: 10.32 ns/call
fast multiply (if): 8.13 ns/call
fast multiply (and): 5.51 ns/call
hardware multiply: 1.77 ns/call

Value 100:
slow multiply: 168.24 ns/call
fast multiply (if): 13.26 ns/call
fast multiply (and): 9.15 ns/call
hardware multiply: 1.77 ns/call

Value 1000:
slow multiply: 1906.62 ns/call
fast multiply (if): 19.61 ns/call
fast multiply (and): 12.82 ns/call
hardware multiply: 1.77 ns/call

Value 10000:
slow multiply: 19272.32 ns/call
fast multiply (if): 25.16 ns/call
fast multiply (and): 17.68 ns/call
hardware multiply: 1.77 ns/call
Note how the slow multiply is bad and getting worse, the fast algorithms grow logarithmically slowly, and the hardware is constant time.

In the official description of AES, we do Galois field multiplication instead of normal multiplication in two places: filling the S-box, and in the affine part of MixColumns.  So which multiplication algorithm does a typical high-performance implementation like in PolarSSL AES

None!  The S-box affine transformation is just baked into the S-box table's values, and the multiply portion of MixColumns' affine transformation is baked into separate per-row copies of the substitution table, then Galois added (XORed) at runtime.  The "aes_gen_tables" function (for the !POLARSSL_AES_ROM_TABLES case) barely does multiplication either, instead computing a table of exponentials and logs, then computing exp[log[v1]+log[v2]].

There's a definite tradeoffs involved in using tables instead of doing arithmetic.