# Algorithms for Multiplication and Exponentiation

CS 463/480 Lecture, Dr. Lawlor

There are quite a few different ways to do multiplication.  The simplest is to use the compiler's "*" operator, but there are a number of situations where you might want to build something yourself.  For example, you might want finite field multiplication (which has a different "add" operation), big integer arithmetic (calling an "add big integers" function), or modular arithmetic (reducing by the modulus each time).

## Exponential-Time Multiply

The simplest approach is just to loop over one of the operands and add:
```var mul_slow(var A,var B) {
var product=0;
for (var i=0;i<B;i++) product+=A;
return product;
}```
This has terrible performance, since it uses O(2B) additions for a number with B bits--it's as slow as brute forcing a key!

If you replace the "+=" with "*=", and initialize the product at 1, you can transform this slow multiply into a slow exponential.

```var exp_slow(var A,var B) {
var product=1;
for (var i=0;i<B;i++) product*=A;
return product;
}```
It's still exponentially slow, though.  (Nonetheless, a surprising number of people use this algorithm).

## Binary / Peasant Multiply

You can do exponentially better with the famous "fast multiplication algorithm", also known as the "Peasant Algorithm" (it was evidently used by Russian peasants, using even/odd to identify the value of the low bit).
```var mul_binary(var A,var B) {
var product=0;
for (var bit=0;B>=(1<<bit);bit++)
{
if (B&(1<<bit))
product+=A; // include A in product
A+=A; // left shift A by 1 bit
}
return product;
}```
This is now O(B) additions for a number with B bits, which is exponentially better than before.

Once again, replacing "+" with "*" gives the
fast exponentiation by squaring trick:
```var exp_binary(var A,var B) {
var product=1;
for (var bit=0;B>=(1<<bit);bit++)
{
if (B&(1<<bit))
product*=A; // include A in product
A*=A; // double exponent on A
}
return product;
}
```

On most modern computers, the "if" above gives fairly poor performance (the branch prediction is... unpredictable).  We can do about 30% better using a bit mask to zero out the contribution to the product.  This will also eliminate a side channel attack based on timing the branches.

Here, we build the mask using a clever trick with signed numbers: negation converts 0,1 to 0,0xffffffff...

```var mul_AND(var A,var B) {
var product=0;
var Bbit=B; // == B>>bit
for (var bit=0; Bbit!=0; bit++)
{
var mask=-(Bbit&1); // -0 == 0; -1 == all ones
product+=A&mask; // optionally include in product

Bbit=Bbit>>1; // move to next bit of B
A+=A; // move to next bit of A
}
return product;
}```
There is one difficulty in converting to exponentiation: if the mask is zero, the above will eliminate the entire product.  We can use the famous(?) bitwise selection technique to multiply by 1 if the mask is zero:
```var exp_AND(var A,var B) {
var product=1;
var Bbit=B; // == B>>bit
for (var bit=0; Bbit!=0; bit++)
{		var mask=-(Bbit&1); // -0 == 0; -1 == all ones

Bbit=Bbit>>1; // move to next bit of B
A*=A; // move to next bit of A
}
return product;
}```

## Hardware Multiply

Any reasonably big modern CPU has a hardware multiply instruction, of course.  It's typically very fast, only a few clock cycles latency, and is often the same speed regardless of the values being multiplied.
```var multiply_hardware(var A,var B) {
return A*B;
}```
The only real downside is the hardware multiply circuit can only handle 32 or 64 bit numbers, and we often need bigger numbers than this.

## Karatsuba Multiplication

For multiprecision arithmetic, where we loop over smaller digits to perform arithmetic operations, addition takes O(N) time for an N digit number.  Multiplication via the binary methods above take O(N) additions, or O(N2) total time for N digit numbers.

Karatsuba multiplication takes O(Nlg 3) = O(N1.585) time, using a divide-and-conquer algorithm.  Here, x=x1*b + x0 and y=y1*b + y0, which is to say x1 and x0 are the high and low halves of x.  To multiply these, you might expand terms to do a total of four half-size multiplications:
x*y = b^2*x1*y1 + b*(x1*y0 + x0*y1) + x0*y0

By contrast, Karatsuba calculates:
x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) + (b+1)*x0*y0

This is algebraically equivalent, but the surprising fact about this is it breaks down one N-digit multiplication into just three N/2 digit multiplications, one fewer than you might expect.  However, there are several additional additions and subtractions required.  For numbers longer than 300-600 bits, Karatsuba is faster than either the binary approach above or a straightforward recursive approach.

Really long numbers, on the order of 64K bits, can benefit from the Schönhage–Strassen algorithm, which uses a Fast Fourier transform to reach O(N log N log log N) time.  However, numbers this big are rarely used in crypto.

## Timing Comparison

For multiplying 64-bit numbers, here are the timings for a 64-bit Sandy Bridge desktop:
```Method mul slow:
3.83 ns for 2,1
9.15 ns for 2,10
48.37 ns for 2,100
314.58 ns for 2,1000
2968.24 ns for 2,10000
29560.18 ns for 2,100000

Method mul binary:
4.13 ns for 2,1
7.65 ns for 2,10
12.01 ns for 2,100
15.58 ns for 2,1000
20.93 ns for 2,10000
25.01 ns for 2,100000

Method mul AND:
4.13 ns for 2,1
5.93 ns for 2,10
8.58 ns for 2,100
10.71 ns for 2,1000
13.98 ns for 2,10000
16.52 ns for 2,100000

Method mul HW:
3.54 ns for 2,1
3.54 ns for 2,10
3.55 ns for 2,100
3.54 ns for 2,10003.55 ns for 2,10000
3.54 ns for 2,100000
```
(Try this in NetRun now!)

Note how the slow multiply is bad and getting worse, the fast algorithms grow logarithmically slowly, and the hardware is constant time.