Optimization: Loop Unrolling & Pipelined Execution

CS 301 Lecture, Dr. Lawlor, 2005/11/7

Think about what's happening inside a little loop:
	for (i=0;i<max;i++) sum=sum+(val++);
This takes 1.1ns/iteration, and assembles into:
  1a:  add    %ecx,%eax         // start: sum+=val
1c: inc %ecx // val++
1d: inc %edx // i++
1e: cmp %ebx,%edx // if i<max...
20: jb 1a <do_it()+0x1a> // ...goto start
Each time through the loop, we do the work (first two instructions), then increment the loop variable, compare it against the loop limit, and do a conditional jump.  All the looping overhead might take longer than doing the work!

Loop Unrolling

We can speed this loop up by doing more real work per iteration of the loop, like this:
	for (i=0;i<max;i+=2) {
sum=sum+(val++);
sum=sum+(val++);
}
This only takes 0.7ns per addition, for about a 35% speedup--not bad!  This is called "loop unrolling". 

If we unroll the loop yet again, like:
	for (i=0;i<max;i+=4) {
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
}
We now take only 0.6ns per addition.  Clearly, we're approaching the point of diminishing returns here.

Loop Unrolling--Cleanup Loop

What happens if "max" isn't a multiple of 4?  The above loop actually gives the *wrong* answer, because sum is always executed a multiple of 4 times.  The standard fix for this is to split out a "cleanup" loop to handle the leftover iterations one-by-one:
	enum {nUnroll=4};
unsigned int nLoop=max/nUnroll; /* round down! */
unsigned int nLeft=max%nUnroll;
for (i=0;i<nLoop;i++) { /* Unrolled main loop */
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
}
for (i=0;i<nLeft;i++) { /* Cleanup loop to handle leftovers */
sum=sum+(val++);
}
The cleanup phase allows us to work for any number of iterations--not just multiples of 4.  If this is called with many iterations, the cleanup loop won't be a performance bottleneck, because most of the time will be in the (faster) main loop.

Duff's Device--Merged Unroll & Cleanup

We can actually improve the cleanup process using the unholy switch/for hybrid described last class.  The idea is to use the unrolled loop *as* the cleanup loop, by just starting the unrolled loop in the middle, so the remainder of the main loop takes care of the cleanup.  For example:
(Executable NetRun Link)
	enum {nUnroll=4};
unsigned int nLoop=max/nUnroll; /* must round down! */
unsigned int nLeft=max%nUnroll;
i=0; /* main loop initialization */
switch (nLeft)
for (;i<=nLoop;i++) { /* Unrolled main loop */
sum=sum+(val++);
case 3: sum=sum+(val++);
case 2: sum=sum+(val++);
case 1: sum=sum+(val++);
case 0: ;
}
For example, when called with "max==6", we need to do 1 unrolled-by-4 iteration, and 2 cleanup iterations.  So "nLoop==6/4==1" and "nLeft==6%4==2".  We start by jumping into the loop at "case 2", where we fall through to execute the two cleanup iterations.  We then increment i, and because of the <= test, we go around one more time for the main loop.

Compare this to the above, and you'll see it should execute exactly the same code in each case.  If you unroll a power of two times, the compiler will use fast bit operations to compute nLoop and nLeft (instead of slow divide and mod operations).

This switch/loop hybridization is due to Tom Duff, and is called "Duff's Device".  The first page of Google hits give you the excellent Jargon File Entry, Duff's 1984 newsgroup writeup, Duff's updated 1988 response to various rather silly criticisms, and an example of how to hide Duff's Device inside a macro.  Unlike Duff's original version, the above even works when the number of desired iterations is zero.  Duff's version compiles without warnings, though...

Dependency-limited Loops

Some loops are dependency-limited.  For example,
	for (i=0;i<max;i++) sum=sum*(val++);
and the unrolled-by-4-version:
	for (i=0;i<max;i+=4) {
sum=sum*(val++);
sum=sum*(val++);
sum=sum*(val++);
sum=sum*(val++);
}
both have exactly the same performance (3.6ns/iteration), because they both have to wait for the previous multiply (a long-latency operation) to complete before they can start work on the next multiply.

However, we can remove the multiply bottleneck and dramatically speed this up by splitting the "sum" into separate variables like this:
(Executable NetRun Link)
	unsigned int sumA=1, sumB=1, sumC=1, sumD=1;
for (i=0;i<max;i+=4) {
sumD=sumD*(val++);
sumC=sumC*(val++);
sumB=sumB*(val++);
sumA=sumA*(val++);
}
return (int)sumA*sumB*sumC*sumD;
This should give exactly the same result as before, since we're doing the exact same multiplications in a different order.  But the performance of this version is just 0.9ns/iteration, which is 4x faster than the unsplit version above, because now all 4 multiplications can (and do) proceed simultaneously!

Why / Why Not Unroll Loops?

There are several drawbacks to loop unrolling generally:
The advantages are: