Optimization: Loop Unrolling & Pipelined Execution
CS 301 Lecture, Dr. Lawlor, 2005/11/7
Think about what's happening inside a little loop:
for (i=0;i<max;i++) sum=sum+(val++);
This takes 1.1ns/iteration, and assembles into:
1a: add %ecx,%eax // start: sum+=val
1c: inc %ecx // val++
1d: inc %edx // i++
1e: cmp %ebx,%edx // if i<max...
20: jb 1a <do_it()+0x1a> // ...goto start
Each time through the loop, we do the work (first two instructions),
then increment the loop variable, compare it against the loop limit,
and do a conditional jump. All the looping overhead might take
longer than doing the work!
Loop Unrolling
We can speed this loop up by doing more real work per iteration of the loop, like this:
for (i=0;i<max;i+=2) {
sum=sum+(val++);
sum=sum+(val++);
}
This only takes 0.7ns per addition, for about a 35% speedup--not bad! This is called "loop unrolling".
If we unroll the loop yet again, like:
for (i=0;i<max;i+=4) {
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
}
We now take only 0.6ns per addition. Clearly, we're approaching the point of diminishing returns here.
Loop Unrolling--Cleanup Loop
What happens if "max" isn't a multiple of 4? The above loop
actually gives the *wrong* answer, because sum is always executed a
multiple of 4 times. The standard fix for this is to split out a
"cleanup" loop to handle the leftover iterations one-by-one:
enum {nUnroll=4};
unsigned int nLoop=max/nUnroll; /* round down! */
unsigned int nLeft=max%nUnroll;
for (i=0;i<nLoop;i++) { /* Unrolled main loop */
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
sum=sum+(val++);
}
for (i=0;i<nLeft;i++) { /* Cleanup loop to handle leftovers */
sum=sum+(val++);
}
The cleanup phase allows us to work for any number of iterations--not
just multiples of 4. If this is called with many iterations, the
cleanup loop won't be a performance bottleneck, because most of the
time will be in the (faster) main loop.
Duff's Device--Merged Unroll & Cleanup
We can actually improve the cleanup process using the unholy switch/for
hybrid described last class. The idea is to use the unrolled loop
*as* the cleanup loop, by just starting the unrolled loop in the
middle, so the remainder of the main loop takes care of the
cleanup. For example:
(Executable NetRun Link)
enum {nUnroll=4};
unsigned int nLoop=max/nUnroll; /* must round down! */
unsigned int nLeft=max%nUnroll;
i=0; /* main loop initialization */
switch (nLeft)
for (;i<=nLoop;i++) { /* Unrolled main loop */
sum=sum+(val++);
case 3: sum=sum+(val++);
case 2: sum=sum+(val++);
case 1: sum=sum+(val++);
case 0: ;
}
For example, when called with "max==6", we need to do 1 unrolled-by-4
iteration, and 2 cleanup iterations. So "nLoop==6/4==1" and
"nLeft==6%4==2". We start by jumping into the loop at "case 2",
where we fall through to execute the two cleanup iterations. We
then increment i, and because of the <= test, we go around one more
time for the main loop.
Compare this to the above, and you'll see it should execute exactly the
same code in each case. If you unroll a power of two times, the
compiler will use fast bit operations to compute nLoop and nLeft
(instead of slow divide and mod operations).
This switch/loop hybridization is due to Tom Duff, and is called
"Duff's Device". The first page of Google hits give you the
excellent Jargon File Entry, Duff's 1984 newsgroup writeup, Duff's updated 1988 response to various rather silly criticisms, and an example of how to hide Duff's Device inside a macro.
Unlike Duff's original version, the above even works when the number of
desired iterations is zero. Duff's version compiles without
warnings, though...
Dependency-limited Loops
Some loops are dependency-limited. For example,
for (i=0;i<max;i++) sum=sum*(val++);
and the unrolled-by-4-version:
for (i=0;i<max;i+=4) {
sum=sum*(val++);
sum=sum*(val++);
sum=sum*(val++);
sum=sum*(val++);
}
both have exactly the same performance (3.6ns/iteration), because they
both have to wait for the previous multiply (a long-latency operation)
to complete before they can start work on the next multiply.
However, we can remove the multiply bottleneck and dramatically speed
this up by splitting the "sum" into separate variables like this:
(Executable NetRun Link)
unsigned int sumA=1, sumB=1, sumC=1, sumD=1;
for (i=0;i<max;i+=4) {
sumD=sumD*(val++);
sumC=sumC*(val++);
sumB=sumB*(val++);
sumA=sumA*(val++);
}
return (int)sumA*sumB*sumC*sumD;
This should give exactly the same result as before, since we're doing
the exact same multiplications in a different order. But the
performance of this
version is just 0.9ns/iteration, which is 4x faster than the unsplit
version above, because now all 4 multiplications can (and do) proceed
simultaneously!
Why / Why Not Unroll Loops?
There are several drawbacks to loop unrolling generally:
- You end up with multiple copies of the same code. This is
always ugly to read and painful to write, but especially if you have to
transform your accesses to the loop variable. The
"-funroll-loops" flag (so the compiler unrolls automatically)may work better than manual unrolling.
- If you do any reasonable amount of work inside the loop, you'll
normally get at best a tiny improvement from unrolling, because the
hardware overlaps the loop overhead with other computations.
- Duff's device is even more obscene than normal if you use the
loop index anywhere inside the loop, because a given line of the loop
body can be executed normally or during the special cleanup pass.
The advantages are:
- Less time wasted updating loop index, especially with small loops.
- Fewer potentially-costly branch instructions.
- More opportunities for the compiler & CPU to optimize the loop body, because branches frighten both the CPU and compiler.
- More opportunities to "untangle" dependencies.