Think about what's happening inside a little loop:

for (i=0;i<max;i++) sum=sum+(val++);This takes 1.1ns/iteration, and assembles into:

1a: add %ecx,%eax // start: sum+=valEach time through the loop, we do the work (first two instructions), then increment the loop variable, compare it against the loop limit, and do a conditional jump. All the looping overhead might take longer than doing the work!

1c: inc %ecx // val++

1d: inc %edx // i++

1e: cmp %ebx,%edx // if i<max...

20: jb 1a <do_it()+0x1a> // ...goto start

for (i=0;i<max;i+=2) {This only takes 0.7ns per addition, for about a 35% speedup--not bad! This is called "loop unrolling".

sum=sum+(val++);

sum=sum+(val++);

}

If we unroll the loop yet again, like:

for (i=0;i<max;i+=4) {We now take only 0.6ns per addition. Clearly, we're approaching the point of diminishing returns here.

sum=sum+(val++);

sum=sum+(val++);

sum=sum+(val++);

sum=sum+(val++);

}

enum {nUnroll=4};The cleanup phase allows us to work for any number of iterations--not just multiples of 4. If this is called with many iterations, the cleanup loop won't be a performance bottleneck, because most of the time will be in the (faster) main loop.

unsigned int nLoop=max/nUnroll; /* round down! */

unsigned int nLeft=max%nUnroll;

for (i=0;i<nLoop;i++) { /* Unrolled main loop */

sum=sum+(val++);

sum=sum+(val++);

sum=sum+(val++);

sum=sum+(val++);

}

for (i=0;i<nLeft;i++) { /* Cleanup loop to handle leftovers */

sum=sum+(val++);

}

(Executable NetRun Link)

enum {nUnroll=4};For example, when called with "max==6", we need to do 1 unrolled-by-4 iteration, and 2 cleanup iterations. So "nLoop==6/4==1" and "nLeft==6%4==2". We start by jumping into the loop at "case 2", where we fall through to execute the two cleanup iterations. We then increment i, and because of the <= test, we go around one more time for the main loop.

unsigned int nLoop=max/nUnroll; /* must round down! */

unsigned int nLeft=max%nUnroll;

i=0; /* main loop initialization */

switch (nLeft)

for (;i<=nLoop;i++) { /* Unrolled main loop */

sum=sum+(val++);

case 3: sum=sum+(val++);

case 2: sum=sum+(val++);

case 1: sum=sum+(val++);

case 0: ;

}

Compare this to the above, and you'll see it should execute exactly the same code in each case. If you unroll a power of two times, the compiler will use fast bit operations to compute nLoop and nLeft (instead of slow divide and mod operations).

This switch/loop hybridization is due to Tom Duff, and is called "Duff's Device". The first page of Google hits give you the excellent Jargon File Entry, Duff's 1984 newsgroup writeup, Duff's updated 1988 response to various rather silly criticisms, and an example of how to hide Duff's Device inside a macro. Unlike Duff's original version, the above even works when the number of desired iterations is zero. Duff's version compiles without warnings, though...

for (i=0;i<max;i++) sum=sum*(val++);and the unrolled-by-4-version:

for (i=0;i<max;i+=4) {both have exactly the same performance (3.6ns/iteration), because they both have to wait for the previous multiply (a long-latency operation) to complete before they can start work on the next multiply.

sum=sum*(val++);

sum=sum*(val++);

sum=sum*(val++);

sum=sum*(val++);

}

However, we can remove the multiply bottleneck and dramatically speed this up by splitting the "sum" into separate variables like this:

(Executable NetRun Link)

unsigned int sumA=1, sumB=1, sumC=1, sumD=1;This should give exactly the same result as before, since we're doing the exact same multiplications in a different order. But the performance of this version is just 0.9ns/iteration, which is 4x faster than the unsplit version above, because now all 4 multiplications can (and do) proceed simultaneously!

for (i=0;i<max;i+=4) {

sumD=sumD*(val++);

sumC=sumC*(val++);

sumB=sumB*(val++);

sumA=sumA*(val++);

}

return (int)sumA*sumB*sumC*sumD;

- You end up with multiple copies of the same code. This is
always ugly to read and painful to write, but especially if you have to
transform your accesses to the loop variable. The
"-funroll-loops" flag (so the compiler unrolls automatically)may work better than manual unrolling.

- If you do any reasonable amount of work inside the loop, you'll
normally get at best a tiny improvement from unrolling, because the
hardware overlaps the loop overhead with other computations.

- Duff's device is even more obscene than normal if you use the loop index anywhere inside the loop, because a given line of the loop body can be executed normally or during the special cleanup pass.

- Less time wasted updating loop index, especially with small loops.
- Fewer potentially-costly branch instructions.

- More opportunities for the compiler & CPU to optimize the loop body, because branches frighten both the CPU and compiler.
- More opportunities to "untangle" dependencies.