So the 1980's roll on. Machines get deeper pipelines and better forwarding, until the machines usually initiate a new instruction every clock cycle.

Eventually, even one cycle per instruction is not fast enough.

So as the 1990's arrived, designers began adding "wide" superscalar execution. They added different "execution units", so more instructions can execute at the same time. For example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle.

But there's a problem with superscalar processors. Say you've got a loop that consists of four integer instructions, then four load/store instructions. Since there's only one integer unit, the first 4 clock cycles are spent executing the integer instructions. The first load/store can start at the same time as the last integer instruction, but then the load/store unit becomes the bottleneck and the loop takes an additional 3 clock cycles, for a total of 7 clocks.

But if your four integer and load/store instructions were interleaved, the loop could take just 4 clock cycles to execute, because both units would always be busy. So for a few years in the early 1990's, the *order* you wrote your code could make a huge difference in performance.

But then hardware designers realized that they *don't* have to execute the instructions in the exact order specified by software; any equivalent order that computes the same result will work just as well. Hence by the mid-1990's "out-of-order execution" CPUs became dominant; until today the CPU might look several dozen instructions ahead to determine which instructions it can safely begin.

But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm).

There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining. A naive application of these is unlikely to provide much benefit, since the compiler and CPU can do naive transformations already; but often you can exploit features of the algorithm to improve performance.

unsigned int prod=1, n=12;This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:

for (unsigned int i=1;i<=n;i++) {

prod*=i;

}

return prod;

unsigned int prod=1, n=12;But if I now use my knowledge that integer multiplication is associative to fully separate even and odd terms:

for (unsigned int i=1;i<=n;i+=2) {

prod*=i;

prod*=(i+1);

}

return prod;

unsigned int prodA=1, prodB=1, n=12;This version has fewer true dependencies, and hence executes dramatically faster--in 18ns instead of 25ns!

for (unsigned int i=1;i<=n;i+=2) {

prodA*=i;

prodB*=(i+1);

}

return prodA*prodB;