Superscalar, Out-of-Order, and Hazards
CS 441 Lecture, Dr. Lawlor
So the 1980's roll on. Machines get deeper pipelines and better
forwarding, until the machines usually initiate a new instruction every
clock cycle.
Eventually, even one cycle per instruction is not fast enough.
So as the 1990's arrived, designers began adding "wide" superscalar
execution. They added different "execution units", so more
instructions can execute at the same time. For
example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle.
But there's a problem with superscalar processors. Say you've got
a loop that consists of four integer instructions, then four load/store
instructions. Since there's only one integer unit, the first 4
clock cycles are spent executing the integer instructions. The
first load/store can start at the same time as the last integer
instruction, but then the load/store unit becomes the bottleneck and
the loop takes an additional 3 clock cycles, for a total of 7 clocks.
But if your four integer and load/store instructions were interleaved,
the loop could take just 4 clock cycles to execute, because both units
would always be busy. So for a few years in the early 1990's,
the *order* you wrote your code could make a huge
difference in performance.
But then hardware designers realized that they *don't* have to execute
the instructions in the exact order specified by software; any
equivalent order that computes the same result will work just as
well. Hence by the mid-1990's "out-of-order execution"
CPUs became dominant; until today the CPU might look several dozen
instructions ahead to determine which instructions it can safely
begin.
Data Dependencies
The biggest limit to superscalar execution, though, is data dependencies (RAW, WAR, and WAW). (Read it!)
But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm).
There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining.
A naive application of these is unlikely to provide much benefit, since
the compiler and CPU can do naive transformations already; but often
you can exploit features of the algorithm to improve performance.
An Example
For example, here's a loop that computes 12 factorial:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i++) {
prod*=i;
}
return prod;
(executable NetRun link)
This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prod*=i;
prod*=(i+1);
}
return prod;
(executable NetRun link)
But if I now use my knowledge that integer multiplication is associative to fully separate even and odd terms:
unsigned int prodA=1, prodB=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prodA*=i;
prodB*=(i+1);
}
return prodA*prodB;
(executable NetRun link)
This version has fewer true dependencies, and hence executes dramatically faster--in 18ns instead of 25ns!