So the 1980's roll on. Machines get deeper pipelines and better forwarding, until the machines usually initiate a new instruction every clock cycle. We saw in the last lecture how MIPS machines really do execute one instruction per clock--you could see this in the instruction timings, where most instruction cost only one additional clock cycle, unless something went wrong and it took two.

Eventually, even one cycle per instruction (CPI) is not fast enough. Sadly, one cycle per instruction is the theoretical limit of pipelining--it's the upper bound everybody's trying to reach by avoiding pipeline stalls, hazards, and flushes.

So as the 1990's arrived, designers began adding "wide" superscalar execution, where multiple instructions start every clock cycle. Nowadays, the metric is "instructions per cycle" (IPC), which can be substantially more than one.

To get multiple instructions running at once, you need different "execution units", so more arithmetic can execute at the same time. For example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle.

You can see superscalar effects in the instruction timings of any modern CPU--many instructions simply cost no time at all. For example, each these assembly programs take the exact same number of nanoseconds on my Pentium 4:

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov eax,8

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov ecx,7

mov eax,8

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov ecx,7

mov eax,ecx

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov ecx,7

add eax,ecx

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov ecx,7

imul eax,ecx

ret

(Try this in NetRun now!) -> Takes 4.4ns

add ecx,7

imul eax,ecx

ret

(Try this in NetRun now!) -> Takes 4.4ns

mov ecx,2The limiting factor for this particular instruction sequence seems to just be the number of instructions--the Pentium 4's control unit can't handle three writes per clock cycle.

mov eax,3

mov edx,4

ret

You can avoid the substantial call/return overhead and see a little more timing detail using a tight loop like this one:

mov ecx,1000This takes about two clocks to go around each time. You can even add one more "add" instructions without changing the loop time (ah, superscalar!). Because the Pentium 4 uses both rising and trailing edges of the clock, a 2.8GHz Pentium 4 (like the main NetRun machine) can take half-integer multiples of the clock period, 0.357ns. So your timings are 0.357ns for one clock, 0.536ns for 1.5 clocks (like the loop above without the "add"), 0.714ns for 2.0 clocks (like the loop above), 0.893ns for 2.5 clocks, etc.

start:

add eax,1

dec ecx

jne start

ret

But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm).

There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining. A naive application of these is unlikely to provide much benefit, since the compiler and CPU can do naive transformations already; but often you can exploit features of the algorithm to improve performance. Plus, CPU designers has recently been shedding their most complex features in favor of higher parallelism, so the machines of the future may look like the machines of the past!

But if your four integer and load/store instructions were interleaved, the loop could take just 4 clock cycles to execute, because both units would always be busy. So for a few years in the early 1990's, the *order* you wrote your code could make a huge difference in performance.

But then hardware designers realized that they *don't* have to execute the instructions in the exact order specified by software; any equivalent order that computes the same result will work just as well. Hence by the mid-1990's "out-of-order execution" CPUs became dominant; until today the CPU might look a hundred instructions ahead to determine which instructions it can safely begin.

unsigned int prod=1, n=12;This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:

for (unsigned int i=1;i<=n;i++) {

prod*=i;

}

return prod;

unsigned int prod=1, n=12;But if I now use my knowledge that integer multiplication is associative to fully separate even and odd terms:

for (unsigned int i=1;i<=n;i+=2) {

prod*=i;

prod*=(i+1);

}

return prod;

unsigned int prodA=1, prodB=1, n=12;This version has fewer true dependencies, and hence executes substantially faster--in 18ns instead of 25ns! You can keep unrolling the loop, adding prodC and prodD, exposing more and more parallelism, until you eventually hit resource limits: not enough registers, not enough multiply units, not enough addition units, and your performance improvement eventually drops off.

for (unsigned int i=1;i<=n;i+=2) {

prodA*=i;

prodB*=(i+1);

}

return prodA*prodB;