Pipelined and Superscalar Computation, with Dependencies

Pipelining

Back in 1985, around the era of the Intel 80386, each CPU instruction (like "mov") had a "cycle count", which was just the total number of CPU clock cycles that the instruction took from start to finish. You could look this up right in the Intel documentation. "mov" took one clock cycle, "add" took one clock cycle, "mul" took 30 clock cycles, etc.

This whole cycle-count scheme broke down around 1989 with the 486 processor. The problem is that the 486 was actually able to start working on the next instruction before the first instruction was finished. This idea of working on several things at once ("parallelism") is amazingly important for performance today, but it started very small--the 486 could add *and* jump simultaniously.

The idea is for the CPU to treat its various arithmetic circuits the same way you'd treat a washer and dryer. For the sake of argument, assume you've got:

Five loads of laundry A B C D E
They all need to be washed, dried, and folded
Load E's already clean, and just needs to be dried and folded.
Each load takes an hour to wash, dry, or fold

Your (microcode) control program reads:
    Wash A
    Dry A
    Fold A
    Wash B
    Dry B
...
    Fold D
    // don't need to wash E
    Dry E
    Fold E

Here's how a 386 would execute these stages:

Hour	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Washer	A			B			C			D
Dryer		A			B			C			D		E
Fold			A			B			C			D		E

It's an all-day job, which is clearly stupid, because only one thing is happening at once.

The sensible thing is to make a "pipeline", like an assembly line, where you start washing load B as soon as you move load A over to the dryer. So here's how a 486 would do the same work:

Hour	1	2	3	4	5	6	7
Washer	A	B	C	D
Dryer		A	B	C	D	E
Fold			A	B	C	D	E

That is, you start washing load D at the same time as you start folding load B. This way, everything's busy, which speeds up the overall program substantially.

Notice how long it takes to dry C--it really takes an hour, but imagine you leave that instruction out. You can't start drying D, since it's still in the washer. You could start folding C immediately, but you're busy folding B. So if you leave out "dry C", the overall program is the *same speed*. This means "dry C" is basically free--zero clock cycles!

Back in the 1990's, it seemed like zero-clock instructions were really profound, in a zen kind of way.

Pipeline Stalls and Flushes

In a real CPU, the stages "A", "B", "C", "D", and "E" might be "fetch instruction from memory", "decode instruction", "read registers needed for computation", "compute values on registers" (execute phase), and finally "write result back to register file" (instruction completion or "retirement").

	F1 D1 R1 E1 W1               (instruction 1)
	   F2 D2 R2 E2 W2            (instruction 2)
	      F3 D3 R3 E3 W3         (instruction 3)
	         F4 D4 R4 E4 W4      (instruction 4)
	            F5 D5 R5 E5 W5   (instruction 5)
	-> time ->

But consider what happens when you execute a slow instruction like divide--subsequent instructions might use the divided value. The CPU hence can't execute these instructions until their data is ready, and so the CPU must detect these dependencies (also called "pipeline interlocks") and insert waste "stall cycles" or ("pipeline bubbles") to wait for the needed data. Pipeline stalls can still cost performance today--see below for software tricks on minimising dependencies!

	F1 D1 R1 E1 W1                       (normal instruction)
	   F2 D2 R2 E2 E2 E2 E2 E2 E2 W2     (divide: 6 clock cycles)
	      F3 D3 R3 stall ........ E3 W3  (three stalled-out instructions)
	         F4 D4 stall ........ R4 E4 W4
	            F5 stall ........ D5 R5 E5 W5
	                              F6 D6 R6 E6 W6 (continuing normally)
	-> time ->

Smarter CPUs will only stall instructions that are waiting on that data, rather than stopping everything in progress like I've shown above.

It's even worse with branch instructions--rather than just waiting, subsequent instructions, even though they're fetched and ready to go, might never have happened. So these instructions need to be "flushed" from the pipeline; and again, it's the CPU's control unit that needs to detect the situation and do the flushing (anybody starting to smell some complexity here?).

	F1 D1 R1 E1 W1        (normal instructions)
	   F2 D2 R2 E2        Branch!
	      F3 D3 R3 flush  (flushed instructions-that-never-were)
	         F4 D4 flush
	            F5 flush
	               F9 ... (starting over at the branch target)
	-> time ->

Pipelining with Operand Forwarding

Because it might take a few clock cycles to write and then read from the register file, and it's very common that an instruction needs the value computed by the immediately preceeding instruction, often real pipelined CPUs bypass the register file, and pass values directly from the execute stage of one instruction into the execute stage of the next:

	F1 D1 R1 E1 W1               (instruction 1)
	   F2 D2 R2 E2 W2            (instruction 2)
	-> time ->

Note that if you don't do forwarding, you would need to insert two stall cycles to wait for the write and read from the register file:

	F1 D1 R1 E1 W1               (instruction 1)
	   F2 D2 stall R2 E2 W2            (instruction 2)
	-> time ->

As usual, the CPU control unit must detect the dependency, decide to use operand forwarding, and light up the appropriate CPU hardware to make it happen.

Data Dependencies

The biggest limit to pipelined execution is data dependencies (RAW, WAR, and WAW). (Read it!)

But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm). A modern x86 CPU, with 8 architectural registers, might have 50 renamed registers, only accessible via the register renaming hardware!

There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining. A naive application of these is unlikely to provide much benefit, since the compiler and CPU can do naive transformations already; but often you can exploit features of the algorithm to improve performance.

Superscalar and Out of Order

So the 1980's roll on. Machines get deeper pipelines and better forwarding, until the machines usually initiate a new instruction every clock cycle.

Eventually, even one cycle per instruction is not fast enough.

So as the 1990's arrived, designers began adding wider execution. They added different "execution units", so more instructions can execute at the same time. For example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle. Machines that can work on several instructions at once are merely piplined; but CPUs that can start (and work on) more than one ('scalar') instruction per clock cycle are called superscalar.

But there's a problem with superscalar processors. Say you've got a loop that consists of four integer instructions, then four load/store instructions. Since there's only one integer unit, the first 4 clock cycles are spent executing the integer instructions. The first load/store can start at the same time as the last integer instruction, but then the load/store unit becomes the bottleneck and the loop takes an additional 3 clock cycles, for a total of 7 clocks:

Program Order:

    add1  add2  add3  add4 load1 load2 load3 load4

Execution Order (superscalar, not out of order):

    add1  add2  add3  add4
                      load1 load2 load3 load4
	-> time ->

But if you changed the program so your four integer and load/store instructions were interleaved, then the loop could take just 4 clock cycles to execute, because both units would always be busy. So for a few years in the early 1990's, the *order* you wrote your code could make a huge difference in performance.

But then hardware designers realized that they *don't* have to execute the instructions in the exact order specified by software; any equivalent order that computes the same result will work just as well. Hence by the mid-1990's alongside superscalar "out-of-order execution" CPUs became dominant; and today the CPU might look several dozen instructions ahead to determine which instructions it can safely begin.

Execution Order (superscalar, out of order):

    add1  add2  add3  add4
    load1 load2 load3 load4
	-> time ->

Here's an Ars Technica article on the design of Intel's Core CPU.

There's a good summary of the "Architecture des Ordinateurs" at this Chinese copy of French coursenotes.

Fewer Dependencies, Faster Code

Yes, you can use all this information even if you're not building a CPU.

For example, here's a loop that computes 12 factorial:

unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i++) {
	prod*=i;
}
return prod;

(executable NetRun link)

This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:

unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
	prod*=i;
	prod*=(i+1);
}
return prod;

(executable NetRun link)

But if I now use my knowledge that integer multiplication is associative, to fully separate even and odd terms:

unsigned int prodA=1, prodB=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
	prodA*=i;
	prodB*=(i+1);
}
return prodA*prodB;

(executable NetRun link)

This version has fewer true dependencies, and hence executes dramatically faster--in 18ns instead of 25ns! You can keep unrolling the loop, adding prodC and prodD, exposing more and more parallelism, until you eventually hit resource limits: not enough registers, not enough multiply units, not enough addition units, and your performance improvement eventually drops off.