Pipelined and Superscalar Computation, with Dependencies
CS 641 Lecture, Dr. Lawlor
Pipelining
Back in 1985, around the era of the Intel 80386, each CPU instruction
(like "mov") had a "cycle count", which was just the total number of
CPU clock cycles that the instruction took from start to finish.
You could look this up right in the Intel documentation. "mov" took one
clock cycle, "add" took one clock cycle, "mul" took 30 clock cycles,
etc.
This whole cycle-count scheme broke down around 1989 with the 486
processor. The problem is that the 486 was actually able to start
working on the next instruction
before the first instruction was finished. This idea
of working on several things at once ("parallelism") is amazingly
important for performance today, but it started very small--the 486
could add *and* jump
simultaniously.
The idea is for the CPU to treat its various arithmetic circuits the
same way you'd treat a washer and dryer. For the sake of
argument, assume you've got:
- Five loads of laundry A B C D E
- They all need to be washed, dried, and folded
- Load E's already clean, and just needs to be dried and folded.
- Each load takes an hour to wash, dry, or fold
Your (microcode) control program reads:
Wash A
Dry A
Fold A
Wash B
Dry B
...
Fold D
// don't need to wash E
Dry E
Fold E
Here's how a 386 would execute these stages:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
Washer
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
|
|
|
Dryer
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
|
Fold
|
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
It's an all-day job, which is clearly stupid, because only one thing is happening at once.
The sensible thing is to make a "pipeline", like an assembly line,
where you start washing load B as soon as you move load A over to the
dryer. So here's how a 486 would do the same work:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
Washer
|
A
|
B
|
C
|
D
|
|
|
|
Dryer
|
|
A
|
B
|
C
|
D
|
E
|
|
Fold
|
|
|
A
|
B
|
C
|
D
|
E
|
That is, you start washing load D at the same time as you start folding
load B. This way, everything's busy, which speeds up the overall
program substantially.
Notice how long it takes to dry C--it really takes an hour, but imagine
you leave that instruction out. You can't start drying D, since
it's still in the washer. You could start folding C immediately,
but you're busy folding B. So if you leave out "dry C", the
overall program is the *same speed*. This means "dry C" is basically
free--zero clock cycles!
Back in the 1990's, it seemed like zero-clock instructions were really profound, in a zen kind of way.
Pipeline Stalls and Flushes
In a real CPU, the stages "A", "B", "C", "D", and "E" might be "fetch
instruction from memory", "decode instruction", "read registers needed
for computation", "compute values on registers" (execute phase), and finally "write
result back to register file" (instruction completion or "retirement").
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 R2 E2 W2 (instruction 2)
F3 D3 R3 E3 W3 (instruction 3)
F4 D4 R4 E4 W4 (instruction 4)
F5 D5 R5 E5 W5 (instruction 5)
-> time ->
But consider what happens when you execute a slow instruction like
divide--subsequent instructions might use the divided
value. The CPU hence can't execute these instructions until their
data is ready, and so the CPU must detect these dependencies (also
called "pipeline interlocks") and insert waste "stall cycles" or
("pipeline bubbles") to wait for the needed data. Pipeline stalls
can still cost
performance today--see below for software tricks on minimising
dependencies!
F1 D1 R1 E1 W1 (normal instruction)
F2 D2 R2 E2 E2 E2 E2 E2 E2 W2 (divide: 6 clock cycles)
F3 D3 R3 stall ........ E3 W3 (three stalled-out instructions)
F4 D4 stall ........ R4 E4 W4
F5 stall ........ D5 R5 E5 W5
F6 D6 R6 E6 W6 (continuing normally)
-> time ->
Smarter CPUs will only stall instructions that are waiting on that
data, rather than stopping everything in progress like I've shown above.
It's even worse with branch instructions--rather than just waiting,
subsequent instructions, even though they're fetched and ready to go,
might never have happened.
So these instructions need to be "flushed" from the pipeline; and
again, it's the CPU's control unit that needs to detect the situation
and do the flushing (anybody starting to smell some complexity here?).
F1 D1 R1 E1 W1 (normal instructions)
F2 D2 R2 E2 Branch!
F3 D3 R3 flush (flushed instructions-that-never-were)
F4 D4 flush
F5 flush
F9 ... (starting over at the branch target)
-> time ->
Pipelining with Operand Forwarding
Because it might take a few clock cycles to write and then read from
the register file, and it's very common that an instruction needs the
value computed by the immediately preceeding instruction, often real
pipelined CPUs bypass the register file, and pass values directly from the execute stage of one instruction into the execute stage of the next:
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 R2 E2 W2 (instruction 2)
-> time ->
Note that if you don't do forwarding, you would need to insert two
stall cycles to wait for the write and read from the register file:
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 stall R2 E2 W2 (instruction 2)
-> time ->
As usual, the CPU control unit must detect the dependency, decide to
use operand forwarding, and light up the appropriate CPU hardware to
make it happen.
Data Dependencies
The biggest limit to pipelined execution is data dependencies (RAW, WAR, and WAW). (Read it!)
But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm). A modern x86 CPU, with 8 architectural registers, might have 50 renamed registers, only accessible via the register renaming hardware!
There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining.
A naive application of these is unlikely to provide much benefit, since
the compiler and CPU can do naive transformations already; but often
you can exploit features of the algorithm to improve performance.
Superscalar and Out of Order
So the 1980's roll on. Machines get deeper pipelines and better
forwarding, until the machines usually initiate a new instruction every
clock cycle.
Eventually, even one cycle per instruction is not fast enough.
So as the 1990's arrived, designers began adding wider
execution. They added different "execution units", so more
instructions can execute at the same time. For
example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle. Machines that can work on several instructions at once are merely piplined; but CPUs that can start (and work on) more than one ('scalar') instruction per clock cycle are called superscalar.
But there's a problem with superscalar processors. Say you've got
a loop that consists of four integer instructions, then four load/store
instructions. Since there's only one integer unit, the first 4
clock cycles are spent executing the integer instructions. The
first load/store can start at the same time as the last integer
instruction, but then the load/store unit becomes the bottleneck and
the loop takes an additional 3 clock cycles, for a total of 7 clocks:
Program Order:
add1 add2 add3 add4 load1 load2 load3 load4
Execution Order (superscalar, not out of order):
add1 add2 add3 add4
load1 load2 load3 load4
-> time ->
But if you changed the program so your four integer and load/store instructions were interleaved,
then the loop could take just 4 clock cycles to execute, because both units
would always be busy. So for a few years in the early 1990's,
the *order* you wrote your code could make a huge
difference in performance.
But then hardware designers realized that they *don't* have to execute
the instructions in the exact order specified by software; any
equivalent order that computes the same result will work just as
well. Hence by the mid-1990's alongside superscalar "out-of-order execution"
CPUs became dominant; and today the CPU might look several dozen
instructions ahead to determine which instructions it can safely
begin.
Execution Order (superscalar, out of order):
add1 add2 add3 add4
load1 load2 load3 load4
-> time ->
Here's an Ars Technica article on the design of Intel's Core CPU.
There's a good summary of the "Architecture des Ordinateurs" at this Chinese copy of French coursenotes.
Fewer Dependencies, Faster Code
Yes, you can use all this information even if you're not building a CPU.
For example, here's a loop that computes 12 factorial:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i++) {
prod*=i;
}
return prod;
(executable NetRun link)
This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prod*=i;
prod*=(i+1);
}
return prod;
(executable NetRun link)
But if I now use my knowledge that integer multiplication is associative, to fully separate even and odd terms:
unsigned int prodA=1, prodB=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prodA*=i;
prodB*=(i+1);
}
return prodA*prodB;
(executable NetRun link)
This version has fewer true dependencies, and hence executes
dramatically faster--in 18ns instead of 25ns! You can keep
unrolling the loop, adding prodC and prodD, exposing more and more
parallelism, until you eventually hit resource limits: not enough
registers, not enough multiply units, not enough addition units, and
your performance improvement eventually drops off.