Pipelining, Dependencies, and Superscalar Execution
CS 441 Lecture, Dr. Lawlor
If you're trying to run a single sequential program faster, you need to do as much work in parallel as you can.
- pipelining keeps
all execution units busy, by executing different stages of several
instructions simultaniously. For example, the fetch unit will be
fetching instruction 4, while instruction 3 is being decoded, and
instruction 2 is being executed, and instruction 1 is just finishing
up. Done properly, pipelining can result in one instruction being
finished every clock cycle. Pipelining showed up in mainframes
starting in the late 1950's with the IBM Stretch, and then again in the mid-1980's on desktops with the Intel 486.
- Branch
instructions can seriously limit the performance of a pipelined
machine. For example, if instruction 2 is a branch, instructions
3 & 4 should never have been executed, so you need to either stall
the whole pipeline until the branch is resolved, or speculatively
execute pipelined work and then either commit or quash the results.
- One solution is branch prediction, like a "branch history table". This has developed to a fine art in modern CPUs.
- superscalar
execution is when several instructions are initiated every clock
cycle. You need several copies of each execution unit (fetch,
decode, execute, writeback), multiple register read/write ports, and a
complex control unit to keep everything working together. The CDC 6600 mainframe was (arguably) superscalar back in 1964, and superscalar hit desktops in the mid-1990's with the Intel Pentium and PowerPC 604.
- Dependencies between instructions are the limiting factor in superscalar machines.
- Several interesting tricks, such as register renaming, can reduce the impact of dependencies.
I'll describe each of these in more detail below.
Pipelining
Back in the 1980's, microprocessors took many clock cycles per
instruction, usually at least four clock cycles. They even
printed up tables of clock cycle counts
for various instructions and modes, to help you optimize your
code. This ancient situation still exists with many embedded
processors, like the PIC microcontroller. But this began to
change with the Intel 486, which used "pipelining" to cut the number of clock cycles per instruction (CPI).
The idea behind pipelining is for the CPU to schedule its various components the
same way a sane human would use a washer and dryer. For the sake of
argument, assume you've got:
- Five loads of laundry A B C D E
- They all need to be washed, dried, and folded
- Load E's already clean, and just needs to be dried and folded.
- Each load takes an hour to wash, dry, or fold
So the steps needed are as follows:
Wash A
Dry A
Fold A
Wash B
Dry B
...
Fold D
// don't need to wash E
Dry E
Fold E
Here's how a 386 would execute this program:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
Washer
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
|
|
|
Dryer
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
|
Fold
|
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
It's an all-day job, which is clearly stupid, because only one thing is happening at once.
The sensible thing is to make a "pipeline", like an assembly line,
where you start washing load B as soon as you move load A over to the
dryer. So here's how a 486 would do the same work:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
Washer
|
A
|
B
|
C
|
D
|
|
|
|
Dryer
|
|
A
|
B
|
C
|
D
|
E
|
|
Fold
|
|
|
A
|
B
|
C
|
D
|
E
|
That is, you start washing load D at the same time as you start folding
load B. This way, everything's busy, which speeds up the overall
program substantially--about a factor of two in this case.
Notice how long it takes to dry C--it really takes an hour, but imagine
you didn't need to do that (e.g., C is your rubber sheets). You can't start drying D, since
it's still in the washer. You could start folding C immediately,
but you're busy folding B. So if you leave out "dry C", the
overall process is the *same speed*. This means "dry C" is basically
free--zero clock cycles!
Back in the 1990's, it seemed like zero-clock instructions were really profound, in a zen kind of way.
Pipeline Stalls and Flushes
In a real CPU, the stages "A", "B", "C", "D", and "E" might be "fetch
instruction from memory", "decode instruction", "read registers needed
for computation", "compute values on registers" (execute phase), and finally "write
result back to register file" (instruction completion or "retirement").
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 R2 E2 W2 (instruction 2)
F3 D3 R3 E3 W3 (instruction 3)
F4 D4 R4 E4 W4 (instruction 4)
F5 D5 R5 E5 W5 (instruction 5)
-> time ->
But consider what happens when you execute a slow instruction like
divide--subsequent instructions might use the divided
value. The CPU hence can't execute these instructions until their
data is ready, and so the CPU must detect these dependencies (also
called "pipeline interlocks") and insert waste "stall cycles" or
("pipeline bubbles") to wait for the needed data. Pipeline stalls
can still cost
performance today--see below for software tricks on minimising
dependencies!
F1 D1 R1 E1 W1 (normal instruction)
F2 D2 R2 E2 E2 E2 E2 E2 E2 W2 (divide: 6 clock cycles)
F3 D3 R3 stall ........ E3 W3 (three stalled-out instructions)
F4 D4 stall ........ R4 E4 W4
F5 stall ........ D5 R5 E5 W5
F6 D6 R6 E6 W6 (continuing normally)
-> time ->
Smarter CPUs will only stall instructions that are waiting on that
data, rather than stopping everything in progress like I've shown above.
Here's a MIPS example that takes an extra clock cycle due to the load
into register 4 immediately being used by the next instruction, the add:
li $3, 5
lw $4, ($25)
add $2, $4, $3
jr $31
nop
(Try this in NetRun now!)
It's even worse with branch instructions--rather than just waiting,
subsequent instructions, even though they're fetched and ready to go,
might never have happened.
So these instructions need to be "flushed" from the pipeline; and
again, it's the CPU's control unit that needs to detect the situation
and do the flushing (anybody starting to smell some complexity here?).
F1 D1 R1 E1 W1 (normal instructions)
F2 D2 R2 E2 Branch!
F3 D3 R3 flush (flushed instructions-that-never-were)
F4 D4 flush
F5 flush
F9 ... (starting over at the branch target)
-> time ->
MIPS branches are pretty good, in that they only take one additional
clock cycle, and you can even specify the instruction to insert during
that clock cycle, called the "branch delay slot" instruction.
li $4, 6
li $3, 5
beq $3,$3,end_of_world
add $2, $3, $4 # Always gets executed (branch delay slot)
# Stuff in here gets jumped over
nop
nop
nop
end_of_world:
jr $31
nop
(Try this in NetRun now!)
Pipelining with Operand Forwarding
Because it might take a few clock cycles to write and then read from
the register file, and it's very common that an instruction needs the
value computed by the immediately preceeding instruction, often real
pipelined CPUs bypass the register file, and pass values directly from the execute stage of one instruction into the execute stage of the next:
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 R2 E2 W2 (instruction 2)
-> time ->
Here E2 needs the result coming out of E1, so you can put in a special
"forwarding" circuit to take it there. Note that if you don't do
forwarding, you would need to insert two
stall cycles to wait for the write and read from the register file:
F1 D1 R1 E1 W1 (instruction 1)
F2 D2 stall R2 E2 W2 (instruction 2)
-> time ->
As usual, the CPU control unit must detect the dependency, decide to
use operand forwarding, and light up the appropriate CPU hardware to
make it happen.
MIPS is a
backronym for "Machine (without) Interlocking Pipeline Stages", because
the designers were really happy with themselves for building in operand
forwarding. It does work, since we can measure that there are no
stalls in this code, despite the back-to-back dependent "add"
instructions:
li $4, 6
add $3, $4, $4
add $2, $3, $4
jr $31
nop
(Try this in NetRun now!)
Superscalar Execution
So the 1980's roll on. Machines get deeper pipelines and better
forwarding, until the machines usually initiate a new instruction every
clock cycle. We saw in the last lecture how MIPS machines really
do execute one instruction per clock--you could see this in the
instruction timings, where most instruction cost only one additional
clock cycle, unless something went wrong and it took two.
Eventually, even one cycle per instruction (CPI) is not fast
enough. Sadly, one cycle per instruction is the theoretical limit
of pipelining--it's the upper bound everybody's trying to reach by
avoiding pipeline stalls, hazards, and flushes.
So as the 1990's arrived, designers began adding "wide" superscalar
execution, where multiple instructions start every clock
cycle. Nowadays, the metric is "instructions per cycle"
(IPC), which can be substantially more than one.
To get multiple instructions running at once, you need different
"execution units", so more arithmetic can execute at the same
time. For
example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle.
You can see superscalar effects in the instruction timings of any
modern CPU--many instructions simply cost no time at all. For
example, each these assembly programs take the exact same number of
nanoseconds on my Pentium 4:
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov eax,8
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
mov eax,8
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
mov eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
add eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
imul eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
add ecx,7
imul eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
Yet this one is actually slower by two clock cycles:
mov ecx,2
mov eax,3
mov edx,4
ret
(Try this in NetRun now!)
The limiting factor for this particular instruction sequence seems to
just be the number of instructions--the Pentium 4's control unit can't
handle three writes per clock cycle.
You can avoid the substantial call/return overhead and see a little more timing detail using a tight loop like this one:
mov ecx,1000
start:
add eax,1
dec ecx
jne start
ret
(Try this in NetRun now!)
This takes about two clocks to go around each time. You can
even add one more "add" instructions without changing the loop time
(ah, superscalar!). Because the Pentium 4 uses both rising and
trailing edges of the clock, a 2.8GHz Pentium 4 (like the main NetRun
machine) can take half-integer multiples of the clock period,
0.357ns. So your timings are 0.357ns for one clock, 0.536ns for
1.5 clocks (like the loop above without the "add"), 0.714ns for 2.0
clocks (like the loop above), 0.893ns for 2.5 clocks, etc.
Data Dependencies & Register Renaming
The biggest limit to superscalar execution, overall, is data dependencies (RAW, WAR, and WAW). (Read it!)
But you can reduce most of the "fake" dependencies (WAR and WAW) using register renaming (i.e., Tomasulo's Algorithm).
There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining.
A naive application of these is unlikely to provide much benefit, since
the compiler and CPU can do naive transformations already; but often
you can exploit features of the algorithm to improve performance.
Plus, CPU designers has recently been shedding their most complex
features in favor of higher parallelism, so the machines of the future
may look like the machines of the past!
Out of Order: Superscalar in Practice
There's one big problem with superscalar processors. Say you've got
a loop that consists of four integer instructions, then four load/store
instructions. Since there's only one integer unit, the first 4
clock cycles are spent executing the integer instructions. The
first load/store can start at the same time as the last integer
instruction, but then the load/store unit becomes the bottleneck and
the loop takes an additional 3 clock cycles, for a total of 7 clocks.
But if your four integer and load/store instructions were interleaved,
the loop could take just 4 clock cycles to execute, because both units
would always be busy. So for a few years in the early 1990's,
the *order* you wrote your code could make a huge
difference in performance.
But then hardware designers realized that they *don't* have to execute
the instructions in the exact order specified by software; any
equivalent order that computes the same result will work just as
well. Hence by the mid-1990's "out-of-order execution"
CPUs became dominant; until today the CPU might look a hundred
instructions ahead to determine which instructions it can safely
begin.
Why you care: Superscalar-Friendly Code
For example, here's a loop that computes 12 factorial:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i++) {
prod*=i;
}
return prod;
(executable NetRun link)
This takes 25ns per call on the NetRun 2.8GHz Pentium 4. Naive loop unrolling produces zero performance gain; basically the CPU is already doing this much:
unsigned int prod=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prod*=i;
prod*=(i+1);
}
return prod;
(executable NetRun link)
But if I now use my knowledge that integer multiplication is associative to fully separate even and odd terms:
unsigned int prodA=1, prodB=1, n=12;
for (unsigned int i=1;i<=n;i+=2) {
prodA*=i;
prodB*=(i+1);
}
return prodA*prodB;
(executable NetRun link)
This version has fewer true dependencies, and hence executes substantially faster--in 18ns instead of 25ns! You can keep
unrolling the loop, adding prodC and prodD, exposing more and more
parallelism, until you eventually hit resource limits: not enough
registers, not enough multiply units, not enough addition units, and
your performance improvement eventually drops off.