Out of Order Execution and Register Renaming
CS 441 Lecture, Dr. Lawlor
It's really interesting to watch the evolution of modern microarchitectures:
- The Intel 8008, from 1972, has
a single shared internal bus used for everything: instruction fetch,
execution, memory accesses. This means it takes many clock cycles
per instruction, to interleave access to the bus.
- The Intel 386, from the 1980's, added
pipelining. The big difference is you now have many separate
busses across the machine, and need to fetch instructions in big blocks
(otherwise fetch interferes with execution memory access).
- The Intel Pentium from early 1990 added superscalar execution, so now there are multiple arithmetic units and a dependency-checking control unit.
- A recent CPU like the Intel Core2
from 2000's has renamed registers and a reorder buffer, as described
below. This is about as far as you can take performance without
changing the software.
- Since about 2005 we've had several separate processor cores per chip, which requires multicore software.
- Future CPU/GPU hybrids, like AMD's 2011 "APU A8", might combine several CPU cores with a serious graphics processor, to tackle both sequential and parallel programs.
Superscalar history
So the 1980's roll on. Machines get deeper pipelines and better
forwarding, until the machines usually initiate a new instruction every
clock cycle. Eventually, even one cycle per instruction (CPI) is not fast
enough. Sadly, one cycle per instruction is the theoretical limit
of pipelining--it's the upper bound everybody's trying to reach by
avoiding pipeline stalls, hazards, and flushes.
So as the 1990's arrived, designers began adding "wide" superscalar
execution, where multiple instructions start every clock
cycle. Nowadays, the metric is "instructions per cycle"
(IPC), which can be substantially more than one. (Similarly, at
some point travel went from gallons per mile to miles per gallon, which
is good!)
To get multiple instructions running at once, you need different
"execution units", so more arithmetic can execute at the same
time. For
example, 1994's PowerPC 601 could issue one integer, one floating-point, and one load/store instruction every clock cycle.
You can see superscalar effects in the instruction timings of any
modern CPU--many instructions simply cost no time at all. For
example, each these assembly programs take the exact same number of
nanoseconds on my Pentium 4:
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov eax,8
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
mov eax,8
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
mov eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
add eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
mov ecx,7
imul eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
add ecx,7
imul eax,ecx
ret
(Try this in NetRun now!) -> Takes 4.4ns
Yet this one is actually slower by two clock cycles:
mov ecx,2
mov eax,3
mov edx,4
ret
(Try this in NetRun now!)
The limiting factor for this particular instruction sequence seems to
just be the number of instructions--the Pentium 4's control unit can't
handle three writes per clock cycle.
You can avoid the substantial call/return overhead and see a little more timing detail using a tight loop like this one:
mov ecx,1000
start:
add eax,1
dec ecx
jne start
ret
(Try this in NetRun now!)
This takes about two clocks to go around each time. You can
even add one more "add" instructions without changing the loop time
(ah, superscalar!). Because the Pentium 4 uses both rising and
trailing edges of the clock, a 2.8GHz Pentium 4 (like the main NetRun
machine) can take half-integer multiples of the clock period,
0.357ns. So your timings are 0.357ns for one clock, 0.536ns for
1.5 clocks (like the loop above without the "add"), 0.714ns for 2.0
clocks (like the loop above), 0.893ns for 2.5 clocks, etc.
Data Dependencies & Register Renaming
The biggest limit to superscalar execution, overall, is data dependencies: RAW, WAR, and WAW, also called "hazards." (Read it!)
- RAR (read after read) is not a dependency, since reads can happen in any order.
- WAW (write after write) is where I write a value after you write
it. The machine just has to make sure the correct value ends up
in the register, which is my value.
- WAR (write after read) is where I write a value after you read
it. The machine just has to make sure you get your value before I
destroy it.
- RAW (read after write) is the hard one: I read a value after you write it. This means I need your value, so I really have to wait for you.
You can eliminate the "fake" dependencies WAW and WAR using register renaming (also called Tomasulo's Algorithm):
a typical implementation is to hand out a new renamed register for
every write operation.
Modern CPUs rename the architectural registers (like eax through edi)
out to a huge set of *hundreds* of possible physical registers.
Also, because most values are passed directly arithmetic-to-arithmetic
(with register bypass), today a "register" is basically just an
identifier that the arithmetic units use to communicate.
There are several interesting transformations you can use to expose and reduce dependencies, such as Software Pipelining.
A naive application of these is unlikely to provide much benefit, since
the compiler and CPU can do naive transformations already; but often
you can exploit features of the algorithm to improve performance.
Plus, CPU designers has recently been shedding their most complex
features in favor of higher parallelism, so the machines of the future
may look like the machines of the past!
Out of Order: Superscalar in Practice
There's one big problem with superscalar processors. Say you've got
a loop that consists of four integer instructions, then four load/store
instructions. Since there's only one integer unit, the first 4
clock cycles are spent executing the integer instructions. The
first load/store can start at the same time as the last integer
instruction, but then the load/store unit becomes the bottleneck and
the loop takes an additional 3 clock cycles, for a total of 7 clocks.
But if your four integer and load/store instructions were interleaved,
the loop could take just 4 clock cycles to execute, because both units
would always be busy. So for a few years in the early 1990's,
the *order* you wrote your code could make a huge
difference in performance.
But then hardware designers realized that they *don't* have to execute
the instructions in the exact order specified by software; any
equivalent order that computes the same result will work just as
well. Hence by the mid-1990's "out-of-order execution"
CPUs
became dominant; until today the CPU might look a hundred
instructions ahead to determine which instructions it can safely
begin. This "reorder window" stretches across loop iterations,
function calls, and branches to pull as much parallelism as possible
from the code.
Finally, instructions are "retired" in program original order via a
special cleanup circuit. In the event of a debug breakpoint
(crash or hardware interrupt), the debugger will see the "retired"
version of the register: the last fully calculated
copy.
The Modern World
Consider a modern CPU microarchitecture, like Intel's Core cpus. These chips embrace so-called "instruction level parallelism" at a number of levels:
- Instructions are pipelined into 14 stages.
- Branches are predicted hundreds of cycles in advance,
- Execution is superscalar, with up to 4 instructions starting per clock.
- There are as many as 8 separate arithmetic units (counting both
integer and the various floating-point units) which can all be busy at
once.
- Program registers are renamed across a pool of maybe a hundred
physical registers (Intel hasn't released the actual number).
This eliminates WAR and WAW dependencies, and helps keep the pipelines
full.
- Execution is out of order, with up to 96 instructions in the
reorder buffer. This allows the CPU to bypass slow instructions.
The bottom line is that all of this circuitry and effort exists *solely* to extract parallelism from sequential assembly code.