The History of Pipelining

CS 441 Lecture, Dr. Lawlor

Today we looked at the performance of this program, which just adds 1 to each int or float in an array.
void add_to_floats(float *dest,int n) {
int i;
for (i=0;i<n;i++) dest[i]+=1.0;
}
void add_to_ints(int *dest,int n) {
int i;
for (i=0;i<n;i++) dest[i]+=1;
}
#define n 1000
float some_floats[n]; int some_ints[n];
int add_floats(void) { add_to_floats(some_floats,n); return 0;}
int add_ints(void) { add_to_ints(some_ints,n); return 0;}

int foo(void) {
int i; double per_call;
for (i=0;i<n;i++) some_floats[i]=0.0f;
per_call=time_function(add_ints);
printf("%.2f ns/int\n",1.0e9*per_call/n);
per_call=time_function(add_floats);
printf("%.2f ns/float\n",1.0e9*per_call/n);
return 0;
}

(Try this in NetRun now!)

Here's the performance of this code on various hardware (all on NetRun):

Hardware
ns/int
ns/float
ns/clock
instructions per loop
clocks per instruction
Discussion
Intel 486, 50MHz,
1991
181ns/int
763ns/float
20ns/clock
4 (int)
8 (float)
3 (int)
5 (float)
Classic non-pipelined CPU: many clocks/instruction.
MIPS R5000,
180MHz,
1996
51ns/int
114ns/float
5.5ns/clock
9 (int)
11 (float)
1 (int)
2 (float)
Classic fully pipelined CPU: one instruction/clock.  Note how there are more instructions than CISC, but each instruction runs faster!
PowerPC G4, 768MHz, 2001
8.3 ns/int
10.2 ns/float
1.3ns/clock
8 (int)
9 (float)
0.8 (int)
0.87 (float)
Superscalar RISC CPU: multiple instructions per clock cycle.  The PowerPC wasn't amazingly good at doing superscalar work yet, but it was superscalar.
Intel Pentium III,
1133MHz,
2002
3.58 ns/int
2.87 ns/float
0.88ns/clock
4 (int)
7 (float)
1 (int)
0.5 (float)
Superscalar x86 CPU: the integer unit is fully pipelined, so we get one instruction per clock cycle.  But floating point runs *simultaneously* with the integer stuff!
Intel
Pentium 4,
2.8Ghz,
2005
1.22 ns/int
1.57 ns/float
0.36ns/clock
4 (int)
6 (float)
0.84 (int)
0.73 (float)
Modern CPUs are able to run even integer code superscalar.  Yet some code actually takes more clock cycles per instruction on the Pentium 4, due to its deep pipeline.
Intel Q6600, 2.4GHz,
2008
0.84 ns/int
0.85 ns/float
0.42ns/clock
4 (int)
6 (float)
0.5 (int)
0.33 (float)
We're even more superscalar, executing 2 or 3 instructions per clock cycle.

It's also interesting to look at the pipeline depths listed in Wikipedia's pipelining page: the first pipelines were only 3-5 stages deep, but they reached 31 stages by the Pentium 4 era.  This was actually too many stages--pipeline fill and drain latencies, and the interlocks between pipeline stages, began to dominate runtime.  Thus Intel's later CPUs, such as the Core 2, used much shorter pipelines of around 14 stages.

Bonus: Errata

Almost every CPU produced has a few hardware bugs, called "errata".  Typically these are minor problems, easily found by the manufacturer and fixed via a small adjustment to the silicon, or a boot-time software fix ("microcode patch").  Sometimes, an unknown bug will be found by customers.

Intel received a lot of bad press back in 1994 for a fairly minor problem with their floating-point divide on early Pentium chips.  Afer early denials that this was even an important bug, Intel ended up running a free replacement program.  They made the recalled bad chips into keychains for employees to carry, with Andy Grove's quote in response:
Bad companies are destroyed by crises;
good companies survive them;
great companies are improved by them.

In other words, Intel would redouble their efforts to avoid customer-visible bugs.