Super-Scalar Execution: Multiple Instructions/Clock

CS 441 Lecture, Dr. Lawlor

Here's the obvious way to compute the factorial of 12:

int i, fact=1;
for (i=1;i<=12;i++) {
	fact*=i;
}
return fact;
(Try this in NetRun now!)

Here's a modified version where we separately compute even and odd factorials, then multiply them at the end:

int i, factO=1, factE=1;
for (i=1;i<=12;i+=2) {
	factO*=i;
	factE*=i+1;
}
return factO*factE;
(Try this in NetRun now!)

This modification makes the code "superscalar friendly", so it's possible to execute the loop's multiply instructions simultaniously. Note that this isn't simply a loop unrolling, which gives a net loss of performance, it's a higher-level transformation to expose parallelism in the problem.

Hardware	Obvious	Superscalar	Savings	Discussion
Intel 486, 50MHz, 1991	5000ns	5400ns	-10%	Classic non-pipelined CPU: many clocks/instruction. The superscalar transform just makes the code slower, because the hardware isn't superscalar.
Intel Pentium III, 1133MHz, 2002	59.6ns	50.1ns	+16%	Pipelined CPU: the integer unit is fully pipelined, so we get one instruction per clock cycle. The P3 is also weakly superscalar, but the benefit is small.
Intel Pentium 4, 2.8Ghz, 2005	22.6ns	15.0ns	+33%	Virtually all of the improvement here is due to the P4's much higher clock rate.
Intel Q6600, 2.4GHz, 2008	16.7ns	9.4ns	+43%	Lower clock rate, but fewer pipeline stages leads to better overall performance.
Intel Sandy Bridge i5-2400 3.1Ghz 2011	11.8ns	5.3ns	+55%	Higher clock rate and better tuned superscalar execution. Superscalar transform gives a substantial benefit--with everything else getting faster, the remaining dependencies become more and more important.