# Super-Scalar Execution: Multiple Instructions/Clock

CS 441 Lecture, Dr. Lawlor

Here's the obvious way to compute the factorial of 12:
`int i, fact=1;for (i=1;i<=12;i++) {	fact*=i;}return fact;(Try this in NetRun now!)`

Here's a modified version where we separately compute even and odd factorials, then multiply them at the end:

`int i, factO=1, factE=1;for (i=1;i<=12;i+=2) {	factO*=i;	factE*=i+1;}return factO*factE;(Try this in NetRun now!)`
This modification makes the code "superscalar friendly", so it's possible to execute the loop's multiply instructions simultaniously.  Note that this isn't simply a loop unrolling, which gives a net loss of performance, it's a higher-level transformation to expose parallelism in the problem.

 Hardware Obvious Superscalar Savings Discussion Intel 486, 50MHz, 1991 5000ns 5400ns -10% Classic non-pipelined CPU: many clocks/instruction.  The superscalar transform just makes the code slower, because the hardware isn't superscalar. Intel Pentium III, 1133MHz, 2002 59.6ns 50.1ns +16% Pipelined CPU: the integer unit is fully pipelined, so we get one instruction per clock cycle.  The P3 is also weakly superscalar, but the benefit is small. Intel Pentium 4, 2.8Ghz, 2005 22.6ns 15.0ns +33% Virtually all of the improvement here is due to the P4's much higher clock rate. Intel Q6600, 2.4GHz, 2008 16.7ns 9.4ns +43% Lower clock rate, but fewer pipeline stages leads to better overall performance. Intel Sandy Bridge i5-2400 3.1Ghz 2011 11.8ns 5.3ns +55% Higher clock rate and better tuned superscalar execution.  Superscalar transform gives a substantial benefit--with everything else getting faster, the remaining dependencies become more and more important.