Here's a modified version where we separately compute even and odd factorials, then multiply them at the end:
This modification makes the code "superscalar friendly", so it's possible to execute the loop's multiply instructions simultaniously. Note that this isn't simply a loop unrolling, which gives a net loss of performance, it's a higher-level transformation to expose parallelism in the problem.
|Intel 486, 50MHz,
||Classic non-pipelined CPU: many clocks/instruction. The superscalar transform just makes the code slower, because the hardware isn't superscalar.
|Intel Pentium III,
||Pipelined CPU: the integer
unit is fully pipelined, so we get one instruction per clock
cycle. The P3 is also weakly superscalar, but the benefit is small.
||Virtually all of the improvement here is due to the P4's much higher clock rate.
|Intel Q6600, 2.4GHz,
||Lower clock rate, but fewer pipeline stages leads to better overall performance.
|Intel Sandy Bridge i5-2400 3.1Ghz
||Higher clock rate and better tuned superscalar execution.
Superscalar transform gives a substantial benefit--with everything else
getting faster, the remaining dependencies become more and more