Optimization: Cache Performance

CS 301 Lecture, Dr. Lawlor, 2005/11/9

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask".  Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

(Executable NetRun Link)
	for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=1234567;
}
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) 2.8GHz P4 1.6GHz P-M 2.2Ghz Athlon64 2.0GHz PPC G5 900MHz P3 300MHz PPC
1.02 4.92 4.88 2.27 5.53 17.47 16.03
2.05 4.6 4.87 2.28 5.53 17.47 16.03
4.1 4.72 4.9 2.28 5.18 17.48 16.03
8.19 4.83 4.91 2.28 3.6 17.49 16.03
16.38 4.75 4.91 2.28 3.6 17.52 16.03
32.77 6.59 4.84 2.28 3.63 21.57 16.03
65.54 6.84 10.1 2.29 5.31 21.58 16.64
131.07 6.92 10.11 5.26 5.31 21.97 40.31
262.14 7.13 10.11 6.92 5.31 98.28 40.34
524.29 8.48 10.07 10.13 23.98 144.04 52.33
1048.58 19.33 10.43 38.95 44.59 153.2 49.86
2097.15 54.33 28.87 76.15 99.11 156.86 144.76
4194.3 76.31 85.3 78.05 112.55 157.32 256.09
8388.61 75.33 111.43 78.81 210.04 159.43 342.73
16777.22 77.49 120.39 81.77 214.19 166.52 166.52
33554.43 77.93 126.73 81.56 208.21 168.58 168.58

I claim each performance plateau corresponds to a chunk of hardware.  Note that there are three jumps in the timings:
Note that machines have been getting faster and faster (see the slow machines on the right), but RAM isn't much faster nowadays!

DRAM Hardware

There's more detail than you'd ever want to know about DRAM, RAS/CAS, EDO, FPM, DDR, etc at Ars Technica.

DRAM cells are slightly light sensitive.  This means you can actually build a camera using a RAM chip as a sensor (this was much easier back in the day, when chips were often packaged in a little metal "can" instead of being cast into plastic or ceramic).  The downside is the sensors have a native 1-bit resolution (real black-and-white), so even to get grayscale you've got to take several exposures and average.  The upside is that the chips in a $50 512MB (4gigabit) DIMM could be used to build a 4 *gigapixel* camera, and the thing could be read out at several frames per second.  If you were willing to restrict the focal area and resolution, you could easily get hundreds of thousands of frames per second, although you'd probably need a very bright lightsource.