Optimization: Cache Performance
CS 301 Lecture, Dr. Lawlor, 2005/11/9
Here's an inner loop that does something funny--it jumps around (using
"loc") inside an array called "buf", but only within the bounds
established by "mask". Like any loop, each iteration takes some
amount of time; but what's suprising is that there's a very strong
dependence of the speed on the value of "mask", which establishes the
size of the array we're jumping around in.
(Executable NetRun Link)
for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=1234567;
}
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) |
2.8GHz P4 |
1.6GHz P-M |
2.2Ghz Athlon64 |
2.0GHz PPC G5 |
900MHz P3 |
300MHz PPC |
1.02 |
4.92 |
4.88 |
2.27 |
5.53 |
17.47 |
16.03 |
2.05 |
4.6 |
4.87 |
2.28 |
5.53 |
17.47 |
16.03 |
4.1 |
4.72 |
4.9 |
2.28 |
5.18 |
17.48 |
16.03 |
8.19 |
4.83 |
4.91 |
2.28 |
3.6 |
17.49 |
16.03 |
16.38 |
4.75 |
4.91 |
2.28 |
3.6 |
17.52 |
16.03 |
32.77 |
6.59 |
4.84 |
2.28 |
3.63 |
21.57 |
16.03 |
65.54 |
6.84 |
10.1 |
2.29 |
5.31 |
21.58 |
16.64 |
131.07 |
6.92 |
10.11 |
5.26 |
5.31 |
21.97 |
40.31 |
262.14 |
7.13 |
10.11 |
6.92 |
5.31 |
98.28 |
40.34 |
524.29 |
8.48 |
10.07 |
10.13 |
23.98 |
144.04 |
52.33 |
1048.58 |
19.33 |
10.43 |
38.95 |
44.59 |
153.2 |
49.86 |
2097.15 |
54.33 |
28.87 |
76.15 |
99.11 |
156.86 |
144.76 |
4194.3 |
76.31 |
85.3 |
78.05 |
112.55 |
157.32 |
256.09 |
8388.61 |
75.33 |
111.43 |
78.81 |
210.04 |
159.43 |
342.73 |
16777.22 |
77.49 |
120.39 |
81.77 |
214.19 |
166.52 |
166.52 |
33554.43 |
77.93 |
126.73 |
81.56 |
208.21 |
168.58 |
168.58 |
I claim each performance plateau corresponds to a chunk of hardware. Note that there are three jumps in the timings:
- "L1" cache is the fastest (like 5ns) but tiny (100KB or less).
- "L2" cache is slower (7-10ns) but much bigger (up to a meg)
- RAM is painfully slow (100ns or more) but huge (gigs)
Note that machines have been getting faster and faster (see the slow machines on the right), but RAM isn't much faster nowadays!
DRAM Hardware
There's more detail than you'd ever want to know about DRAM, RAS/CAS, EDO, FPM, DDR, etc at Ars Technica.
DRAM cells are slightly light sensitive. This means you can actually build a camera using a RAM chip as a sensor
(this was much easier back in the day, when chips were often packaged
in a little metal "can" instead of being cast into plastic or
ceramic). The downside is the sensors have a native 1-bit
resolution (real black-and-white), so even to get grayscale you've got
to take several exposures and average. The upside is that the
chips in a $50 512MB (4gigabit) DIMM could be used to build a 4
*gigapixel* camera, and the thing could be read out at several frames
per second. If you were willing to restrict the focal area and
resolution, you could easily get hundreds of thousands of frames per
second, although you'd probably need a very bright lightsource.