Memory Access Performance and Cache

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask".  Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

 for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
	sum+=buf[loc&mask]++;
	loc+=del;
	del=del+sum;
 }
(Try this in NetRun now!)

Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").

Size (KB) 4GHz Skylake 2.4GHz Q6600 2.8GHz P4 2.2Ghz Athlon64 2.0GHz PPC G5 900MHz P3 900MHz ARM 300MHz PPC
1 1.29 2.31 4.05 2.27 5.5 17.5 15.5 16.0
2 1.29 2.29 4.39 2.28 5.5 17.5 13.4 16.0
4 1.29 2.29 4.63 2.28 5.2 17.5 13.4 16.0
8 1.29 2.29 4.71 2.28 3.6 17.5 13.4 16.0
16 1.29 2.29 4.76 2.28 3.6 17.5 13.4 16.0
33 1.29 2.29 7.74 2.28 3.6 21.6 14.7 16.0
66 1.8 3.91 8.67 2.29 5.3 21.6 20.4 16.6
131 2.34 4.66 9.07 5.26 5.3 22.0 24.5 40.3
262 3.05 4.91 9.54 6.92 5.3 98.3 26.8 40.3
524 4.74 4.98 12.57 10.13 24.0 144.0 43.5 52.3
1049 5.7 5.02 33.5 38.95 44.6 153.2 117.0 49.9
2097 6.12 6.52 61.49 76.15 99.1 156.9 160.9 144.8
4194 5.91 15.72 76.95 78.05 112.6 157.3 186.3 256.1
8389 8.82 40.3 85.36 78.81 210.0 159.4 201.7 342.7
16777 25.22 50.19 88.55 81.77 214.2 166.5 215.2 166.5
33554 31.04 53.35 90.81 81.56 208.2 168.6 226.9 168.6

I claim each performance plateau corresponds to a chunk of hardware.  Note that there are three jumps in the timings:

Memory Speed

In general, memory accesses have performance that's:

So if you're getting bad performance, you can either:

These are actually two aspects of "locality", the similarity between the values you access.  The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality: