Memory Access Performance and Cache

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask".  Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

 for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
	sum+=buf[loc&mask]++;
	loc+=del;
	del=del+sum;
 }
(Try this in NetRun now!)

Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").

Size (KB) 4GHz Skylake 2.4GHz Q6600 2.8GHz P4 2.2Ghz Athlon64 2.0GHz PPC G5 900MHz P3 900MHz ARM 300MHz PPC
1 1.29 2.31 4.05 2.27 5.5 17.5 15.5 16.0
2 1.29 2.29 4.39 2.28 5.5 17.5 13.4 16.0
4 1.29 2.29 4.63 2.28 5.2 17.5 13.4 16.0
8 1.29 2.29 4.71 2.28 3.6 17.5 13.4 16.0
16 1.29 2.29 4.76 2.28 3.6 17.5 13.4 16.0
33 1.29 2.29 7.74 2.28 3.6 21.6 14.7 16.0
66 1.8 3.91 8.67 2.29 5.3 21.6 20.4 16.6
131 2.34 4.66 9.07 5.26 5.3 22.0 24.5 40.3
262 3.05 4.91 9.54 6.92 5.3 98.3 26.8 40.3
524 4.74 4.98 12.57 10.13 24.0 144.0 43.5 52.3
1049 5.7 5.02 33.5 38.95 44.6 153.2 117.0 49.9
2097 6.12 6.52 61.49 76.15 99.1 156.9 160.9 144.8
4194 5.91 15.72 76.95 78.05 112.6 157.3 186.3 256.1
8389 8.82 40.3 85.36 78.81 210.0 159.4 201.7 342.7
16777 25.22 50.19 88.55 81.77 214.2 166.5 215.2 166.5
33554 31.04 53.35 90.81 81.56 208.2 168.6 226.9 168.6

I claim each performance plateau corresponds to a chunk of hardware.  Note that there are three jumps in the timings:

Memory Speed

In general, memory accesses have performance that's:

So if you're getting bad performance, you can either:

These are actually two aspects of "locality", the similarity between the values you access.  The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality:

Set-associative Cache Mapping

Anytime you have a cache, of any kind, you need to figure out what to do when the cache gets full.  Generally, you face this problem when you've got a new element X to load into the cache--which cache slot do you place  X into?

The simplest approach in hardware is a "direct mapped cache", where element X goes into cache slot X%N (where N is the size of the cache), where the elements just wrap around the cache slots.  Direct mapping means elements 1 and 2 will go into different adjacent slots, but you can support many elements before wrapping around the cache.

For example, the Pentium 4's L1 cache is 64KB in size and direct-mapped.  This means address 0x0ABCD and address 0x1ABCD (which are 64KB apart) both get mapped to the same place in the cache.  So even though this program is fast (5.2ns/call):

enum {n=1024*1024};
char arr[n]; 

int foo(void) {
	arr[0]++;
	arr[12345]++;
	return 0;
}

(Try this in NetRun now!)

By contrast this very similar-looking program is very slow (20+ns/call), because array elements exactly 64KB apart map to the same line of the cache, so the CPU keeps overwriting one with the other.  This is called "cache thrashing", and it makes the cache is totally useless:

enum {n=1024*1024};
char arr[n]; 

int foo(void) {
	arr[0]++;
	arr[65536]++;
	return 0;
}

(Try this in NetRun now!)

In general, power-of-two jumps in memory can be very slow on direct-mapped machines.  This is one of the only cases on computers where powers of two are not ideal!  The solution: pad out your data structures so you won't access different values that are a big power of two distance apart.


CS 301 Lecture Note, 2016, Dr. Orion LawlorUAF Computer Science Department.