Memory Access Performance and Cache

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask". Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

 for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
	sum+=buf[loc&mask]++;
	loc+=del;
	del=del+sum;
 }
(Try this in NetRun now!)

Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").

Size (KB)	4GHz Skylake	2.4GHz Q6600	2.8GHz P4	2.2Ghz Athlon64	2.0GHz PPC G5	900MHz P3	900MHz ARM	300MHz PPC
1	1.29	2.31	4.05	2.27	5.5	17.5	15.5	16.0
2	1.29	2.29	4.39	2.28	5.5	17.5	13.4	16.0
4	1.29	2.29	4.63	2.28	5.2	17.5	13.4	16.0
8	1.29	2.29	4.71	2.28	3.6	17.5	13.4	16.0
16	1.29	2.29	4.76	2.28	3.6	17.5	13.4	16.0
33	1.29	2.29	7.74	2.28	3.6	21.6	14.7	16.0
66	1.8	3.91	8.67	2.29	5.3	21.6	20.4	16.6
131	2.34	4.66	9.07	5.26	5.3	22.0	24.5	40.3
262	3.05	4.91	9.54	6.92	5.3	98.3	26.8	40.3
524	4.74	4.98	12.57	10.13	24.0	144.0	43.5	52.3
1049	5.7	5.02	33.5	38.95	44.6	153.2	117.0	49.9
2097	6.12	6.52	61.49	76.15	99.1	156.9	160.9	144.8
4194	5.91	15.72	76.95	78.05	112.6	157.3	186.3	256.1
8389	8.82	40.3	85.36	78.81	210.0	159.4	201.7	342.7
16777	25.22	50.19	88.55	81.77	214.2	166.5	215.2	166.5
33554	31.04	53.35	90.81	81.56	208.2	168.6	226.9	168.6

I claim each performance plateau corresponds to a chunk of hardware. Note that there are three jumps in the timings:

"L1" cache is the fastest (like 1-5ns) but tiny (100KB or less).
"L2" cache is 2x slower but much bigger (up to a few megs).
RAM is painfully slow but huge (gigs).

Memory Speed

In general, memory accesses have performance that's:

Good *if* you're accessing a small amount of memory--small enough to stay in cache. The pieces of memory you use often automatically get copied into fast cache memory. For example, the top of the stack, recent stuff from the heap, and commonly used globals are almost always in cache, and hence really fast. Machines only have a couple of megs of cache nowdays.
Good *if* you're accessing memory sequentially--if you access a[i], you then access a[i+1]. This "streaming" access is fast because memory chips don't have to change rows.
Terrible *if* you're accessing a big buffer (too big to fit in cache) in a nonsequential way. Cached accesses are on the order of 1ns. Streaming access is also a few ns. Random access that isn't in the cache can be as much as 100ns!

So if you're getting bad performance, you can either:

Make things smaller, or re-use them more often, to fit in the cache.
Make your accesses more regular, to get streaming access.

These are actually two aspects of "locality", the similarity between the values you access. The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality:

Change your algorithm to improve locality. Think about how your loops access memory. Often it's trivial to reorganize your loops so you touch the same data 5 times before hitting the next piece of data (temporal locality); or so you access arrays sequentially (spatial locality). Sometimes an algorithm that's theoretically more efficient will be slower than another algorithm with more instructions but better locality!
Change your data structure to improve locality. For example, you might change linked lists into arrays. The trouble with linked lists is that links are allocated at random separate locations; but an array will be contiguous in memory (spatial locality).
Change your input data to improve locality. For example, processing data in small pieces (1KB) will usually have better temporal locality than huge chunks (1GB). Of course, if your pieces are too small, sometimes your program will slow down because of the per-piece overhead!