Cache

CS 441 Lecture, Dr. Lawlor
Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask".  Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

(Executable NetRun Link)
	for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=1234567;
}
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) 2.8GHz P4 1.6GHz P-M 2.2Ghz Athlon64 2.0GHz PPC G5 900MHz P3 300MHz PPC
1.02 4.92 4.88 2.27 5.53 17.47 16.03
2.05 4.6 4.87 2.28 5.53 17.47 16.03
4.1 4.72 4.9 2.28 5.18 17.48 16.03
8.19 4.83 4.91 2.28 3.6 17.49 16.03
16.38 4.75 4.91 2.28 3.6 17.52 16.03
32.77 6.59 4.84 2.28 3.63 21.57 16.03
65.54 6.84 10.1 2.29 5.31 21.58 16.64
131.07 6.92 10.11 5.26 5.31 21.97 40.31
262.14 7.13 10.11 6.92 5.31 98.28 40.34
524.29 8.48 10.07 10.13 23.98 144.04 52.33
1048.58 19.33 10.43 38.95 44.59 153.2 49.86
2097.15 54.33 28.87 76.15 99.11 156.86 144.76
4194.3 76.31 85.3 78.05 112.55 157.32 256.09
8388.61 75.33 111.43 78.81 210.04 159.43 342.73
16777.22 77.49 120.39 81.77 214.19 166.52 166.52
33554.43 77.93 126.73 81.56 208.21 168.58 168.58

I claim each performance plateau corresponds to a chunk of hardware.  Note that there are three jumps in the timings:
Note that machines have been getting faster and faster (see the slow machines on the right), but RAM isn't much faster nowadays!

DRAM Hardware

DRAM cells are basically little capacitors, and their charge does slowly bleed off over a timescale of milliseconds; internally they need to be "refreshed" (read off and re-written), which special circuitry performs.

There's more detail than you'd ever want to know about DRAM chip interfacing: RAS/CAS, EDO, FPM, DDR, etc at Ars Technica

Charged DRAM cells are slightly light sensitive.  This means you can actually build a camera using a RAM chip as a sensor (this was much easier back in the day, when chips were often packaged in a little metal "can" instead of being semi-permanently cast into plastic or ceramic).  The downside is the sensors have a native 1-bit resolution (real black-and-white), so even to get grayscale you've got to take several exposures and average.  The upside is that the chips in a $50 1GB (8 gigabit) DIMM could be used to build a 8 *gigapixel* camera (in literal on-or-off black and white, however), and the thing could be read out at several frames per second.  If you were willing to restrict the focal area and resolution, you could easily get hundreds of thousands of frames per second, although you'd probably need a very bright lightsource.

Memory Speed

In general, memory accesses have performance that's:
So if you're getting bad performance, you can either:
These are actually two aspects of "locality": the values you access should be similar.  The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality:
Here's some actual size-vs-stride memory performance I measured on a Core2 Duo CPU using this NetRun program.

Size   \   Stride
2 11 43 155 546
1.0 KB 2.96ns 2.97ns 2.98ns 2.96ns 2.96ns
4.0 KB 3.03ns 3.03ns 3.10ns 3.10ns 3.06ns
16.0 KB 3.04ns 3.05ns 3.10ns 3.10ns 3.06ns
64.0 KB 3.03ns 3.10ns 3.84ns 3.47ns 3.24ns
256.0 KB 3.03ns 3.10ns 3.82ns 3.96ns 3.96ns
1024.0 KB 3.03ns 3.18ns 4.57ns 4.01ns 4.53ns
4096.0 KB 3.05ns 3.21ns 22.19ns 31.05ns 35.02ns
16384.0 KB 3.03ns 3.27ns 22.90ns 42.74ns 43.54ns

Note that up-to-256KB access is always fast (good temporal locality; those 256 Kbytes just get hit over and over).  Similarly, stride-2 access is always fast (good spatial locality; the next access is just 2 bytes from the previous access).  The only time memory access is slow is when jumping around in a big buffer--and it's 10x slower when you do that!

2D Arrays implemented as 1D Arrays

C and C++ don't actually support "real" 2D arrays.

For example, here's some 2D array code--check the disassembly
int arr[4][3];
int foo(void) {
arr[0][0]=0xA0B0;
arr[0][1]=0xA0B1;
arr[0][2]=0xA0B2;
arr[1][0]=0xA1B0;
arr[2][0]=0xA2B0;
arr[3][0]=0xA3B0;
return 0;
}
(executable NetRun link)

Here, arr[i][j] is at a 1D address like arr[i*3+j].  Here's a picture of the array, and the 1D addresses of each element:

i==
0
1
2
3
j==0
[0]
[3]
[6]
[9]
j==1
[1]
[4]
[7]
[10]
j==2
[2]
[5]
[8]
[11]

Note that adjacent j indices are actually adjacent in memory; adjacent i indices are separated by 3 ints.  There are lots of ways to say this same thing:
In general, when you write
    int arr[sx][sy];
    arr[x][y]++;

The compiler turns this into a 1D array like so:
    int arr[sx*sy];
    arr[x*sy+y]++;

Here y is the fast index. Note that the x coordinate is multiplied by the y size.  This is because between x and x+1, there's one whole set of y's.  Between y and y+1, there's nothing!  So you want your *innermost* loop to be y, since then that loop will access contiguous array elements.

We could also have written
    int arr[sy][sx];
    arr[y][x]++;

Which in 1D notation is:  
    int arr[sy*sx];
    arr[x+y*sx]++;

Now x is the fast index, and the y coordinate is multiplied by the x size.  You now want your innermost loop to be x, since adjacent x values are contiguous in memory.

For example, this program has terrible performance-- 75ns/pixel:
(Executable NetRun Link)
enum {sx=512};
enum {sy=512};
double img[sx*sy];

int inc_img(void) {
for (int x=0;x<sx;x++)
for (int y=0;y<sy;y++)
img[x+y*sx]++;
return 0;
}

int foo(void) {
double t=time_function(inc_img);
printf("%.3f ns/pixel\n",t/(sx*sy)*1.0e9);
return 0;
}
We can speed the program up by either: