Hardware Implementation of Shared Memory

False Sharing and Cache Thrashing

If two threads try to write to the *same* variable at the same time, you can get the wrong answer.

But surprisingly, if two threads try to write to different but nearby variables at the same time, you get the right answer, but a big performance hit!

enum {n=2}; /* number of threads */
enum {m=1000*1000}; /* number of times to access the int */
int offset=1; /* distance, in ints, between threads' accesses */
volatile int arr[1025];
void hammer_on_int(volatile int *ptr) {
	for (int i=0;i<m;i++)
		(*ptr)++;
}
int multi_hammer(void) {
#pragma omp parallel for
        for (int thread=0;thread<n;thread++) {
        	hammer_on_int(&arr[thread*offset]);
        }
	return 0;
}
int foo(void) {
	for (offset=1;offset<=1024;offset*=2)
		printf("%d-byte offset took %.3f ns/++\n",
			sizeof(int)*offset,time_function(multi_hammer)*1.0e9/m);
	return 0;
}

(Try this in NetRun now!)

This program prints out:

4-byte offset took 19.437 ns/++
8-byte offset took 19.304 ns/++
16-byte offset took 20.442 ns/++
32-byte offset took 22.939 ns/++
64-byte offset took 2.615 ns/++
128-byte offset took 2.601 ns/++
256-byte offset took 2.598 ns/++
512-byte offset took 2.572 ns/++
1024-byte offset took 2.857 ns/++
2048-byte offset took 2.664 ns/++
4096-byte offset took 2.571 ns/++
Program complete.  Return 0 (0x0)

When two threads are working on data within the same 64-byte cache line, there's a substantial (10x!) slowdown.

Snoopy Cache Coherence Protocol

The problem is the cache coherence protocol the two CPUs use to ensure that writes to different locations will combine properly.

In particular, most modern machines use the MESI variant of MSI (go read 'em!). These protocols have line granularity, so if two CPUs are accessing data near each other (within one cache line), you'll get "cache thrashing" as the CPUs have to hand the line back and forth.

The solution to this "false sharing" problem is to just separate data accessed by multiple threads--they've got to be at least one cache line apart for good performance.

Non-Uniform Memory Access (NUMA): Scalable Cache Coherence

Larger machines, where there is no shared bus, usually have non-uniform memory access: local memory is faster to access than remote memory. To keep track of the caches, NUMA machines usually use some form of directory protocol, where there's a tiny stub associated with each potential cache line in memory, telling you the CPU responsible for keeping track of that cache line, or -1 if it's uncached.

Software Distributed Shared Memory (SDSM)

You can even write your own distributed shared memory cache coherence protocol, using local RAM as a cache for remote RAM. Typically the "cache line" size is the same as the size of a hardware page, say 4KB for x86. This is pretty big, so to avoid false sharing on SDSM systems, you need to make sure different threads' data is many kilobytes separated!