Shared Memory Performance Pitfalls

False Sharing

If you've got more than one thread accessing the same variable, you have "true sharing". For example, here all the threads are incrementing the same float, values[0]:

volatile float values[100000]; /* values to increment */
void inc_float(volatile float *f) { for (int rep=0;rep<250000;rep++) *f+=1.0f; }
int foo(void) 
{
  values[0]=0.0;
#pragma omp parallel for num_threads(2)
  for (int i=0;i<4;i++) {
	inc_float(&values[0]);
  }
  return values[0];
}

(Try this in NetRun now!)

One core: 9.9ns/iteration. Two cores: 12.2ns/iteration, plus you get the wrong answer, due to the race for values[0].

If we give each thread its own value, we get the right answer, because they're not sharing the same value. But if different threads' values are close enough to share cache lines, then our performance is still terrible, because the CPUs still have to exchange cache lines repeatedly. This is called "false sharing".

#pragma omp parallel for num_threads(2)
  for (int i=0;i<4;i++) {
	inc_float(&values[i]);
  }

(Try this in NetRun now!)

One core: 9.9ns/iteration. Two cores: 12.2 ns/iteration, but at least the answer is now correct.

The way to avoid false sharing is to give each thread its own value, far from the values of other threads.

#pragma omp parallel for num_threads(2)
  for (int i=0;i<4;i++) {
	inc_float(&values[i*100]);
  }

(Try this in NetRun now!)

One core: 9.9 ns/iteration (still). Two cores: 5.4 ns/iteration. Now we get at least a little speedup!

Threading Overhead

For tiny problems, you have to worry about the overhead of creating the OpenMP threads:

enum {n=100};
volatile float values[n];
int foo(void) 
{
  for (int i=0;i<n;i++) values[i]=0.0f;
  return 0;
}

(Try this in NetRun now!)

No OpenMP: 97ns. Basic OpenMP, num_threads==1: 1000ns. num_threads==2: 10,000ns. num_threads==4: 20,000ns.

Clearly, you've got to do at least tens of microseconds of work (tens of thousands of instructions) for threading to be worthwhile!

Memory Bandwidth Saturation

For medium-sized problems, that fit in cache, speedup is good:

enum {n=1000000};
volatile float values[n];
int foo(void) 
{
//#pragma omp parallel for num_threads(4)
  for (int i=0;i<n;i++) values[i]=0.0f;
  return 0;
}

(Try this in NetRun now!)
Serial: 1.03ms. Two cores: 0.48ms. Four cores: 0.28ms. Nice!

For large problems, that don't fit in cache, you're waiting mostly for RAM:

enum {n=10000000};
volatile float values[n];
int foo(void) 
{
#pragma omp parallel for num_threads(4)
  for (int i=0;i<n;i++) values[i]=0.0f;
  return 0;
}

(Try this in NetRun now!)

Serial: 21.0ms . Two cores: 15.8ms. Four cores: 15.3ms.

Curiously enough, the machine isn't much faster when four cores are waiting for RAM, than when two cores are waiting for RAM! In general, often a "bottleneck" other than the CPU can limit performance: memory, the disk, the network, the mouse or keyboard. In those situations, just adding cores isn't going to help!

Universal Fix: Distributed Memory

A curiously universal fix for bottlenecks is just to replicate everything: not just cores, but memory, disks, RAM, and so on. Such a machine is called a cluster, and the big drawback is that it's tough to run shared-memory programs (like OpenMP) on a bunch of separate machines. Folks have tried, building software distributed shared memory systems like Cluster OpenMP, but it's tough to get good performance for real applications.

Instead, distributed-memory machines are often programmed using explicit message passing, like send and receive calls.