High Performance Programming for Multicore

The whole point of having multiple cores is to improve performance.  But for some problems it's surprisingly hard to get any performance benefit from multicore.  One big reason is this:

      If the program is CPU bound, adding CPUs can help.  If not, it won't.

Many other things can limit the performance of real programs:

Generally, if you want to avoid all these things and get good performance, you need a big array (or vector) of data sitting in memory in one contiguous chunk, that you can walk right through and compute.  Files and network connections tend to be limited by the OS or hardware, and fancy trees or linked data structures tend to be memory-limited.

The other big factor limiting performance is the difference between bandwidth (wholesale) and latency (retail).

System Bandwidth Latency
CPU Billions of instructions / sec a few nanoseconds / branch
Multicore CPU bandwidth times core count (assuming perfect speedup)  1000 ns or more to create threads
GPU Trillions of instructions / sec 4000 ns or more to start kernel
Memory 10-20 GB/sec (streaming access) 20ns (random access)

Disk

Spinning disk: 50-100MB/s

SSD: 50-500MB/s

Spinning disk: 5ms seek time

SSD: 0.1ms seek time (WAY better!)

Network

Wifi: 54 Mbit/sec

Gig E: 1000 Mbit/sec

Wifi: 2 ms

Gig E: 0.2 ms

 

For a program with mostly reads (such as a search for a winning checkers move, or a faster solution to a scheduling problem), this is enough.  For a program with both reads and writes to memory, it can be hard to beat a single threaded solution, but it is possible!

Serial Histogram

This is the serial base case, and takes about 0.7 nanoseconds per data item on my Sandy Bridge quad core machine.

const int ndata=1000*1000;
int data[ndata];

const int ncores=4;
const int nhist=1000;
volatile int hist[nhist];

int build_histo(void) {
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Naive Parallelization Fail

Just slapping in an OpenMP pragma has two effects: the program gets way slower, 4.6ns/item, and it also gets the wrong answer due to the threads overwriting each others' work.

int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Adding Critical Section (Lock)

You can at least get the right answer by adding a critical section to the increment.  The only problem is we've actually destroyed all the parallelism, and the cores now have to fight for the critical section lock, so this takes 96 nanoseconds per element--over a 100x slowdown!

OpenMP is amazingly lacking in a finer-grained lock primitive.  Most other thread libraries support a "lock" or "mutex" (mutual exclusion) data structure, so you could make an array of them to reduce contention on the one big lock.  This 'lock splitting' technique can help reduce lock contention, but doesn't change the lock overhead, which is typically dozens of nanoseconds.

int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp critical /* only one thread at a time here */
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Adding Atomic Access

The hardware actually supports 'atomic' multithread-safe versions of a few instructions, like integer addition.  This does some magic at the cache line level to guarantee exclusive access.  It's much finer grained than a critical section, which excludes all processors, so it's quite a bit faster, down to 6.1ns per element. 

Confusingly, 'atomic' operations are implemented using the x86 'lock' prefix, but today a 'lock' has come to mean a much slower library-supported critical section.  Atomics have been getting much faster on recent hardware, and GPUs recently added a 'zero penalty atomic' that somehow runs at the same speed as normal arithmetic.  It still uses more bus traffic, or else I think they'd just make all operations atomic, and eliminate a huge class of multithreaded problems!

int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp atomic /* perform this operation 'atomically' */
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

 

Privatize Data

Sharing is bad.  Shared data needs to be accessed in some funky way, or you run the risk of another thread overwriting your work.  

Separate data is good.  Separate data is always fast (no cache thrashing), always correct (no multicore timing issues), and just generally works like the good old days when there was only one thread.

So we can give each core a separate area in memory to build its own histogram.  Done poorly, indexed like hist[nhist][ncores], the different cores fight for cache lines, and you get poor performance due to false sharing (cache coherence thrashing). Done properly, where each core's data is contiguous, this is very fast, about 0.3ns/item, even including the time to merge them afterwards.

#include <omp.h>
const int ndata=1000*1000;
int data[ndata];

const int ncores=4; // FIXME: use omp_get_max_threads()
const int nhist=1000;
volatile int hist[ncores][nhist]; // per-core histograms

int build_histo(void) {
// Sum into local arrays
	#pragma omp parallel for
	for (int d=0;d<ndata;d++) {
		int c=omp_get_thread_num();
		hist[c][data[d]]++;
	}
// Sum across cores, to core 0
	for (int i=0;i<nhist;i++)
		for (int c=1;c<ncores;c++)
			hist[0][i]+=hist[c][i];
	return 0;
}

(Try this in NetRun now!)

I hardcoded "ncores" to 4 above, for clarity.  In a real program, you need to dynamically allocate the "hist" array based on how many cores actually exist, like this:

int ncores=omp_get_max_threads();
const int nhist=1000;
typedef int histarray[nhist];
volatile histarray *hist=new histarray[omp_get_max_threads()];

 

Also, to prevent cache thrashing at the boundaries between threads, you should pad histarray by at least 16 integers, like "typedef int histarray[nhist+32];", which yields about a 10% performance boost.

Generally speaking, the highest performance solution will always be to make separate copies.  However, the problem is then merging the copies.  For example, in a database program, keeping track of 4-32 separate copies of the database would be a coherence nightmare--essentially the underlying multicore cache coherence problem pushed up one more level!

 


CS 441 Lecture Note, 2014, Dr. Orion LawlorUAF Computer Science Department.