Parallelism in Practice

CS 321 2007 Lecture, Dr. Lawlor

I promise this is the last lecture you'll hear on parallelism--I think it's a really interesting and important topic, but there is a lot more material to cover this semester.

So you've got this application.  The application is sequential--it's a normal top-to-bottom C++ program.  You want to make it work on multiple CPUs. 

Step one is to figure out what parts of the program are taking the time--there's no point in parallelizing the idle loop.  If the program is slow because it's waiting for the disk, or network card, or graphics card, STOP, because using multiple CPUs is probably not going to help.

If the program spends most of its time computing (doing arithmetic or memory operations), then using multiple CPUs should be able to speed up the program substantially.  The first and hardest part is splitting up the big glob-o-code into little pieces that can run independently. Once you've split up the program into pieces, you then just have to run each piece.  Below are several ways to run pieces of code simultaniously.

Splitting up Programs into Pieces

So you've got a big ball of C code.  You want to run pieces of that code on multiple CPUs.  This is almost always possible, but it's almost never easy.
There's often a substantial amount of rethinking required to split a program into pieces.  It's easy, for example, to mess up the division of work so some things get done twice, and others never get done at all!

Stuff that's really hard to do in parallel includes sorting, compiling, and talking to existing sequential code.  It's also usually a bad idea to try to split up file I/O operations into pieces--if you can eventually make the file locking and seeking work reliably, you'll often still get bad performance.  I always try to switch to in-memory operations first, or have only one thread do most of the I/O.

One well-known problem with dividing a problem into pieces is "load balance"--the pieces are almost never all the same size.  For example, if you divide a problem into two pieces for two processors, but the left piece is way bigger, the right processor will spend most of its time waiting for the left one to finish.  If you have 100 processors (the 2017 Intel Centium), it's easy to screw up your work division so that 99 of the processors are waiting for the last one to finish.  This can destroy the performance improvement you'd get from parallelism, so load imbalance is a bad thing.

Luckily, there's a cool trick for fixing load balance problems--just make way more pieces than you think you'll need ("overdecomposition").  For example, if you've got two processors, make like ten threads.  Then if one thread finishes early, there still are nine more for your processor to choose from.  In general, you'll get good performance from threads until the threads are doing less than a few dozen milliseconds of work (or until the OS can't create you any more threads!).

Load balance is usually impossible to fix if you've divided the problem into inherently unequal pieces--for example, if you divide a compiler into three threads (preprocessor, syntax parse, and code generation).  The problem with this "heterogenous" each-thread-does-a-different-thing approach is that you've only got three threads, so you can't overdecompose.  You're also limited to how many CPUs you can support.  Doing a threaded compiler that has one thread per *input file* would be a better approach, since almost anytime you're waiting for the compiler, there are lots of different files to compile.

Parallelism Using Threads (UNIX)

	// ... prepare program to be run in pieces ...
// Run the pieces as separate threads:
pthread_t *thr=new pthread_t[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pthread_create(&thr[i],0,work,w);
}
// Wait until all pieces are finished (until work routine returns)
for (int i=0;i<n;i++) {
void *ret;
pthread_join(thr[i],&ret);
}
You do have to be very careful with threads that the workers don't accidentally overwrite each other's work, since all memory is shared between threads.  Sometimes this requires locking, but usually it requires making separate copies of variables.  Keeping track of the separate copies can be a pain.  Separating the copies used across a big program can be a real pain.

Parallelism Using Processes

Unlike threads, processes don't share memory by default.  This substantially reduces the number of places you have to worry about race conditions and other parallel problems.  However, there are times when you want to share some memory between processes--for example, to collect the pieces of the final result.   You must allocate these shared regions in a special way--on UNIX, you can make a shared memory region really easily using mmap or with more code using shmget/shmat (see Beej's shm example).  On Windows, you can create a shared-memory region with CreateFileMapping and then MapViewOfFile (see codeproject example).

Here's my UNIX example (works on Linux and Mac OS X, but not Windows):
#include <unistd.h> /* for fork */
#include <sys/mman.h> /* for mmap */
#include <sys/wait.h> /* for wait */

// Return len bytes of RAM that will automagically be shared after a fork().
void *shared_malloc(int len) {
void *p=mmap(0,len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1,0);
if (p==(void *)-1) printf("mmap failed!\n");
return p;
}
void shared_free(void *p,int len) {
munmap(p,len);
}

...
// ... prepare program to be run in pieces ...
// ... call shared_malloc for memory shared between pieces ...
// Run the pieces as separate processes:
int *pid=new int[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pid[i]=fork();
if (pid[i]==0) { /* I'm the child laborer */
work(w);
exit(0);
}
}
// Wait until all pieces are finished (until children exit)
for (int i=0;i<n;i++) {
int ret;
waitpid(pid[i],&ret,0);
}

Parallelism Using the Network

There are also several stranger ways to communicate between different processes, including MPI calls and TCP sockets.  We will talk about the network later this semester.