Parallelizing and Load Balancing Parallel Programs

CS 641 Lecture, Dr. Lawlor

So you've got this application. The application is sequential--it's a normal top-to-bottom C++ program. You want to make it work on multiple CPUs.

Step one is to figure out what parts of the program are taking the time--there's no point in parallelizing the idle loop. If the program is slow because it's waiting for the disk, or network card, or graphics card, STOP, because using multiple CPUs is probably not going to help.

If the program spends most of its time computing (doing arithmetic or memory operations), then using multiple CPUs should be able to speed up the program substantially. The first and hardest part is splitting up the big glob-o-code into little pieces that can run independently. Once you've split up the program into pieces, you then just have to run each piece. Below are several ways to run pieces of code simultaniously.

Splitting up Programs into Pieces

So you've got a big ball of C code. You want to run pieces of that code on multiple CPUs. This is almost always possible, but it's almost never trivially easy.

For programs that are computing a big array of stuff, you can compute pieces of the array separately. For example, most graphics programs can divide up the image (or 3D model) into pieces.
For programs that search over a big area for stuff, you can search subareas separately.
For programs that total up a bunch of stuff, you can usually total up the pieces independently, and then come up with a grand total at the end.

There's often a substantial amount of rethinking required to split a program into pieces. It's easy, for example, to mess up the division of work so some things get done twice, and others never get done at all!

Stuff that's really hard to do in parallel includes sorting, complicated nonlinear transformations like compiling, and talking to existing sequential code. It's also usually harmful to try to split up one file's I/O operations into pieces--if you can eventually make the file locking and seeking work reliably, you'll often still get bad performance. I always try to switch to in-memory operations first, write out to multiple files (which works nicely), or have only one thread do most of the I/O.

One well-known problem with dividing a problem into pieces is "load balance"--the pieces are almost never all the same size. For example, if you divide a problem into two pieces for two processors, but the left piece is way bigger, the right processor will spend most of its time waiting for the left one to finish. If you have 100 processors (the 2017 Intel Centium), it's easy to screw up your work division so that 99 of the processors are waiting for the last one to finish. This can destroy the performance improvement you'd get from parallelism, so load imbalance is a bad thing.

Luckily, there's a cool trick for fixing load balance problems--just make way more pieces than you think you'll need ("overdecomposition"). For example, if you've got two processors, make like ten threads. Then if one thread finishes early, there still are nine more for your processor to choose from. In general, you'll get good performance from threads until the threads are doing less than a few dozen milliseconds of work (or until the OS can't create you any more threads!). (See my friend Milind Bhandarkar's thesis for some performance graphs of this.)

Finally, load balance is usually impossible to fix if you've divided the problem into inherently unequal pieces--for example, if you divide a compiler into three threads (preprocessor, syntax parse, and code generation). The problem with this "heterogenous" each-thread-does-a-different-thing approach is that you've only got three threads, so you can't overdecompose. You're also limited to how many CPUs you can support. Doing a threaded compiler that has one thread per *input file* would be a better approach, since almost anytime you're waiting for the compiler, there are lots of different files to compile.