Parallelizing and Load Balancing Parallel Programs
CS 641 Lecture, Dr. Lawlor
So you've got this application. The application is
sequential--it's a normal top-to-bottom C++ program. You want to
make it work on multiple CPUs.
Step one is to figure out what parts of the program are taking the
time--there's no point in parallelizing the idle loop. If the
program is slow because it's waiting for the disk, or network card, or
graphics card, STOP, because using multiple CPUs is probably not going to help.
If the program spends most of its time computing (doing arithmetic or
memory operations), then using multiple CPUs should be able to speed up
the program substantially. The first and hardest part is
splitting up the big glob-o-code into little pieces that can run
independently. Once you've split up the program into pieces, you then
just have to run each piece. Below are several ways to run pieces
of code simultaniously.
Splitting up Programs into Pieces
So you've got a big ball of C code. You want to run pieces of
that code on multiple CPUs. This is almost always possible, but
it's almost never trivially easy.
There's often a substantial amount of rethinking required to split a
program into pieces. It's easy, for example, to mess up the
division of work so some things get done twice, and others never get
done at all!
- For programs that are computing a big array of stuff, you can
compute pieces of the array separately. For example, most graphics programs
can divide up the image (or 3D model) into pieces.
- For programs that search over a big area for stuff, you can search subareas separately.
- For programs that total up a bunch of stuff, you can usually
total up the pieces independently, and then come up with a grand total
at the end.
Stuff that's really hard to do in parallel includes sorting, complicated nonlinear transformations like compiling,
and talking to existing sequential code. It's also usually harmful to try to split up one file's I/O operations
into pieces--if you can eventually make the file locking and seeking
you'll often still get bad performance. I always try to switch to
in-memory operations first, write out to multiple files (which works nicely), or have only one thread do most of the I/O.
One well-known problem with dividing a problem into pieces is "load
balance"--the pieces are almost never all the same size. For
example, if you divide a problem into two pieces for two processors,
but the left piece is way bigger, the right processor will spend most
of its time waiting for the left one to finish. If you have 100
processors (the 2017 Intel Centium), it's easy to screw up your work
division so that 99 of the processors are waiting for the last one to
finish. This can destroy the performance improvement you'd get
from parallelism, so load imbalance is a bad thing.
Luckily, there's a cool trick for fixing load balance problems--just
make way more pieces than you think you'll need
("overdecomposition"). For example, if you've got two processors,
make like ten threads. Then if one thread finishes early, there
still are nine more for your processor to choose from. In
general, you'll get good performance from threads until the threads are
doing less than a few dozen milliseconds of work (or until the OS can't
create you any more threads!). (See my friend Milind Bhandarkar's thesis for some performance graphs of this.)
Finally, load balance is usually impossible to fix if you've divided the problem into inherently
unequal pieces--for example, if you divide a compiler into three
threads (preprocessor, syntax parse, and code generation). The
problem with this "heterogenous" each-thread-does-a-different-thing
approach is that you've only got three threads, so you can't
overdecompose. You're also limited to how many CPUs you can
support. Doing a threaded compiler that has one thread per *input
file* would be a better approach, since almost anytime you're waiting
for the compiler, there are lots of different files to compile.