Distributed-Memory Parallel Programming with MPI
CS 641 Lecture, Dr. Lawlor
First, read this basic MPI tutorial, which has a good survey of all the MPI routines. The gory details are in the MPI 1.1 standard,
which includes some examples and is fairly readable for a standard (that's not saying much: most standards are
hideously unreadable).
Pay particular attention to the "big eight" MPI
functions:
- MPI_Init and MPI_Finalize:
these set up and tear down MPI. MPI_Init sets up your
environment--lots of stuff, from command-line arguments to file I/O,
doesn't work right until you call MPI_Init, so call it before you do
anything. Also, you absolutely MUST call MPI_Finalize under every
possible exit path, or many MPI implementations won't kill the other
processes correctly, leaving zombies to prey on living processes.
Because of these dangers, every MPI main program should start with
*exactly* this code:
#include <mpi.h>
#include <stdlib.h> /* for atexit */
void my_exit_fn(void) {MPI_Finalize();}
int main(int argc,char *argv[])
{
MPI_Init(&argc,&argv);
atexit(my_exit_fn);
... now start actual work ...
}
- MPI_Comm_size and MPI_Comm_rank:
these return your process number (your "rank"), and the number of
processes (the "size") in a "communicator", which is almost always just
the whole machine, called MPI_COMM_WORLD. Here's how you get your
rank and size:
int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
- MPI_Send and MPI_Recv:
these "point to point" functions just send bytes from one place to
another. They're the meat and potatoes of MPI. The
arguments give a contiguous list of data in memory, of a fixed length,
of a given data type,
being sent to a given destination rank, with any integer "tag" you like
(a tag of zero works fine), on a communicator (almost always
MPI_COMM_WORLD). MPI_Recv takes an "MPI_Status" pointer, where
MPI can stash the actual received message length and source
process. I remember the first six arguments for the send/receive
calls, MPI_Send(void* buffer, int count, MPI_Datatype datatype, int processor_dest, int tag, MPI_Comm comm); using the mnemonic "Bob Can't Do Peanuts with That Crap" (no offence to Bob).
- MPI_Bcast and MPI_Reduce:
these "collective" functions broadcast data from one processor out to
every processor, or reduce data from all processors to one "master"
processor.
Those are really
the only functions you learn in MPI 1.1, the rest are just small variations on
those themes.
For example, here's an idiomatic MPI program where the first process sends one integer to the last process:
#include <mpi.h> /* for MPI_ functions */
#include <stdio.h> /* for printf */
#include <stdlib.h> /* for atexit */
void call_finalize(void) {MPI_Finalize();}
int main(int argc,char *argv[]) {
MPI_Init(&argc,&argv);
atexit(call_finalize); /*<- important to avoid weird errors! */
int rank=0,size=1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
int tag=17; /*<- random integer ID for this message exchange */
if (rank==0) {
int val=1234;
MPI_Send(&val,1,MPI_INT, size-1,tag,MPI_COMM_WORLD);
printf("Rank %d sent value %d\n",rank,val);
}
if (rank==size-1) {
MPI_Status sts;
int val=0;
MPI_Recv(&val,1,MPI_INT, MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&sts);
printf("Rank %d received value %d\n",rank,val);
}
return 0;
}
You can copy this same program on the UAF powerwall from ~olawlor/641_demo/simplempi/.
MPI Performance on the Powerwall
The UAF Bioinformatics Powerwall
is a fairly typical modern cluster, with gigabit ethernet connecting
ten dual-core nodes. We recently upgraded this cluster to run OpenMPI 1.3, instead of the more venerable MPICH. Here's how it performs:
Test send/recv_int took 47.113us per iteration (10000 iterations)
Sending even a small, one-int message takes a *long* time (high alpha
cost; about 50 us). This is mostly a function of your network
hardware.
Test sandbag (100) took 49.522us per iteration (1000 iterations)
The "sandbag" is my function that does CPU work.
Test send/recv_int + sandbag(100) took 96.148us per iteration (10000 iterations)
If we both compute, and communicate, the time should add close to
linearly if the CPU and network are not running at the same time.
Test isend/irecv_int + sandbag(100) took 53.740us per iteration (100 iterations)
For better communication performance, use Isend and Irecv, which are "nonblocking": the CPU can keep working while the network talks.
Test send/recv_zero length took 46.548us per iteration (10000 iterations)
Test send/recv_1k took 72.320us per iteration (100 iterations)
Test send/recv_1meg took 9169.580us per iteration (100 iterations)
Short messages, even zero length, are expensive due to alpha
cost. A 1-kilobyte message costs only 30% more than a one-int
message! Communication costs under 10ns/byte for long messages,
since the startup overhead amortizes away. OpenMPI really can
deliver over 100MB/sec on gigabit ethernet.
Test send/recv_int_overlap took 2.792us per iteration (1000 iterations)
Here's another curiousity--if I do repeated one-int sends to the same
destination, MPI is smart enough to start bundling the ints together
into longer messages. Of course, it's still far more expensive
than just bundling them yourself.
Test send/recv_1meg + sandbag(10000) took 14111.791us per iteration (100 iterations)
Test isend/irecv_1meg + sandbag(10000) took 14018.261us per iteration (100 iterations)
However, MPI doesn't seem to overlap long messages very well--this is
clearly first compute, then communicate. (I might need to add an
"MPI_Test" call or something inside my sandbag function.) Isend doesn't help much here either.
(2 cpus) Test barrier took 54.131us per iteration (1000 iterations)
(10 cpus)Test barrier took 242.433us per iteration (1000 iterations)
A barrier requires only one synchronizing message on two CPUs.
(2 cpus) Test bcast_int + barrier took 100.409us per iteration (1000 iterations)
(10 cpus)Test bcast_int + barrier took 390.508us per iteration (1000 iterations)
A broadcast costs about the same as one network message on two
processors. On more processors, it's more expensive, but not
tenfold more like the naive "process zero sends once to everybody"
algorithm. On modern hardware, broadcast is very fast.
(2 cpus) Test reduce_int + barrier took 100.939us per iteration (1000 iterations)
(10 cpus)Test reduce_int + barrier took 328.141us per iteration (1000 iterations)
Similarly, a reduction costs basically one message (on two processors).
You can see this benchmark in action at ~olawlor/641_demo/commbench/.
MPI on the Powerwall
I've given you all accounts on the UAF Bioinformatics Powerwall
up in Chapman 205. You can SSH in directly from on-campus, but
from off-campus the UAF firewall will block you unless you SSH to port
80 (normally the web port). On UNIX systems, that's like "ssh -p
80 fsxxx@powerwall0.cs.uaf.edu".
Once you're logged into the powerwall, copy over the "mandel" example program with:
cp -r ~olawlor/641_demo/mandel .
Now build and run the "mandel" program like so:
cd mandel
make
mpirun -np 2 ./mandel
You can edit the source code with pico, a friendly little editor (use the arrow keys, press Ctrl-X to exit)
pico main.cpp
You can make a backup copy of the code with
cp main.cpp bak_v1_original.cpp
If you have a UNIX machine, you can view the resulting fractal image with:
xv out.ppm
Or you can copy the files off the powerwall for local display; you may prefer to
convert out.ppm out.jpg
to get a normal JPEG.
Parallelization Generally
So you've got this application. The application is
sequential--it's a normal top-to-bottom C++ program. You want to
make it work on multiple CPUs.
Step one is to figure out what parts of the program are taking the
time--there's no point in parallelizing the idle loop. If the
program is slow because it's waiting for the disk, or network card, or
graphics card, STOP, because using multiple CPUs is probably not going to help.
If the program spends most of its time computing (doing arithmetic or
memory operations), then using multiple CPUs should be able to speed up
the program substantially. The first and hardest part is
splitting up the big glob-o-code into little pieces that can run
independently. Once you've split up the program into pieces, you then
just have to run each piece. Below are several ways to run pieces
of code simultaniously.
Splitting up Programs into Pieces
So you've got a big ball of C code. You want to run pieces of
that code on multiple CPUs. This is almost always possible, but
it's almost never trivially easy.
- For programs that are computing a big array of stuff, you can
compute pieces of the array separately. For example, most
graphics programs
can divide up the image (or 3D model) into pieces. OpenMP works
great for this. In MPI, you have to compute "your" part of the
array yourself, and loop over it.
- For programs that search over a big area for stuff, you can search subareas separately.
- For programs that total up a bunch of stuff, you can usually
total up the pieces independently, and then come up with a grand total
at the end.
There's often a substantial amount of rethinking required to split a
program into pieces. It's easy, for example, to mess up the
division of work so some things get done twice, and others never get
done at all!
Stuff that's really hard to do in parallel includes sorting,
complicated nonlinear transformations like compiling,
and talking to existing sequential code. It's also usually
harmful to try to split up one file's I/O operations
into pieces--if you can eventually make the file locking and seeking
work reliably,
you'll often still get bad performance. I always try to switch to
in-memory operations on arrays first, or write out to multiple files
(which works nicely), or have only one thread do most of the I/O.
One well-known problem with dividing a problem into pieces is "load
balance"--the pieces are almost never all the same size. For
example, if you divide a problem into two pieces for two processors,
but the left piece is way bigger, the right processor will spend most
of its time waiting for the left one to finish. If you have 100
processors (the 2017 Intel Centium), it's easy to screw up your work
division so that 99 of the processors are waiting for the last one to
finish. This can destroy the performance improvement you'd get
from parallelism, so load imbalance is a bad thing.
Luckily, there's a cool trick for fixing load balance problems--just
make way more pieces than you think you'll need
("overdecomposition"). For example, if you've got two processors,
make like ten threads. Then if one thread finishes early, there
still are nine more for your processor to choose from. In
general, you'll get good performance from threads until the threads are
doing less than a few dozen milliseconds of work (or until the OS can't
create you any more threads!). (See my friend Milind Bhandarkar's thesis for some performance graphs of this.)
Finally, load balance is usually impossible to fix if you've divided the problem into inherently
unequal pieces--for example, if you divide a compiler into three
threads (preprocessor, syntax parse, and code generation). The
problem with this "heterogenous" each-thread-does-a-different-thing
approach is that you've only got three threads, so you can't
overdecompose. You're also limited to how many CPUs you can
support. Doing a threaded compiler that has one thread per *input
file* would be a better approach, since almost anytime you're waiting
for the compiler, there are lots of different files to compile.