Parallel Programming: MPI & Messaging

CS 641 Lecture, Dr. Lawlor

First, read this basic MPI tutorial, which has a good survey of all the MPI routines. The gory details are in the MPI 1.1 standard, which is fairly readable for a standard (but then, most standards are hideously unreadable). Pay attention to the "big eight" MPI functions: MPI_Init, MPI_Finalize, MPI_Comm_size, MPI_Comm_rank, MPI_Send, MPI_Recv, MPI_Bcast, and MPI_Reduce. Those are really the only functions in MPI 1.1, the rest are just small variations on those themes.

For example, here's an idiomatic MPI program where the first process sends one integer to the last process:

#include <mpi.h> /* for MPI_ functions */
#include <stdio.h> /* for printf */
#include <stdlib.h> /* for atexit */

void call_finalize(void) {MPI_Finalize();}
int main(int argc,char *argv[]) {
        MPI_Init(&argc,&argv);
        atexit(call_finalize); /*<- important to avoid weird errors! */

        int rank=0,size=1;
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);
        MPI_Comm_size(MPI_COMM_WORLD,&size);

        int tag=17; /*<- random integer ID for this message exchange */
        if (rank==0) {
                int val=1234;
                MPI_Send(&val,1,MPI_INT, size-1,tag,MPI_COMM_WORLD);
                printf("Rank %d sent value %d\n",rank,val);
        }
        if (rank==size-1) {
                MPI_Status sts;
                int val=0;
                MPI_Recv(&val,1,MPI_INT, MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&sts);
                printf("Rank %d received value %d\n",rank,val);
        }
        return 0;
}

You can copy this same program on the UAF powerwall from ~olawlor/641_demo/simplempi/.

MPI on the Powerwall

I've given you all accounts on the UAF Bioinformatics Powerwall up in Chapman 205. You can SSH in directly from on-campus, but from off-campus the UAF firewall will block you unless you SSH to port 80 (normally the web port). On UNIX systems, that's like "ssh -p 80 fsxxx@powerwall0.cs.uaf.edu".

Once you're logged into the powerwall, copy over the "mandel" example program with:
    cp -r ~olawlor/441_demo/mandel .

Now build and run the "mandel" program like so:
    cd mandel
    make
    mpirun -np 2 ./mandel

You can edit the source code with pico, a friendly little editor (use the arrow keys, press Ctrl-X to exit)
    pico main.cpp

You can make a backup copy of the code with
    cp main.cpp bak_v1_original.cpp

If you have a UNIX machine, you can view the resulting fractal image with:
    xv out.ppm
Or you can copy the files off the powerwall for local display; you may prefer to
    convert out.ppm out.jpg
to get a normal JPEG.

Theoretical Message-Passing Performance

Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
tnet = a + b * L;

Where

tnet: network time. The total time, in seconds, spent on the network.
a: network latency. The time, in seconds, to send a zero-length message. For MPICH running on gigabit Ethernet, this is something like 50us/message, which is absurd. Fancier, more expensive networks such as Myrinet or Infiniband have latencies as low as 5us/message (in 2005; Wikipedia now claims 1.3us/message). (Network latency is often written as alpha or α).
b: 1/(network bandwidth). The time, in seconds, to send each byte of the message. For 100MB/s gigabit ethernet (1000Mb/s is megabits), this is 10ns/byte. For 4x Infiniband, which sends 1000MB/s, this is 1ns/byte. (Network 1/bandwidth is often written as beta or β).
L: number of bytes in message.

The bottom line is that shorter messages are always faster. *But* due to the per-message overhead, they may not be *appreciably* faster. For example, for Ethernet a zero-length message takes 50us. A 100-byte message takes 51us (50us/message+100*10ns/byte). So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
a = b * L
or
L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long! Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead). Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate). Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed. Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.

Parallel CPU Performance

OK, that's the network's performance. What about the CPU? A typical model is:
tcpu = k * n / p

Where

tcpu is the total time spent computing, in seconds.
k is the (average) time spent computing each element. For example, Mandelbrot set rendering takes about 300ns/pixel. Some folks break k down into the (average) number of floating-point operations per element (flops/element), multiplied by the time taken per floating-point operation (seconds/flop, typically around 1ns (CPU) to 0.1ns (GPU)).
n is the number of elements in the computation. For example, a 1000x1000 image has 1m pixels.
p is the number of processors. Dividing the total work by p implicitly assumes the problem's load is (more or less) evenly balanced!

Parallel Problem Performance

Say we're doing an n=1mpixel Mandelbrot rendering in parallel on p=10 processors. Step one is to render all the pixels:
    tcpu = k * n / p
    tcpu = 300 ns/pixel * 1m pixels / 10 processors = 30ms / processor

After rendering the pixels, we've got to send them to one place for output. So each processor sends a message of size L = (3 bytes/pixel * n / p) = 300Kbytes/processor, and thus:
    tnet = a + b * L
    tnet = a + b * (3 bytes/pixel * 1m bytes / 10 processors) = 3.05 ms/processor
(Careful: this assumes all the messages sent can be received at one place in parallel, which often isn't the case!)

Hence in this case we spend almost 10 times longer computing the data as we do sending it, which is pretty good.

Unlike the network, the CPU spends its time actually solving the problem. So CPU time is usually seen as "good" useful work, while network time is "bad" parallel overhead. So more generally, you'd like to make sure the "computation to communication ratio", tcpu/tnet is as high as possible. For the mandelbrot problem:

    tcpu / tnet = k * n / p / ( a + b * n / p * 3)

Or, simplifying:
    tcpu / tnet = k / ( a * p / n + b * 3)

Remember, a higher CPU/net ratio means more time spent computing, and hence higher efficiency. So we can immediately see that:

As the amount of work k per element grows, the efficiency goes up. (This makes sense: more time spent computing is "better", at least from the point of view of CPU/net ratio!)
As the network's per-message cost a grows, efficiency goes down. (Slower network is worse)
As you add processors to a fixed-size problem, the per-message cost goes up, and efficiency goes down. (More processors means more messages.)
As the problem size n grows, the per-message cost gets less important, and efficiency goes up. (Messages get longer with bigger problems, which uses the network more efficiently.)
As the network's time per byte b goes down, efficiency goes up. (Faster network is better)

Or, to spend more time computing *relative* to the time spent communicating:

Get a faster network.
Find a bigger problem.
Do *more* work per element (or get a *slower* CPU!).
Use *fewer* processors.

Of course, what users really care about is the *total* time spent, tcpu + tnet:
tcpu + tnet = k * n / p + a + b * n / p * 3

To spend less time overall:

Get a faster network.
Find a *smaller* problem.
Do *less* work per element (or get a *faster* CPU).
Use more processors.

Note that even if you do all these things, you can *never* solve the mandelbrot problem in parallel in less than the per-message time a. For Ethernet, this means parallel problems are always going to take 50us. In particular, if the problem can be solved serially in less than 50us, going parallel can't possibly help! Luckily, it's rare you care about a problem that takes less than 50us total. And if that sub-50us computation is just a part of a bigger problem, you can parallelize the *bigger* problem.

In general, the per-message cost means you want to parallelize the *coarsest* pieces of the computation you can!

Parallelism Generally

Here's a survey of data structures used by scientific simulation software. I prepared this while in Illinois, working on a variety of scientific simulation software.

Some applications are quite easy to parallelize. Others aren't.

"Naturally Parallel" applications don't need any communication. For example, Mandelbrot set rendering is naturally parallel, because each Mandelbrot pixel is a stand-alone problem you can solve independently of all the other Mandelbrot pixels. GPUs can handle naturally parallel applications in a single pass. The other common term for this is "Trivially Parallel" or "Embarrasingly Parallel", which makes it sound like a bad thing--but parallelism is both natural and good!
"Neighbor" applications communicate with their immediate neighbors, and that's it. For example, heat flow in a plate can be computed based solely on one's immediate neighbors (new value at arr[i] is a function of the old value of arr[i-1], arr[i], and arr[i+1]). Neighbor applications naturally fit into the "ghost exchange" communication pattern, and for large problem sizes usually can be tweaked to get good performance.
"Other" applications have a weirder, often collective communication pattern. Depending on the structure of the problem, such applications can sometimes have relatively good performance, but are often network bound.
"Sequential" applications might have multiple threads or processes, but they don't have any parallelism--only one thread/process can do useful work at a time. Many I/O limited applications are basically sequential, since ten CPUs can wait on one disk no faster than one CPU! (Solution: add more disks!)