Parallel Algorithm Design & Performance Analysis

CS 641 Lecture, Dr. Lawlor 

Parallelism Generally

Here's a survey of data structures used by scientific simulation software.  I prepared this while in Illinois, working on a variety of scientific simulation software.

 Some applications are quite easy to parallelize.  Others... aren't.

Amdahl's Law

Fact: Amdahl's Law limits the speedup of a partially parallelized program to 1/F, where F is the fraction of the program's runtime that is still sequential, even assuming everything else is so parallel it takes no time at all.
    speedup = old time / new time = sequential time / parallel time
                    = 1 / ( F + (1-F)/P ) => 1/F     (as number of processors P approaches infinity)

Question: what fraction of real programs MUST be sequential?  I claim it's only a vanishingly small amount--the user interface probably should be sequential in places, since there's only one user (one mouse, one button).  So Amdahl's law is in some sense true but useless--if the sequential part of the code is a 2-ns button click, a 2-second program has F=1/billion, so the speedup is limited to one billion.  Big deal!

However, it really takes programming effort to parallelize some parts of the program--some algorithms are sequential (you need new algorithms), some hardware like I/O is sequential (you need better hardware), etc. 

Mathematically, since
    F proportional to 1/pain
then Ahmdal's law really says
    speedup is proportional to pain

Which is both a true and useful statement.

I/O Performance

One recurring problem in parallel computing is I/O: talking to the disk, network, or output devices.  Oddly, most real hardware has wildly different latency (time for one individual operation) and bandwidth (total bytes per second)--you might expect latency = 1/bandwidth, but in practice that first byte is way more expensive than the next bytes.  For example, loading a single byte from RAM requires the whole row to be addressed and sensed, so reading adjacent bytes gets much cheaper.  Sending one byte over the network requires an entire packet, so sending multiple bytes at once decreases the cost substantially.

Storage Device
Latency (sec/request)
Requests/sec
Bandwidth (bytes/sec)
Source
Floppy Drive
500ms
2
5KB/s - Horrible
Hazy memories
Modem
100ms (round trip)
10
5KB/s - Horrible "
DVD-ROM
150ms (seek time)
8
5MB/s - Sad
Experiment (laptop drive)
Flash Drive
0.5ms
2000
13MB/s - Sad
Experiment (USB flashdrive)
Hard Disk
10ms
100
50MB/s - Decent
Experiment (laptop hard drive)
Gigabit Ethernet
100us
10,000
100MB/s - Decent
Experiment (powerwall)
RAM
50ns
20 million
2GB/s - Great Experiment (netrun Pentium 4)
Cache
1ns
1 billion
4GB/s - Great
Experiment (netrun Pentium 4)

Parallel Performance In MPI

Selected outputs, running under MPICH 1.2.7 on our gigabit ethernet powerwall:

Test send/recv_int took 52.468us per iteration (10000 iterations)
Sending even a small, one-int message takes a *long* time (high alpha cost as listed below).
Test sandbag (100) took 104.236us per iteration (1000 iterations)
The "sandbag" is my function that does CPU work.
Test send/recv_int + sandbag(100) took 176.449us per iteration (10000 iterations)
Nonlinear slowdown!  I think the CPU and network are interfering with one another here--the bottom line is that communication is expensive, especially when it won't overlap with computation.
Test isend/irecv_int + sandbag(100) took 114.860us per iteration (100 iterations)
For better performance, use Isend and Irecv, which are "nonblocking": the CPU can keep working while the network talks.

Test send/recv_1meg took 10529.570us per iteration (100 iterations)
Communication costs just 10ns/byte for long messages, since the startup overhead amortizes away.
Test isend/irecv_1meg + sandbag(10000) took 20864.780us per iteration (100 iterations)
However, MPI doesn't seem to overlap long messages very well--this is clearly first compute, then communicate.  (I might need to add an "MPI_Test" call or something inside my sandbag function.)

Test barrier took 61.669us per iteration (1000 iterations)
A barrier requires only one synchronizing message per CPU.
Test bcast_int + barrier took 111.750us per iteration (1000 iterations)
A broadcast costs about the same as one network message on two processors.
Test reduce_int + barrier took 116.753us per iteration (1000 iterations)
Similarly, a reduction costs basically one message (on two processors).

Theoretical Message-Passing Performance

Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
    tnet = a + b * L;

Where
The bottom line is that shorter messages are always faster.  *But* due to the per-message overhead, they may not be *appreciably* faster.  For example, for Ethernet a zero-length message takes 50us.  A 100-byte message takes 51us (50us/message+100*10ns/byte).  So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
    a = b * L
or
    L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long!  Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead).  Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate).  Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed.  Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.

Parallel CPU Performance

OK, that's the network's performance.  What about the CPU?  A typical model is:
    tcpu = k * n / p

Where

Parallel Problem Performance

Say we're doing an n=1mpixel Mandelbrot rendering in parallel on p=10 processors.  Step one is to render all the pixels:
    tcpu = k * n / p
    tcpu = 300 ns/pixel * 1m pixels / 10 processors = 30ms / processor

After rendering the pixels, we've got to send them to one place for output.  So each processor sends a message of size L = (3 bytes/pixel * n / p) = 300Kbytes/processor, and thus:
    tnet = a + b * L
    tnet = a + b * (3 bytes/pixel * 1m bytes / 10 processors) = 3.05 ms/processor
(Careful: this assumes all the messages sent can be received at one place in parallel, which often isn't the case!)

Hence in this case we spend almost 10 times longer computing the data as we do sending it, which is pretty good.

Unlike the network, the CPU spends its time actually solving the problem.  So CPU time is usually seen as "good" useful work, while network time is "bad" parallel overhead.  So more generally, you'd like to make sure the "computation to communication ratio", tcpu/tnet is as high as possible.  For the mandelbrot problem:

    tcpu / tnet = k * n / p / ( a + b * n / p * 3)

Or, simplifying:
    tcpu / tnet = k / ( a * p / n + b * 3)

Remember, a higher CPU/net ratio means more time spent computing, and hence higher efficiency.  So we can immediately see that:
Or, to spend more time computing *relative* to the time spent communicating:
Of course, what users really care about is the *total* time spent, tcpu + tnet:
    tcpu + tnet = k * n / p + a + b * n / p * 3

To spend less time overall:
Note that even if you do all these things, you can *never* solve the mandelbrot problem in parallel in less than the per-message time a.  For Ethernet, this means parallel problems are always going to take 50us.  In particular, if the problem can be solved serially in less than 50us, going parallel can't possibly help!  Luckily, it's rare you care about a problem that takes less than 50us total.  And if that sub-50us computation is just a part of a bigger problem, you can parallelize the *bigger* problem.

In general, the per-message cost means you want to parallelize the *coarsest* pieces of the computation you can!