Parallel Algorithm Design & Performance Analysis
CS 641 Lecture, Dr. Lawlor
Parallelism Generally
Here's a survey of data structures
used by scientific simulation software. I prepared this while in
Illinois, working on a variety of scientific simulation software.
Some applications are quite easy to parallelize. Others... aren't.
- "Naturally Parallel" applications don't need any
communication. For example, Mandelbrot set rendering is naturally
parallel, because each Mandelbrot pixel is a stand-alone problem you
can solve independently of all the other Mandelbrot pixels. GPUs
can handle naturally parallel applications in a single pass. The
other common term for this is "Trivially Parallel" or "Embarrasingly
Parallel", which makes it sound like a bad thing--but parallelism is both natural and good!
- "Neighbor" applications communicate with their immediate
neighbors, and that's it. For example, heat flow in a plate can
be computed based solely on one's immediate neighbors (new value at
arr[i] is a function of the old value of arr[i-1], arr[i], and
arr[i+1]). Neighbor applications naturally fit into the "ghost
exchange" communication pattern, and for large problem sizes usually
can be tweaked to get good performance.
- "Other" applications have a weirder, often collective
communication pattern. Depending on the structure of the problem,
such applications can sometimes have relatively good performance, but
are often network bound.
- "Sequential" applications might have multiple threads or
processes, but they don't have any parallelism--only one thread/process
can do useful work at a time. Many I/O limited applications are
basically sequential, since ten CPUs can wait on one disk no faster
than one CPU! (Solution: add more disks!)
Amdahl's Law
Fact: Amdahl's Law
limits the speedup of a partially parallelized program to 1/F, where F
is the fraction of the program's runtime that is still sequential, even
assuming everything else is so parallel it takes no time at all.
speedup = old time / new time = sequential time / parallel time
= 1 / ( F + (1-F)/P ) => 1/F (as number of
processors P approaches infinity)
Question: what fraction of real programs MUST be sequential? I
claim it's only a vanishingly small amount--the user interface probably
should be sequential in places, since there's only one user (one mouse,
one button). So Amdahl's law is in some sense true but
useless--if the sequential part of the code is a 2-ns button click, a
2-second program has F=1/billion, so the speedup is limited to one
billion. Big deal!
However, it really takes programming effort to parallelize some parts
of the program--some algorithms are sequential (you need new
algorithms), some hardware like I/O is sequential (you need better
hardware), etc.
Mathematically, since
F proportional to 1/pain
then Ahmdal's law really says
speedup is proportional to pain
Which is both a true and useful statement.
I/O Performance
One recurring problem in parallel computing is I/O: talking to the
disk, network, or output devices. Oddly, most real hardware has
wildly different latency (time for one individual operation) and
bandwidth (total bytes per second)--you might expect latency =
1/bandwidth, but in practice that first byte is way more expensive than
the next bytes. For example, loading a single byte from RAM
requires the whole row to be addressed and sensed, so reading adjacent
bytes gets much cheaper. Sending one byte over the network
requires an entire packet, so sending multiple bytes at once decreases
the cost substantially.
Storage Device
|
Latency (sec/request)
|
Requests/sec
|
Bandwidth (bytes/sec)
|
Source
|
Floppy Drive
|
500ms
|
2
|
5KB/s - Horrible
|
Hazy memories
|
Modem
|
100ms (round trip)
|
10
|
5KB/s - Horrible |
"
|
DVD-ROM
|
150ms (seek time)
|
8
|
5MB/s - Sad
|
Experiment (laptop drive)
|
Flash Drive
|
0.5ms
|
2000
|
13MB/s - Sad
|
Experiment (USB flashdrive)
|
Hard Disk
|
10ms
|
100
|
50MB/s - Decent
|
Experiment (laptop hard drive)
|
Gigabit Ethernet
|
100us
|
10,000
|
100MB/s - Decent
|
Experiment (powerwall)
|
RAM
|
50ns
|
20 million
|
2GB/s - Great |
Experiment (netrun Pentium 4)
|
Cache
|
1ns
|
1 billion
|
4GB/s - Great
|
Experiment (netrun Pentium 4)
|
Parallel Performance In MPI
Selected outputs, running under MPICH 1.2.7 on our gigabit ethernet powerwall:
Test send/recv_int took 52.468us per iteration (10000 iterations)
Sending even a small, one-int message takes a *long* time (high alpha cost as listed below).
Test sandbag (100) took 104.236us per iteration (1000 iterations)
The "sandbag" is my function that does CPU work.
Test send/recv_int + sandbag(100) took 176.449us per iteration (10000 iterations)
Nonlinear slowdown! I think the CPU and network are interfering
with one another here--the bottom line is that communication is
expensive, especially when it won't overlap with computation.
Test isend/irecv_int + sandbag(100) took 114.860us per iteration (100 iterations)
For better performance, use Isend and Irecv, which are "nonblocking": the CPU can keep working while the network talks.
Test send/recv_1meg took 10529.570us per iteration (100 iterations)
Communication costs just 10ns/byte for long messages, since the startup overhead amortizes away.
Test isend/irecv_1meg + sandbag(10000) took 20864.780us per iteration (100 iterations)
However, MPI doesn't seem to overlap long messages very well--this is
clearly first compute, then communicate. (I might need to add an
"MPI_Test" call or something inside my sandbag function.)
Test barrier took 61.669us per iteration (1000 iterations)
A barrier requires only one synchronizing message per CPU.
Test bcast_int + barrier took 111.750us per iteration (1000 iterations)
A broadcast costs about the same as one network message on two processors.
Test reduce_int + barrier took 116.753us per iteration (1000 iterations)
Similarly, a reduction costs basically one message (on two processors).
Theoretical Message-Passing Performance
Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
tnet = a + b * L;
Where
- tnet: network time. The total time, in seconds, spent on the network.
- a:
network latency. The time, in seconds, to send a
zero-length message. For MPICH running on gigabit Ethernet, this
is something like 50us/message, which is absurd (it's the time from the
CPU to the northbridge to the PCI controller to the network card to the
network to the switch to the other network card to the CPU interrupt
controller through the OS to MPI and finally to the application).
Fancier, more
expensive networks such as Myrinet or Infiniband have latencies as low as 5us/message (in 2005; Wikipedia now claims 1.3us/message). (Network latency is often written as alpha or α).
- b: 1/(network bandwidth). The time, in seconds, to send
each byte of the message. For 100MB/s gigabit ethernet (1000Mb/s
is megabits), this is 10ns/byte. For 4x Infiniband, which sends 1000MB/s, this is 1ns/byte. (Network 1/bandwidth is often written as beta or β).
- L: number of bytes in message.
The bottom line is that shorter messages are always faster. *But*
due to the per-message overhead, they may not be *appreciably*
faster. For example, for Ethernet a zero-length message takes
50us. A 100-byte message takes 51us
(50us/message+100*10ns/byte). So you might as well send 100 bytes
if you're going to bother sending anything!
In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
a = b * L
or
L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which
is amazingly long! Shorter messages are "alpha dominated", which
means you're waiting for the network latency (speed of light,
acknowledgements, OS overhead). Longer messages are "beta
dominated", which means you're waiting for the network bandwidth
(signalling rate). Smaller messages don't help much if you're
alpha dominated!
The large per-message cost of most networks means many applications cannot
get good performance with an "obvious" parallel strategy, like sending
individual floats when they're needed. Instead, you have to
figure out beforehand what data needs to be sent, and send everything at once.
Parallel CPU Performance
OK, that's the network's performance. What about the CPU? A typical model is:
tcpu = k * n / p
Where
- tcpu is the total time spent computing, in seconds.
- k
is the (average) time spent computing each element. For example,
Mandelbrot set rendering takes about 300ns/pixel. Some folks
break k down into the (average) number of floating-point operations per
element (flops/element), multiplied by the time taken per
floating-point operation (seconds/flop, typically around 1ns (CPU) to
0.1ns (GPU)).
- n is the number of elements in the computation. For example, a 1000x1000 image has 1m pixels.
- p is the number of processors. Dividing the total work by p
implicitly assumes the problem's load is (more or less) evenly balanced!
Parallel Problem Performance
Say we're doing an n=1mpixel Mandelbrot rendering in parallel on p=10 processors. Step one is to render all the pixels:
tcpu = k * n / p
tcpu = 300 ns/pixel * 1m pixels / 10 processors = 30ms / processor
After rendering the pixels, we've got to send them to one place for
output. So each processor sends a message of size L = (3
bytes/pixel * n / p) = 300Kbytes/processor, and thus:
tnet = a + b * L
tnet = a + b * (3 bytes/pixel * 1m bytes / 10 processors) = 3.05 ms/processor
(Careful: this assumes all the messages sent can be received at one place in parallel, which often isn't the case!)
Hence in this case we spend almost 10 times longer computing the data as we do sending it, which is pretty good.
Unlike the network, the CPU spends its time actually solving the
problem. So CPU time is usually seen as "good" useful work, while
network time is "bad" parallel overhead. So more generally, you'd
like to make sure the "computation to communication ratio", tcpu/tnet
is as high as possible. For the mandelbrot problem:
tcpu / tnet = k * n / p / ( a + b * n / p * 3)
Or, simplifying:
tcpu / tnet = k / ( a * p / n + b * 3)
Remember, a higher CPU/net ratio means more time spent computing, and
hence higher efficiency. So we can immediately see that:
- As the amount of work k per element grows, the efficiency goes
up. (This makes sense: more time spent computing is "better", at least from the point of view of CPU/net ratio!)
- As the network's per-message cost a grows, efficiency goes down. (Slower network is worse)
- As you add processors to a fixed-size problem, the per-message cost goes up, and efficiency goes down. (More processors means more messages.)
- As the problem size n grows, the per-message cost gets less
important, and efficiency goes up. (Messages get longer with
bigger problems, which uses the network more efficiently.)
- As the network's time per byte b goes down, efficiency goes up. (Faster network is better)
Or, to spend more time computing *relative* to the time spent communicating:
- Get a faster network.
- Find a bigger problem.
- Do *more* work per element (or get a *slower* CPU!).
- Use *fewer* processors.
Of course, what users really care about is the *total* time spent, tcpu + tnet:
tcpu + tnet = k * n / p + a + b * n / p * 3
To spend less time overall:
- Get a faster network.
- Find a *smaller* problem.
- Do *less* work per element (or get a *faster* CPU).
- Use more processors.
Note that even if you do all these things, you can *never* solve the
mandelbrot problem in parallel in less than the per-message time
a. For Ethernet, this means parallel problems are always going to
take 50us. In particular, if the problem can be solved serially
in less than 50us, going parallel can't possibly help! Luckily,
it's rare you care about a problem that takes less than 50us
total. And if that sub-50us computation is just a part of a
bigger problem, you can parallelize the *bigger* problem.
In general, the per-message cost means you want to parallelize the *coarsest* pieces of the computation you can!