Parallel Performance Theory

CS 441 Lecture, Dr. Lawlor

Network Performance

Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
    tnet = a + b * L;

The bottom line is that shorter messages are always faster.  *But* due to the per-message overhead, they may not be *appreciably* faster.  For example, for Ethernet a zero-length message takes 50us.  A 100-byte message takes 51us (50us/message+100*10ns/byte).  So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
    a = b * L
    L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long!  Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead).  Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate).  Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed.  Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.

Parallel CPU Performance

OK, that's the network's performance.  What about the CPU?  A typical model is:
    tcpu = k * n / p


Parallel Problem Performance

Say we're doing an n=1mpixel Mandelbrot rendering in parallel on p=10 processors.  Step one is to render all the pixels:
    tcpu = k * n / p
    tcpu = 300 ns/pixel * 1m pixels / 10 processors = 30ms / processor

After rendering the pixels, we've got to send them to one place for output.  So each processor sends a message of size L = (3 bytes/pixel * n / p) = 300Kbytes/processor, and thus:
    tnet = a + b * L
    tnet = a + b * (3 bytes/pixel * 1m bytes / 10 processors) = 3.05 ms/processor
(Careful: this assumes all the messages sent can be received at one place in parallel, which often isn't the case!)

Hence in this case we spend almost 10 times longer computing the data as we do sending it, which is pretty good.

Unlike the network, the CPU spends its time actually solving the problem.  So CPU time is usually seen as "good" useful work, while network time is "bad" parallel overhead.  So more generally, you'd like to make sure the "computation to communication ratio", tcpu/tnet is as high as possible.  For the mandelbrot problem:

    tcpu / tnet = k * n / p / ( a + b * n / p * 3)

Or, simplifying:
    tcpu / tnet = k / ( a * p / n + b * 3)

Remember, a higher CPU/net ratio means more time spent computing, and hence higher efficiency.  So we can immediately see that:
Or, to spend more time computing *relative* to the time spent communicating:
Of course, what users really care about is the *total* time spent, tcpu + tnet:
    tcpu + tnet = k * n / p + a + b * n / p * 3

To spend less time overall:
Note that even if you do all these things, you can *never* solve the mandelbrot problem in parallel in less than the per-message time a.  For Ethernet, this means parallel problems are always going to take 50us.  In particular, if the problem can be solved serially in less than 50us, going parallel can't possibly help!  Luckily, it's rare you care about a problem that takes less than 50us total.  And if that sub-50us computation is just a part of a bigger problem, you can parallelize the *bigger* problem.

In general, the per-message cost means you want to parallelize the *coarsest* pieces of the computation you can!