Network Performance Implications: Message Combining

CS 441 Lecture, Dr. Lawlor

So if I send messages back and forth between two CPUs, like this:
        char *buf=new char[nbytes];
double start=MPI_Wtime();
for (rep=0;rep<100;rep++) {
int evenrep=rep%2;
if (rank==evenrep) {
MPI_Send(buf,nbytes,MPI_BYTE, !rank,0,c);
}
if (rank!=evenrep) {
MPI_Status sts;
MPI_Recv(buf,nbytes,MPI_BYTE, !rank,0,c, &sts);
}
}
double elapsed1=MPI_Wtime()-start;
std::cout<<"Rank "<<rank<<" comm takes "<<elapsed1/rep*1.0e6<<" us per message\n";
If you look at the performance of this for various nbytes, you immediately notice the problem with small messages:
nbytes
time/msg
bandwidth
msgs/second
1
47us
0.021MB/s
21,000
10
47us
0.21MB/s
21,000
100
47us
2.1MB/s
21,000
1000
73us
14MB/s
13,000
10000
175us
57MB/s
6,000
100000
1047us
95MB/s
1,000

It doesn't take longer to send a 100-byte message than to send a 1-byte message!  Thus, you might as well always send at least 100 bytes at a time.

Theoretical Message-Passing Performance

Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
    tnet = a + b * L;

Where
The bottom line is that shorter messages are always faster.  *But* due to the per-message overhead, they may not be *appreciably* faster.  For example, for Ethernet a zero-length message takes 50us.  A 100-byte message takes 51us (50us/message+100*10ns/byte).  So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
    a = b * L
or
    L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long!  Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead).  Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate).  Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed.  Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.