Network Performance Implications: Message Combining
CS 441 Lecture, Dr. Lawlor
So if I send messages back and forth between two CPUs, like this:
char *buf=new char[nbytes];
double start=MPI_Wtime();
for (rep=0;rep<100;rep++) {
int evenrep=rep%2;
if (rank==evenrep) {
MPI_Send(buf,nbytes,MPI_BYTE, !rank,0,c);
}
if (rank!=evenrep) {
MPI_Status sts;
MPI_Recv(buf,nbytes,MPI_BYTE, !rank,0,c, &sts);
}
}
double elapsed1=MPI_Wtime()-start;
std::cout<<"Rank "<<rank<<" comm takes "<<elapsed1/rep*1.0e6<<" us per message\n";
If you look at the performance of this for various nbytes, you immediately notice the problem with small messages:
nbytes
|
time/msg
|
bandwidth
|
msgs/second
|
1
|
47us
|
0.021MB/s
|
21,000
|
10
|
47us
|
0.21MB/s
|
21,000 |
100
|
47us
|
2.1MB/s
|
21,000 |
1000
|
73us
|
14MB/s
|
13,000
|
10000
|
175us
|
57MB/s
|
6,000
|
100000
|
1047us
|
95MB/s
|
1,000
|
It doesn't take longer to send a 100-byte message than to send a 1-byte
message! Thus, you might as well always send at least 100 bytes
at a time.
Theoretical Message-Passing Performance
Most network cards require some (fairly large) fixed amount of time per message, plus some (smaller) amount of time per byte of the message:
tnet = a + b * L;
Where
- tnet: network time. The total time, in seconds, spent on the network.
- a:
network latency. The time, in seconds, to send a
zero-length message. For MPICH running on gigabit Ethernet, this
is something like 50us/message, which is absurd (it's the time from the
CPU to the northbridge to the PCI controller to the network card to the
network to the switch to the other network card to the CPU interrupt
controller through the OS to MPI and finally to the application).
Fancier, more
expensive networks such as Myrinet or Infiniband have latencies as low as 5us/message (in 2005; Wikipedia now claims 1.3us/message). (Network latency is often written as alpha or α).
- b: 1/(network bandwidth). The time, in seconds, to send
each byte of the message. For 100MB/s gigabit ethernet (1000Mb/s
is megabits), this is 10ns/byte. For 4x Infiniband, which sends 1000MB/s, this is 1ns/byte. (Network 1/bandwidth is often written as beta or β).
- L: number of bytes in message.
The bottom line is that shorter messages are always faster. *But*
due to the per-message overhead, they may not be *appreciably*
faster. For example, for Ethernet a zero-length message takes
50us. A 100-byte message takes 51us
(50us/message+100*10ns/byte). So you might as well send 100 bytes
if you're going to bother sending anything!
In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
a = b * L
or
L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which
is amazingly long! Shorter messages are "alpha dominated", which
means you're waiting for the network latency (speed of light,
acknowledgements, OS overhead). Longer messages are "beta
dominated", which means you're waiting for the network bandwidth
(signalling rate). Smaller messages don't help much if you're
alpha dominated!
The large per-message cost of most networks means many applications cannot
get good performance with an "obvious" parallel strategy, like sending
individual floats when they're needed. Instead, you have to
figure out beforehand what data needs to be sent, and send everything at once.