Building Broadcast From Point-to-Point

CS 641 Lecture, Dr. Lawlor

Say you've got some data on one MPI process, for example rank zero, that you need to communicate out to every process. The obvious way to do it is for process zero to loop over every destination:

	if (rank==0) { /* I have the data */
		std::vector<MPI_Request> reqs; MPI_Request req;
		for (int dest=0;dest<size;dest++) {
			if (rank!=dest) { /* somebody needs my data */
				MPI_Isend(&val,n,MPI_INT, 
					dest,tag,MPI_COMM_WORLD,&req); 
				reqs.push_back(req);
			}
		}
		MPI_Waitall(reqs.size(),&reqs[0],NULL);
	} else { /* I get the data */
		MPI_Status sts;
		int src=0;
		MPI_Recv(&val,n,MPI_INT, src,tag,MPI_COMM_WORLD,&sts);
	}

This works, but it's O(P) at the sender, which turns into a real bottleneck on more than ten or so processors.

A much better algorithm is the hypercube broadcast. At each step, all the processors with the data send it to a new processor without the data, doubling the number of processors that have the data ("nwithdata"). It's called hypercube because you work up from 0D (a single point, rank zero) to 1D (two processors, ranks zero and one) and 2D (four processors, 0-3) and so on:

	std::vector<MPI_Request> reqs; MPI_Request req;
	for (int nwithdata=1;nwithdata<size;nwithdata*=2) {
		if (rank<nwithdata) { /* I have the data */
			int dest=rank+nwithdata;
			if (dest<size) { /* somebody needs my data */
				MPI_Isend(&val,n,MPI_INT, 
					dest,tag,MPI_COMM_WORLD,&req); 
				reqs.push_back(req);
			}
		}
		else if (rank<2*nwithdata) { /* my chance to get the data */
			MPI_Status sts;
			int src=rank-nwithdata;
			MPI_Recv(&val,n,MPI_INT, src,tag,MPI_COMM_WORLD,&sts);
		}
	}
	MPI_Waitall(reqs.size(),&reqs[0],NULL);

Or, finally, you could just call MPI's builtin broadcast, which on high-end hardware actually is a hardware broadcast:

	MPI_Bcast(&val,n,MPI_INT, 0,MPI_COMM_WORLD);

In practice, all three methods are about the same speed for 2 processors, and on 10 processors, dimensional exchange is about twice as fast as the brute-force send to all, and MPI_Bcast is about twofold faster yet!

Curiously, a reduction or gather can use the same hypercube to decrease message alpha cost. A variant of hypercube called dimensional exchange can even do all-to-all operations like gather-to-all.