Parallel Graphics

CS 481/681 2007 Lecture, Dr. Lawlor

Parallel in General

The basic idea is just to divide up your work into pieces.  The question of "which pieces?" depends on the problem.  Usually there's a natural answer where the pieces are almost totally independent, and hence can be worked on in parallel.  For images, it's natural to divide up the image into separate regions.  For *any* program that outputs a big array, it's natural to divide up the output array.  For more complicated programs, sometimes it's tricky to figure out how to make things parallel, but luckily most graphics tasks are "embarrasingly parallel", or as I prefer to say, "naturally parallel".

There is one quite common problem with doing parallel work--unequal work distribution.  If two processors are working on a problem, but one processor ended up with 90% of the work (even if it's only 50% of the image), then that processor isn't getting much help from the other processor.  In general, to get good performance, you've got to keep all your CPUs busy doing useful stuff.  One trivial but suprisingly effective technique for this is called "overdecomposition", where you just make way more pieces than processors.  This lets the OS switch between pieces dynamically, improving the "load balance" and overall performance.

One technique I can't recommend is making threads for each of your program's major software units.  For example, a game might consist of an AI module, a physics module, and a rendering module.  You could make three threads out of these modules.  But there are several problems with this approach:
Instead, I try to split up the work done inside each module into threads--data decomposition instead of task decomposition.  This lets you support way more CPUs, can give you acceptable load balance, and is usually a better overall approach.


See my CS321 lecture for how to split your program into threads or processes.  This is an easy-but-limiting approach: since threads and processes usually communicate by sharing memory, the physical hardware you run on MUST have shared memory.  Shared memory is supposedly somewhat easier to program (I'm not sure about that), but it definitely imposes some additional hardware cost.  In particular, with shared memory you can't just have a big pile of separate boxes connected with a network.  For that, you need MPI.


MPI is a very strange but powerful interface.  It's designed to run on separate machines that share only a network.  So MPI actually calls your "main" routine once on every processor, almost like "main" is already the thread worker function.

Every MPI program must begin with a call to MPI_Init, and end with a call to MPI_Finalize:
#include <mpi.h>
int main(int argc,char **argv) {
... all work here ..
If you forget to call MPI_Finalize, under MPICH you'll get weird errors like this when your program exits--and this is extremely embarassing if it happens in class and you don't recognize it!
p17_20106:  p4_error: net_recv read:  probable EOF on socket: 1
rm_l_17_20107: (0.286554) net_send: could not write to fd=5, errno = 32
bm_list_30320: (2.274217) wakeup_slave: unable to interrupt slave 0 pid 30319
Once you've initialized MPI, you've got to figure out which processor you are, and how many processors exist:
std::cout<<"Hello-- I am processor "<<rank<<" of "<<size<<"\n";
You compile your program (say "main.cpp") using mpiCC:
	mpiCC  main.cpp  -o main
Now you can run using mpirun.  "-np" controls the number of processes to create.
	mpirun -np 2 ./main
My version (on the powerwall in ~olawlor/481_demo/simple_mpi/) gives:
olawlor@powerwall0:~/481_demo/simple_mpi$ mpiCC  main.cpp  -o main
olawlor@powerwall0:~/481_demo/simple_mpi$ mpirun -np 2 ./main
Hello-- I am processor 0 of 2
Hello-- I am processor 1 of 2
olawlor@powerwall0:~/481_demo/simple_mpi$ mpirun -np 7 ./main
Hello-- I am processor 0 of 7
Hello-- I am processor 6 of 7
Hello-- I am processor 4 of 7
Hello-- I am processor 2 of 7
Hello-- I am processor 3 of 7
Hello-- I am processor 1 of 7
Hello-- I am processor 5 of 7
Based only on the rank and size information, you can usually figure out what part of the image you should work on.  So you do some computing.  You then want to forward the result to some central processor to be stored in a file.  Unlike with threads, you can't just stuff your results into a big shared data structure, nor can the central processor just pull the data out of your arrays.  Instead, you usually communicate your results using sends and receives...

MPI Sends and Receives

MPI_Send takes six parameters: a pointer to the data to send, the number of bytes to send, MPI_BYTE, the processor to send the data to, a "tag" used to match sends and receives (make up a number here, like zero), and MPI_COMM_WORLD.
	MPI_Send(&myWork,sizeof(myWork),MPI_BYTE,  bossRank, tag, MPI_COMM_WORLD);
MPI_Recv is the other side of send.  It takes the exact same six parameters, this time describing the receiver.  It's also got a (silly) "MPI_Status" tacked on at the end, which can be used to figure out the message sender, tag, and length if you don't know them.
	MPI_Status stat;
hisRank, tag, MPI_COMM_WORLD,&stat);
You MUST make sure every MPI_Send has a matching MPI_Recv in your programs.  Sometimes this is tricky!

For example, to have the boss receive one piece of work from each minion, the boss would have to loop over minions like so:
/* send all our data to the boss processor */
int tag=1234;
int boss=0;
if (rank==boss) { /* I'm the boss */
for (int minion=0;minion<size;minion++)
if (minion!=boss) {
MPI_Status stat;
char c;
minion, tag, MPI_COMM_WORLD,&stat);
std::cout<<"Minion "<<minion<<" computed the value '"<<c<<"'.\n";
} else { /* I'm a lowly minion */
char myWork='I';
boss, tag, MPI_COMM_WORLD);
(You can also use the slightly faster and cleaner MPI_Gather for this if all the minion's results are the same size.)

See the example on the powerwall in ~olawlor/cs481_demo/sendrecv/.