Image Output and Load Balance

CS 441 Lecture, Dr. Lawlor

Today we figured out how to output image data, using a PPM image file. We used local image coordinates for rendering, and global image coordinates only for image reassembly.

Finally, we scaled our distributed-memory parallel program to support an arbitrary number of cores, and measured load balance.

1 processes, 0.459 seconds
2 processes, 0.404 seconds
4 processes, 0.285 seconds
8 processes, 0.155 seconds

We found that adding more processes than CPUs helped load balance substantially: with some spare processes, the OS scheduler can fill in idle CPUs, resulting in better overall performance. Making 8 processes only We can fork off up to 9 processes on NetRun (but that's the limit).

// Socket-based multicore parallelism (for quad-core machine)
#include "osl/socket.h"
#include "osl/socket.cpp"
#include <sys/wait.h> /* for wait() */
#include <unistd.h> /* for fork() */
#include <complex>

/**
 A linear function in 2 dimensions: returns a double as a function of (x,y).
*/
class linear2d_function {
public:
	double a,b,c;
	void set(double a_,double b_,double c_) {a=a_;b=b_;c=c_;}
	linear2d_function(double a_,double b_,double c_) {set(a_,b_,c_);}
	double evaluate(double x,double y) const {return x*a+y*b+c;}
};

	const int wid=1000, ht=1000;
	// Set up coordinate system to render the Mandelbrot Set:
	double scale=3.0/wid;
	linear2d_function fx(scale,0.0,-1.0); // returns c given pixels 
	linear2d_function fy(0.0,scale,-1.0);

char render_mset(int x,int y) {
/* Walk this Mandelbrot Set pixel */
	typedef std::complex<double> COMPLEX;
	COMPLEX c(fx.evaluate(x,y),fy.evaluate(x,y));
	COMPLEX z(0.0);
	int count;
	enum {max_count=256};
	for (count=0;count<max_count;count++) {
		z=z*z+c;
		if ((z.real()*z.real()+z.imag()*z.imag())>4.0) break;
	}
		
	return count;
}

class row {
public:
	char data[wid];
};

/* Run as process "rank", one process among "size" others.  
   Each socket connects you with another rank: s[0] connects to rank 0.
*/
void run(int rank,int size,SOCKET *s) {
	int procpiece=ht/size; int gystart=rank*procpiece;
	row limg[procpiece]; /* local copy of the final image */
	
	/* Render our piece of the image */
	for (int y=0;y<procpiece;y++)
	{
		for (int x=0;x<wid;x++) limg[y].data[x]=render_mset(x,gystart+y);
	}
	
	if (rank>0) 
	{ /* send our partial piece to rank 0 */
		skt_sendN(s[0],&limg[0].data[0],sizeof(row)*procpiece);
	}
	else
	{ /* rank 0: receive partial pieces from other ranks */
		row gimg[ht];
		for (int r=0;r<size;r++) 
		if (r==0) {
			memcpy(gimg,limg,sizeof(row)*procpiece);
		} else {
			skt_recvN(s[r],&gimg[r*procpiece].data[0],
				sizeof(row)*procpiece);
		}
		/* Print out assembled image */
		std::ofstream of("out.ppm",std::ios_base::binary);
		of<<"P5\n"; // greyscale, binary
		of<<wid<<" "<<ht<<"\n"; // image size
		of<<"255\n"; // byte image
		of.write(&gimg[0].data[0],sizeof(row)*ht);
	}
}


int foo(void) {
	double start=time_in_seconds();
	unsigned int port=0;
	const int size=read_input(); /* ???-core machine */
	SOCKET s[size];
	SERVER_SOCKET serv=skt_server(&port);
	for (int child=1;child<size;child++) {
		int newpid=fork();
		if (newpid==0) { /* I'm the child */
			s[0]=skt_connect(skt_lookup_ip("127.0.0.1"),port,2);
			run(child,size,s);
			skt_close(s[0]);
			exit(0); /* close out child process when done */
		}
		/* else I'm the parent */
		s[child]=skt_accept(serv,0,0);
	}
	/* Now that all children are created, run as parent */
	run(0,size,s);
	/* Once parent is done, collect all the children */
	for (int child=1;child<size;child++) {
		skt_close(s[child]);
		int status=0; 
		wait(&status); /* wait for child to finish */
	}
	cout<<"That took "<<(time_in_seconds()-start)<<" seconds\n";
	return 0;
}

(Try this in NetRun now!)