Image Output and Load Balance

CS 441 Lecture, Dr. Lawlor

Today we figured out how to output image data, using a PPM image file.  We used local image coordinates for rendering, and global image coordinates only for image reassembly. 

Finally, we scaled our distributed-memory parallel program to support an arbitrary number of cores, and measured load balance.
1 processes, 0.459 seconds
2 processes, 0.404 seconds
4 processes, 0.285 seconds
8 processes, 0.155 seconds
We found that adding more processes than CPUs helped load balance substantially: with some spare processes, the OS scheduler can fill in idle CPUs, resulting in better overall performance.  Making 8 processes only We can fork off up to 9 processes on NetRun (but that's the limit).
// Socket-based multicore parallelism (for quad-core machine)
#include "osl/socket.h"
#include "osl/socket.cpp"
#include <sys/wait.h> /* for wait() */
#include <unistd.h> /* for fork() */
#include <complex>

/**
A linear function in 2 dimensions: returns a double as a function of (x,y).
*/
class linear2d_function {
public:
double a,b,c;
void set(double a_,double b_,double c_) {a=a_;b=b_;c=c_;}
linear2d_function(double a_,double b_,double c_) {set(a_,b_,c_);}
double evaluate(double x,double y) const {return x*a+y*b+c;}
};

const int wid=1000, ht=1000;
// Set up coordinate system to render the Mandelbrot Set:
double scale=3.0/wid;
linear2d_function fx(scale,0.0,-1.0); // returns c given pixels
linear2d_function fy(0.0,scale,-1.0);

char render_mset(int x,int y) {
/* Walk this Mandelbrot Set pixel */
typedef std::complex<double> COMPLEX;
COMPLEX c(fx.evaluate(x,y),fy.evaluate(x,y));
COMPLEX z(0.0);
int count;
enum {max_count=256};
for (count=0;count<max_count;count++) {
z=z*z+c;
if ((z.real()*z.real()+z.imag()*z.imag())>4.0) break;
}

return count;
}

class row {
public:
char data[wid];
};

/* Run as process "rank", one process among "size" others.
Each socket connects you with another rank: s[0] connects to rank 0.
*/
void run(int rank,int size,SOCKET *s) {
int procpiece=ht/size; int gystart=rank*procpiece;
row limg[procpiece]; /* local copy of the final image */

/* Render our piece of the image */
for (int y=0;y<procpiece;y++)
{
for (int x=0;x<wid;x++) limg[y].data[x]=render_mset(x,gystart+y);
}

if (rank>0)
{ /* send our partial piece to rank 0 */
skt_sendN(s[0],&limg[0].data[0],sizeof(row)*procpiece);
}
else
{ /* rank 0: receive partial pieces from other ranks */
row gimg[ht];
for (int r=0;r<size;r++)
if (r==0) {
memcpy(gimg,limg,sizeof(row)*procpiece);
} else {
skt_recvN(s[r],&gimg[r*procpiece].data[0],
sizeof(row)*procpiece);
}
/* Print out assembled image */
std::ofstream of("out.ppm",std::ios_base::binary);
of<<"P5\n"; // greyscale, binary
of<<wid<<" "<<ht<<"\n"; // image size
of<<"255\n"; // byte image
of.write(&gimg[0].data[0],sizeof(row)*ht);
}
}


int foo(void) {
double start=time_in_seconds();
unsigned int port=0;
const int size=read_input(); /* ???-core machine */
SOCKET s[size];
SERVER_SOCKET serv=skt_server(&port);
for (int child=1;child<size;child++) {
int newpid=fork();
if (newpid==0) { /* I'm the child */
s[0]=skt_connect(skt_lookup_ip("127.0.0.1"),port,2);
run(child,size,s);
skt_close(s[0]);
exit(0); /* close out child process when done */
}
/* else I'm the parent */
s[child]=skt_accept(serv,0,0);
}
/* Now that all children are created, run as parent */
run(0,size,s);
/* Once parent is done, collect all the children */
for (int child=1;child<size;child++) {
skt_close(s[child]);
int status=0;
wait(&status); /* wait for child to finish */
}
cout<<"That took "<<(time_in_seconds()-start)<<" seconds\n";
return 0;
}

(Try this in NetRun now!)