Simulating Heat Flow in Parallel

CS 441/641 Lecture, Dr. Lawlor

Theoretically, we're solving Laplace's Equation (a special case of the Poisson Problem).  In practice, we're just doing repeated neighborhood averaging:
     new temperature (x,y) = 0.25 * (temperature(x-1,y) + temperature(x+1,y) + temperature(x,y-1) + temperature(x,y+1));

Typically boundary elements, which have no neighbor on one side, are computed in a totally different way, such as:
This is a naturally parallel problem, but it requires communication along the boundaries between processors every step.  It's a simple toy example that has the same general communication pattern as more complex physics codes, such as fluid dynamics or ice sheet modeling.

Here's a simple sequential implementation.
/* Store a 2D array as a row major 1D array */
template <class T>
class array2D {
int wid,ht;
std::vector<T> data; /* wid * ht elements */
public:
array2D(int w,int h) :wid(w), ht(h), data(w*h) {}

// Return array size
inline int nx() const {return wid;}
inline int ny() const {return ht;}

// Manipulate array elements
T &operator() (int x,int y) {return data[y*wid+x];}
T operator() (int x,int y) const {return data[y*wid+x];}

// Swap our data with this array
void swap(array2D<T> &other) {
std::swap(wid,other.wid);
std::swap(ht ,other.ht );
std::swap(data,other.data);
}
};

/* Dump a 2D array to a PPM file */
template <class T>
void write(const array2D<T> &arr,const char *name) {
std::ofstream f(name,std::ios_base::binary);
f<<"P5\n"; // grayscale
f<<arr.nx()<<" "<<arr.ny()<<"\n"; // dimensions
f<<"255\n"; // byte data
for (int y=0;y<arr.ny();y++)
for (int x=0;x<arr.nx();x++)
{
float v=arr(x,y)*255.99;
unsigned char c=(unsigned char)v;
if (v<0) c=0;
if (v>255) c=255;
f.write((char *)&c,1);
}
}

int foo(void) {
int w=1000, h=1000;
array2D<float> cur(w,h);
array2D<float> next(w,h);

// Make initial conditions
for (int y=0;y<cur.ny();y++)
for (int x=0;x<cur.nx();x++)
{
cur(x,y)=fmod(0.01*sqrt(x*x+y*y),1.0);
// subtle: need boundary conditions for next array
next(x,y)=cur(x,y);
}

// Run a few iterations of blurring
enum {nblur=100};
double start=time_in_seconds();
for (int blur=0;blur<nblur;blur++)
{
for (int y=1;y<cur.ny()-1;y++)
for (int x=1;x<cur.nx()-1;x++)
{
next(x,y)=0.25*(cur(x-1,y)+cur(x+1,y)+cur(x,y-1)+cur(x,y+1));
}
cur.swap(next);
}
double elapsed=time_in_seconds()-start;
std::cout<<"Performance: "<<elapsed/((w-2)*(h-2)*nblur)*1.0e9<<" ns/pixel\n";

// Dump final image (good for debugging)
write(cur,"out.ppm");
return 0;
}

(Try this in NetRun now!)

This takes about 10.9 nanoseconds per pixel on my Q6600 in pure sequential mode. 

Adding OpenMP around the "blur" loop actually performs well and doesn't crash, but it doesn't give the right answer, since you can't start working on iteration 50 until you've finished iteration 49.  The right place to add OpenMP is around the "y" loop, which gives decent performance.

It's a little tricky to solve this kind of problem using SIMD instructions.  The big problem is the left and right shifts, although in theory they could be implemented with either unaligned loads or shuffles (swizzles).

In MPI, the hard part is that if we divide up "cur", both processors will need to exchange boundary data every step of the outer loop.  This involves sends and receives, and is simplest if you only divide up one axis, like the Y loop.