Theoretically, we're solving Laplace's Equation (a special case of the Poisson Problem). In practice, we're just doing repeated neighborhood averaging:

new temperature (x,y) = 0.25 * (temperature(x-1,y) + temperature(x+1,y) + temperature(x,y-1) + temperature(x,y+1));

Typically boundary elements, which have no neighbor on one side, are computed in a totally different way, such as:

- Imposing a fixed value, like an inside or outside temperature
- Copying over a value from the interior, making an insulating boundary

Here's a simple sequential implementation.

/* Store a 2D array as a row major 1D array */This takes about 10.9 nanoseconds per pixel on my Q6600 in pure sequential mode.

template <class T>

class array2D {

int wid,ht;

std::vector<T> data; /* wid * ht elements */

public:

array2D(int w,int h) :wid(w), ht(h), data(w*h) {}

// Return array size

inline int nx() const {return wid;}

inline int ny() const {return ht;}

// Manipulate array elements

T &operator() (int x,int y) {return data[y*wid+x];}

T operator() (int x,int y) const {return data[y*wid+x];}

// Swap our data with this array

void swap(array2D<T> &other) {

std::swap(wid,other.wid);

std::swap(ht ,other.ht );

std::swap(data,other.data);

}

};

/* Dump a 2D array to a PPM file */

template <class T>

void write(const array2D<T> &arr,const char *name) {

std::ofstream f(name,std::ios_base::binary);

f<<"P5\n"; // grayscale

f<<arr.nx()<<" "<<arr.ny()<<"\n"; // dimensions

f<<"255\n"; // byte data

for (int y=0;y<arr.ny();y++)

for (int x=0;x<arr.nx();x++)

{

float v=arr(x,y)*255.99;

unsigned char c=(unsigned char)v;

if (v<0) c=0;

if (v>255) c=255;

f.write((char *)&c,1);

}

}

int foo(void) {

int w=1000, h=1000;

array2D<float> cur(w,h);

array2D<float> next(w,h);

// Make initial conditions

for (int y=0;y<cur.ny();y++)

for (int x=0;x<cur.nx();x++)

{

cur(x,y)=fmod(0.01*sqrt(x*x+y*y),1.0);

// subtle: need boundary conditions for next array

next(x,y)=cur(x,y);

}

// Run a few iterations of blurring

enum {nblur=100};

double start=time_in_seconds();

for (int blur=0;blur<nblur;blur++)

{

for (int y=1;y<cur.ny()-1;y++)

for (int x=1;x<cur.nx()-1;x++)

{

next(x,y)=0.25*(cur(x-1,y)+cur(x+1,y)+cur(x,y-1)+cur(x,y+1));

}

cur.swap(next);

}

double elapsed=time_in_seconds()-start;

std::cout<<"Performance: "<<elapsed/((w-2)*(h-2)*nblur)*1.0e9<<" ns/pixel\n";

// Dump final image (good for debugging)

write(cur,"out.ppm");

return 0;

}

Adding OpenMP around the "blur" loop actually performs well and doesn't crash, but it doesn't give the right answer, since you can't start working on iteration 50 until you've finished iteration 49. The right place to add OpenMP is around the "y" loop, which gives decent performance.

It's a little tricky to solve this kind of problem using SIMD instructions. The big problem is the left and right shifts, although in theory they could be implemented with either unaligned loads or shuffles (swizzles).

In MPI, the hard part is that if we divide up "cur", both processors will need to exchange boundary data every step of the outer loop. This involves sends and receives, and is simplest if you only divide up one axis, like the Y loop.