My new-this-year language EPGPU is built on top of OpenCL, a parallel programming language that currently runs on NVIDIA GPUs, ATI GPUs, AMD multicore CPUs, and Intel multicore/SSE/AVX CPUs, all with quite good performance. The big drawback of OpenCL, the many function calls needed to get anything done, are all hidden inside of EPGPU. The basic interface is quite simple:

#include "epgpu.h""GPU_FILLKERNEL" makes a kernel (like a function) that can be run on the GPU (or other parallel hardware). "i" is effectively a keyword, giving the array index for your float. "result" is a keyword, storing the value of your float.

#include "epgpu.cpp"

#include <iostream>

GPU_FILLKERNEL(float,

simple_floats,(float start),

{

result=start+i;

}

)

int main() {

int n=10;

gpu_array<float> arr(n);

arr=simple_floats(1000.23);

for (int i=0;i<n;i++) std::cout<<arr[i]<<"\n";

return 0;

}

Floats are stored in a "gpu_array", which is basically my GPU version of a std::vector (and similar to a thrust::device_vector). It's easy to run a FILL kernel into an array:

arr=simple_floats(1000.23);This runs the simple_floats function on every array element, splitting into threads, blocks, or SIMD units as needed by the current hardware.

For more complex functions, where functions and array elements aren't in a one-to-one mapping, you use a non-FILL kernel. Now you need to manually pass in the array address, and explicitly run however many copies of the function you need.

#include "epgpu.h"FILL and non-FILL kernels are enough to write quite a few simple applications, as I explore at the technical paper in the EPGPU download page here:

#include "epgpu.cpp"

#include <iostream>

GPU_KERNEL(

scatter_floats,(__global<float *> array,float param),

{

array[i]=param;

array[10+i]=7;

}

)

int main() {

int n=20;

gpu_array<float> arr(n);

scatter_floats(arr,1.234).run(n/2);

for (int i=0;i<n;i++) std::cout<<"["<<i<<"]= "<<arr[i]<<"\n";

return 0;

}

http://www.cs.uaf.edu/sw/EPGPU/

total=0;In parallel, the problem is a lot harder--"total" is a sequential bottleneck, since everybody has to add to it simultaniously. In OpenMP, with a small number of threads, "reduction(+:total)" creates private copies, and adds them up with atomic operations, but this doesn't scale well to the many thousands of threads on a GPU, because it takes time in O(p) for p processors.

for (int i=0;i<length;i++) total+=arr[i];

Another algorithm that scales much better is a tree-based summation, shown here for 16 elements. We begin by summing pairs of two elements, then summing across groups of four elements, then eight, and finally all sixteen.

The code is as follows:

#include "epgpu.h"On the NetRun GTX 280, this gives:

#include "epgpu.cpp"

#include "lib/inc.c" /* netrun, for timing */

#include <iostream>

GPU_KERNEL(

writeFloats,(__global<float *> arr,float start),

{

arr[i]=start;

}

)

GPU_KERNEL(

totalFloats,(__global<float *> arr,int length,int step),

{

int srcIdx=step*i+step/2;

if (srcIdx<length)

arr[step*i]+=arr[srcIdx];

}

)

/* Total up the n values in this array into arr[0].

Does this recursively, using log(n) kernel invocations. */

void totalArray(gpu_array<float> &arr,int n)

{

for (int step=2;step/2<n;step*=2) {

totalFloats(arr,n,step).run((n+step-1)/step);

}

}

/* Single thread version: easy to write, but extraordinarily slow (250ns/float!) */

GPU_FILLKERNEL(float,

linearSum,(__global<float *> arr,int length),

{

result=0;

for (int i=0;i<length;i++) result+=arr[i];

}

)

int main() {

int n=8000;

gpu_array<float> arr(n);

float start=1000.23;

writeFloats(arr,start).run(n);

double t=time_in_seconds();

gpu_array<float> total(1);

total=linearSum(arr,n);

float result=total[0];

std::cout<<"Linear total="<<result<<" ("<<(time_in_seconds()-t)*1.0e6<<"us)\n";

t=time_in_seconds();

totalArray(arr,n);

result=arr[0];

std::cout<<"Recursive total="<<result<<" ("<<(time_in_seconds()-t)*1.0e6<<"us)\n";

std::cout<<"Analytic total="<<start*n<<"\n";

return 0;

}

OpenCL initialized with 30 compute unitsNote that even with a small 8000-entry array, the sequential linear summation already much slower (O(n) versus O(lg n)) and less accurate (due to roundoff) than the tree-based recursive summation. Both effects just get worse as the array gets bigger.

Linear total=8.00104e+06 (3102.08us)

Recursive total=8.00184e+06 (345.117us)

Analytic total=8.00184e+06

You can do even better by putting the lower levels of the tree summation into GPU "__local__" memory, which allows you to use local thread synchronization instead of calling a second kernel, and by running the highest levels of the tree on the CPU, where sequential summation is faster. But both of these implementation adjustments just tweak the asymptotic constant--the big speedup is using a good parallel algorithm!