Graphics Processing Unit (GPU) Programming in CUDA and OpenCL

CS 641 Lecture, Dr. Lawlor

First, a brief word on GPU hardware--it's really not as weird as you think. Basically, the 1000 or so "stream processors" in a modern GPU typically consist of:

16 or 32-way SIMD execution units, driven by a weakly superscalar scheduling unit, making up one core. The GPU compiler hides the SIMD instructions, so they're never in your face from a programmability standpoint. But for performance, branch granularity is very important, since all the SIMD units will take the same branches. Unlike the CPU, GPU SIMD width has been shrinking lately (ATI just switched from 64 to 16). NVIDIA/CUDA calls one SIMD block a "thread warp"; ATI calls the same thing a "wavefront".
Between 2 and 48 cores. Nobody calls them cores (for marketing reasons), but conceptually that's exactly what they are. Again, the graphics driver spreads your code across the cores for you.
Very deep hyperthreading, with each core running up to several hundred threads.

If you multiply these numbers, you get about a thousand arithmetic units executing instructions chosen from 10K-100K threads, which as you might expect can deliver serious performance, currently about a teraflop.

But if you look at a modern CPU, like an Intel Sandy Bridge, it might have:

8-way AVX SIMD units, driven by a very agressive scheduling unit. Sadly, as a programmer the SIMD is right in your face.
4 or 8 cores. Sadly, these are again right in your face, unless you're using OpenMP or some similar decent tool.
2-way hyperthreading, doubling the number of software-visible cores.

Multiplying these out shows about 100-way parallelism is certainly possible, although to get there the traditional programming interface is nasty. Last year Intel and AMD both released very impressive multicore/SIMD OpenCL drivers for their CPU hardware--thus using the GPU programming model on the CPU!

So the modern contest is really not "ordinary serial C++" versus "crazy GPU stuff". Instead, it's a three-way battle, between the increasingly dated sequential C++, a wide variety of multicore/SIMD CPU models, and a few new GPU programming languages.

I foresee increasingly little difference between a heavily SIMD+multithreaded CPU, and a GPU.

CUDA

CUDA is NVIDIA corporation's C++-style GPU programming language. It only runs on NVIDIA hardware, which is annoying, but the language is really quite good, so it's currently the most popular.

Here's probably the simplest possible CUDA program:

#include <iostream>
#include <cuda.h>

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x; /* find my index */
	vals[i]=i+param;
}
int main(int argc,char *argv[]) {
	int n=16; /* total number of floats */
	float *vals; /* device array of n values */
	cudaMalloc( (void**) &vals, n*sizeof(float) ); //Allocate GPU space
	
	set_array<<<1,n>>>(vals,0.1234); /* Initialize the space on the GPU */ 

/* Copy a few elements back to CPU for printing */
	int i=7;
	float f=-999.0; /* CPU copy of value */
	cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost);
	std::cout<<"vals["<<i<<"] = "<<f<<"\n";
        return 0;
}

(Try this in NetRun now!)

Basically, we:

Start by including cuda.h.
Define a funky GPU (__global__) function "set_array". It writes to a GPU array named "vals" at its own thread index.
Define an ordinary C++ CPU main function.
Allocate space on the GPU with cudaMalloc. You get a pointer back, but it's a pointer into GPU space, so you can't dereference it, only hand it back to the GPU!
Call your funky GPU function using the funky "kernel invocation syntax". Inside the <<< >>> goes the number of GPU "blocks" of threads, then the number of threads per block. The total parallelism is blocks*threads.
Copy data back to main using a cudaMemcpy. Be sure to check for errors when you do this, unlike above!

In the version below, we actually check errors, and we're allocating threads in 'blocks', which lets you use multiple blocks to work around the hardware's thread count limit of 512 threads per block:

#include <iostream>
#include <cuda.h>
/* error checking */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x+blockDim.x*blockIdx.x; /* find my index */
	vals[i]=i+param;
}
int main(int argc,char *argv[]) {
	int w=4, h=4; /* number of blocks, threads per block */
	int n=w*h; /* total number of floats */
	float *vals; /* device array of n values */
	check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space
	
	set_array<<<w,h>>>(vals,0.1234); /* Initialize the space on the GPU */ 

/* Copy a few elements back to CPU for printing */
	for (int i=0;i<n;i+=3) {
		float f=-999.0; /* CPU copy of value */
		check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
		std::cout<<"vals["<<i<<"] = "<<f<<"\n";
	}
        return 0;
}

(Try this in NetRun now!)

Several interesting architectural details crop up in the CUDA documentation:

The execution model is very unusual, especially compared to typical "what one thread should do" machine code.

A "thread" means one single execution of a kernel. This is mostly conceptual, since the hardware operates on warps of threads.
A "warp" is a group of 32 threads that all take the same branches. A warp is really a SIMD group: a bunch of floats sharing one instruction decoder. The hardware does a good job with predication, so warps aren't "in your face" like with SSE instructions.
A "block" is a group of a few hundred threads that have access to the same "__shared__" memory. The block size is specified in software, but limited by hardware to 256, 512, or 1024 threads maximum. More threads per block is generally better, with serious slowdowns for less than 100 or so threads per block. The PTX manual calls blocks "CTAs" (Cooperative Thread Arrays).
The entire kernel consists of a set of blocks of threads.

The memory model is also highly segmented and specialized, unlike the flat memory of modern CPUs.

"registers" are unique to that thread. Early 8000-series cards had 8192 registers available; GTX 200 series had 16K registers; and the new Fermi GTX 400s have 32K registers. Registers are divided up among threads, so the fewer registers each thread uses, the more threads the machine can keep in flight, hiding latency.
"shared" memory is declared with __shared__, and can be read or written by all the threads in a block. This is handy for neighborhood communication where global memory would be too slow. There is at least 16KB of shared memory available per thread block; Fermi cards can expose up to 48KB with special config options. Use "__syncthreads()__" to synchronize shared writes across a whole thread block.
"global" memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls go to global memory, which is considered "slow" at only 100GB/second! Older hardware had very strict rules on "global memory coalescing", but luckily newer (Fermi-era) hardware just prefers locality, if you can manage it.
"param" is the PTX abstraction around the parameter-passing protocol. They reserve the right to change this, as hardware and software changes.
constants and compiled program code are stored in their own read-only memories.
"local" memory is unique to each thread, but paradoxically slower than shared memory. Don't use it!

CUDA Performance: Use Big Arrays!

One universal aspect of CUDA is that kernel calls (<<<), mallocs, and memcpy all take quite a long time to get running. Once they're running, they're fairly fast, but for maximum efficiency you should plan on accessing about a megabyte each time! Here's an example where I benchmark this length dependence:

#include <iostream>
#include <cuda.h>
#include "lib/inc.c"

float *dev_ptr=0, *host_ptr=0;
int len=0;

int time_memcpy(void) {
	cudaMemcpy(dev_ptr,host_ptr,len,cudaMemcpyHostToDevice);
	cudaThreadSynchronize();
	return 0;
}

__global__ void doGPUdatawrite(float *arr,float val) {
	int i=blockIdx.x*blockDim.x+threadIdx.x;
	arr[i]=val;
}
int time_datawrite(void) {
	int blocks=1, tpb=len;
	while (tpb>=512) {blocks*=2; tpb/=2;}
	doGPUdatawrite<<<blocks,tpb>>>(dev_ptr,1.2345);
	cudaThreadSynchronize();
	return 0;
}

void time_sweep(const char *name,timeable_fn f) {
	int max=1024*1024*4;
	cudaMalloc((void **)&dev_ptr,max*sizeof(float));
	cudaMallocHost((void **)&host_ptr,max*sizeof(float));
	std::cout<<"Did mallocs\n";
	for (len=64;len<=max;len*=4) {
		double t=time_function(f);
		printf("%s size %d: %.2f ns/float (%.1f us)\n",
			name,len,t/len*1.0e9,t*1.0e6);
	}
	cudaFreeHost(host_ptr);
	cudaFree(dev_ptr);
}

int main(int argc,char *argv[]) {
	std::cout<<"Starting up...\n";
	time_sweep("memcpy",time_memcpy);
	time_sweep("GPU write",time_datawrite);
	return 0;
}

(Try this in NetRun now!)

Here's the output on NetRun's NVIDIA GeForce GTX 280:

memcpy size 64: 163.73 ns/float (10.5 us)
memcpy size 256: 41.11 ns/float (10.5 us)
memcpy size 1024: 10.91 ns/float (11.2 us)
memcpy size 4096: 2.95 ns/float (12.1 us)
memcpy size 16384: 1.01 ns/float (16.6 us)
memcpy size 65536: 0.53 ns/float (34.7 us)
memcpy size 262144: 0.41 ns/float (108.3 us)
memcpy size 1048576: 0.38 ns/float (401.3 us)
memcpy size 4194304: 0.38 ns/float (1574.1 us)

GPU write size 64: 200.15 ns/float (12.8 us)
GPU write size 256: 50.37 ns/float (12.9 us)
GPU write size 1024: 12.65 ns/float (13.0 us)
GPU write size 4096: 3.21 ns/float (13.2 us)
GPU write size 16384: 0.85 ns/float (13.9 us)
GPU write size 65536: 0.25 ns/float (16.5 us)
GPU write size 262144: 0.10 ns/float (25.7 us)
GPU write size 1048576: 0.06 ns/float (62.8 us)
GPU write size 4194304: 0.05 ns/float (210.4 us)

Note that really big arrays, with millions of floats, take much less time per float than smaller arrays. The basic trouble is that going through the OS, across the PCIe bus, into the GPU, and back takes like 10 microseconds (10us = 10000ns). If you go to all that just to get one or two floats, you'll get terrible performance.

CUDA Application Performance: Mandelbrot Rendering

Here's a little Mandelbrot set rendering application in CUDA. It includes benchmarking code, which shows some surprising results.

#include <cuda.h>
#include <iostream>
#include <fstream>
#include "lib/inc.c" /* NetRun utility functions, like time_in_seconds */

#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU Code! */
__global__ void fill_in_array(float *arr,int wid) {
	int i=threadIdx.x+blockDim.x*blockIdx.x; // my thread's global number
	int x=i%wid, y=i/wid; // my thread's pixel to work on

	float cr=x*1.0/wid, ci=y*1.0/wid;
	float zr=cr, zi=ci;
	int count;
	const int max_count=256;
	for (count=0;count<max_count;count++) {
		// z= z*z+c
		float nzr=zr*zr-zi*zi + cr;
		float nzi=2.0*zr*zi + ci;
		if ((nzr*nzr+nzi*nzi)>4.0) break;
		zr=nzr; zi=nzi;
	}
	arr[y*wid+x]=count;
}

/* Run on CPU */
int main(int argc,char *argv[]) {
	int wid=512,ht=512;
	float *arr=0; /* LIVES ON THE GPU!!!! */
	double start=time_in_seconds(), elapsed;
	check(cudaMalloc((void **)&arr, wid*ht*sizeof(float)));
	elapsed=time_in_seconds()-start;
	std::cout<<"Allocated array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";
	start=time_in_seconds(); // <- reset the timer!
	int threadsPerBlock=512;
	int nBlocks=wid*ht/threadsPerBlock;
	fill_in_array<<<nBlocks,threadsPerBlock>>>(arr,wid);

	check(cudaThreadSynchronize()); // look for errors in kernel
	elapsed=time_in_seconds()-start;
	std::cout<<"Rendered array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";
	
	float harr[wid*ht];
	check(cudaMemcpy(harr,arr,wid*ht*sizeof(float),cudaMemcpyDeviceToHost));
	elapsed=time_in_seconds()-start;
	std::cout<<"Copied out array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";

	std::ofstream of("out.ppm",std::ios_base::binary);
	of<<"P5\n"; // greyscale, binary
	of<<wid<<" "<<ht<<"\n"; // image size
	of<<"255\n"; // byte image
	for (int i=0;i<wid*ht;i++) {
		char c=(char)harr[i];
		of<<c;
	}
	return 0;
}

(Try this in NetRun now!)

The surprising part about this is the timing breakdown:

Allocated array: 152.6ns/pixel, time 0.040 seconds
Rendered array: 3.4ns/pixel, time 0.001 seconds
Copied out array: 10.6ns/pixel, time 0.003 seconds

The biggest surprise is the cudaMalloc time, which is ridiculously huge because this is the first CUDA call in the program, so the driver has to locate and set up the graphics card. Adding a "dummy" cudaMalloc of four bytes reduces the allocation time to less than a nanosecond per pixel.

Second, it takes longer to copy the data off the card than it does to compute it! Switching to a "char" data type cuts both the copy-out time (due to less data copied) as well as the rendering time (due to less data written)

(Try this in NetRun now!)

Allocated array: 0.02ns/pixel, time 5.11634e-06 seconds
Rendered array: 2.74ns/pixel, time 0.000719074 seconds
Copied out array: 4.63ns/pixel, time 0.00121307 seconds

CUDA and Thrust

A nice library for more complex operations is called Thrust, sort of like STL or Boost for the GPU. As of CUDA 4.0, Thrust is an official part of the CUDA distribution, and it's preinstalled on NetRun.

Here's how to use thrust::reduce, which can be used to add up all the values in an array, or find their maximum or minimum. 0 is the additive identity. plus is the operation to perform.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

int main(void)
{
  // generate some random data on the host
	thrust::host_vector<int> h_vec(10);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

  // transfer to device
	thrust::device_vector<int> d_vec = h_vec;

  // sum on device
	int final_sum = thrust::reduce(d_vec.begin(), d_vec.end(), 
		0, thrust::plus<int>());

	std::cout<<"Final sum="<<final_sum<<"\n";

	return 0;
}

(Try this in NetRun now!)

Here's how to use thrust::reduce to find the biggest and smallest elements in the array too.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

int main(void)
{
  // generate some random data on the host
	thrust::host_vector<int> h_vec(10);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

  // transfer to device
	thrust::device_vector<int> d_vec = h_vec;

  // sum on device
	int final_sum = thrust::reduce(d_vec.begin(), d_vec.end(), 
		0, thrust::plus<int>());
	int final_max = thrust::reduce(d_vec.begin(), d_vec.end(), 
		0, thrust::maximum<int>());
	int final_min = thrust::reduce(d_vec.begin(), d_vec.end(), 
		999, thrust::minimum<int>());

	std::cout<<"Final sum="<<final_sum<<"  max="<<final_max<<"  min="<<final_min<<"\n";

	return 0;
}

(Try this in NetRun now!)

It's not very efficient (see below) to call thrust::reduce three times on the same vector. It's more efficient to call it once, and collect up the sum, min, and max all in one go. To do this, we need to write a weird 'functor' to pass to thrust::reduce.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

// This is used to store everything we know about the ints seen so far:
class sum_min_max {
public:
	int sum, min, max;
	sum_min_max() {sum=0; min=1000000000; max=-1000000000;}
	__device__ __host__ sum_min_max(int value) {sum=value; min=value; max=value;}
};

// This 'functor' function object combines two sum_min_max objects
class smm_combiner {
public:
__device__ __host__ 
sum_min_max operator()(sum_min_max l,const sum_min_max &r) {
	l.sum+=r.sum;
	if (l.min>r.min) l.min=r.min;
	if (l.max<r.max) l.max=r.max;
	return l;
}
};

int main(void)
{
  // generate some random data on the host
	thrust::host_vector<int> h_vec(10);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

  // transfer to device
	thrust::device_vector<int> d_vec = h_vec;

  // sum/min/max on device
	sum_min_max final = thrust::reduce(d_vec.begin(), d_vec.end(), 
		sum_min_max(), smm_combiner());

	std::cout<<"Final sum="<<final.sum<<"  max="<<final.max<<"  min="<<final.min<<"\n";

	return 0;
}

(Try this in NetRun now!)

This same idea could probably be better written as a thrust::tuple.

Here's how to use thrust::sort.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <cstdlib>
#include "lib/inc.c" /* for netrun timing functions */

int main(void)
{
  // generate some random data on the host
	thrust::host_vector<int> h_vec(10);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand();

  // transfer to device
	thrust::device_vector<int> d_vec = h_vec;

  // sort on device
	thrust::sort(d_vec.begin(), d_vec.end());

  // copy back, and print
	h_vec = d_vec;
	for (unsigned int i=0;i<h_vec.size();i++) 
		std::cout<<"val["<<i<<"] = "<<h_vec[i]<<"\n";

	return 0;
}

(Try this in NetRun now!)

Here's a version of thrust::sort where we time the sort for various data sizes.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <cstdlib>
#include "lib/inc.c" /* for netrun timing functions */

thrust::device_vector<int> d_vec;
int sort_device_vector(void) {
	// sort on device
	thrust::sort(d_vec.begin(), d_vec.end());
	return 0;
}

int main(void)
{
  for (int vec_size=16;vec_size<=4*1024*1024;vec_size*=4) 
  {
  // generate random data on the host
	thrust::host_vector<int> h_vec(vec_size);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand();

  // transfer to device
	d_vec = h_vec;

	double time=time_function(sort_device_vector);
	printf("Time for sort of size %d: %.3f ns/elt (%.3f ms)\n",
		h_vec.size(), time/h_vec.size()*1.0e9, time*1.0e3);

  // copy back and print
	h_vec = d_vec;

	if (vec_size<=16)
		for (unsigned int i=0;i<h_vec.size();i++) 
			std::cout<<"val["<<i<<"] = "<<h_vec[i]<<"\n";

  }

  return 0;
}

(Try this in NetRun now!)

This outputs:

Time for sort of size 16: 21099.579 ns/elt (0.338 ms)
Time for sort of size 64: 5302.252 ns/elt (0.339 ms)
Time for sort of size 256: 1377.077 ns/elt (0.353 ms)
Time for sort of size 1024: 312.655 ns/elt (0.320 ms)
Time for sort of size 4096: 78.331 ns/elt (0.321 ms)
Time for sort of size 16384: 20.922 ns/elt (0.343 ms)
Time for sort of size 65536: 8.636 ns/elt (0.566 ms)
Time for sort of size 262144: 5.332 ns/elt (1.398 ms)
Time for sort of size 1048576: 3.527 ns/elt (3.699 ms)
Time for sort of size 4194304: 3.027 ns/elt (12.694 ms)

Note how sorting 16 elements and 16 thousand elements takes nearly exactly the same amount of time. This is common on the GPU--the startup time to get a kernel running is measured in microseconds, not nanoseconds like a function call. This means you need to do work in huge batches, millions of data items at a time, to get good efficiency.

CUDA Near Peak Performance

This program hits nearly peak performance on a GTX 280, about 520 gigaflops:

#include <iostream>
#include <cuda.h>
#include "lib/inc.c"

const int ops_per=100;

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=blockIdx.x*blockDim.x+threadIdx.x; /* find my index */
	float x=0.9f+i*0.0000001f;
	for (int rep=0;rep<ops_per;rep++) {
		x=x*0.1f+0.97f;
	}
	vals[i]=x;
}
int main(int argc,char *argv[]) {
	int n=10000000; /* total number of floats */
	float *vals; /* device array of n values */
	cudaMalloc( (void**) &vals, n*sizeof(float) ); //Allocate GPU space
	
	double start=time_in_seconds();
	int threads=256; int blocks=n/threads;
	set_array<<<blocks,threads>>>(vals,765.4321); /* Initialize the space on the GPU */
	float f=-999.0; /* CPU copy of value */
	cudaMemcpy(&f,&vals[0],sizeof(float),cudaMemcpyDeviceToHost); /* forces kernel to actually time */
	double per=(time_in_seconds()-start)/n;
	std::cout<<"That took "<<per*1.0e9<<" ns/element, or "<<2*ops_per/per*1.0e-9<<" GF ("<<blocks<<" blocks)\n";

/* Copy a few elements back to CPU for printing */
	for (int i=0;i<n;i+=n/20) {
		double start=time_in_seconds();
		cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost);
		double per=(time_in_seconds()-start);
		std::cout<<"vals["<<i<<"] = "<<f<<"   ("<<per*1.0e6<<" microseconds)\n";
	}
        return 0;
}

(Try this in NetRun now!)

Dr. Lawlor's EPGPU Language

My 2011 language EPGPU is built on top of OpenCL, a parallel programming language that currently runs on NVIDIA GPUs, ATI GPUs, AMD multicore CPUs, and Intel multicore/SSE/AVX CPUs, all with quite good performance. The big drawback of OpenCL, the many function calls needed to get anything done, are all hidden inside of EPGPU.

The basic interface is quite simple: GPU code goes inside a macro, which builds a C++ object.

#include "epgpu.h"
#include "epgpu.cpp"
#include <iostream>

GPU_FILLKERNEL(float,
 simple_floats,(float start),
 {
	result=start+i;
 }
)

int main() {
	int n=10;
	gpu_vector<float> arr(n);
	arr=simple_floats(1000.23);
	for (int i=0;i<n;i++) std::cout<<arr[i]<<"\n";
	return 0;
}

(Try this in NetRun now!)

"GPU_FILLKERNEL" makes a kernel (like a function) that can be run on the GPU (or other parallel hardware). "i" is effectively a keyword, giving the array index for your float. "result" is a keyword, storing the value of your float.

Floats are stored in a "gpu_vector", which is basically my GPU version of a std::vector (and similar to a thrust::device_vector). It's easy to run a FILL kernel into an array:

    arr=simple_floats(1000.23);

This runs the simple_floats function on every array element, splitting into threads, blocks, or SIMD units as needed by the current hardware.

For more complex functions, where functions and array elements aren't in a one-to-one mapping, you use a non-FILL kernel. Now you need to manually pass in the array address, and explicitly run however many copies of the function you need.

#include "epgpu.h"
#include "epgpu.cpp"
#include <iostream>

GPU_KERNEL(
 scatter_floats,(__global<float *> array,float param),
 {
	array[i]=param;
	array[10+i]=7;
 }
)

int main() {
	int n=20;
	gpu_vector<float> arr(n);
	scatter_floats(arr,1.234).run(n/2);
	for (int i=0;i<n;i++) std::cout<<"["<<i<<"]= "<<arr[i]<<"\n";
	return 0;
}

(Try this in NetRun now!)

FILL and non-FILL kernels are enough to write quite a few simple applications, as I explore at the technical paper in the EPGPU download page here:
http://www.cs.uaf.edu/sw/EPGPU/

Parallel Summation Example

One common problem is to total up the values in an array. Serially, the code is trivial:

	total=0;
	for (int i=0;i<length;i++) total+=arr[i];

In parallel, the problem is a lot harder--"total" is a sequential bottleneck, since everybody has to add to it simultaniously. In OpenMP, with a small number of threads, "reduction(+:total)" creates private copies, and adds them up with atomic operations, but this doesn't scale well to the many thousands of threads on a GPU, because it takes time in O(p) for p processors.

Another algorithm that scales much better is a tree-based summation, shown here for 16 elements. We begin by summing pairs of two elements, then summing across groups of four elements, then eight, and finally all sixteen.
parallel prefix-like tree summation (reduction) [image is derivative work from public domain WikipediaFile:Prefix_sum_16.svg]

parallel prefix-like tree summation (reduction) [image is derivative work from public domain WikipediaFile:Prefix_sum_16.svg]

The code is as follows:

#include "epgpu.h"
#include "epgpu.cpp"
#include "lib/inc.c" /* netrun, for timing */
#include <iostream>

GPU_KERNEL(
 writeFloats,(__global<float *> arr,float start),
 {
	arr[i]=start;
 }
)

GPU_KERNEL(
 totalFloats,(__global<float *> arr,int length,int step),
 {
	int srcIdx=step*i+step/2;
	if (srcIdx<length)
		arr[step*i]+=arr[srcIdx];
 }
)

/* Total up the n values in this array into arr[0].
   Does this recursively, using log(n) kernel invocations. */
void totalArray(gpu_array<float> &arr,int n)
{
	for (int step=2;step/2<n;step*=2) {
		totalFloats(arr,n,step).run((n+step-1)/step);
	}
}

/* Single thread version: easy to write, but extraordinarily slow (250ns/float!) */
GPU_FILLKERNEL(float,
  linearSum,(__global<float *> arr,int length),
  {
	result=0;
	for (int i=0;i<length;i++) result+=arr[i];
  }
)

int main() {
	int n=8000;
	gpu_vector<float> arr(n);
	float start=1000.23;
	writeFloats(arr,start).run(n);

	double t=time_in_seconds();
	gpu_vector<float> total(1);
	total=linearSum(arr,n);
	float result=total[0];
	std::cout<<"Linear total="<<result<<"  ("<<(time_in_seconds()-t)*1.0e6<<"us)\n";

	t=time_in_seconds();
	totalArray(arr,n);
	result=arr[0];
	std::cout<<"Recursive total="<<result<<"  ("<<(time_in_seconds()-t)*1.0e6<<"us)\n";

	std::cout<<"Analytic total="<<start*n<<"\n";
	return 0;
}

(Try this in NetRun now!)

On the NetRun GTX 280, this gives:

OpenCL initialized with 30 compute units
Linear total=8.00104e+06  (3102.08us)
Recursive total=8.00184e+06  (345.117us)
Analytic total=8.00184e+06

Note that even with a small 8000-entry array, the sequential linear summation already much slower (O(n) versus O(lg n)) and less accurate (due to roundoff) than the tree-based recursive summation. Both effects just get worse as the array gets bigger.

You can do even better by putting the lower levels of the tree summation into GPU "__local__" memory, which allows you to use local thread synchronization instead of calling a second kernel, and by running the highest levels of the tree on the CPU, where sequential summation is faster. But both of these implementation adjustments just tweak the asymptotic constant--the big speedup is using a good parallel algorithm!