Neural Network processing on the GPU

The Basic Sigmoid

So... you're now able to write C++-like "GLSL" code that runs on the graphics card. For example, this Neural-Network-friendly "sigmoid" function runs in 0.4ns on the graphics card:

float sigmoid(float x) {
	return 2/(1 + exp2(-x)) - 1;
}
void main(void) {
	gl_FragColor=vec4(sigmoid(texcoords.x*10.0-5.0)*0.5+0.5);
}

(Try this in NetRun now!)

On the CPU, the same function runs in over 100ns:

float sigmoid(float x) {
	return 2/(1 + exp2(-x)) - 1;
}
int foo(void) {
	return sigmoid(2.6);
}

(Try this in NetRun now!)

So, the graphics card is 200x faster for doing sigmoids. Of course, this is sort of an unfair comparison, because:

We're using g++, a fairly poor optimizing compiler; supposedly the Intel compiler can make the CPU version several-fold faster.
Our CPU is a fairly old hyperthreaded Pentium 4 at 2.8 GHz; a modern Core2 Extreme might again be several times faster.
"exp2", like a lot of strange functions, is in hardware on the graphics card, but is a big complicated software function on the CPU.

So 200x overall speedup is a bit much to expect. Yet the 200x speedup is not a mistake--graphics cards really are crazy-fast at floating-point arithmetic.

One Neural Network Layer

OK, if we've got sigmoids, the only other thing we need is weighted inputs. So we need to store weights, input floats, and output floats somehow. The standard graphics-card way to store these is in "textures", which are just 2D rectangular arrays of color pixels (color == 4 floats). So we've got to pick what the X axis, Y axis, and colors mean.

Here's a decent choice:

For input and output values, I'm going to make the Y axis be the neural network's node number, the X axis be the problem number/4, and the colors will be four separate problems inside each pixel. Making the X axis and color "axis" match is usually a good idea, because textures are normally stored in RAM with X pixels contiguous, and colors inside that.
For weights, I'm going to make the Y axis be the output node number, and the X axis be input node number/4, with the colors being smaller input node numbers.

So our inner loop looks like this. You have to imagine we've somehow loaded up "tex1" with our input node values, and "tex3" with our weights. The "texture2D" function looks up a texture's pixel value given a location in the texture:

vec4 sigmoid(vec4 x) {
	return 2/(1 + exp2(-x)) - 1;
}
void main(void) {
	vec4 activation=vec4(0.0);
	int n=32;
	float pix2tex=1.0/256;
	for (int i=0;i<n;i+=4) {
		vec4 w=texture2D(tex3,vec2(texcoords.y,(i+0.5)*(1.0/n)));
		activation+=w.r*texture2D(tex1,vec2(0,(i+0.5)*pix2tex));
		activation+=w.g*texture2D(tex1,vec2(0,(i+1+0.5)*pix2tex));
		activation+=w.b*texture2D(tex1,vec2(0,(i+2+0.5)*pix2tex));
		activation+=w.a*texture2D(tex1,vec2(0,(i+3+0.5)*pix2tex));
	}
	gl_FragColor=sigmoid(activation)*0.5+0.5;
}

(Try this in NetRun now!)

This run in 5.4ns per output pixel. Now, each output pixel consists of the neural network node values for four separate problems.

Suprisingly, when this code runs for every pixel, it'll compute an entire output layer for a whole set of input problems. If we run this code several times, we can push data through a Neural Network's layers, and get the output for a whole set of input problems.

Overall

Here's some example code to implement these ideas: 405_neuralnet--neural network checkerboard evaluation on the graphics card. Download GPU Neural Network: .zip w/exe (640K) .tar.gz (581K). I've written some simpler examples using the same libraries; you can download the simpler examples here: .zip w/exe (636K) .tar.gz (571K).

But the speed, sadly, is pathetic for small sets of problems:
4 problem-sized batch: 120000.000 ns per board; 8333 evaluations/second

Luckily, the fix is easy: use bigger batches (more problems at once, bigger textures, less overhead).
8 problem-sized batch: 63750.000 ns per board; 15686 evaluations/second
16 problem-sized batch: 31250.000 ns per board; 32000 evaluations/second
32 problem-sized batch: 16250.000 ns per board; 61538 evaluations/second
64 problem-sized batch: 8593.750 ns per board; 116364 evaluations/second
128 problem-sized batch: 4687.500 ns per board; 213333 evaluations/second
256 problem-sized batch: 2812.500 ns per board; 355556 evaluations/second
512 problem-sized batch: 1914.063 ns per board; 522449 evaluations/second
1024 problem-sized batch: 1425.781 ns per board; 701370 evaluations/second
2048 problem-sized batch: 1181.641 ns per board; 846281 evaluations/second
4096 problem-sized batch: 1044.922 ns per board; 957009 evaluations/second
8192 problem-sized batch: 1068.115 ns per board; 936229 evaluations/second

That's close to five times the CPU's maximum speed!

In general, graphics card programming can be very fast if:

You've got a big dataset, big enough to amortize away the microseconds of overhead from talking to the hardware. "Big" means a few hundred thousand floating-point operations, at least.
You're doing a lot of floating point, especially "weird" floating point like sines, cosines, exponentials, divides, square roots, etc. The GPU just chews up these high-latency operations; the CPU bogs down horribly.
You can write your code independently for every output value (every pixel program).

By contrast, small problems tend to just waste all their time contacting the graphics card (in general, any CPU-GPU synchronizing operation costs at least 10us == 10,000ns!). The graphics hardware doesn't natively support integer or double-precision operations, and is currently restricted to running pixel programs to output colors to 2D textures.

Despite these limitations, graphics card programming can *immensely* speed up a lot of problems, especially scientific computing/simulation type problems.