Neural Network processing on the GPU

CS 641 Lecture, Dr. Lawlor

The Basic Sigmoid

So... you're now able to write C++-like "GLSL" code that runs on the graphics card.  For example, this Neural-Network-friendly "sigmoid" function runs in 0.4ns on the graphics card:
float sigmoid(float x) {
return 2/(1 + exp2(-x)) - 1;
}
void main(void) {
gl_FragColor=vec4(sigmoid(texcoords.x*10.0-5.0)*0.5+0.5);
}

(Try this in NetRun now!)

On the CPU, the same function runs in over 100ns:
float sigmoid(float x) {
return 2/(1 + exp2(-x)) - 1;
}
int foo(void) {
return sigmoid(2.6);
}

(Try this in NetRun now!)

So, the graphics card is 200x faster for doing sigmoids.  Of course, this is sort of an unfair comparison, because:

So 200x overall speedup is a bit much to expect.  Yet the 200x speedup is not a mistake--graphics cards really are crazy-fast at floating-point arithmetic. 

One Neural Network Layer

OK, if we've got sigmoids, the only other thing we need is weighted inputs.  So we need to store weights, input floats, and output floats somehow.  The standard graphics-card way to store these is in "textures", which are just 2D rectangular arrays of color pixels (color == 4 floats).  So we've got to pick what the X axis, Y axis, and colors mean.

Here's a decent choice:
So our inner loop looks like this.  You have to imagine we've somehow loaded up "tex1" with our input node values, and "tex3" with our weights.  The "texture2D" function looks up a texture's pixel value given a location in the texture:
vec4 sigmoid(vec4 x) {
return 2/(1 + exp2(-x)) - 1;
}
void main(void) {
vec4 activation=vec4(0.0);
int n=32;
float pix2tex=1.0/256;
for (int i=0;i<n;i+=4) {
vec4 w=texture2D(tex3,vec2(texcoords.y,(i+0.5)*(1.0/n)));
activation+=w.r*texture2D(tex1,vec2(0,(i+0.5)*pix2tex));
activation+=w.g*texture2D(tex1,vec2(0,(i+1+0.5)*pix2tex));
activation+=w.b*texture2D(tex1,vec2(0,(i+2+0.5)*pix2tex));
activation+=w.a*texture2D(tex1,vec2(0,(i+3+0.5)*pix2tex));
}
gl_FragColor=sigmoid(activation)*0.5+0.5;
}

(Try this in NetRun now!)

This run in 5.4ns per output pixel.  Now, each output pixel consists of the neural network node values for four separate problems.

Suprisingly, when this code runs for every pixel, it'll compute an entire output layer for a whole set of input problems.  If we run this code several times, we can push data through a Neural Network's layers, and get the output for a whole set of input problems.

Overall

Here's some example code to implement these ideas: 405_neuralnet--neural network checkerboard evaluation on the graphics card.  Download GPU Neural Network: .zip w/exe (640K) .tar.gz (581K).   I've written some simpler examples using the same libraries; you can download the simpler examples here: .zip w/exe (636K) .tar.gz (571K).

But the speed, sadly, is pathetic for small sets of problems:
4 problem-sized batch: 120000.000 ns per board; 8333 evaluations/second

Luckily, the fix is easy: use bigger batches (more problems at once, bigger textures, less overhead).
8 problem-sized batch: 63750.000 ns per board; 15686 evaluations/second
16 problem-sized batch: 31250.000 ns per board; 32000 evaluations/second
32 problem-sized batch: 16250.000 ns per board; 61538 evaluations/second
64 problem-sized batch: 8593.750 ns per board; 116364 evaluations/second
128 problem-sized batch: 4687.500 ns per board; 213333 evaluations/second
256 problem-sized batch: 2812.500 ns per board; 355556 evaluations/second
512 problem-sized batch: 1914.063 ns per board; 522449 evaluations/second
1024 problem-sized batch: 1425.781 ns per board; 701370 evaluations/second
2048 problem-sized batch: 1181.641 ns per board; 846281 evaluations/second
4096 problem-sized batch: 1044.922 ns per board; 957009 evaluations/second
8192 problem-sized batch: 1068.115 ns per board; 936229 evaluations/second

That's close to five times the CPU's maximum speed!

In general, graphics card programming can be very fast if:
By contrast, small problems tend to just waste all their time contacting the graphics card (in general, any CPU-GPU synchronizing operation costs at least 10us == 10,000ns!).  The graphics hardware doesn't natively support integer or double-precision operations, and is currently restricted to running pixel programs to output colors to 2D textures.

Despite these limitations, graphics card programming can *immensely* speed up a lot of problems, especially scientific computing/simulation type problems.