General-Purpose Graphical Processing Units
or, Solving Non-Graphics Problems on Graphics Cards
CS 641 Lecture, Dr. Lawlor
So, graphics cards.  As of about five years ago, they became fully-programmable full-fledged computers with:
  - 32-bit IEEE floating point numbers ("floats").
- All the usual operations: +, -, *, /, sqrt, sin, cos, etc.
 
- Branching, looping, etc.
Graphics cards are now several times faster than the CPU.  How do they achieve this speed?
It's not because graphics card designers are better paid or smarter than CPU designers, or that the industry is so much bigger:
- Graphics card maker nVidia takes in around $4 billion per year, and has about 4,000 employees (source).  ATI is about the same size.
 
- CPU maker Intel takes in over $40 billion per year, and has about 85,000 employees (source).  AMD has $5 billion/year sales, and 16,000 employees.
The difference is that graphics cards run "pixel programs"--a sequence
of instructions to calculate the color of one pixel.  The programs
for two adjacent pixels cannot
interact with one another, which means that all the pixel programs are
independent of each other.  This implies all the pixels can be
rendered in parallel, with no waiting or synchronization between pixels.
Read that again.  That means graphics cards execute a parallel programming language.  
Parallelism theoretically allows you to get lots of computing done at a
very low cost.  For example, say you've got a 1000x1000 pixel
image.  That's a million pixels.  If you can build a circuit
to do one floating-point operation to those pixels in 1ns (one
billionth of a second, a typical flop speed nowadays), and you can fit
a million of those circuits on one chip (this is the part that can't be
done at the moment), you've just built a 1,000 teraflop computer. 
That's three times faster than the fastest computer in the world, the
$100 million dollar, 128,000-way parallel Blue Gene. 
We're not there yet, because we can't fit that much floating-point
circuitry on one chip, but this is the advantage of parallel execution.
As of 2006, the fastest graphics card on the market
renders at least 32 pixels simultaneously.  Values stored at each
pixel consist of four 32-bit IEEE floating-point numbers.  This
means every clock cycle, the cards are operating on 128 floats at
once.  The "LRP" instruction does about 3 flops per float, and
executes in a single clock.  At a leisurely 1GHz, the $500 32-pipe nVidia
GeForce 8800 thus would do at least:
     3 flops/float*4 floats/pixel*32
pixels/clock*1 Gclocks/second=384 billion flops/second (384 gigaflops)
Recall that a regular FPU only handles one or two (with superscalar
execution) floats at a time, and the SSE/AltiVec extensions only handle
four floats at a time.  Even with SSE, the Pentium 4 theoretical
peak performance is about 15 gigaflops, but I can't get more than about
3 gigaflops doing any real work. 
Graphics Card Programming (for graphics)
Back in 2002, if you wanted to write a "pixel program" to run on the
graphics card, you had to write nasty, unportable and very low-level
code that only worked with one manufacturer's cards.  Today, you
can write code using the (C++-like) GL Shader Language (GLSL), and run
the exact same code on your ATI and nVidia cards, on your Window
machine, Linux box, or Mac OS machine.  
The high-level languages available today are:
- OpenGL Shading Language
(GLSL), which is a portable C++-like language.  It's performance isn't always quite
as good as assembly, but it's much easier to write complicated shaders.
- Microsoft's Windows-specific DirectX High-Level Shading Language is another C++-like language, best explored with ATI's Rendermonkey application.
- nVidia's Cg
(C for graphics) is another portable C++-like language with bindings to
OpenGL or DirectX.  It's competing with languages built directly
into the graphics drivers, though, so it's not faring too well.
 
Here's a very simple GLSL program.  This is the main function that
runs for each pixel.  It's got one return value, the color of the
pixel ("gl_FragColor" in GLSL).  The datatype of a color is
"vec4", which consists of four floats: red, green, blue, and alpha
(transparency).  So this code renders every pixel red:
void main(void) {
	gl_FragColor=vec4(1,0,0,1);
}
(Try this in NetRun now!)
Here's a program that renders each pixel in a color that corresponds to its location onscreen (its "texture coordinate").
void main(void) {
	gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);
}
(Try this in NetRun now!)
Here's a more complicated GLSL program, that uses each pixel's texture coordinates to set up a Mandelbrot Set iteration.
vec2 c=vec2(3.0,2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/
vec2 z=c;
/* Mandelbrot iteration: keep iterating until z gets big */
for (int i=0;i<15;i++) {
	/* break if length of z is >= 4.0 */
	if (z.r*z.r+z.g*z.g>=4.0) break;
	/* z = z^2 + c;  (where z and c are complex numbers) */
	z=vec2(
		z.r*z.r-z.g*z.g,
		2.0*z.r*z.g
	)+c;
}
gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));
(Try this in NetRun now!)
Graphics Card Programming (for non-graphics)
This is all well and good for drawing pictures, but there are lots of
other problems out there that don't involve pictures in any way. 
Or do they?  All computation is just data manipulation, and we can
write *anything* in a pixel shader--floats are floats, after all!
Deep down, the GPU supports a fairly small number of primitives:
  - Pixel programs, which we write in GLSL like above.  You can
compile a whole set of different pixel programs, and then switch
between them pretty quickly.
 
- Textures, which are 2D rectangular arrays of pixels.  That
sounds graphics-specific, but you can think of this as "2D array of
floats".  Again, you can have several textures in one computation.
 
- Framebuffer objects, which let you run pixel programs that read from arbitrary textures and write to one texture.
So you could interpret the first GLSL program as corresponding to this non-graphics code:
	for (int i=0;i<n;i++) array[i]=myStruct(1.0,0.0,0.0,0.0);
And so on.  The only annoying part is that though a pixel program can read from any location on any texture it likes, it can only write
to its own pixel.  And no, you can't bind the same texture for
both reads and writes.  Hey, that's because RAR is not a
dependency, but WAR/RAW/WAW is!
I've written a little set of classes that can be used to simplify
graphics card programming (which is normally packed with OpenGL calls,
and kinda messy/ugly).  They're linked off the 481 page: 481_gpgpu--Perform interesting non-graphics computations on the graphics card.
 Download: .zip w/exe 
   (636K) 
 .tar.gz
   (571K)