"GPGPU": General-Purpose computations on Graphics Processing Units

Graphics Card Programming

Back in 2002, if you wanted to write a "pixel program" to run on the graphics card, you had to write nasty, unportable and very low-level code that only worked with one manufacturer's cards. Today, you can write code using a variety of C++-lookalike languages, and then run the exact same code on your ATI and nVidia cards; and build binaries that work the same on your Windows machine, Linux box, or Mac OS X machine.

The graphics-specific C++-like languages available are:

OpenGL Shading Language, "GLSL", is a portable C++-like language (but without pointers or memory allocation). It's now part of OpenGL, and the compiler comes as part of every recent OpenGL driver. This is the best cross-platform language right now.
Microsoft's Windows-specific DirectX High-Level Shading Language is another C++-like language, best explored with ATI's Rendermonkey application. Again, it's part of DirectX 9, so you've probably already got the compiler.
nVidia's Cg ("C for graphics") was among the first portable C++-like languages, and it's unique in binding to either OpenGL or DirectX. This has largely been replaced by CUDA or GLSL.

There are also several older (2003 or so) assembly-like graphics-targetted interfaces:

OpenGL ARB_fragment_program, which is a high-performance portable assembly code.
Microsoft's DirectX Shader Model 3.0 is an assembly-like backend for DirectX 9 cards.

The recent thing (since 2007 or so) is full C++ interfaces, including bitwise operations, pointers, etc all running on the GPU.

nVidia's CUDA provides a most-of-C++-including-pointers interface to the GPU's many processors. It's not particularly intended for graphics, but CUDA programs can still expose the thousand-way parallelism supported by GPU hardware. nVidia GPUs after the GeForce 8000 series support CUDA. Nothing from ATI or any other vendor supports CUDA, nor are they likely to.
Upcoming standard OpenCL promises CUDA-like functionality across all platforms. Nobody can actually run OpenCL code yet, as of 2009-04 though.

The bottom line is the graphics card is more or less "just another computer" inside your computer. The biggest difference is that your programs run once for each *pixel* (in parallel), not just *once* (in serial).

We'll be looking at OpenGL Shading Language (GLSL). It looks a lot like C++. The hardware is like with SSE, where on the graphics card allvariables are stored in four-float registers. In GLSL, they're called "vec4"s, although you can also declare variables of type "float", "vec2", and "vec3". You can think of any vector as storing "RGBA" colors, "XYZW" 3D positions, or you can just think of them as four floats. The output of your GLSL program is "gl_FragColor", the color of the pixel your program is rendering. Your program runs once for each pixel onscreen--hundreds of thousands of times!

So the simplest OpenGL fragment program returns a constant color, like this program, which always returns red pixels:

gl_FragColor=vec4(1,0,0,0);

(Try this in NetRun now!)

Note there's no loop here, but this program by definition runs on every pixel. In general, you control the pixels you want drawn using some polygon geometry, and the program runs on every pixel touched by that geometry.

Here's what our incoming texture coordinates (2D location onscreen) look like:

gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);

(Try this in NetRun now!)

0 means black, and 1 means fully saturated color (all colors saturated means white). The X coordinate of the onscreen location becomes the Red component of the output color--note how the image gets redder from left to right. The Y coordinate of the onscreen location becomes the Green component of the output color--note how the image gets greener from bottom to top. Red and green add up to yellow, so the top-right corner (where X=Y=1) is yellow.

We can make the left half of the screen red, and the right half blue, like this:

if (texcoords.x<0.5) /* left half of screen? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

We can make a little circle in the middle of the screen red like this:

float x=(texcoords.x-0.5)*1.4, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

We can make a whole grid of circles across the screen like this:

vec2 v=fract(texcoords*10);
float x=(v.x-0.5)*1.4, y=v.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

The calling convention for a GLSL program in NetRun is slightly simplified from the general case used for real graphics programs:

Your pixel's location onscreen is stored in the vec2 variable "texcoords". The x coordinate of this vector gives your on-screen x coordinate, which varies from 0 on the left side of the screen to 1 on the right side. The y coordinate gives onscreen y, from 0 at bottom to 1 at top. In a more general program, input data can be obtained from texture coordinates passed in from the calling program and/or set up inside a "vertex program" that runs on each polygon vertex. You can view these texture coordinates as follows:
There are several predefinited 2D textures called "tex1" (me making a silly face), "tex3", "tex4", and "tex5" (used for various homeworks).

Here's how you look up the color in a "texture" (a stored 2D image, used like memory/arrays on graphics cards) at a particular texture coordinate:

gl_FragColor=texture2D(tex1,vec2(texcoords.x,texcoords.y));

(Try this in NetRun now!)

This looks up in the texture "tex1" at the coordinates given by texcoords. You can store the looked-up vec4 in a new variable "v" like so:

vec4 v=texture2D(tex1,vec2(texcoords.x,texcoords.y));
gl_FragColor=1.0-v;

(Try this in NetRun now!)

Note how we've flipped black to white by computing "1.0-v"!

The builtin functions include "vec2" through "vec4" (build a vector), "length" (return the float length of any vector), "fract" (return the fractional part of any vector), and many other functions.

Remember that you can compute *anything* on the GPU!

For example, here's the Mandelbrot set fractal rendered on the graphics card:

vec2 c=vec2(3.0,2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/
vec2 z=c;
/* Mandelbrot iteration: keep iterating until z gets big */
for (int i=0;i<15;i++) {
	/* break if length of z is >= 4.0 */
	if (z.r*z.r+z.g*z.g>=4.0) break;
	/* z = z^2 + c;  (where z and c are complex numbers) */
	z=vec2(
		z.r*z.r-z.g*z.g,
		2.0*z.r*z.g
	)+c;
}
gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));

(Try this in NetRun now!)

Graphics Cards' Crazy Performance

Graphics cards are now several times faster than the CPU. How do they achieve this speed?

It's not because graphics card designers are better paid or smarter than CPU designers, or that the graphics card industry is any bigger:

Graphics card maker nVidia takes in around $2 billion per year, and has about 2,000 employees (source). ATI is about the same size.
CPU maker Intel takes in over $30 billion per year, and has about 85,000 employees (source). AMD has $5 billion/year sales, and 16,000 employees.

The difference is that graphics cards run "pixel programs"--a sequence of instructions to calculate the color of one pixel. The programs for two adjacent pixels cannot interact with one another, which means that all the pixel programs are independent of each other. This implies all the pixels can be rendered in parallel, with no waiting or synchronization between pixels.

Read that again. That means graphics cards execute a parallel programming language.

Parallelism theoretically allows you to get lots of computing done at a very low cost. For example, say you've got a 1000x1000 pixel image. That's a million pixels. If you can build a circuit to do one floating-point operation to those pixels in 1ns (one billionth of a second, a typical flop speed nowadays), and you could fit a million of those circuits on one chip (this is the part that can't be done at the moment), then you've just built a 1,000 teraflop computer. That's three times faster than the fastest computer in the world, the $100 million dollar, 128,000-way parallel Blue Gene. We're not there yet, because we can't fit that much floating-point circuitry on one chip, but this is the advantage of parallel execution.

As of 2006, the fastest graphics card on the market renders at least 32 pixels simultaneously. Values stored at each pixel consist of four 32-bit IEEE floating-point numbers. This means every clock cycle, the cards are operating on 128 floats at once. The "LRP" instruction does about 3 flops per float, and executes in a single clock. At a leisurely 1GHz, the $500 32-pipe nVidia GeForce 8800 thus would do at least:
3 flops/float*4 floats/pixel*32 pixels/clock*1 Gclocks/second=384 billion flops/second (384 gigaflops)

Recall that a regular FPU only handles one or two (with superscalar execution) floats at a time, and the SSE/AltiVec extensions only handle four floats at a time. Even with SSE, the Pentium 4 theoretical peak performance is about 15 gigaflops, but I can't get more than about 3 gigaflops doing any real work. By contrast, the now-obsolete Mobility Radeon 9600 graphics card in my laptop handles 4 pixels (16 floats) simultaneously, and pulls about 16 gigaflops, handily beating the Pentium 4.

Graphics Card Programming (for non-graphics)

This is all well and good for drawing pictures, but there are lots of other problems out there that don't involve pictures in any way. Or do they? All computation is just data manipulation, and we can write *anything* in a pixel shader--floats are floats, after all!

Deep down, the GPU supports a fairly small number of primitives:

Pixel programs, which we write in GLSL like above. You can compile a whole set of different pixel programs, and then switch between them pretty quickly.
Textures, which are 2D rectangular arrays of pixels. That sounds graphics-specific, but you can think of this as "2D array of floats". Again, you can have several textures in one computation.
Framebuffer objects, which let you run pixel programs that read from arbitrary textures and write to one texture.

So you could interpret the first GLSL program as corresponding to this non-graphics code:

	for (int i=0;i<n;i++) array[i]=myStruct(1.0,0.0,0.0,0.0);

And so on. The only annoying part is that though a pixel program can read from any location on any texture it likes, it can only write to its own pixel. And no, you can't bind the same texture for both reads and writes. Hey, that's because RAR is not a dependency, but WAR/RAW/WAW is!

I've written a little set of classes that can be used to simplify graphics card programming (which is normally packed with OpenGL calls, and kinda messy/ugly). They're linked off my old 481 page: 481_gpgpu--Perform interesting non-graphics computations on the graphics card. Download: .zip w/exe (636K) .tar.gz (571K) .

Here's the little demo code; see the "demos" directory in the above archive for more examples.

/**
  General-Purpose Graphics Processing Unit (GPGPU)
  Demo Application
  
  Orion Sky Lawlor, olawlor@acm.org, 2008-04-21 (Public Domain)
*/
#include "ogl/gpgpu.h" /* Orion's GPGPU library */
#include "ogl/glsl.cpp" /* for easy linking, just #include the library code */

int main(int argc,char *argv[]) {
	gpgpu_init();
	
	/* Upload a small lookup-table texture */
	static gpgpu_texture2D *table=new gpgpu_texture2D(1000,2000);
	
	/* Run a pixel program into that texture */
	static GLhandleARB program=makeProgramObject(GPGPU_V,
		"varying vec4 location; /* onscreen coordinates */\n"
		"void main(void) { \n"
		"	gl_FragColor=vec4(sin(location.x*100),0,0,0); \n"
		"}"
	);
	
	static gpgpu_framebuffer *fb=new gpgpu_framebuffer(table);
	fb->run_time(program);
	
	/* Grab the floating-point values out of the rendered framebuffer */
	float data[4];
	fb->read(data,1,1); /* read 1x1 pixel region */
	std::cout<<"Read data: ";
	for (int i=0;i<4;i++) std::cout<<" "<<data[i];
	std::cout<<"\n";
	
	/* Show the resulting texture values */
	gpgpu_show_and_pause(table,"GPGPU: Lookup table");
	
	return 0;
}