Back in 2002, if you wanted to write a "pixel program" to run on the graphics card, you had to write nasty, unportable and very low-level code that only worked with one manufacturer's cards. Today, you can write code using a variety of C++-lookalike languages, and then run the exact same code on your ATI and nVidia cards; and build binaries that work the same on your Windows machine, Linux box, or Mac OS X machine.

The languages available are:

- C style (pointers, bytes, arrays)

- CUDA or OpenCL
are cutting edge languages that support even ints and pointers on the
graphics card, although they both require quite recent hardware and a
special software development kit.

- Graphics style (pixels, textures)

- OpenGL Shading Language
(GLSL) is a portable C++-like language. It's now part of OpenGL,
and the compiler comes as part of every recent OpenGL driver. It's currently my favorite language.

- Microsoft's Windows-specific DirectX High-Level Shading Language is another C++-like language, best explored with ATI's Rendermonkey application. Again, it's part of DirectX 9, so you've probably already got the compiler.

- nVidia's Cg ("C for graphics") was among the first portable C++-like languages, and it's unique in binding to either OpenGL or DirectX, although it does require a special runtime driver.
- Assembly language style (instructions, registers)

- NVIDIA PTX, which is the backend for CUDA.

- OpenGL ARB_fragment_program, which is a portable assembly code.
- Microsoft's DirectX Shader Model 3.0 is an assembly-like backend for DirectX 9 cards.

We'll be looking at OpenGL Shading Language (GLSL). It looks a lot like C++. The hardware is like with SSE, where on the graphics card variables are all floats, either a single "float" or a four-float register called a "vec4". You can also declare variables of type "vec2" and "vec3", and even "int", although int is often emulated with float. You can think of a vec4 as storing "RGBA" colors, "XYZW" 3D positions, or you can just think of them as four adjacent floats.

The output of your GLSL program is "gl_FragColor", the color of the pixel your program is rendering. Your program runs once for each pixel onscreen--hundreds of thousands of times!

So the simplest OpenGL fragment program returns a constant color, like this program, which always returns red pixels:

gl_FragColor=vec4(1,0,0,0);Note there's no loop here, but this program by definition runs on every pixel. In a bigger graphics program, you can control the pixels you want drawn using some polygon geometry, and the program runs on every pixel touched by that geometry.

Here's what our incoming texture coordinates (2D location onscreen) look like:

gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);0 means black, and 1 means fully saturated color (all colors saturated means white). The X coordinate of the onscreen location becomes the Red component of the output color--note how the image gets redder from left to right. The Y coordinate of the onscreen location becomes the Green component of the output color--note how the image gets greener from bottom to top. Red and green add up to yellow, so the top-right corner (where X=Y=1) is yellow.

We can make the left half of the screen red, and the right half blue, like this:

if (texcoords.x<0.5) /* left half of screen? */We can make a little circle in the middle of the screen red like this:

gl_FragColor=vec4(1,0,0,0); /* red */

else

gl_FragColor=vec4(0,0,1,0); /* blue */

float x=texcoords.x-0.5, y=texcoords.y-0.5;We can make a whole grid of circles across the screen like this:

float radius=sqrt(x*x+y*y);

if (radius<0.3) /* inside the circle? */

gl_FragColor=vec4(1,0,0,0); /* red */

else

gl_FragColor=vec4(0,0,1,0); /* blue */

vec2 v=fract(texcoords*10);We can make smooth transitions between the colors with a blending operation; this is the "fuzzy logic" equivalent of our mask-based if-then-else implementation.

float x=(v.x-0.5)*1.4, y=v.y-0.5;

float radius=sqrt(x*x+y*y);

if (radius<0.3) /* inside the circle? */

gl_FragColor=vec4(1,0,0,0); /* red */

else

gl_FragColor=vec4(0,0,1,0); /* blue */

float x=texcoords.x-0.5, y=texcoords.y-0.5;

float radius=sqrt(x*x+y*y);

float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);

greeny=greeny*greeny; // sin-squared

vec4 green=vec4(0,texcoords.y,0,0);

vec4 notgreen=vec4(1.0,0.0,0.6,0);

gl_FragColor=greeny*green+(1.0-greeny)*notgreen;

Graphics cards also support loops, like this loop over different sine-wave sources:

float finalgreen=0.0;

for (int source=0;source<2;source++)

{

float x=texcoords.x-0.4-0.1*source, y=texcoords.y-0.5;

float radius=sqrt(x*x+y*y);

float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);

greeny=greeny*greeny; // sin-squared

finalgreen+=greeny*0.33;

}

vec4 green=vec4(0,texcoords.y,0,0);

vec4 notgreen=vec4(1.0,0.0,0.6,0);

gl_FragColor=finalgreen*green+(1.0-finalgreen)*notgreen;

Note the performance of this on the graphics card (a GeForce GTX
280) is 0.1 nanoseconds per pixel. (Essentially) the same code
run on the CPU takes 200 nanoseconds per pixel, which is, er, two
*thousand* times slower. Ouch!

static float texcoords_x=0.1, texcoords_y=0.2; texcoords_x+=0.1;

float finalgreen=0.0;

for (int source=0;source<2;source++)

{

float x=texcoords_x-0.4-0.1*source, y=texcoords_y-0.5;

float radius=sqrt(x*x+y*y);

float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);

greeny=greeny*greeny; // sin-squared

finalgreen+=greeny*0.33;

}

float green=texcoords_y; // vec4(0,texcoords.y,0,0);

float notgreen=100.0; // vec4(1.0,0.0,0.6,0);

return finalgreen*green+(1.0-finalgreen)*notgreen;

The calling convention for a GLSL program in NetRun is slightly simplified from the general case used for real graphics programs:

- Your pixel's location onscreen is stored in the vec2 variable
"texcoords". The x coordinate of this vector gives your on-screen x
coordinate, which varies from 0 on the left side of the screen to 1 on
the right side. The y coordinate gives onscreen y, from 0 at
bottom to 1 at top. In a more general program, input data can be obtained from texture
coordinates passed in from the calling program and/or set up inside a
"vertex program" that runs on each polygon vertex. You can view these texture coordinates as follows:

- There are several predefinited 2D textures called "tex1" (me making a silly face), "tex3", "tex4", and "tex5" (used for various homeworks).

gl_FragColor=texture2D(tex1,vec2(texcoords.x,texcoords.y));This looks up in the texture "tex1" at the coordinates given by texcoords. You can store the looked-up vec4 in a new variable "v" like so:

vec4 v=texture2D(tex1,vec2(texcoords.x,texcoords.y));Note how we've flipped black to white by computing "1.0-v"!

gl_FragColor=1.0-v;

The builtin functions include "vec2" through "vec4" (build a vector), "length" (return the float length of any vector), "fract" (return the fractional part of any vector), and many other functions.

Remember that you can compute *anything* on the GPU!

For example, here's the Mandelbrot set fractal rendered on the graphics card:

vec2 c=vec2(2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/

vec2 z=c;

/* Mandelbrot iteration: keep iterating until z gets big */

for (int i=0;i<15;i++) {

/* break if length of z is >= 4.0 */

if (z.r*z.r+z.g*z.g>=4.0) break;

/* z = z^2 + c; (where z and c are complex numbers) */

z=vec2(

z.r*z.r-z.g*z.g,

2.0*z.r*z.g

)+c;

}

gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));

It's not because graphics card designers are better paid or smarter than CPU designers, or that the graphics card industry is any bigger:

- Graphics card maker nVidia takes in around $2 billion per year, and has about 2,000 employees (source). ATI is about the same size.

- CPU maker Intel takes in over $30 billion per year, and has about 85,000 employees (source). AMD has $5 billion/year sales, and 16,000 employees.

Read that again. That means graphics cards execute a parallel programming language.

Parallelism theoretically allows you to get lots of computing done at a very low cost. For example, say you've got a 1000x1000 pixel image. That's a million pixels. If you can build a circuit to do one floating-point operation to those pixels in 1ns (one billionth of a second, a typical flop speed nowadays), and you could fit a million of those circuits on one chip (this is the part that can't be done at the moment), then you've just built a 1,000 teraflop computer. That's about as fast as 2006's fastest computer in the world, the $100 million dollar, 128,000-way parallel Blue Gene. We're not there yet, because we can't fit that much floating-point circuitry on one chip, but this is the advantage of parallel execution.

As of 2009, the fastest graphics cards on the market render up to 1600 pixels simultaneously. This means every clock cycle, the cards are operating on over 1600 floats at once. The "MAD" instruction does 2 flops per float, and executes in a single clock. At a leisurely 0.85GHz, the $400 Radeon 5870 thus would do at least:

2 flops/float*1600 floats/clock*0.85 Gclocks/second=2720 billion flops/second (2.7 teraflops)

Recall that a regular FPU only handles one or two (with superscalar execution) floats at a time, and the SSE/AltiVec extensions only handle four floats at a time. Even with SSE, an 8-way core i7's theoretical peak performance is only 128 gigaflops, less than one twentieth of a GPU, and in practice it's very hard to even get that.