Graphics Processing Unit (GPU) Computing
CS 441 Lecture, Dr. Lawlor
Back in 2002, if you wanted to write a "pixel program" to run on the
graphics card, you had to write nasty, unportable and very low-level
code that only worked with one manufacturer's cards. Today, you
can write code using a variety of C++-lookalike languages, and then run
the exact same code on your ATI and nVidia cards; and build binaries that work the same on your Windows
machine, Linux box, or Mac OS X machine.
The languages available are:
- OpenGL Shading Language is a portable C++-like language. It's now part of OpenGL, and the compiler comes as part of every recent OpenGL driver.
- Microsoft's Windows-specific DirectX High-Level Shading Language is another C++-like language, best explored with ATI's Rendermonkey application. Again, it's part of DirectX 9, so you've probably already got the compiler.
- nVidia's Cg
("C for graphics") was among the first portable C++-like languages, and
it's unique in binding to either OpenGL or DirectX, although it does
require a special runtime driver.
- CUDA or OpenCL
are cutting edge languages that support even ints and pointers on the
graphics card, although they both require quite recent hardware and a
special software development kit.
- OpenGL ARB_fragment_program, which is a portable assembly code.
- Microsoft's DirectX Shader Model 3.0 is an assembly-like backend for DirectX 9 cards.
The bottom line is the graphics card is more or less "just another computer" inside your computer.
The biggest difference is that your programs run once for each *pixel* (in parallel), not
just *once* (in serial).
We'll be looking at OpenGL Shading Language (GLSL). It looks a lot like C++. The hardware is like with SSE, where on the graphics card variables
are all floats, either a single "float" or a four-float register called
a "vec4". You can also declare variables of type "vec2" and
"vec3", and even "int", although int is often emulated with
float. You can think of a vec4 as storing "RGBA"
colors, "XYZW" 3D positions, or you can just think of them as four
adjacent floats.
The output of your GLSL program is "gl_FragColor", the
color of the pixel your program is rendering. Your program runs
once for each pixel onscreen--hundreds of thousands of times!
So the simplest OpenGL fragment program returns a constant color, like this program, which always returns red pixels:
gl_FragColor=vec4(1,0,0,0);
(Try this in NetRun now!)
Note there's no loop here, but this program by definition
runs on every pixel. In general, you control the pixels you want
drawn using some polygon geometry, and the program runs on every pixel
touched by that geometry.
Here's what our incoming texture coordinates (2D location onscreen) look like:
gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);
(Try this in NetRun now!)
0 means black, and 1 means fully saturated color (all
colors saturated means white). The X coordinate of the onscreen
location becomes the Red component of the output color--note how the
image gets redder from left to right. The Y coordinate of the
onscreen location becomes the Green component of the output color--note
how the image gets greener from bottom to top. Red and green add
up to yellow, so the top-right corner (where X=Y=1) is yellow.
We can make the left half of the screen red, and the right half blue, like this:
if (texcoords.x<0.5) /* left half of screen? */
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
(Try this in NetRun now!)
We can make a little circle in the middle of the screen red like this:
float x=(texcoords.x-0.5)*1.4, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
(Try this in NetRun now!)
We can make a whole grid of circles across the screen like this:
vec2 v=fract(texcoords*10);
float x=(v.x-0.5)*1.4, y=v.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
(Try this in NetRun now!)
The calling convention for a GLSL program in NetRun is slightly
simplified from the general case used for real graphics programs:
- Your pixel's location onscreen is stored in the vec2 variable
"texcoords". The x coordinate of this vector gives your on-screen x
coordinate, which varies from 0 on the left side of the screen to 1 on
the right side. The y coordinate gives onscreen y, from 0 at
bottom to 1 at top. In a more general program, input data can be obtained from texture
coordinates passed in from the calling program and/or set up inside a
"vertex program" that runs on each polygon vertex. You can view these texture coordinates as follows:
- There are several predefinited 2D textures called "tex1" (me
making a silly face), "tex3", "tex4", and "tex5" (used for various
homeworks).
Here's how you look up the color in a "texture" (a stored 2D image,
used like memory/arrays on graphics cards) at a particular texture
coordinate:
gl_FragColor=texture2D(tex1,vec2(texcoords.x,texcoords.y));
(Try this in NetRun now!)
This looks up in the texture "tex1" at the coordinates given by
texcoords. You can store the looked-up vec4 in a new variable "v"
like so:
vec4 v=texture2D(tex1,vec2(texcoords.x,texcoords.y));
gl_FragColor=1.0-v;
(Try this in NetRun now!)
Note how we've flipped black to white by computing "1.0-v"!
The builtin functions include "vec2" through "vec4" (build a vector),
"length" (return the float length of any vector), "fract" (return the
fractional part of any vector), and many other functions.
Remember that you can compute *anything* on the GPU!
For example, here's the Mandelbrot set fractal rendered on the graphics card:
vec2 c=vec2(3.0,2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/
vec2 z=c;
/* Mandelbrot iteration: keep iterating until z gets big */
for (int i=0;i<15;i++) {
/* break if length of z is >= 4.0 */
if (z.r*z.r+z.g*z.g>=4.0) break;
/* z = z^2 + c; (where z and c are complex numbers) */
z=vec2(
z.r*z.r-z.g*z.g,
2.0*z.r*z.g
)+c;
}
gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));
(Try this in NetRun now!)
Graphics Cards' Crazy Performance
Graphics cards are now several times faster than the CPU. How do they achieve this speed?
It's not because graphics card designers are better paid or smarter
than CPU designers, or that the graphics card industry is any bigger:
- Graphics card maker nVidia takes in around $2 billion per year, and has about 2,000 employees (source). ATI is about the same size.
- CPU maker Intel takes in over $30 billion per year, and has about 85,000 employees (source). AMD has $5 billion/year sales, and 16,000 employees.
The difference is that graphics cards run "pixel programs"--a sequence
of instructions to calculate the color of one pixel. The programs
for two adjacent pixels cannot
interact with one another, which means that all the pixel programs are
independent of each other. This implies all the pixels can be
rendered in parallel, with no waiting or synchronization between pixels.
Read that again. That means graphics cards execute a parallel programming language.
Parallelism theoretically allows you to get lots of computing done at a
very low cost. For example, say you've got a 1000x1000 pixel
image. That's a million pixels. If you can build a circuit
to do one floating-point operation to those pixels in 1ns (one
billionth of a second, a typical flop speed nowadays), and you could fit
a million of those circuits on one chip (this is the part that can't be
done at the moment), then you've just built a 1,000 teraflop computer.
That's three times faster than the fastest computer in the world, the
$100 million dollar, 128,000-way parallel Blue Gene.
We're not there yet, because we can't fit that much floating-point
circuitry on one chip, but this is the advantage of parallel execution.
As of 2009, the fastest graphics cards on the market
render up to 1600 pixels simultaneously. This
means every clock cycle, the cards are operating on over 1600 floats at
once. The "MAD" instruction does 2 flops per float, and
executes in a single clock. At a leisurely 0.85GHz, the $400 Radeon 5870 thus would do at least:
2 flops/float*1600 floats/clock*0.85 Gclocks/second=2720 billion flops/second (2.7 teraflops)
Recall that a regular FPU only handles one or two (with superscalar
execution) floats at a time, and the SSE/AltiVec extensions only handle
four floats at a time. Even with SSE, an 8-way core i7's theoretical
peak performance is only 128 gigaflops, less than one twentieth of a GPU, and in practice it's very hard to even get that.
GPU: Texture Cache Performance Effects
Here's a little program that reads a texture, tex1, at five slightly shifted locations; visually this produces a blurred image.
vec4 sum=vec4(0.0);
for (int i=0;i<5;i++)
sum+=tex2D(tex1,texcoords*1.0+i*0.01);
gl_FragColor=sum*(1.0/5);
(Try this in NetRun now!)
At the given default scale factor of 1.0, this program takes 0.1 ns per pixel (on NetRun's fast GeForce GTX 280 card).
If we zoom in, to a scale factor of 0.5 or 0.1, the program takes
exactly the same time. We're still accessing nearby pixels.
But if we zoom out, to a scale factor of 2.0, like this, then
adjacent pixels onscreen get fairly distant texture pixels, and
suddenly the program slows down to over 0.23ns per pixel.
...
sum+=tex2D(tex1,texcoords*2.0+i*0.01);
...
(Try this in NetRun now!)
Zooming out farther slows the access down even more, up to 3ns per pixel with a scale of 16. That's a 30-fold slowdown!
The reason for this is the "texture cache":
- When you read a value from a texture, the hardware fetches both it and nearby parts of the texture from RAM.
- All the fetched texture is stored in the texture cache, a chunk of on-chip memory designed for speed.
- Subsequent reads from that part of the texture are thus quite fast, because they don't have to go to RAM.
When we're zoomed way out, adjacent pixels onscreen are read from
distant locations in the texture (texcoords * 16.0 means there are 16
pixels between each read!).
The bottom line: for high performance, read textures in a contiguous
fashion (nearby pixels), not random-access (distant pixels).
GPU: Branch Divergence Penalty
On the CPU, the branch performance model is really simple. If you've got code like:
if (something) A(); else B();
On the CPU, if the branch is taken, this takes time A, else it takes time B.
On the GPU, depending on how other adjacent pixels take the branch, this code could take time A+B.
GPU branches work like CPU branches (one or the other) as long as
nearby regions of the screen branch the same way ("coherent branches",
sorta the branch equivalent of access locality). For example,
each call to "cool_colors" takes about 0.1ns per pixel, but because we
branch in big onscreen blocks here, this takes a reasonable
0.11ns/pixel overall:
vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}
void main(void) {
float doit=fract(texcoords.x*1.0);
if (doit<0.3)
gl_FragColor=cool_colors(texcoords).wxyz*0.5;
else if (doit<0.7)
gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
else
gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}
(Try this in NetRun now!)
If I change this so "doit" varies much faster onscreen, then
adjacent pixels will be taking different branches. The GPU
implements this like SSE: you figure out the answer for both branches,
then use bitwise operations to mask off the untaken branch. So
now the hardware actually has to run "cool_colors" three times for
every pixel (one per branch), and our time goes up to 0.285ns/pixel!
vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}
void main(void) {
float doit=fract(texcoords.x*100.0);
if (doit<0.3)
gl_FragColor=cool_colors(texcoords).wxyz*0.5;
else if (doit<0.7)
gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
else
gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}
(Try this in NetRun now!)
Internally, even a graphics card with a thousand "shader cores"
really has only a few dozen execution cores, running an SSE-like
multiple-floats-per-instruction program. Each execution core is
responsible for a small contiguous block of pixels, so if all those
pixel branch together, the core can skip the entire "else" case.
If some of a core's pixels branch one way, and some branch the other way, the core has to take both branches, and the program slows down appreciably.
NVIDIA
calls a group of threads that branch the same way a "warp", and the
overall GPU architecture "SIMT": Single Instruction, Multiple
Thread. Current NVIDIA machines have 32 threads (floats) per warp.
ATI calls a group of threads that branch the same way a "wavefront", typically 64 floats per wavefront.