General-Purpose Graphics Processor Unit (GPGPU) Computing
CS 441 Lecture, Dr. Lawlor
It's actually pretty darn easy to run non-graphics computations on the
GPU. For example, here's the ballistics computation program from
HW4:
/**
Trivial OpenGL program showing a ballistics computation on the GPU.
Must be compiled with GLEW and glut, like -lGLEW -lglut
Orion Sky Lawlor, olawlor@acm.org, 2009-12-03 (Public Domain)
*/
#include <stdio.h>
#include <stdlib.h>
#include "ogl/gpgpu.h" /* Dr. Lawlor's OpenGL simplification library for GPGPU */
#include "ogl/glsl.cpp" /* Dr. Lawlor's GLSL compilation library */
/* Data type for pixels: color floats */
enum {pixeltype=GL_RGBA32F_ARB};
int main()
{
int width=1,height=100;
gpu_env e; /* stores programs and GLSL names of your arrays */
gpu_array *DEST=new gpu_array(e,"DEST",width,height,0,pixeltype);
GPU_RUN(*DEST, /* This is GLSL code, run on the GPU: */
/* create initial vx & vy */
int i=(int)(location.y*100.0);
float vx=0.73-i*0.005;
float vy=0.01+i*0.0005;
float coeff_drag=0.007;
float dt=0.001;
float coeff_g=1/6.0*(32/5280.0);
float x=0.0; float y=0.0; float t=0.0;
while (1) {
float ax=-coeff_drag*vx; float ay=-coeff_drag*vy-coeff_g; /* acceleration */
vx+=dt*ax; vy+=dt*ay; /* velocity */
if (y+dt*vy<0.0) { /* we're about to hit the ground */
break;
}
x+=dt*vx; y+=dt*vy; /* position */
t+=dt;
}
gl_FragColor=vec4(x,y,t,vy);
)
/* Copy ballistics data back from GPU */
float *data=new float[4*width*height];
DEST->read(data,width,height);
/* Print out output grid */
for (int y=0;y<height;y++) {
for (int x=0;x<4;x++) printf("%.2f ",data[x+4*y]);
printf("\n");
}
return 0;
}
You can't do normal output from inside the GPU. Actually, the big
limitation is you can *ONLY* write to your pixel (gl_FragColor) of the
current output texture (DEST, in this case). We used a
four-float-per-pixel RGBA texture (in pixeltype) so we can write out
four floats at once (x,y,t, and vy); you have to replicate
computations, or use MRT, to write more floats than this.
In the above program, we don't read from anything, but you can also
read from another input texture. Here we're using greyscale (one
float per pixel) textures, which are a little easier to follow from the
CPU side than RGBA (four floats per pixel):
/**
Trivial OpenGL program showing how to get data into and out of the GPU.
Must be compiled with GLEW and glut, like -lGLEW -lglut
Orion Sky Lawlor, olawlor@acm.org, 2009-12-03 (Public Domain)
*/
#include <stdio.h>
#include <stdlib.h>
#include "ogl/gpgpu.h"
#include "ogl/glsl.cpp"
/* Data type for pixels: greyscale floats */
enum {pixeltype=GL_LUMINANCE32F_ARB};
int main()
{
gpu_env e; /* stores programs, names of arrays */
gpu_array *SRC,*DEST;
/* Start with data on CPU */
int width=2,height=3;
float *data=new float[width*height];
for (int y=0;y<height;y++) for (int x=0;x<width;x++)
data[x+y*width]=0.1*x+1.0*y;
/* Copy data over to GPU */
SRC=new gpu_array(e,"SRC",width,height,data,pixeltype);
DEST=new gpu_array(e,"DEST",width,height,data,pixeltype);
GPU_RUN(*DEST,
float c=texture2D(SRC,location).r; /* read SRC at our location */
gl_FragColor=c+0.2; /* add and write out to DEST */
)
/* Copy grid back from GPU */
DEST->read(data,width,height);
/* Print out output grid */
for (int y=0;y<height;y++) {
for (int x=0;x<width;x++) printf("%.2f ",data[x+y*width]);
printf("\n");
}
return 0;
}
These programs and my "gpgpu.h" header are on the powerwall in
"~olawlor/441_demo/gpgpu/". You can build and run them like:
cp -r ~olawlor/441_demo/gpgpu/ .
cd gpgpu/simplegl
make && ./main
If you're running from a UNIX machine with X forwarding, turn that off
with "export DISPLAY=:0" first, or you'll get an error like "freeglut
(foo): Unable to create direct context rendering for window 'GPGPU
Program' Error! OpenGL hardware or software too
old--no GLSL!". For some reason GLSL doesn't forward over X yet.
You can also download these programs and my gpgpu.h header:
- gpgpu_basics.tgz (16KB) contains just the above source files and minimal GPGPU support header above.
- gpgpu_full (Zip, Tar-gzip) (800KB) contains a Windows executable, makefiles, and all my GPU support code.
(Note that after I taught class this morning, I realized I can quote
the GLSL code inside the GPU_RUN macro; leaving out the quotes makes
the source code look a lot cleaner.)
CUDA
NVIDIA's CUDA
interface removes the "only write to your pixel" restriction from GPU
code, giving you full pointer-based read/write access to all of GPU
memory. I think this unrestricted shared-memory programming model
is actually quite dangerous, because it allows you to make all the same
memory-race bugs you can get with multithreaded programming, and most
applications don't need that power anyway.
/**
Dr. Lawlor's Example Program using NVIDIA CUDA
Orion Sky Lawlor, olawlor@acm.org, 2008-07-28 (Public Domain)
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <cutil.h> /* <- from NVIDIA CUDA SDK */
enum{n=64}; /* size of my array */
__global__ void funky_GPU_stuff(float *vals,float param) {
int i=threadIdx.x+blockIdx.x*blockDim.x; /* my location in array */
vals[i]=i*param;
}
int main(int argc,char** argv)
{
float param=0.01;
/* Allocate an output array in GPU memory with cudaMalloc */
float *d_vals=0; /* device values */
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_vals, 2*n*sizeof(float)));
printf("CudaMalloc'd a GPU pointer: %p\n",d_vals);
funky_GPU_stuff<<<1,n>>>(d_vals,param);
float h_vals[n];
CUDA_SAFE_CALL( cudaMemcpy(h_vals,d_vals,n*sizeof(float),cudaMemcpyDeviceToHost));
printf("Got back values from device:\n");
for (int i=0;i<n;i++) {
printf(" h_vals[%d] = %f\n",i,h_vals[i]);
}
}
The important stuff to note here:
- cudaMalloc is called from the CPU to allocate GPU memory.
It returns you a pointer to GPU memory, that you can only use with
future CUDA calls, never directly from the CPU!
- Declaring a function __global__ makes it run on the GPU.
You call a GPU function (a "kernel") with the bizarre
<<<nBlocks,threadsPerBlock>>> syntax. This runs
the function over a given number of "blocks" (up to 65K blocks) and a
given number of "threads" (pixel shaders, up to 512) per block.
Here, I've got one big block, and 64 threads per block--certainly not
enough parallelism to fill up the entire GPU!
- You normally leave your data on the GPU, but for output you need to copy it to the CPU, with a cudaMemcpy.
This program is on the powerwall in "~olawlor/441_demo/cuda/".