General-Purpose Graphics Processor Unit (GPGPU) Computing

CS 441 Lecture, Dr. Lawlor

It's actually pretty darn easy to run non-graphics computations on the GPU.  For example, here's the ballistics computation program from HW4:
`/**Trivial OpenGL program showing a ballistics computation on the GPU.Must be compiled with GLEW and glut, like -lGLEW -lglutOrion Sky Lawlor, olawlor@acm.org, 2009-12-03 (Public Domain)*/#include <stdio.h>#include <stdlib.h>#include "ogl/gpgpu.h" /* Dr. Lawlor's OpenGL simplification library for GPGPU */#include "ogl/glsl.cpp" /* Dr. Lawlor's GLSL compilation library *//* Data type for pixels: color floats */enum {pixeltype=GL_RGBA32F_ARB};int main() {	int width=1,height=100;	gpu_env e; /* stores programs and GLSL names of your arrays */	gpu_array *DEST=new gpu_array(e,"DEST",width,height,0,pixeltype);	GPU_RUN(*DEST, /* This is GLSL code, run on the GPU: */		/* create initial vx & vy */ 		int i=(int)(location.y*100.0);		float vx=0.73-i*0.005;		float vy=0.01+i*0.0005;		float coeff_drag=0.007;		float dt=0.001;		float coeff_g=1/6.0*(32/5280.0);		float x=0.0; float y=0.0; float t=0.0;		while (1) {		       float ax=-coeff_drag*vx; float ay=-coeff_drag*vy-coeff_g; /* acceleration */		       vx+=dt*ax; vy+=dt*ay; /* velocity */		       if (y+dt*vy<0.0) { /* we're about to hit the ground */			       break;		       }		       x+=dt*vx; y+=dt*vy;  /* position */		       t+=dt;		}		gl_FragColor=vec4(x,y,t,vy); 	)	/* Copy ballistics data back from GPU */	float *data=new float[4*width*height];	DEST->read(data,width,height);	/* Print out output grid */	for (int y=0;y<height;y++) {		for (int x=0;x<4;x++) printf("%.2f ",data[x+4*y]);		printf("\n");	}	return 0;}`
You can't do normal output from inside the GPU.  Actually, the big limitation is you can *ONLY* write to your pixel (gl_FragColor) of the current output texture (DEST, in this case).  We used a four-float-per-pixel RGBA texture (in pixeltype) so we can write out four floats at once (x,y,t, and vy); you have to replicate computations, or use MRT, to write more floats than this.

In the above program, we don't read from anything, but you can also read from another input texture.  Here we're using greyscale (one float per pixel) textures, which are a little easier to follow from the CPU side than RGBA (four floats per pixel):
`/**Trivial OpenGL program showing how to get data into and out of the GPU.Must be compiled with GLEW and glut, like -lGLEW -lglutOrion Sky Lawlor, olawlor@acm.org, 2009-12-03 (Public Domain)*/#include <stdio.h>#include <stdlib.h>#include "ogl/gpgpu.h"#include "ogl/glsl.cpp"/* Data type for pixels: greyscale floats */enum {pixeltype=GL_LUMINANCE32F_ARB};int main() {		gpu_env e; /* stores programs, names of arrays */	gpu_array *SRC,*DEST;	/* Start with data on CPU */	int width=2,height=3;	float *data=new float[width*height];	for (int y=0;y<height;y++) for (int x=0;x<width;x++)		data[x+y*width]=0.1*x+1.0*y;		/* Copy data over to GPU */	SRC=new gpu_array(e,"SRC",width,height,data,pixeltype);	DEST=new gpu_array(e,"DEST",width,height,data,pixeltype);	GPU_RUN(*DEST,		 float c=texture2D(SRC,location).r; /* read SRC at our location */		 gl_FragColor=c+0.2; /* add and write out to DEST */	)	/* Copy grid back from GPU */	DEST->read(data,width,height);	/* Print out output grid */	for (int y=0;y<height;y++) {		for (int x=0;x<width;x++) printf("%.2f ",data[x+y*width]);		printf("\n");	}	return 0;}`
These programs and my "gpgpu.h" header are on the powerwall in "~olawlor/441_demo/gpgpu/".  You can build and run them like:
`	cp -r ~olawlor/441_demo/gpgpu/ .	cd gpgpu/simplegl	make && ./main`
If you're running from a UNIX machine with X forwarding, turn that off with "export DISPLAY=:0" first, or you'll get an error like "freeglut (foo): Unable to create direct context rendering for window 'GPGPU Program'   Error!  OpenGL hardware or software too old--no GLSL!".  For some reason GLSL doesn't forward over X yet.

• gpgpu_basics.tgz (16KB) contains just the above source files and minimal GPGPU support header above.
• gpgpu_full (Zip, Tar-gzip) (800KB) contains a Windows executable, makefiles, and all my GPU support code.
(Note that after I taught class this morning, I realized I can quote the GLSL code inside the GPU_RUN macro; leaving out the quotes makes the source code look a lot cleaner.)

CUDA

NVIDIA's CUDA interface removes the "only write to your pixel" restriction from GPU code, giving you full pointer-based read/write access to all of GPU memory.  I think this unrestricted shared-memory programming model is actually quite dangerous, because it allows you to make all the same memory-race bugs you can get with multithreaded programming, and most applications don't need that power anyway.

`/**  Dr. Lawlor's Example Program using NVIDIA CUDA  Orion Sky Lawlor, olawlor@acm.org, 2008-07-28 (Public Domain)*/#include <stdio.h>#include <stdlib.h>#include <string.h>#include <math.h>#include <cutil.h> /* <- from NVIDIA CUDA SDK */enum{n=64}; /* size of my array */__global__ void funky_GPU_stuff(float *vals,float param) {        int i=threadIdx.x+blockIdx.x*blockDim.x; /* my location in array */        vals[i]=i*param;}int main(int argc,char** argv){        float param=0.01;        /* Allocate an output array in GPU memory with cudaMalloc */        float *d_vals=0; /* device values */        CUDA_SAFE_CALL( cudaMalloc( (void**) &d_vals, 2*n*sizeof(float)));        printf("CudaMalloc'd a GPU pointer: %p\n",d_vals);        funky_GPU_stuff<<<1,n>>>(d_vals,param);        float h_vals[n];        CUDA_SAFE_CALL( cudaMemcpy(h_vals,d_vals,n*sizeof(float),cudaMemcpyDeviceToHost));        printf("Got back values from device:\n");        for (int i=0;i<n;i++) {                printf("        h_vals[%d] = %f\n",i,h_vals[i]);        }}`
The important stuff to note here:
• cudaMalloc is called from the CPU to allocate GPU memory.  It returns you a pointer to GPU memory, that you can only use with future CUDA calls, never directly from the CPU!
• Declaring a function __global__ makes it run on the GPU.  You call a GPU function (a "kernel") with the bizarre <<<nBlocks,threadsPerBlock>>> syntax.  This runs the function over a given number of "blocks" (up to 65K blocks) and a given number of "threads" (pixel shaders, up to 512) per block.  Here, I've got one big block, and 64 threads per block--certainly not enough parallelism to fill up the entire GPU!
• You normally leave your data on the GPU, but for output you need to copy it to the CPU, with a cudaMemcpy.
This program is on the powerwall in "~olawlor/441_demo/cuda/".