Graphics Processing Unit (GPU) Programming in CUDA

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

The GPU is not just for graphics anymore, it's a fully programmable general-purpose computer with incredibly high parallelism. The idea is you use the GPU alongside the CPU, so a typical CUDA program has this division of labor:

Normal CPU, the "__host__"

Graphics Card, the "__device__"

Runs main, in the usual way

Reads files

Talks on network

Allocates memory for both sides

Invokes GPU kernels

Runs special kernel functions, using blocks of threads

Delivers high performance on compute-intensive tasks

The host and device have their own dedicated cores and memory stack. In particular, this means you need a special copy to get data from one side to the other.

CUDA Kernels

CUDA uses a special compiler "nvcc", but by default all your CUDA code is normal C++, running normally:

#include <stdio.h>

int main() {
	printf("Hello, world!\n");
	return 0;
}

(Try this in NetRun now!)

To have code run on the GPU, you define a kernel using the "__global__" keyword before the function. You then call the kernel using a special syntax <<<number_of_blocks_of_threads, threads_per_block>>>:

#include <stdio.h>

__global__ void do_stuff(void)
{
	printf("Hello from GPU land!\n");
}

int main() {
	printf("Hello, world!\n");
	
	do_stuff<<<1,2>>>(); // run kernel with 1 block of 2 threads each
	
	cudaError_t error = cudaPeekAtLastError(); // check if kernel launch worked
	if (error == 0) error = cudaDeviceSynchronize(); // wait for GPU threads to finish
	if (error != 0) printf("CUDA Error %s (%d)", cudaGetErrorString(error), (int)error);
	
	return 0;
}

(Try this in NetRun now!)

(It's important to check for CUDA errors, since there are lots of ways for CUDA calls to go wrong, including CUDA driver misconfigurations like card/driver mismatch.)

Inside a kernel, you can ask which thread and block you are using the "threadIdx.x" and "blockIdx.x" builtin variables:

#include <stdio.h>

__global__ void do_stuff(void)
{
	int thr=threadIdx.x;
	int blk=blockIdx.x;
	printf("Hello from GPU thread %d of block %d\n",thr,blk);
}

int main() {
	printf("Hello, world!\n");
	do_stuff<<<2,8>>>(); // run kernel with 2 blocks of 8 threads each
	cudaDeviceSynchronize(); // wait for GPU threads to finish
	return 0;
}

(Try this in NetRun now!)

For reasonable performance, you should run at least a million GPU threads! This means you don't want to print directly from each thread, but you should write an array.

CUDA Memory

Writing to a GPU array is a little tricky, because the GPU cannot allocate its own memory, you need to allocate the memory on the CPU and pass a pointer in to the GPU. Here gpu_vec is my own simple C++ class to hide the rather ugly C style CUDA functions cudaMalloc called on the CPU to allocate GPU memory, and cudaMemcpy to copy data between CPU and GPU memory. CUDA memory allocation calls are quite slow, so typically you want to only do the allocation once at startup if possible.

#include "osl/gpu_vec.h"  /* for gpu_vec class, avoids cudaMalloc / cudaMemcpy */

#include "lib/inc.h" /* for time_in_seconds benchmarking */
#include "lib/inc.c" /* same *

/* GPU Code! */
__global__ void fill_in_array(int *dest,const int *src,int v) {
	int i=threadIdx.x+blockIdx.x*blockDim.x;
	
	dest[i]=src[i]+v;
}

/* Run on CPU */
int main(int argc,char *argv[]) {
	int wid=256,ht=10000;
	std::vector<int> data(wid*ht);
	for (int i=0;i<wid*ht;i++) data[i]=1000+i; // make up some source data
	gpu_vec<int> src(data); // copy data to GPU
	
	gpu_vec<int> dest(wid*ht); // make space for output data

	double start=time_in_seconds();
	fill_in_array<<<ht,wid>>>(dest,src,3);
	int i=dest[0]; // read one element, to make sure process actually finished
	double elapsed=time_in_seconds()-start;

	std::cout<<"Calculated array at "<<elapsed*1.0e9/(wid*ht)<<" ns/entry, total time "<<elapsed*1.0e6<<" microseconds\n";

	// print a few elements, to show it worked:
	for (int i=0;i<10;i++) std::cout<<dest[i]<<"\n";

	return 0;
}

(Try this in NetRun now!)

To get decent performance, you should usually make blocks of 256 threads each, and make thousands of blocks. If you do those things, the GPU hardware can run thousands of threads at a time, resulting in incredible arithmetic performance of up to several thousand floating-point operations per nanosecond(!).

Here's some performance data from a slightly older NVIDIA GeForce GTX 670:

memcpy size 64: 0.01 GB/sec (8.3 us)
memcpy size 256: 0.03 GB/sec (8.3 us)
memcpy size 1024: 0.12 GB/sec (8.5 us)
memcpy size 4096: 0.47 GB/sec (8.7 us)
memcpy size 16384: 1.45 GB/sec (11.3 us)
memcpy size 65536: 3.21 GB/sec (20.4 us) <- half of peak copy speed
memcpy size 262144: 5.29 GB/sec (49.5 us)
memcpy size 1048576: 6.28 GB/sec (166.9 us)
memcpy size 4194304: 6.60 GB/sec (635.9 us)
memcpy size 16777216: 6.67 GB/sec (2513.8 us)

GPU write size 64: 0.01 GB/sec (9.8 us)
GPU write size 256: 0.03 GB/sec (9.8 us)
GPU write size 1024: 0.10 GB/sec (9.8 us)
GPU write size 4096: 0.42 GB/sec (9.8 us)
GPU write size 16384: 1.64 GB/sec (10.0 us)
GPU write size 65536: 6.03 GB/sec (10.9 us)
GPU write size 262144: 17.32 GB/sec (15.1 us) <- half of peak write speed
GPU write size 1048576: 32.76 GB/sec (32.0 us)
GPU write size 4194304: 41.38 GB/sec (101.4 us)
GPU write size 16777216: 39.99 GB/sec (419.5 us)

Time for sort of size 16: 21099.579 ns/elt (0.338 ms)
Time for sort of size 64: 5302.252 ns/elt (0.339 ms)
Time for sort of size 256: 1377.077 ns/elt (0.353 ms)
Time for sort of size 1024: 312.655 ns/elt (0.320 ms)
Time for sort of size 4096: 78.331 ns/elt (0.321 ms)
Time for sort of size 16384: 20.922 ns/elt (0.343 ms)  <- nearly the same time as tiny array!
Time for sort of size 65536: 8.636 ns/elt (0.566 ms)
Time for sort of size 262144: 5.332 ns/elt (1.398 ms) <- half of peak sort speed
Time for sort of size 1048576: 3.527 ns/elt (3.699 ms)
Time for sort of size 4194304: 3.027 ns/elt (12.694 ms)

Note that really big arrays, with millions of floats, deliver way better performance than smaller arrays. The basic trouble is that going through the OS, across the PCIe bus, into the GPU, and back takes like 10000ns (usually similar to the time to create one C++11 thread). If you go to all that just to get one or two floats, or even a few thousand, you'll get terrible performance--the GPU prefers really big blocks of data with many thousands of elements.

CUDA Image processing

The reason the GPU supports thousands of cores running millions of threads is because there are normally millions of pixels onscreen. Here's a simple example, drawing simple gradients and a circle:

#include "osl/gpu_vec.h"  /* for gpu_vec class, avoids cudaMalloc / cudaMemcpy */

#include <fstream> /* ofstream for image write */
#include "lib/inc.h" /* for time_in_seconds benchmarking */
#include "lib/inc.c" /* same *

/* Class to represent one image pixel */
struct pixel {
	unsigned char r,g,b;	
};

/* GPU Code! */
__global__ void draw_image(pixel *image,int wid,int ht) {
	int i=threadIdx.x+blockIdx.x*blockDim.x;
	int x=i%wid, y=i/wid;
	image[i].r=255;
	image[i].g=x*255/wid;
	image[i].b=y*255/ht;
	if (sqrt((float)(x*x+y*y))<200) { // blue circle in top left corner
		image[i].r=0;
		image[i].g=0;
		image[i].b=255;
	}
}

/* Run on CPU */
int main(int argc,char *argv[]) {
	int wid=512,ht=512, threads_per_block=256;
	
	gpu_vec<pixel> image(wid*ht); // make space for output data

	double start=time_in_seconds();
	draw_image<<<wid*ht/threads_per_block,threads_per_block>>>(image,wid,ht);
	pixel p=image[0]; // read one pixel, to make sure process actually finished
	double elapsed=time_in_seconds()-start;

	std::cout<<"Drew image at "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed*1.0e6<<" microseconds\n";

	// Copy the image data back to the CPU, and write to file
	std::vector<pixel> img; image.copy_to(img);
	std::ofstream imgfile("out.ppm",std::ios_base::binary);
	imgfile<<"P6\n"<<wid<<" "<<ht<<" 255\n";
	imgfile.write((char *)&img[0],wid*ht*sizeof(pixel));

	return 0;
}

(Try this in NetRun now!)

Here's a slightly more complex example, drawing the Mandelbrot Set fractal:

/* GPU Code! */
__global__ void draw_image(pixel *image,int wid,int ht) {
	int i=threadIdx.x+blockIdx.x*blockDim.x;
	int x=i%wid, y=i/wid;
	float fx=x*(1.0/wid), fy=y*(1.0/ht);
	float scale=1.0; // amount of the mandelbrot set to draw
	fx*=scale; fy*=scale;
	
	float ci=fy, cr=fx; // complex constant: x,y coordinates
	float zi=ci, zr=cr; // complex number to iterate
	int iter;
	for (iter=0;iter<100;iter++) {
		if (zi*zi+zr*zr>4.0) break; // number too big--stop iterating
		// z = z*z + c
		float zr_new=zr*zr-zi*zi+cr;
		float zi_new=2*zr*zi+ci;
		zr=zr_new; zi=zi_new;
	}
	
	image[i].r=zr*255/4.0;
	image[i].g=zi*255/4.0;
	image[i].b=iter;
}

(Try this in NetRun now!)

CUDA Summary

The important things to remember about GPU programming are:

It is not CPU programming. In particular, you currently have:

No std:: C++ standard library functions in GPU code.
Many weird limits on things like malloc in GPU code.
No reasonably efficient way to do mutexes in GPU code. (You can use CUDA atomics to eliminate multi-thread race conditions on simple values.)

It is a good fit for applications that spend most of their time doing arithmetic on large arrays of simple values like int and float. This includes simulations, brute force cryptography, AI training, and many other "big data" problems.