Graphics Processing Unit (GPU) Programming in CUDA

The GPU is not just for graphics anymore, it's a fully programmable general-purpose computer with incredibly high parallelism.  The idea is you use the GPU alongside the CPU, so a typical CUDA program has this division of labor:

Normal CPU, the "__host__" Graphics Card, the "__device__"

Runs main, in the usual way

Reads files

Talks on network

Allocates memory for both sides

Invokes GPU kernels

Runs special kernel functions, using blocks of threads

Delivers high performance on compute-intensive tasks

Several interesting architectural details crop up in the CUDA documentation.  I've highlighted the programmer-visible details in bold.

CUDA Concept Purpose How big? CPU Equivalent
thread One single execution of a kernel.  This is mostly conceptual, since the hardware operates on blocks of threads. 1 function call
warp (of threads) A group of 16-32 threads that all take the same branches.  The compiler hides this, and the hardware does a good job with predication, so warps aren't "in your face" like with SSE instructions, there's just a mild performance improvement if you can improve "branch coherence". 16-32 "threads" SIMD unit ("floats")
block (of threads) A group of a few hundred threads that have access to the same "__shared__" memory, and are scheduled at the same time.  The number of threads per block is specified in software when you start the kernel, though it is capped by hardware to 512 or 1024 threads maximum, and small block sizes (less than 100) are much slower.  For small kernels on current hardware, 256 threads per block performs best.   About 256
threads
One multicore thread
(Blocks run SMP/SMT style)
kernel

A set of blocks of threads, all running one __global__ function.  You call a kernel using a special syntax:

  yourKernel<<<nBlocks,threadsPerBlock>>>(your args...);

About a million threads Parallel loop

Notice that despite the terminology gap, deep down there's a whole lot of similarity between GPU and CPU architectures; the GPU is just focused on wider problems.  The GPU's SIMD units are much wider, 32 floats at a time versus 8 at a time for AVX or 16 for Xeon Phi.  A typical GPU will have perhaps 8 multiprocessors (each called an SM or SMX), versus about 4 for a CPU.  The big difference is each multiprocessor is deeply hyperthreaded, able to run hundreds of threads at once, typically limited by the physical register count, nowadays an absurd 65,536 registers per multiprocessor.  So you get really ridiculously good GPU performance if your code can fill all the threads, but if you've got only single-threaded work, a GPU's low clockrate and weak superscalar actually gives ridiculously bad performance per thread! 

The on-GPU memory model is also highly segmented and specialized.  Again, the recommended programmer-visible memory types are in bold.

Name Example Purpose Size
(varies)
Speed (approx)
registers int i; Local storage within a thread.  Allocated automatically by the compiler. Several thousand per multiprocessor. 10TB/s (!!)
shared memory __shared__ float arr[256]; Communication within a thread block. 16KB - 48KB/block 1TB/s
constant memory __const__ float *ptr Read-only prebroadcast memory, for kernel parameters and such. 64KB total, 8KB per multiprocessor. 1TB/s?
global memory __global__ float *ptr Communication across blocks, and general on-GPU storage. 1GB or so. 100GB/s
texture memory
texture<float, 3> myTex;
2D and 3D read-only images, with caching and interpolation. (same as global memory) 100GB/s
local memory __local__ float arr[10]; Place for compiler to spill variables.  Much slower than registers--do not use this. (same as global memory) 100GB/s
unified memory

cudaMallocManaged((void **)&ptr,sizeof(*ptr));

Shared between CPU and GPU.  New in CUDA 6.  You still need the CPU to do memory allocation, file I/O, network comms, etc. 8-32GB 2-4GB/s (PCIe)
host memory cudaMemcpy( &device[0], &host[0], sizeof(float)*n, cudaMemcpyHostToDevice) Like unified memory, but with explicit copies.  This is a little more efficient, but more painful to use. 8-32GB 2-4GB/s (PCIe)
network skt_sendN(&buf[0],buflen); Communicate across a cluster. (big) 100MB/s (gigabit)


Note the large performance variation between top and bottom.  This means you can drastically improve performance by moving data closer to arithmetic.  For example, instead of storing most data on the host, leave everything in GPU global memory between kernel calls.  

CUDA Syntax, with Unified Memory

Since CUDA 6.0, "unified" memory allocated with cudaMallocManaged(&ptr,nBytes); is shared between the CPU and the GPU (Previous to this, you had to manually cudaMemcpy data back and forth between normal CPU-side buffers and special cudaMalloc GPU bufffers, which was a real pain).  Now as long as you use cudaMallocManaged to allocate your shared data structures, and call cudaDeviceSynchronize to avoid race conditions when the GPU is accessing shared data, the CPU and GPU can share data including complicated interlinked data structures.

#include <iostream>
#include <cuda.h>

/* GPU kernel: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index
	vals[i]=i+param;
}

int main(int argc,char *argv[]) 
{
// Allocate space shared between CPU and GPU
	int n=16; // total number of floats
	float *vals; // shared array of n values 
	cudaMallocManaged( &vals, n*sizeof(float) ); 

// Run "GPU kernel" on shared space
	int nBlocks=1; // GPU thread blocks to run
	int blockDim=n; // threads/block, should be 256 for best performance
	set_array<<<nBlocks,blockDim>>>(vals,0.1234); /* run kernel on GPU */ 

	cudaDeviceSynchronize(); /* Wait for kernel to finish filling vals array */

// Show results
	int i=7;
	std::cout<<"vals["<<i<<"] = "<<vals[i]<<"\n";
        return 0;
}

(Try this in NetRun now!)

The example above uses an array of floats.  You can actually share classes between CPU and GPU, as long as all the memory for all the parts of the class is allocated using cudaMallocManaged, which we can get by overloading operator new and delete.

#include <iostream>
#include <cuda.h>

/**
  Allocate this class in CPU/GPU unified memory.  
  Inherit to always be unified.
*/
class Unified {
public:
/** Allocate instances in CPU/GPU unified memory */
  void *operator new(size_t len) {
	void *ptr;
	cudaMallocManaged(&ptr, len);
	return ptr;
  }
  void operator delete(void *ptr) {
	cudaFree(ptr);
  }

/** Allocate all arrays in CPU/GPU unified memory */
  void* operator new[] (std::size_t size) {
	void *ptr; 
	cudaMallocManaged(&ptr,size);
	return ptr;
  }
  void operator delete[] (void* ptr) {
	cudaFree(ptr);
  }
};


// The application would be built with Unified classes,
//   which are accessible from either CPU or GPU.
class widget : public Unified 
{
public:
	float value;

	/*
	This method is meant to run on the GPU (__device__)
	By default methods run on the CPU (__host__)
	*/
	__device__ void setValue(float v) { value=v; }
};

/* GPU kernel: set an array of widgets to a value */
__global__ void set_array(widget *w,float param) {
	int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index
	w[i].setValue(i+param);
}

int main(int argc,char *argv[]) 
{
// Allocate space shared between CPU and GPU
	int n=16; // total number of floats
	widget *w=new widget[n]; // shared array of n values (overloaded new[])

// Run "GPU kernel" on shared space
	int nBlocks=1; // GPU thread blocks to run
	int blockDim=n; // threads/block, should be 256 for best performance
	set_array<<<nBlocks,blockDim>>>(w,0.1234); /* run kernel on GPU */ 

	cudaDeviceSynchronize(); /* Wait for kernel to finish filling vals array */

// Show results
	int i=7;
	std::cout<<"widget["<<i<<"] = "<<w[i].value<<"\n";
        return 0;
}

(Try this in NetRun now!)

For a real program, running 16 array elements will never be faster than running on the CPU--it's too small to overcome the memory copying and kernel startup latency.  Typically you'd run a kernel over a large array using thread blocks of 256 threads each, like this:

// Allocate space shared between CPU and GPU
	int n=4*1024*1024; // total number of widgets
	widget *w=new widget[n]; // shared array of n values (overloaded new[])

// Run "GPU kernel" on shared space
	int blockDim=256; // threads/block, should be 256 for best performance
	int nBlocks=n/blockDim; // GPU thread blocks to run
	set_array<<<nBlocks,blockDim>>>(w,0.1234); /* run kernel on GPU */ 

(Try this in NetRun now!)

If you time the various parts of this code, you find:

Startup: 112.644 ms
Memory Allocation: 3.90291 ms
Kernel: 2.94685 ms
widget[7] = 7.1234
Readback: 0.00691414 ms

Note that the kernel fills out 16 megs of floats in 3 milliseconds, a rate of 5 gigabytes per second.  This is actually not very good performance, due to overheads accessing unified memory across the PCI-Express bus.

Switching to GPU-only cudaMalloc'd device global memory, as shown below, produces a memory write speed of 72 gigabytes per second!

CUDA with Explicit Copies

Here's a simple CUDA program using explicit cudaMalloc and cudaMemcpy commands, rather than the simpler unified memory we used above:

#include <iostream>
#include <cuda.h>

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x; /* find my index */
vals[i]=i+param;
}

/* CPU code: memory movement and kernel calls */
int main(int argc,char *argv[]) {
int n=16; /* total number of floats */
float *vals; /* device array of n values */
cudaMalloc( (void**) &vals, n*sizeof(float) ); //Allocate GPU space

set_array<<<1,n>>>(vals,0.1234); /* Initialize the space on the GPU */

/* Copy a few elements back to CPU for printing */
int i=7;
float f=-999.0; /* CPU copy of value */
cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost);
std::cout<<"vals["<<i<<"] = "<<f<<"\n";
return 0;
}

(Try this in NetRun now!)

Basically, we:

In the version below, we actually check errors, and we're allocating threads in 'blocks', which lets you use multiple blocks to work around the hardware's thread count limit of 512 or 1024 threads per block:

#include <iostream>
#include <cuda.h>
/* error checking */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x; /* find my index */
vals[i]=i+param;
}
int main(int argc,char *argv[]) {
int w=4, h=4; /* number of blocks, threads per block */
int n=w*h; /* total number of floats */
float *vals; /* device array of n values */
check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space

set_array<<<w,h>>>(vals,0.1234); /* Initialize the space on the GPU */

/* Copy a few elements back to CPU for printing */
for (int i=0;i<n;i+=3) {
float f=-999.0; /* CPU copy of value */
check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
std::cout<<"vals["<<i<<"] = "<<f<<"\n";
}
return 0;
}

(Try this in NetRun now!)

 

CUDA Performance: Use Big Arrays!

One universal aspect of CUDA is that kernel calls (<<<), mallocs, and memcpy all take quite a long time to get running.  Once they're running, they're fairly fast, but for maximum efficiency you should plan on accessing about a megabyte each time!  Here's an example where I benchmark this length dependence:

#include <iostream>
#include <cuda.h>
#include "lib/inc.c"

float *dev_ptr=0, *host_ptr=0;
int len=0;

int time_memcpy(void) {
	cudaMemcpy(dev_ptr,host_ptr,len,cudaMemcpyHostToDevice);
	cudaThreadSynchronize();
	return 0;
}

__global__ void doGPUdatawrite(float *arr,float val) {
	int i=blockIdx.x*blockDim.x+threadIdx.x;
	arr[i]=val;
}
int time_datawrite(void) {
	int blocks=1, tpb=len;
	while (tpb>=512) {blocks*=2; tpb/=2;} // 256 threads/block target
	while (blocks>=65536) {blocks/=2; tpb*=2;} // stupid limit on size
	doGPUdatawrite<<<blocks,tpb>>>(dev_ptr,1.2345);
	cudaThreadSynchronize();
	return 0;
}

void time_sweep(const char *name,timeable_fn f) {
	int max=1024*1024*16;
	cudaMalloc((void **)&dev_ptr,max*sizeof(float));
	cudaMallocHost((void **)&host_ptr,max*sizeof(float));
	std::cout<<"Did mallocs\n";
	for (len=64;len<=max;len*=4) {
		double t=time_function(f);
		printf("%s size %d: %.2f GB/sec (%.1f us)\n",
			name,len,len*1.0e-9/t,t*1.0e6);
	}
	cudaFreeHost(host_ptr);
	cudaFree(dev_ptr);
}

int main(int argc,char *argv[]) {
	std::cout<<"Starting up...\n";
	time_sweep("memcpy",time_memcpy);
	time_sweep("GPU write",time_datawrite);
	return 0;
}

(Try this in NetRun now!)

Here's the output on NetRun's NVIDIA GeForce GTX 670:

memcpy size 64: 0.01 GB/sec (8.3 us)
memcpy size 256: 0.03 GB/sec (8.3 us)
memcpy size 1024: 0.12 GB/sec (8.5 us)
memcpy size 4096: 0.47 GB/sec (8.7 us)
memcpy size 16384: 1.45 GB/sec (11.3 us)
memcpy size 65536: 3.21 GB/sec (20.4 us) <- half of peak copy speed
memcpy size 262144: 5.29 GB/sec (49.5 us)
memcpy size 1048576: 6.28 GB/sec (166.9 us)
memcpy size 4194304: 6.60 GB/sec (635.9 us)
memcpy size 16777216: 6.67 GB/sec (2513.8 us)
GPU write size 64: 0.01 GB/sec (9.8 us) GPU write size 256: 0.03 GB/sec (9.8 us) GPU write size 1024: 0.10 GB/sec (9.8 us) GPU write size 4096: 0.42 GB/sec (9.8 us) GPU write size 16384: 1.64 GB/sec (10.0 us) GPU write size 65536: 6.03 GB/sec (10.9 us) GPU write size 262144: 17.32 GB/sec (15.1 us) <- half of peak write speed GPU write size 1048576: 32.76 GB/sec (32.0 us) GPU write size 4194304: 41.38 GB/sec (101.4 us) GPU write size 16777216: 39.99 GB/sec (419.5 us)

Note that really big arrays, with millions of floats, deliver way better performance than smaller arrays.  The basic trouble is that going through the OS, across the PCIe bus, into the GPU, and back takes like 10 microseconds (10us = 10000ns).   If you go to all that just to get one or two floats, or even a few thousand, you'll get terrible performance.

CUDA Application Performance: Mandelbrot Rendering

Here's a little Mandelbrot set rendering application in CUDA.  It includes benchmarking code, which shows some surprising results.

#include <cuda.h>
#include <iostream>
#include <fstream>
#include "lib/inc.c" /* NetRun utility functions, like time_in_seconds */

#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU Code! */
__global__ void fill_in_array(float *arr,int wid) {
int i=threadIdx.x+blockDim.x*blockIdx.x; // my thread's global number
int x=i%wid, y=i/wid; // my thread's pixel to work on

float cr=x*1.0/wid, ci=y*1.0/wid;
float zr=cr, zi=ci;
int count;
const int max_count=256;
for (count=0;count<max_count;count++) {
// z= z*z+c
float nzr=zr*zr-zi*zi + cr;
float nzi=2.0*zr*zi + ci;
if ((nzr*nzr+nzi*nzi)>4.0) break;
zr=nzr; zi=nzi;
}
arr[y*wid+x]=count;
}

/* Run on CPU */
int main(int argc,char *argv[]) {
int wid=512,ht=512;
float *arr=0; /* LIVES ON THE GPU!!!! */
double start=time_in_seconds(), elapsed;
check(cudaMalloc((void **)&arr, wid*ht*sizeof(float)));
elapsed=time_in_seconds()-start;
std::cout<<"Allocated array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";
start=time_in_seconds(); // <- reset the timer!
int threadsPerBlock=512;
int nBlocks=wid*ht/threadsPerBlock;
fill_in_array<<<nBlocks,threadsPerBlock>>>(arr,wid);

check(cudaThreadSynchronize()); // look for errors in kernel
elapsed=time_in_seconds()-start;
std::cout<<"Rendered array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";

float harr[wid*ht];
check(cudaMemcpy(harr,arr,wid*ht*sizeof(float),cudaMemcpyDeviceToHost));
elapsed=time_in_seconds()-start;
std::cout<<"Copied out array: "<<elapsed*1.0e9/(wid*ht)<<"ns/pixel, time "<<elapsed<<" seconds\n";

std::ofstream of("out.ppm",std::ios_base::binary);
of<<"P5\n"; // greyscale, binary
of<<wid<<" "<<ht<<"\n"; // image size
of<<"255\n"; // byte image
for (int i=0;i<wid*ht;i++) {
char c=(char)harr[i];
of<<c;
}
return 0;
}

(Try this in NetRun now!)

The surprising part about this is the timing breakdown:

Allocated array: 152.6ns/pixel, time 0.040 seconds
Rendered array: 3.4ns/pixel, time 0.001 seconds
Copied out array: 10.6ns/pixel, time 0.003 seconds

The biggest surprise is the cudaMalloc time, which is ridiculously huge because this is the first CUDA call in the program, so the driver has to locate and set up the graphics card.   Adding a "dummy" cudaMalloc of four bytes reduces the allocation time to less than a nanosecond per pixel.

Second, it takes longer to copy the data off the card than it does to compute it!  Switching to a "char" data type cuts both the copy-out time (due to less data copied) as well as the rendering time (due to less data written)

(Try this in NetRun now!)

Allocated array: 0.02ns/pixel, time 5.11634e-06 seconds
Rendered array: 2.74ns/pixel, time 0.000719074 seconds
Copied out array: 4.63ns/pixel, time 0.00121307 seconds

CUDA and Thrust

A nice library for more complex operations is called Thrust, sort of like STL or Boost for the GPU.  As of CUDA 4.0, Thrust is an official part of the CUDA distribution, and it's preinstalled on NetRun.

Here's how to use thrust::reduce, which can be used to add up all the values in an array, or find their maximum or minimum.  0 is the additive identity.  plus is the operation to perform.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

int main(void)
{
  // generate some random data on the host
	thrust::host_vector<int> h_vec(100000);
	for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

  // transfer to device
	thrust::device_vector<int> d_vec = h_vec;

  // sum on device
	int final_sum = thrust::reduce(d_vec.begin(), d_vec.end(), 
		0, thrust::plus<int>());

	std::cout<<"Final sum="<<final_sum<<"\n";

	return 0;
}

(Try this in NetRun now!)

Here's how to use thrust::reduce to find the biggest and smallest elements in the array too. 

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

int main(void)
{
// generate some random data on the host
thrust::host_vector<int> h_vec(10);
for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

// transfer to device
thrust::device_vector<int> d_vec = h_vec;

// sum on device
int final_sum = thrust::reduce(d_vec.begin(), d_vec.end(),
0, thrust::plus<int>());
int final_max = thrust::reduce(d_vec.begin(), d_vec.end(),
0, thrust::maximum<int>());
int final_min = thrust::reduce(d_vec.begin(), d_vec.end(),
999, thrust::minimum<int>());

std::cout<<"Final sum="<<final_sum<<" max="<<final_max<<" min="<<final_min<<"\n";

return 0;
}

(Try this in NetRun now!)

It's not very efficient (see below) to call thrust::reduce three times on the same vector.  It's more efficient to call it once, and collect up the sum, min, and max all in one go.  To do this, we need to write a weird 'functor' to pass to thrust::reduce.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cstdlib>

// This is used to store everything we know about the ints seen so far:
class sum_min_max {
public:
int sum, min, max;
sum_min_max() {sum=0; min=1000000000; max=-1000000000;}
__device__ __host__ sum_min_max(int value) {sum=value; min=value; max=value;}
};

// This 'functor' function object combines two sum_min_max objects
class smm_combiner {
public:
__device__ __host__
sum_min_max operator()(sum_min_max l,const sum_min_max &r) {
l.sum+=r.sum;
if (l.min>r.min) l.min=r.min;
if (l.max<r.max) l.max=r.max;
return l;
}
};

int main(void)
{
// generate some random data on the host
thrust::host_vector<int> h_vec(10);
for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand()%10;

// transfer to device
thrust::device_vector<int> d_vec = h_vec;

// sum/min/max on device
sum_min_max final = thrust::reduce(d_vec.begin(), d_vec.end(),
sum_min_max(), smm_combiner());

std::cout<<"Final sum="<<final.sum<<" max="<<final.max<<" min="<<final.min<<"\n";

return 0;
}

(Try this in NetRun now!)

This same idea could probably be better written as a thrust::tuple.

Here's how to use thrust::sort.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <cstdlib>
#include "lib/inc.c" /* for netrun timing functions */

int main(void)
{
// generate some random data on the host
thrust::host_vector<int> h_vec(10);
for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand();

// transfer to device
thrust::device_vector<int> d_vec = h_vec;

// sort on device
thrust::sort(d_vec.begin(), d_vec.end());

// copy back, and print
h_vec = d_vec;
for (unsigned int i=0;i<h_vec.size();i++)
std::cout<<"val["<<i<<"] = "<<h_vec[i]<<"\n";

return 0;
}

(Try this in NetRun now!)


Here's a version of thrust::sort where we time the sort for various data sizes. 

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <cstdlib>
#include "lib/inc.c" /* for netrun timing functions */

thrust::device_vector<int> d_vec;
int sort_device_vector(void) {
// sort on device
thrust::sort(d_vec.begin(), d_vec.end());
return 0;
}

int main(void)
{
for (int vec_size=16;vec_size<=4*1024*1024;vec_size*=4)
{
// generate random data on the host
thrust::host_vector<int> h_vec(vec_size);
for (unsigned int i=0;i<h_vec.size();i++) h_vec[i]=rand();

// transfer to device
d_vec = h_vec;

double time=time_function(sort_device_vector);
printf("Time for sort of size %d: %.3f ns/elt (%.3f ms)\n",
h_vec.size(), time/h_vec.size()*1.0e9, time*1.0e3);

// copy back and print
h_vec = d_vec;

if (vec_size<=16)
for (unsigned int i=0;i<h_vec.size();i++)
std::cout<<"val["<<i<<"] = "<<h_vec[i]<<"\n";

}

return 0;
}

(Try this in NetRun now!)

This outputs:

Time for sort of size 16: 21099.579 ns/elt (0.338 ms)
Time for sort of size 64: 5302.252 ns/elt (0.339 ms)
Time for sort of size 256: 1377.077 ns/elt (0.353 ms)
Time for sort of size 1024: 312.655 ns/elt (0.320 ms)
Time for sort of size 4096: 78.331 ns/elt (0.321 ms)
Time for sort of size 16384: 20.922 ns/elt (0.343 ms)
Time for sort of size 65536: 8.636 ns/elt (0.566 ms)
Time for sort of size 262144: 5.332 ns/elt (1.398 ms)
Time for sort of size 1048576: 3.527 ns/elt (3.699 ms)
Time for sort of size 4194304: 3.027 ns/elt (12.694 ms)

Note how sorting 16 elements and 16 thousand elements takes nearly exactly the same amount of time.  This is common on the GPU--the startup time to get a kernel running (latency) is measured in microseconds, not nanoseconds like a function call; while the throughput (bandwidth) is huge, hundreds of gigabytes per second.  This means you need to do work in huge batches, millions of data items at a time, to get good efficiency.  Latency matters!

 


CS 441 Lecture Note, 2014, Dr. Orion LawlorUAF Computer Science Department.