CUDA PTX: GPU assembly language

CS 641 Lecture, Dr. Lawlor

CUDA's underlying quasi-assembly language is called PTX.  The NVIDIA PTX documentation is the official source, but reading the output from NetRun's "Disassemble" command is pretty illuminating too.  If you actually want to write raw PTX for some reason, Kenneth details the commands to use.

There are some good pictures, details, and performance numbers in this 2009 IEEE paper (full text PDF is only accessible from on-campus, so download it).
For example, this CUDA code:
#include <iostream>
#include <cuda.h>
/* error checking */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x;
vals[i]=param;
}
int main(int argc,char *argv[]) {
int w=128, h=2; /* number of threads per block, number of blocks */
int n=w*h; /* total number of floats */
float *vals; /* device array of n values */
check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space

set_array<<<h,w>>>(vals,1.234); /* Initialize the space on the GPU */

/* Copy selected elements back to CPU for printing */
for (int i=0;i<n;i+=5) {
float f=-999.0; /* CPU copy of value */
check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
std::cout<<"vals["<<i<<"] = "<<f<<"\n";
}
return 0;
}

(Try this in NetRun now!)

Results in this PTX assembly:
	.entry _Z9set_arrayPff (
.param .u32 __cudaparm__Z9set_arrayPff_vals,
.param .f32 __cudaparm__Z9set_arrayPff_param)
{
.reg .u16 %rh<4>;
.reg .u32 %r<8>;
.reg .f32 %f<3>;
.loc 28 9 0
// 8 /* GPU code: set an array to a value */
// 9 __global__ void set_array(float *vals,float param) {
$LBB1__Z9set_arrayPff:
.loc 28 11 0
// 10 int i=threadIdx.x+blockDim.x*blockIdx.x;
// 11 vals[i]=param;
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
ld.param.u32 %r1, [__cudaparm__Z9set_arrayPff_vals];
cvt.u32.u16 %r2, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r3, %rh1, %rh2;
add.u32 %r4, %r2, %r3;
mul.lo.u32 %r5, %r4, 4;
add.u32 %r6, %r1, %r5;
st.global.f32 [%r6+0], %f1;
.loc 28 12 0
// 12 }
exit;
$LDWend__Z9set_arrayPff:
} // _Z9set_arrayPff
Briefly:
Several interesting architectural details crop up in the CUDA documentation:

Branching with Predication

PTX conditional branches are implemented using predication.  You declare a .pred register, use a "setp" (set predicate) to set the predicate register, and then @ will predicate any instruction, such as a branch, on the comparison result.  For example:
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x;
if (param<3.0f) vals[i]=param;
}

(Try this in NetRun now!)

	.entry _Z9set_arrayPff (
.param .u32 __cudaparm__Z9set_arrayPff_vals,
.param .f32 __cudaparm__Z9set_arrayPff_param)
{
.reg .u16 %rh<4>;
.reg .u32 %r<8>;
.reg .f32 %f<4>;
.reg .pred %p<3>;
// 9 __global__ void set_array(float *vals,float param) {
$LBB1__Z9set_arrayPff:
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
mov.f32 %f2, 0f40400000; // 3
setp.lt.f32 %p1, %f1, %f2;
@!%p1 bra $Lt_0_1026;
// 10 int i=threadIdx.x+blockDim.x*blockIdx.x;
// 11 if (param<3.0f) vals[i]=param;
ld.param.u32 %r1, [__cudaparm__Z9set_arrayPff_vals];
cvt.u32.u16 %r2, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r3, %rh1, %rh2;
add.u32 %r4, %r2, %r3;
mul.lo.u32 %r5, %r4, 4;
add.u32 %r6, %r1, %r5;
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
st.global.f32 [%r6+0], %f1;
$Lt_0_1026:
exit;
$LDWend__Z9set_arrayPff:
} // _Z9set_arrayPf
Here, we use bra to skip over all the index calculation and global memory store if the comparison comes out false.

Predicate registers can be the input to and, or, xor, not, mov instructions, such as for computing nested branches.