OpenGL ARB_fragment_program Quick Reference ("Cheat Sheet")

Instruction    Inputs  Output	Description
-----------    ------  ------	--------------------------------
ABS	       v       v	absolute value (fabs(A))
ADD	       v,v     v	add (A + B)
CMP	       v,v,v   v	compare (A<0?B:C)
COS	       s       ssss	cosine with reduction to [-PI,PI]
DP3	       v,v     ssss	3-component dot product
DP4	       v,v     ssss	4-component dot product
DPH	       v,v     ssss	homogeneous dot product (DP3,B.w)
DST	       v,v     v	distance vector 
EX2	       s       ssss	exponential base 2 (pow(2,A))
FLR	       v       v	floor (floor(A))
FRC	       v       v	fraction (A-floor(A))
KIL	       v       v	kill fragment (if (A<0) exit())
LG2	       s       ssss	logarithm base 2 (log(A)/log(2))
LIT	       v       v	compute light coefficients
LRP	       v,v,v   v	linear interpolation (A*B+(1-A)*C)
MAD	       v,v,v   v	multiply and add (A*B+C)
MAX	       v,v     v	maximum (max(A,B))
MIN	       v,v     v	minimum (min(A,B))
MOV	       v       v	move
MUL	       v,v     v	multiply (A * B)
POW	       s,s     ssss	scalar power function (pow(A,B))
RCP	       s       ssss	reciprocal (1.0/A)
RSQ	       s       ssss	reciprocal square root (1.0/sqrt(A))
SCS	       s       ss--	sine/cosine without reduction (sin(A),cos(A))
SIN	       s       ssss	sine with reduction to [-PI,PI]
SGE	       v,v     v	set on greater than or equal (A>=B)
SLT	       v,v     v	set on less than (A<B)
SUB	       v,v     v	subtract (A-B)
SWZ	       v       v	extended swizzle
TEX	       v,u,t   v	texture sample
TXB	       v,u,t   v	texture sample with mipmap LOD bias
TXP	       v,u,t   v	texture sample with perspective divide
XPD	       v,v     v	3D vector cross product (A x B)

Table X.5: Summary of fragment program instructions. "v" indicates a floating-point vector input or output, "s" indicates a floating-point scalar input, "ssss" indicates a scalar output replicated across a 4-component result vector, "ss--" indicates two scalar outputs in the first two components, "u" indicates a texture image unit identifier (e.g., "texture[3]" for ARB3 texture), and "t" indicates a texture target (e.g., "2D" for a 2D texture).

Many funny-named operations are fast or totally free (do not slow down program at all). These look like:

Swizzle: arbitrarily rearrange the order of input vector components (may cost a clock or two)	ADD a,b, c.yxwz;
Negate: flip the sign of an input (free)	ADD a,b, -c;
Saturate: clamp output values to lie between 0 and 1 (free)	ADD_SAT a,b,c;
Writemask: only change certain components of the output vector (free)	ADD a.xz, b,c;

The official standard is ARB_fragment_program, the last half of which is somewhat readable. Table X.5 is reproduced from this standard.

I can't find anything on the net that lists even an approximation of the basic time per pixel for the various instructions, so I'm posting what I've measured myself using this code. Beware measurement error! Check real benchmark sites for absolute performance, which scales linearly with clockrate and number of pipelines anyway (and hence nonlinearly with card cost). But the relative performance of various instructions seems roughly equivalent on all ATI (9550, 9600, Mobility 9600, x300) or nVidia (5200, 5600, 6800) cards of a given generation, although as mentioned the scale does vary with card price.

Rough Speed Estimates (ATI Cards)

Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)
  ABS:	-0.00 ns 	 128 times	 1.00 ns Swizzle	 0.46 ns Saturate	
  ADD:	 0.47 ns 	  64 times	 2.05 ns Swizzle	
  CMP:	 0.46 ns 	  64 times	 2.21 ns Swizzle	
  COS:	 5.34 ns 	   6 times	
  DP3:	 0.46 ns 	  64 times	 2.05 ns Swizzle	
  DP4:	 0.46 ns 	  64 times	 2.64 ns Swizzle	
  DPH:	 0.46 ns 	  64 times	 2.67 ns Swizzle	
  DST:	 0.49 ns 	  62 times	 1.46 ns Saturate	 0.43 ns Writemask	
  EX2:	 0.47 ns 	  64 times	
  FLR:	 0.97 ns 	  32 times	 2.43 ns Swizzle	
  FRC:	 0.46 ns 	  64 times	 1.94 ns Swizzle	
  LG2:	 0.46 ns 	  64 times	
  LIT:	-0.18 ns 	 128 times	 4.37 ns Swizzle	 0.00 ns Negate	
  LRP:	 0.46 ns 	  64 times	 2.25 ns Swizzle	 0.97 ns Negate	
  MAD:	 0.46 ns 	  64 times	 2.21 ns Swizzle	
  MAX:	 0.46 ns 	  64 times	 2.05 ns Swizzle	
  MIN:	 0.46 ns 	  64 times	 2.05 ns Swizzle	
  MOV:	-0.00 ns 	 128 times	 0.46 ns Saturate	
  MUL:	 0.47 ns 	  64 times	 2.05 ns Swizzle	
  POW:	 1.46 ns 	  21 times	
  RCP:	 0.46 ns 	  64 times	
  RSQ:	 0.46 ns 	  64 times	
  SCS:	 3.40 ns 	   9 times	 4.30 ns Saturate	
  SGE:	 0.97 ns 	  32 times	 2.72 ns Swizzle	
  SLT:	 0.97 ns 	  32 times	 2.72 ns Swizzle	
  SIN:	 4.37 ns 	   7 times	 4.86 ns Saturate	
  SUB:	 0.46 ns 	  64 times	 2.05 ns Swizzle	
  SWZ:	 1.00 ns 	  29 times	 1.94 ns Saturate	 0.05 ns Writemask	
  TEX:	 0.48 ns 	   3 times	 0.97 ns Swizzle	
  TXP:	 0.49 ns 	   3 times	 0.97 ns Swizzle	
  TXB:	 0.49 ns 	   3 times	 0.97 ns Swizzle	
  XPD:	 0.97 ns 	  32 times	 3.88 ns Swizzle

These numbers were collected using November 2005 drivers on an ATI Radeon 9550, a fairly low-end card.

The compiler optimizes away repeated "MOV", "ABS", or "LIT" instructions--they don't actually take 0ns.

The instruction limits are fairly low on these cards. However, "TEX" has a limit of 3 only if each "TEX" depends on the results of the preceeding instruction (i.e., 3 dependent textures); you can have lots of non-dependent textures in a single program. Apparently the next-generation ATI cards have much longer instruction limits.

Commonly-used instructions like arithmetic, dot products, and compares are 1 clock. Floor, cross-product, swizzles, and set-if-compares are 2 clocks. Adding a swizzle slows down almost all instruction types by 2 or more clocks. COS, SIN, and SCS are about 10 clocks, and should be avoided.

<>Rough Speed Estimates (nVidia Cards)

Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)
  ABS:	-0.00 ns 	1024 times	 0.07 ns Saturate	
  ADD:	 0.00 ns 	1024 times	 0.13 ns Saturate	
  CMP:	-0.01 ns 	1024 times	
  COS:	 0.13 ns 	1024 times	
  DP3:	 0.07 ns 	1024 times	
  DP4:	 0.07 ns 	1024 times	 0.13 ns Writemask	
  DPH:	 0.14 ns 	1024 times	
  DST:	 0.13 ns 	1024 times	 0.02 ns Swizzle	
  EX2:	 0.13 ns 	1024 times	
  FLR:	 0.13 ns 	1024 times	
  FRC:	 0.13 ns 	1024 times	
  LG2:	 0.13 ns 	1024 times	
  LIT:	 0.39 ns 	1024 times	 0.26 ns Writemask	
  LRP:	 0.14 ns 	1024 times	
  MAD:	 0.13 ns 	1024 times	
  MAX:	 0.01 ns 	1024 times	 0.13 ns Swizzle	 0.13 ns Negate	
  MIN:	 0.01 ns 	1024 times	 0.13 ns Swizzle	 0.13 ns Negate	
  MOV:	 0.00 ns 	1024 times	
  MUL:	 0.00 ns 	1024 times	 0.07 ns Saturate	
  POW:	 0.26 ns 	1024 times	
  RCP:	-0.01 ns 	1024 times	 0.13 ns Negate	 0.13 ns Saturate	
  RSQ:	 0.25 ns 	1024 times	
  SCS:	 0.13 ns 	1024 times	
  SGE:	 0.13 ns 	1024 times	
  SLT:	 0.13 ns 	1024 times	
  SIN:	 0.13 ns 	1024 times	
  SUB:	-0.00 ns 	1024 times	 0.13 ns Saturate	
  SWZ:	-0.00 ns 	1024 times	 0.07 ns Saturate	
  TEX:	 0.13 ns 	1024 times	
  TXP:	 0.13 ns 	1024 times	
  TXB:	 0.13 ns 	1024 times	
  XPD:	 0.14 ns 	1024 times

These numbers were collected using November 2005 drivers from an nVidia GeForce 6800 (stock), which is a much more expensive card than the ATI card tested above, so don't infer anything from the relative performance.

The nVidia drivers are heavily optimizing my code here--it's not the case that adds are free, it's that *repeated* adds (like I use for performance testing) are turned into a single multiply by the compiler. Most of the zero-clock instructions tested as single-clock instructions using the January 2005 drivers.

The instruction limits for nVidia hardware are very high--I stopped testing at 1024 instructions.

Generally, instructions take one or two clocks. Two-clock instructions include RSQ, POW, and LIT.

O. Lawlor, ffosl@uaf.edu
Up to: Class Site, CS, UAF