# OpenGL ARB_fragment_program Quick Reference ("Cheat Sheet")

 `Instruction Inputs Output Description----------- ------ ------ --------------------------------ABS v v absolute value (fabs(A))ADD v,v v add (A + B)CMP v,v,v v compare (A<0?B:C)COS s ssss cosine with reduction to [-PI,PI]DP3 v,v ssss 3-component dot productDP4 v,v ssss 4-component dot productDPH v,v ssss homogeneous dot product (DP3,B.w)DST v,v v distance vector EX2 s ssss exponential base 2 (pow(2,A))FLR v v floor (floor(A))FRC v v fraction (A-floor(A))KIL v v kill fragment (if (A<0) exit())LG2 s ssss logarithm base 2 (log(A)/log(2))LIT v v compute light coefficientsLRP v,v,v v linear interpolation (A*B+(1-A)*C)MAD v,v,v v multiply and add (A*B+C)MAX v,v v maximum (max(A,B))MIN v,v v minimum (min(A,B))MOV v v moveMUL v,v v multiply (A * B)POW s,s ssss scalar power function (pow(A,B))RCP s ssss reciprocal (1.0/A)RSQ s ssss reciprocal square root (1.0/sqrt(A))SCS s ss-- sine/cosine without reduction (sin(A),cos(A))SIN s ssss sine with reduction to [-PI,PI]SGE v,v v set on greater than or equal (A>=B)SLT v,v v set on less than (A

Table X.5:  Summary of fragment program instructions.  "v" indicates a floating-point vector input or output, "s" indicates a floating-point scalar input, "ssss" indicates a scalar output replicated across a 4-component result vector, "ss--" indicates two scalar outputs in the first two components, "u" indicates a texture image unit identifier (e.g., "texture[3]" for ARB3 texture), and "t" indicates a texture target (e.g., "2D" for a 2D texture).

Many funny-named operations are fast or totally free (do not slow down program at all).  These look like:
 Swizzle: arbitrarily rearrange the order of input vector components (may cost a clock or two) ADD a,b, c.yxwz; Negate: flip the sign of an input (free) ADD a,b, -c; Saturate: clamp output values to lie between 0 and 1 (free) ADD_SAT a,b,c; Writemask: only change certain components of the output vector (free) ADD a.xz, b,c;

The official standard is ARB_fragment_program, the last half of which is somewhat readable.  Table X.5 is reproduced from this standard.

I can't find anything on the net that lists even an approximation of the basic time per pixel for the various instructions, so I'm posting what I've measured myself using this code.  Beware measurement error!  Check real benchmark sites for absolute performance, which scales linearly with clockrate and number of pipelines anyway (and hence nonlinearly with card cost).   But the relative performance of various instructions seems roughly equivalent on all ATI (9550, 9600, Mobility 9600, x300) or nVidia (5200, 5600, 6800) cards of a given generation, although as mentioned the scale does vary with card price.

## Rough Speed Estimates (ATI Cards)

`Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)  ABS:	-0.00 ns 	 128 times	 1.00 ns Swizzle	 0.46 ns Saturate	  ADD:	 0.47 ns 	  64 times	 2.05 ns Swizzle	  CMP:	 0.46 ns 	  64 times	 2.21 ns Swizzle	  COS:	 5.34 ns 	   6 times	  DP3:	 0.46 ns 	  64 times	 2.05 ns Swizzle	  DP4:	 0.46 ns 	  64 times	 2.64 ns Swizzle	  DPH:	 0.46 ns 	  64 times	 2.67 ns Swizzle	  DST:	 0.49 ns 	  62 times	 1.46 ns Saturate	 0.43 ns Writemask	  EX2:	 0.47 ns 	  64 times	  FLR:	 0.97 ns 	  32 times	 2.43 ns Swizzle	  FRC:	 0.46 ns 	  64 times	 1.94 ns Swizzle	  LG2:	 0.46 ns 	  64 times	  LIT:	-0.18 ns 	 128 times	 4.37 ns Swizzle	 0.00 ns Negate	  LRP:	 0.46 ns 	  64 times	 2.25 ns Swizzle	 0.97 ns Negate	  MAD:	 0.46 ns 	  64 times	 2.21 ns Swizzle	  MAX:	 0.46 ns 	  64 times	 2.05 ns Swizzle	  MIN:	 0.46 ns 	  64 times	 2.05 ns Swizzle	  MOV:	-0.00 ns 	 128 times	 0.46 ns Saturate	  MUL:	 0.47 ns 	  64 times	 2.05 ns Swizzle	  POW:	 1.46 ns 	  21 times	  RCP:	 0.46 ns 	  64 times	  RSQ:	 0.46 ns 	  64 times	  SCS:	 3.40 ns 	   9 times	 4.30 ns Saturate	  SGE:	 0.97 ns 	  32 times	 2.72 ns Swizzle	  SLT:	 0.97 ns 	  32 times	 2.72 ns Swizzle	  SIN:	 4.37 ns 	   7 times	 4.86 ns Saturate	  SUB:	 0.46 ns 	  64 times	 2.05 ns Swizzle	  SWZ:	 1.00 ns 	  29 times	 1.94 ns Saturate	 0.05 ns Writemask	  TEX:	 0.48 ns 	   3 times	 0.97 ns Swizzle	  TXP:	 0.49 ns 	   3 times	 0.97 ns Swizzle	  TXB:	 0.49 ns 	   3 times	 0.97 ns Swizzle	  XPD:	 0.97 ns 	  32 times	 3.88 ns Swizzle	`
These numbers were collected using November 2005 drivers on an ATI Radeon 9550, a fairly low-end card.

The compiler optimizes away repeated "MOV", "ABS", or "LIT" instructions--they don't actually take 0ns.

The instruction limits are fairly low on these cards.  However, "TEX" has a limit of 3 only if each "TEX" depends on the results of the preceeding instruction (i.e., 3 dependent textures); you can have lots of non-dependent textures in a single program.  Apparently the next-generation ATI cards have much longer instruction limits.

Commonly-used instructions like arithmetic, dot products, and compares are 1 clock.  Floor, cross-product, swizzles, and set-if-compares are 2 clocks.  Adding a swizzle slows down almost all instruction types by 2 or more clocks.  COS, SIN, and SCS are about 10 clocks, and should be avoided.

## <>Rough Speed Estimates (nVidia Cards)

`Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)  ABS:	-0.00 ns 	1024 times	 0.07 ns Saturate	  ADD:	 0.00 ns 	1024 times	 0.13 ns Saturate	  CMP:	-0.01 ns 	1024 times	  COS:	 0.13 ns 	1024 times	  DP3:	 0.07 ns 	1024 times	  DP4:	 0.07 ns 	1024 times	 0.13 ns Writemask	  DPH:	 0.14 ns 	1024 times	  DST:	 0.13 ns 	1024 times	 0.02 ns Swizzle	  EX2:	 0.13 ns 	1024 times	  FLR:	 0.13 ns 	1024 times	  FRC:	 0.13 ns 	1024 times	  LG2:	 0.13 ns 	1024 times	  LIT:	 0.39 ns 	1024 times	 0.26 ns Writemask	  LRP:	 0.14 ns 	1024 times	  MAD:	 0.13 ns 	1024 times	  MAX:	 0.01 ns 	1024 times	 0.13 ns Swizzle	 0.13 ns Negate	  MIN:	 0.01 ns 	1024 times	 0.13 ns Swizzle	 0.13 ns Negate	  MOV:	 0.00 ns 	1024 times	  MUL:	 0.00 ns 	1024 times	 0.07 ns Saturate	  POW:	 0.26 ns 	1024 times	  RCP:	-0.01 ns 	1024 times	 0.13 ns Negate	 0.13 ns Saturate	  RSQ:	 0.25 ns 	1024 times	  SCS:	 0.13 ns 	1024 times	  SGE:	 0.13 ns 	1024 times	  SLT:	 0.13 ns 	1024 times	  SIN:	 0.13 ns 	1024 times	  SUB:	-0.00 ns 	1024 times	 0.13 ns Saturate	  SWZ:	-0.00 ns 	1024 times	 0.07 ns Saturate	  TEX:	 0.13 ns 	1024 times	  TXP:	 0.13 ns 	1024 times	  TXB:	 0.13 ns 	1024 times	  XPD:	 0.14 ns 	1024 times	`
These numbers were collected using November 2005 drivers from an nVidia GeForce 6800 (stock), which is a much more expensive card than the ATI card tested above, so don't infer anything from the relative performance.

The nVidia drivers are heavily optimizing my code here--it's not the case that adds are free, it's that *repeated* adds (like I use for performance testing) are turned into a single multiply by the compiler.  Most of the zero-clock instructions tested as single-clock instructions using the January 2005 drivers.

The instruction limits for nVidia hardware are very high--I stopped testing at 1024 instructions.

Generally, instructions take one or two clocks.  Two-clock instructions include RSQ, POW, and LIT.

O. Lawlor, ffosl@uaf.edu
Up to: Class Site, CS, UAF