OpenGL ARB_fragment_program Quick Reference ("Cheat Sheet")



Instruction    Inputs  Output	Description
----------- ------ ------ --------------------------------
ABS v v absolute value (fabs(A))
ADD v,v v add (A + B)
CMP v,v,v v compare (A<0?B:C)
COS s ssss cosine with reduction to [-PI,PI]
DP3 v,v ssss 3-component dot product
DP4 v,v ssss 4-component dot product
DPH v,v ssss homogeneous dot product (DP3,B.w)
DST v,v v distance vector
EX2 s ssss exponential base 2 (pow(2,A))
FLR v v floor (floor(A))
FRC v v fraction (A-floor(A))
KIL v v kill fragment (if (A<0) exit())
LG2 s ssss logarithm base 2 (log(A)/log(2))
LIT v v compute light coefficients
LRP v,v,v v linear interpolation (A*B+(1-A)*C)
MAD v,v,v v multiply and add (A*B+C)
MAX v,v v maximum (max(A,B))
MIN v,v v minimum (min(A,B))
MOV v v move
MUL v,v v multiply (A * B)
POW s,s ssss scalar power function (pow(A,B))
RCP s ssss reciprocal (1.0/A)
RSQ s ssss reciprocal square root (1.0/sqrt(A))
SCS s ss-- sine/cosine without reduction (sin(A),cos(A))
SIN s ssss sine with reduction to [-PI,PI]
SGE v,v v set on greater than or equal (A>=B)
SLT v,v v set on less than (A<B)
SUB v,v v subtract (A-B)
SWZ v v extended swizzle
TEX v,u,t v texture sample
TXB v,u,t v texture sample with mipmap LOD bias
TXP v,u,t v texture sample with perspective divide
XPD v,v v 3D vector cross product (A x B)

Table X.5:  Summary of fragment program instructions.  "v" indicates a floating-point vector input or output, "s" indicates a floating-point scalar input, "ssss" indicates a scalar output replicated across a 4-component result vector, "ss--" indicates two scalar outputs in the first two components, "u" indicates a texture image unit identifier (e.g., "texture[3]" for ARB3 texture), and "t" indicates a texture target (e.g., "2D" for a 2D texture).

Many funny-named operations are fast or totally free (do not slow down program at all).  These look like:
Swizzle: arbitrarily rearrange the order of input vector components (may cost a clock or two)
ADD a,b, c.yxwz;
Negate: flip the sign of an input (free)
ADD a,b, -c;
Saturate: clamp output values to lie between 0 and 1 (free)
ADD_SAT a,b,c;
Writemask: only change certain components of the output vector (free)
ADD a.xz, b,c;

The official standard is ARB_fragment_program, the last half of which is somewhat readable.  Table X.5 is reproduced from this standard.

I can't find anything on the net that lists even an approximation of the basic time per pixel for the various instructions, so I'm posting what I've measured myself using this code.  Beware measurement error!  Check real benchmark sites for absolute performance, which scales linearly with clockrate and number of pipelines anyway (and hence nonlinearly with card cost).   But the relative performance of various instructions seems roughly equivalent on all ATI (9550, 9600, Mobility 9600, x300) or nVidia (5200, 5600, 6800) cards of a given generation, although as mentioned the scale does vary with card price.

Rough Speed Estimates (ATI Cards)

Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)
ABS: -0.00 ns 128 times 1.00 ns Swizzle 0.46 ns Saturate
ADD: 0.47 ns 64 times 2.05 ns Swizzle
CMP: 0.46 ns 64 times 2.21 ns Swizzle
COS: 5.34 ns 6 times
DP3: 0.46 ns 64 times 2.05 ns Swizzle
DP4: 0.46 ns 64 times 2.64 ns Swizzle
DPH: 0.46 ns 64 times 2.67 ns Swizzle
DST: 0.49 ns 62 times 1.46 ns Saturate 0.43 ns Writemask
EX2: 0.47 ns 64 times
FLR: 0.97 ns 32 times 2.43 ns Swizzle
FRC: 0.46 ns 64 times 1.94 ns Swizzle
LG2: 0.46 ns 64 times
LIT: -0.18 ns 128 times 4.37 ns Swizzle 0.00 ns Negate
LRP: 0.46 ns 64 times 2.25 ns Swizzle 0.97 ns Negate
MAD: 0.46 ns 64 times 2.21 ns Swizzle
MAX: 0.46 ns 64 times 2.05 ns Swizzle
MIN: 0.46 ns 64 times 2.05 ns Swizzle
MOV: -0.00 ns 128 times 0.46 ns Saturate
MUL: 0.47 ns 64 times 2.05 ns Swizzle
POW: 1.46 ns 21 times
RCP: 0.46 ns 64 times
RSQ: 0.46 ns 64 times
SCS: 3.40 ns 9 times 4.30 ns Saturate
SGE: 0.97 ns 32 times 2.72 ns Swizzle
SLT: 0.97 ns 32 times 2.72 ns Swizzle
SIN: 4.37 ns 7 times 4.86 ns Saturate
SUB: 0.46 ns 64 times 2.05 ns Swizzle
SWZ: 1.00 ns 29 times 1.94 ns Saturate 0.05 ns Writemask
TEX: 0.48 ns 3 times 0.97 ns Swizzle
TXP: 0.49 ns 3 times 0.97 ns Swizzle
TXB: 0.49 ns 3 times 0.97 ns Swizzle
XPD: 0.97 ns 32 times 3.88 ns Swizzle
These numbers were collected using November 2005 drivers on an ATI Radeon 9550, a fairly low-end card. 

The compiler optimizes away repeated "MOV", "ABS", or "LIT" instructions--they don't actually take 0ns.

The instruction limits are fairly low on these cards.  However, "TEX" has a limit of 3 only if each "TEX" depends on the results of the preceeding instruction (i.e., 3 dependent textures); you can have lots of non-dependent textures in a single program.  Apparently the next-generation ATI cards have much longer instruction limits.

Commonly-used instructions like arithmetic, dot products, and compares are 1 clock.  Floor, cross-product, swizzles, and set-if-compares are 2 clocks.  Adding a swizzle slows down almost all instruction types by 2 or more clocks.  COS, SIN, and SCS are about 10 clocks, and should be avoided.

<>Rough Speed Estimates (nVidia Cards)

Opcode  Time/pixel     Maximum Count     Slow options (Swizzle, saturate, negate, writemask)
ABS: -0.00 ns 1024 times 0.07 ns Saturate
ADD: 0.00 ns 1024 times 0.13 ns Saturate
CMP: -0.01 ns 1024 times
COS: 0.13 ns 1024 times
DP3: 0.07 ns 1024 times
DP4: 0.07 ns 1024 times 0.13 ns Writemask
DPH: 0.14 ns 1024 times
DST: 0.13 ns 1024 times 0.02 ns Swizzle
EX2: 0.13 ns 1024 times
FLR: 0.13 ns 1024 times
FRC: 0.13 ns 1024 times
LG2: 0.13 ns 1024 times
LIT: 0.39 ns 1024 times 0.26 ns Writemask
LRP: 0.14 ns 1024 times
MAD: 0.13 ns 1024 times
MAX: 0.01 ns 1024 times 0.13 ns Swizzle 0.13 ns Negate
MIN: 0.01 ns 1024 times 0.13 ns Swizzle 0.13 ns Negate
MOV: 0.00 ns 1024 times
MUL: 0.00 ns 1024 times 0.07 ns Saturate
POW: 0.26 ns 1024 times
RCP: -0.01 ns 1024 times 0.13 ns Negate 0.13 ns Saturate
RSQ: 0.25 ns 1024 times
SCS: 0.13 ns 1024 times
SGE: 0.13 ns 1024 times
SLT: 0.13 ns 1024 times
SIN: 0.13 ns 1024 times
SUB: -0.00 ns 1024 times 0.13 ns Saturate
SWZ: -0.00 ns 1024 times 0.07 ns Saturate
TEX: 0.13 ns 1024 times
TXP: 0.13 ns 1024 times
TXB: 0.13 ns 1024 times
XPD: 0.14 ns 1024 times
These numbers were collected using November 2005 drivers from an nVidia GeForce 6800 (stock), which is a much more expensive card than the ATI card tested above, so don't infer anything from the relative performance.

The nVidia drivers are heavily optimizing my code here--it's not the case that adds are free, it's that *repeated* adds (like I use for performance testing) are turned into a single multiply by the compiler.  Most of the zero-clock instructions tested as single-clock instructions using the January 2005 drivers.

The instruction limits for nVidia hardware are very high--I stopped testing at 1024 instructions.

Generally, instructions take one or two clocks.  Two-clock instructions include RSQ, POW, and LIT.


O. Lawlor, ffosl@uaf.edu
Up to: Class Site, CS, UAF