Floating-Point Numbers

CS 641 Lecture, Dr. Lawlor
To represent fractional values, prior to fast floating-point hardware we used to use fixed point arithmetic, where you keep track of the decimal point's location at compile time.  Floating point allows the decimal point to move at runtime, making it more flexible than fixed point.

If you need a review of floats, see these CS 301 lecture notes:
This patent provides a decent summary of a floating-point add circuit.  The basic idea is usually:
• Shift the two input numbers so their decimal points line up.
• Renormalize the sum: count off zero bits until you hit a one, and shift significant digits up.
You can shift both input numbers into huge fixed-point values (for example, a 32-bit float can be shifted into a fx128.153 floating-point number), but it's much more circuitry-efficient to shift the smaller number so it matches up with the larger value, as we discussed in class.

Software Examples

x86 ancient (1980's) interface: floating-point register stack.
`fldpi ; Push "pi" onto floating-point stackfld DWORD[my_float]  ; push constantfaddp ; add one and pisub esp,8 ; Make room on the stack for an 8-byte doublefstp QWORD [esp]; Push printf's double parameter onto the stackpush my_string ; Push printf's string parameter (below)extern printfcall printf  ; Print stringadd esp,12    ; Clean up stackret ; Done with functionmy_string: db "Yo!  Here's our float: %f",0xa,0my_float: dd 1.0 ; floating-point DWORD`

(Try this in NetRun now!)

x86 newer (1990's) interface: SSE registers
`movups xmm0,[my_arr] ; load up arrayaddps xmm0,xmm0 ; add array to itselfmovups [my_arr],xmm0 ; store back to memorypush 4 ; number of values to printpush my_arr ; array to printextern farray_printcall farray_print  ; Print stringadd esp,8    ; Clean up stackret ; Done with functionsection ".data"my_arr: dd 1.0, 2.0, 3.0, 4.0 ; floating-point DWORD`

(Try this in NetRun now!)

Bits in Floating-point Numbers

We can pretty easily count the bits in a float, by making the float smaller and smaller until roundoff loses the "x":
`float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */int itcount=0;while (x+1.0f!=1.0f) {	x=x*0.5;	itcount++;}std::cout<<"itcount=="<<itcount<<"\n";`

(Try this in NetRun now!)

We can't do the cout on the GPU, but we can print out different colors.  Green is right on the money; blue is too small, red is too big.
`float x=1.0+texcoords.x*1.0e-10; /* FEAR ME, OPTIMIZER!!! */int itcount=0;while (x+1.0!=1.0) {	x=x*0.5;	itcount++;}if (itcount>24) gl_FragColor=vec4(1,0,0,0); /* red */else if (itcount==24) gl_FragColor=vec4(0,1,0,0); /* green */else if (itcount<24) gl_FragColor=vec4(0,0,1,0); /* blue */`
This outputs green, indicating that "float" on the GPU has exactly the same number of bits (23) as on the CPU.

Similarly, we can keep shrinking an "x" until it gets rounded off to zero.  We use up all the exponent values (128), then all the mantissa values (23) as denormals to get a total of 150 places: 2^-150 "underflows" to zero.
`float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */int itcount=0;while (x!=0.0f) {	x=x*0.5;	itcount++;}std::cout<<"itcount=="<<itcount<<"\n";`

(Try this in NetRun now!)

Now on the GPU:
`float x=1.0+texcoords.x*1.0e-10; /* FEAR ME, OPTIMIZER!!! */int itcount=0;while (x!=0.0) {	x=x*0.5;	itcount++;}int expected=150;if (itcount>expected) gl_FragColor=vec4(1,0,0,0); /* red */else if (itcount==expected) gl_FragColor=vec4(0,1,0,0); /* green */else if (itcount<expected) gl_FragColor=vec4(0,0,1,0); /* blue */`

(Try this in NetRun now!)

Wait--the GPU is returning blue, indicating fewer iterations than we expected!  Experimentally, we can find that the GPU hits underflow at 2^-127.  Wait, that's the last of the normal numbers before denormals kick in.  So the GPU doesn't do denormals, and indeed you can measure that there is zero performance penalty for using values near them.