# non-SSE floating point: stack based x87

CS 301 Lecture, Dr. Lawlor

As a reminder, here's how we print a float living out there in memory:
`push rax ; <- gotta align stack before calling farray_printmov rdi,valmov rsi,1 ; number of floats to printextern farray_printcall farray_printpop raxretsection .dataval: dd 1.234`

(Try this in NetRun now!)

Here's how we'd modify that float using SSE instructions:
`movss xmm0,DWORD[val]addss xmm0,xmm0movss DWORD[val],xmm0push rax ; <- gotta align stack before calling farray_printmov rdi,valmov rsi,1 ; number of floats to printextern farray_printcall farray_printpop raxretsection .dataval: dd 1.234`

Here's how we'd overwrite that float with the constant "pi" using the pre-SSE ancient "x87" floating point instructions:

`fldpifstp DWORD[val]push rax ; <- gotta align stack before calling farray_printmov rdi,valmov rsi,1 ; number of floats to printextern farray_printcall farray_printpop raxretsection .dataval: dd 1.234`

`fld DWORD[val] ; load one copyfld DWORD[val] ; load a second copyfaddp  ; add the top copies (and pop)fstp DWORD[val] ; store and poppush rax ; <- gotta align stack before calling farray_printmov rdi,valmov rsi,1 ; number of floats to printextern farray_printcall farray_printpop raxretsection .dataval: dd 1.234`

(Try this in NetRun now!)

Obvious question...

WHAT THE HECK IS HAPPENING HERE!?

## x87 Stack-Based Floating Point

Ever use an HP calculator?  Instead of punching "2 + 3", you type "2 <enter> 3 +".  The numbers get loaded onto the "operand stack", and the "+" key adds the two operands on top of the stack.  So instead of "2 + 3 + 4", you just load up 2, 3, and 4, then "+" and "+".

That's how they designed the old 387 math coprocessor, the first hardware floating point for x86.  Like a calculator.

So "fld" just loads a new value onto the "register stack" (totally unrelated to any other register, or the real stack in memory).
"fstp" stores the top value out to memory.

You can also refer to values on the stack by name.  In NASM, this is like "st0" for the top of the stack, "st1" for one down from that, "st2" for two down, all the way through "st7" for the last value (there are only 8 values in the register stack).  So I can duplicate the top value with "fld st0":
`fld DWORD[val]fld st0faddpfstp DWORD[val`

(Try this in NetRun now!)

Unlike faddp's popping the top two values, just "fadd" can manipulate existing values on the stack.  For example, I can add the top value to itself like this:
`fld DWORD[val]fadd st0,st0fstp DWORD[val`

(Try this in NetRun now!)

You can dig deeper in the stack, here ignoring st1 and adding st0 and st2.
`fldpi fld1fld DWORD[val]fadd st0,st2  ; st0: val, st1: 1, st2: pifstp DWORD[val`

`error: invalid combination of opcode and operands`

Why not?  They didn't have enough bits to specify an arbitrary source and arbitrary destination, so one value needs to be on the top of the stack for every operation.  This is easy for simple functions, but ends up taking lots of loads and stores for realistically long functions.  This shuffling is one big reason we don't use the old stack-based floating point anymore.

Curiously, saving bits in the instructions is very important for bytecode languages, like Java or .NET, so they typically use a stack-based arithmetic model instead of registers.  The Just-In-Time compiler usually can pick registers when generating machine code though, so this doesn't cost much speed.

## Pre-SSE Float to Integer Conversions

The other big reason that they had to rebuild floating point with the SSE instructions was floating point to integer conversion. Fortran and C derived languages always round floats to integers "toward zero": 3.7 rounds down to 3, and -3.7 rounds up to -3.  Arguably rounding toward the nearest integer would have been better: 3.7 would round up to 4, 3.4 would round down to 3.  For arithmetic operations, you normally have to round to the number of digits available.

But the old x87 only had one set of "rounding" bits, used both for integer conversions (which should round down), and ordinary arithmetic (which should round toward the nearest value).  This means the compiler has to generate the following insanely slow code to switch rounding modes before the "fistp" (Floating-point to Integer STore, with Pop):

`fldpi  ; value to convertfnstcw WORD [oldcw] ; store old rounding modemov    ax,WORD [oldcw]mov    ah,0xc ; overwrite high bits (rounding mode)mov    WORD [newcw],ax ; store to memoryfldcw  WORD [newcw] ; change rounding modefistp  DWORD [value] ; do conversion to memoryfldcw  WORD [oldcw] ; change back to original modemov eax,[value] ; load from memoryretsection .dataoldcw: dw 0 ; storage for old control wordnewcw: dw 0 ; storage for new control workvalue: dd 0 ; place to put new value`

(Try this in NetRun now!)

Ouch!  Even on a new machine, this takes over 13 nanoseconds, while an SSE "cvttss2si" only takes 3ns:
`movss xmm0,[value]  ; load value to convertcvttss2si eax,xmm0 ; convert it (with truncation: round down)retsection .datavalue: dd 3.14159`

## BONUS: Fast Float-to-Int Conversion without SSE

Recall from our float bits lecture that floats are stored using 32 perfectly ordinary bits:

 Sign Exponent Fraction (or "Mantissa") 1 bit--   0 for positive   1 for negative 8 unsigned bits--   127 means 20   137 means 210 23 bits-- a binary fraction. Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the low bits.

The correct way to see the bits inside a float is to use an "unholy union":
`union unholy_t { /* a union between a float and an integer */public:	float f;	int i;};int foo(void) {	unholy_t unholy;	unholy.f=3.0; /* put in a float */	return unholy.i; /* take out an integer */}`

(Try this in NetRun now!)

For example, we can use integer bitwise operations to zero out the float's sign bit, making a quite cheap floating-point absolute value operation:
`float val=-3.1415;int foo(void) {	unholy_t unholy;	unholy.f=val; /* put in a negative float */	unholy.i=unholy.i&0x7fFFffFF; /* mask off the float's sign bit */	return unholy.f; /* now the float is positive! */}`

(Try this in NetRun now!)

Back before SSE, floating point to integer conversion in C++ was really really slow.  As we list above, the problem is that the same x86 FPU control word bits affect rounding both for float operations like addition and for float-to-int conversion.  Thus you've got to save the old control word out to memory, switch its rounding mode to integer, load the new control word, do the integer conversion, and finally load the original control word to resume normal operation.

But our unholy union to the rescue!  If you add a value like 1<<23 to a float, the floating-point hardware will round off all the bits after the decimal point, and shift the integer value of the float down into the low bits.  We can then extract those bits with the float-to-int union above, mask away the exponent, and we've sped up float-to-int conversion by about 6 fold.
`union unholy_t { /* a union between a float and an integer */public:	float f;	int i;};float val=+3.1415;int foo(void) {	unholy_t unholy;	unholy.f=val+(1<<23); /* scrape off the fraction bits with the weird constant */	return unholy.i&0x7FffFF; /* mask off the float's sign and exponent bits */}`

(Try this in NetRun now!)

This "fast float-to-integer trick" has been independently discovered by many smart people, including:
• Chris Hecker "Let's get to the (floating) point", includes good history of floating-point for 3D game programming.
• Mike Herf "know your FPU".
• Sree Kotay "Fixing Floating Fast".
Thankfully, since Intel added the curious "cvttss2si" instruction to SSE, and "fisttp" to SSE3, this trick isn't needed on modern CPUs.  But it's still useful to understand it!