non-SSE floating point: stack based x87

CS 301 Lecture, Dr. Lawlor

As a reminder, here's how we print a float living out there in memory:
push rax ; <- gotta align stack before calling farray_print
mov rdi,val
mov rsi,1 ; number of floats to print
extern farray_print
call farray_print
pop rax
ret

section .data
val: dd 1.234

(Try this in NetRun now!)

Here's how we'd modify that float using SSE instructions:
movss xmm0,DWORD[val]
addss xmm0,xmm0
movss DWORD[val],xmm0

push rax ; <- gotta align stack before calling farray_print
mov rdi,val
mov rsi,1 ; number of floats to print
extern farray_print
call farray_print
pop rax
ret

section .data
val: dd 1.234

(Try this in NetRun now!)

Here's how we'd overwrite that float with the constant "pi" using the pre-SSE ancient "x87" floating point instructions:

fldpi
fstp DWORD[val]

push rax ; <- gotta align stack before calling farray_print
mov rdi,val
mov rsi,1 ; number of floats to print
extern farray_print
call farray_print
pop rax
ret

section .data
val: dd 1.234

(Try this in NetRun now!)

Here's how we'd load the float, and add it to itself:

fld DWORD[val] ; load one copy
fld DWORD[val] ; load a second copy
faddp ; add the top copies (and pop)
fstp DWORD[val] ; store and pop

push rax ; <- gotta align stack before calling farray_print
mov rdi,val
mov rsi,1 ; number of floats to print
extern farray_print
call farray_print
pop rax
ret

section .data
val: dd 1.234

(Try this in NetRun now!)

Obvious question...

WHAT THE HECK IS HAPPENING HERE!?

x87 Stack-Based Floating Point

Ever use an HP calculator?  Instead of punching "2 + 3", you type "2 <enter> 3 +".  The numbers get loaded onto the "operand stack", and the "+" key adds the two operands on top of the stack.  So instead of "2 + 3 + 4", you just load up 2, 3, and 4, then "+" and "+".

That's how they designed the old 387 math coprocessor, the first hardware floating point for x86.  Like a calculator.

So "fld" just loads a new value onto the "register stack" (totally unrelated to any other register, or the real stack in memory).
"faddp" adds the two values on top of the register stack.
"fstp" stores the top value out to memory.

You can also refer to values on the stack by name.  In NASM, this is like "st0" for the top of the stack, "st1" for one down from that, "st2" for two down, all the way through "st7" for the last value (there are only 8 values in the register stack).  So I can duplicate the top value with "fld st0":
fld DWORD[val]
fld st0
faddp
fstp DWORD[val

(Try this in NetRun now!)

Unlike faddp's popping the top two values, just "fadd" can manipulate existing values on the stack.  For example, I can add the top value to itself like this:
fld DWORD[val]
fadd st0,st0
fstp DWORD[val

(Try this in NetRun now!)

You can dig deeper in the stack, here ignoring st1 and adding st0 and st2.
fldpi 
fld1
fld DWORD[val]
fadd st0,st2 ; st0: val, st1: 1, st2: pi
fstp DWORD[val

(Try this in NetRun now!)

But you can't add st1 and st2: "fadd st1,st2" gives

error: invalid combination of opcode and operands

Why not?  They didn't have enough bits to specify an arbitrary source and arbitrary destination, so one value needs to be on the top of the stack for every operation.  This is easy for simple functions, but ends up taking lots of loads and stores for realistically long functions.  This shuffling is one big reason we don't use the old stack-based floating point anymore.

Curiously, saving bits in the instructions is very important for bytecode languages, like Java or .NET, so they typically use a stack-based arithmetic model instead of registers.  The Just-In-Time compiler usually can pick registers when generating machine code though, so this doesn't cost much speed.

Pre-SSE Float to Integer Conversions

The other big reason that they had to rebuild floating point with the SSE instructions was floating point to integer conversion. Fortran and C derived languages always round floats to integers "toward zero": 3.7 rounds down to 3, and -3.7 rounds up to -3.  Arguably rounding toward the nearest integer would have been better: 3.7 would round up to 4, 3.4 would round down to 3.  For arithmetic operations, you normally have to round to the number of digits available.

But the old x87 only had one set of "rounding" bits, used both for integer conversions (which should round down), and ordinary arithmetic (which should round toward the nearest value).  This means the compiler has to generate the following insanely slow code to switch rounding modes before the "fistp" (Floating-point to Integer STore, with Pop):

fldpi  ; value to convert

fnstcw WORD [oldcw] ; store old rounding mode
mov ax,WORD [oldcw]
mov ah,0xc ; overwrite high bits (rounding mode)
mov WORD [newcw],ax ; store to memory
fldcw WORD [newcw] ; change rounding mode
fistp DWORD [value] ; do conversion to memory
fldcw WORD [oldcw] ; change back to original mode

mov eax,[value] ; load from memory
ret

section .data
oldcw: dw 0 ; storage for old control word
newcw: dw 0 ; storage for new control work
value: dd 0 ; place to put new value

(Try this in NetRun now!)

Ouch!  Even on a new machine, this takes over 13 nanoseconds, while an SSE "cvttss2si" only takes 3ns:
movss xmm0,[value]  ; load value to convert
cvttss2si eax,xmm0 ; convert it (with truncation: round down)
ret
section .data
value: dd 3.14159

(Try this in NetRun now!)

BONUS: Fast Float-to-Int Conversion without SSE

Recall from our float bits lecture that floats are stored using 32 perfectly ordinary bits:

Sign
Exponent
Fraction (or "Mantissa")
1 bit--
  0 for positive
  1 for negative
8 unsigned bits--
  127 means 20
  137 means 210
23 bits-- a binary fraction.

Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the low bits.

The correct way to see the bits inside a float is to use an "unholy union":
union unholy_t { /* a union between a float and an integer */
public:
float f;
int i;
};
int foo(void) {
unholy_t unholy;
unholy.f=3.0; /* put in a float */
return unholy.i; /* take out an integer */
}

(Try this in NetRun now!)

For example, we can use integer bitwise operations to zero out the float's sign bit, making a quite cheap floating-point absolute value operation:
float val=-3.1415;
int foo(void) {
unholy_t unholy;
unholy.f=val; /* put in a negative float */
unholy.i=unholy.i&0x7fFFffFF; /* mask off the float's sign bit */
return unholy.f; /* now the float is positive! */
}

(Try this in NetRun now!)

Back before SSE, floating point to integer conversion in C++ was really really slow.  As we list above, the problem is that the same x86 FPU control word bits affect rounding both for float operations like addition and for float-to-int conversion.  Thus you've got to save the old control word out to memory, switch its rounding mode to integer, load the new control word, do the integer conversion, and finally load the original control word to resume normal operation.

But our unholy union to the rescue!  If you add a value like 1<<23 to a float, the floating-point hardware will round off all the bits after the decimal point, and shift the integer value of the float down into the low bits.  We can then extract those bits with the float-to-int union above, mask away the exponent, and we've sped up float-to-int conversion by about 6 fold.
union unholy_t { /* a union between a float and an integer */
public:
float f;
int i;
};
float val=+3.1415;
int foo(void) {
unholy_t unholy;
unholy.f=val+(1<<23); /* scrape off the fraction bits with the weird constant */
return unholy.i&0x7FffFF; /* mask off the float's sign and exponent bits */
}

(Try this in NetRun now!)

This "fast float-to-integer trick" has been independently discovered by many smart people, including:
Thankfully, since Intel added the curious "cvttss2si" instruction to SSE, and "fisttp" to SSE3, this trick isn't needed on modern CPUs.  But it's still useful to understand it!