Bitwise Floating-Point Tricks

CS 301 Lecture, Dr. Lawlor

Recall from our float bits lecture that floats are stored using 32 perfectly ordinary bits:

Sign
Exponent
Fraction (or "Mantissa")
1 bit--
  0 for positive
  1 for negative
8 unsigned bits--
  127 means 20
  137 means 210
23 bits-- a binary fraction.

Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the low bits.

The correct way to see the bits inside a float is to use an "unholy union":
union unholy_t { /* a union between a float and an integer */
public:
float f;
int i;
};
int foo(void) {
unholy_t unholy;
unholy.f=3.0; /* put in a float */
return unholy.i; /* take out an integer */
}

(Try this in NetRun now!)

For example, we can use integer bitwise operations to zero out the float's sign bit, making a quite cheap floating-point absolute value operation:
float val=-3.1415;
int foo(void) {
unholy_t unholy;
unholy.f=val; /* put in a negative float */
unholy.i=unholy.i&0x7fFFffFF; /* mask off the float's sign bit */
return unholy.f; /* now the float is positive! */
}

(Try this in NetRun now!)

Back before SSE, floating point to integer conversion in C++ was really really slow.  The problem is that the same x86 FPU control word bits affect rounding both for float operations like addition and for float-to-int conversion.  For example, this float-to-int code takes 55ns(!) on a pre-SSE Pentium III:
float val=+3.1415;
int foo(void) {
return (int)(val+0.0001);
}

(Try this in NetRun now!)

The problem is evident in the assembly code--you've got to save the old control word out to memory, switch its rounding mode to integer, load the new control word, do the integer conversion, and finally load the original control word to resume normal operation.

But our unholy union to the rescue!  If you add a value like 1<<23 to a float, the floating-point hardware will round off all the bits after the decimal point, and shift the integer value of the float down into the low bits.  We can then extract those bits with the float-to-int union above, mask away the exponent, and we've sped up float-to-int conversion by about 6 fold.
union unholy_t { /* a union between a float and an integer */
public:
float f;
int i;
};
float val=+3.1415;
int foo(void) {
unholy_t unholy;
unholy.f=val+(1<<23); /* scrape off the fraction bits with the weird constant */
return unholy.i&0x7FffFF; /* mask off the float's sign and exponent bits */
}

(Try this in NetRun now!)

This "fast float-to-integer trick" has been independently discovered by many smart people, including:
Thankfully, since Intel added the curious "cvttss2si" instruction to SSE, and "fisttp" to SSE3, this trick isn't needed on modern CPUs.  But it's still useful to understand it!