Floats represent continuous values. But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign |
Exponent |
Fraction (or
"Mantissa") |

1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |

The hardware interprets a float as having the value:

value = (-1)

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1)

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)

struct float_bits {

unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */

unsigned int exp:8; /**< Value is 2^(exp-127) */

unsigned int sign:1; /**< 0 for positive, 1 for negative */

};

/* A union is a struct where all the fields *overlap* each other */

union float_dissector {

float f;

float_bits b;

};

float_dissector s;

s.f=8.0;

std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";

return 0;

In addition to the 32-bit "float", there are several different sizes of floating-point types:

C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |

float |
4 bytes (everywhere) |
1.0x10^{-7} |
10^{38} |
8 |
23 |
2^{24} |

double |
8 bytes (everywhere) |
2.0x10^{-15} |
10^{308} |
11 |
52 |
2^{53} |

long double |
12-16 bytes (if it even exists) |
2.0x10^{-20} |
10^{4932} |
15 |
64 |
2^{65} |

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions! The advantages of using floats are:

- Floats can store fractional numbers.

- Floats never overflow.
- "double" has more precision than "int".

x86 is not like that.

The problem is that the x86 instruction set wasn't designed with floating-point in mind; they added floating-point instructions to the CPU later (with the 8087, a separate chip that handled all floating-point instructions). Unfortunately, there weren't many unused opcode bytes left, and (being the 1980's, when bytes were expensive) the designers really didn't want to make the instructions longer. So instead of the usual instructions like "add register A to register B", x86 floating-point has just "add", which saves the bits that would be needed to specify the source and destination registers!

But the question is, what the heck are you adding? The answer is the "top two values on the floating-point register stack". That's not "the stack" (the memory area used by function calls), it's a separate set of values totally internal to the CPU's floating-point hardware. There are various load functions that push values onto the floating-point register stack, and most of the arithmetic functions read from the top of the floating-point register stack. So to compute stuff, you load the values you want to manipulate onto the floating-point register stack, and then use some arithmetic instructions.

For example, to add together the three values a, b, and c, you'd "load a; load b; add; load c; add;". Or, you could "load a; load b; load c; add; add;". If you've ever used an HP calculator, or written Postscript or Forth code, you've seen this "Reverse Polish Notation".

fldpi ; Push "pi" onto floating-point stackThere are lots of useful floating-point instructions:

sub esp,8 ; Make room on the stack for an 8-byte double

fstp QWORD [esp]; Push printf's double parameter onto the stack

push my_string ; Push printf's string parameter (below)

extern printf

call printf ; Print string

add esp,12 ; Clean up stack

ret ; Done with function

my_string: db "Yo! Here's our float: %f",0xa,0

Assembly |
Description |

fld1 |
Pushes into the floating-point registers the constant 1.0 |

fldz |
Pushes into the floating-point registers the constant 0.0 |

fldpi |
Pushes the constant pi. (Try this in NetRun now!) |

fld DWORD [eax] |
Pushes
into the floating-point registers the 4-byte "float" loaded from memory
at address eax. This is how most constants get loaded into the
program. (Try this in NetRun now!) |

fild DWORD [eax] |
Pushes into the floating-point registers the 4-byte "int" loaded from memory at address eax. |

fld QWORD [eax] |
Pushes an 8-byte "double" loaded from address eax. (Try this in NetRun now!) |

fld st0 |
Duplicates the top float, so there are now two copes of it. (Try this in NetRun now!) |

fstp DWORD [eax] | Pops the top floating-point value, and stores it as a "float" to address eax. |

fst DWORD [eax] | Reads the top floating-point value and stores it as a "float" to address eax. This doesn't change the value stored on the floating-point stack. |

fstp QWORD [eax] | Pops the top floating-point value, and stores it as a "double" to address eax. |

faddp |
Add the top two values, pushes the result. (Try this in NetRun now!) |

fsubp |
Subtract the two values, pushes the result. Note "fld A; fld B; fsubp;" computes A-B. (Try this in NetRun now!) There's also a "fsubrp" that subtracts in the opposite order (computing B-A). |

fmulp |
Multiply the top two values. |

fdivp |
Divide the top two values. Note "fld A; fld B; fdivp;" computes A/B. (Try this in NetRun now!) There's also a "fdivrp" that divides in the opposite order (computing B/A). |

fabs |
Take the absolute value of the top floating-point value. |

fsqrt |
Take the square root of the top floating-point value. |

fsin |
Take the sin() of the top floating-point value, treated as radians. (Try this in NetRun now!) |

In general, the "p" instructions pop a value from the floating-point stack.

The non-"p" instructions don't. For example, there isn't a "fsinp" instruction, since sin only takes one argument, so the stack stays the same height after doing a sin().

x86 has quite a few really bizarre-sounding floating-point instructions. Intel's Reference Volume 2 has the complete list (Section 3, alphabetized under "f"). The "+1" and "-1" versions are designed to decrease roundoff, by shifting the input to the most sensitive region.

F2XM1 |
2^{x} - 1 |

FYL2X |
y*log_{2}(x), where x is on top of the floating-point stack. |

FYL2XP1 | y*log_{2}(x+1), where x is on top |

FCHS |
-x |

FSINCOS |
Computes *both* sin(x) and cos(x). cos(x) ends up on top. |

FPATAN |
atan2(a/b), where b is on top |

FPREM |
fmod(a,b), where b is on top |

FRNDINT |
Round to the nearest integer |

FXCH |
Swap the top two values on the floating-point stack |