Floating Point Computation and x86 Assembly

CS 441 Lecture, Dr. Lawlor

Ordinary integers can only represent integral values. "Floating-point numbers" can represent non-integral values. This is useful for engineering, science, statistics, graphics, and any time you need to represent numbers from the real world, which are rarely integral!

Floats store numbers in an odd way--they're really storing the number in scientific notation, like
x = + 3.785746 * 10⁵
Note that:

You only need one bit to represent the sign--plus or minus.
The exponent's just an integer, so you can store it as an integer.
The 3.785746 part can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

Scientific notation is designed to be compatible with slide rules (here's a circular slide rule demo); slide rules are basically a log table starting at 1. This works because log(1) = 0, and log(a) + log(b) = log(ab). But slide rules only give you the mantissa; you need to figure out the exponent yourself. The "order of magnitude" guess that engineers (and I) like so much is just a calculation using zero significant digits--no mantissa, all exponent.

Normalized Numbers

Scientific notation can represent the same number in several different ways:
x = + 3.785746 * 10⁵ = + 0.3785746 * 10⁶ = + 0.003785746 * 10⁷ = + 37.85746 * 10⁴

It's common to "normalize" a number in scientific notation so that:

There's exactly one digit to the left of the decimal point.
And that digit ain't zero.

This means the 10⁵ version is the "normal" way to write the number above.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

Float as Bits

Floats represent continuous values. But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign	Exponent	Fraction (or "Mantissa")
1 bit-- 0 for positive 1 for negative	8 unsigned bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction. Don't forget the implicit leading 1!

The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1) ⁰ * 2 ^(130-127)* 1.0000....

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
	unsigned int exp:8; /**< Value is 2^(exp-127) */
	unsigned int sign:1; /**< 0 for positive, 1 for negative */
};

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
	float f;
	float_bits b;
};

float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";
return 0;

(Executable NetRun link)

There are several different sizes of floating-point types:

C Datatype	Size	Approx. Precision	Approx. Range	Exponent Bits	Fraction Bits	+-1 range
float	4 bytes (everywhere)	1.0x10^-7	10³⁸	8	23	2²⁴
double	8 bytes (everywhere)	2.0x10^-15	10³⁰⁸	11	52	2⁵³
long double	12-16 bytes (if it exists)	2.0x10^-20	10⁴⁹³²	15	64	2⁶⁵

Nowadays floats have roughly the same performance as integers: addition takes a little over a nanosecond (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions.

Roundoff in Representation

0.1 decimal is an infinitely repeating pattern in binary (0.0(0011), with 0011 repeating). This means multiplying by some *finite* pattern to approximate 0.1 is only an approximation of really dividing by the integer 10.0. The exact difference is proportional to the precision of the numbers and the size of the input data:

for (int i=1;i<1000000000;i*=10) {
	double mul01=i*0.1;
	double div10=i/10.0;
	double diff=mul01-div10;
	std::cout<<"i="<<i<<"  diff="<<diff<<"\n";
}

(executable NetRun link)

On my P4, this gives:

i=1  diff=5.54976e-18
i=10  diff=5.55112e-17
i=100  diff=5.55112e-16
i=1000  diff=5.55112e-15
i=10000  diff=5.55112e-14
i=100000  diff=5.55112e-13
i=1000000  diff=5.54934e-12
i=10000000  diff=5.5536e-11
i=100000000  diff=5.54792e-10
Program complete.  Return 0 (0x0)

That is, there's a factor of 10^-15 difference between double-precision 0.1 and the true 1/10!

Roundoff in Arithmetic

They're funny old things, floats. The fraction part only stores so much precision; further bits are lost. For example, in reality,
1.2347654 * 10⁴ = 1.234* 10⁴ + 7.654* 10⁰
But to three decimal places,
1.234 * 10⁴ = 1.234* 10⁴ + 7.654* 10⁰
which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places. This "roundoff" has implications.

For example, adding one repeatedly will eventually stop doing anything:

float f=0.73;
while (1) {
	volatile float g=f+1;
	if (g==f) {
		printf("f+1 == f  at f=%.3f, or 2^%.3f\n",
			f,log(f)/log(2.0));
		return 0;
	}
	else f=g;
}

(executable NetRun link)
Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

For another example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result (up to roundoff):
    1.2355308 * 10⁴ = 1.234* 10⁴ + (7.654* 10⁰ + 7.654* 10⁰)
    1.2355308 * 10⁴ = (1.234* 10⁴ + 7.654* 10⁰) + 7.654* 10⁰
In other words, parenthesis don't matter if you're computing the exact result. But to three decimal places,
    1.235 * 10⁴ = 1.234* 10⁴ + (7.654* 10⁰ + 7.654* 10⁰)
    1.234 * 10⁴ = (1.234* 10⁴ + 7.654* 10⁰) + 7.654* 10⁰
In the first line, the small values get added together, and together they're enough to move the big value. But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!

double lil=1.0;
double big=pow(2.0,64);
printf(" big+(lil+lil) -big = %.0f\n", big+(lil+lil) -big);
printf("(big+lil)+lil  -big = %.0f\n",(big+lil)+lil  -big);

(executable NetRun link)

float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
	windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

(executable NetRun link)

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:

float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
	gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

(executable NetRun link)

Roundoff is very annoying, but it doesn't matter if you don't care about exact answers, like in simulation (where "exact" means the same as the real world, which you'll never get anyway) or games.

Denormal Numbers

A denormal (tiny) number doesn't have an implicit leading 1; it's got an implicit leading 0. This means the ciruitry for doing arithmetic on denormal numbers doesn't look much like the circuitry for doing arithmetic on ordinary normalized numbers. On modern machines, there *is* no circuitry for computing on denormal numbers--instead, the hardware traps out to software when it hits a denormal value.

This would be fine, except software is way slower than hardware--about 25x slower on the NetRun machine!

float f=pow(2,-128); // denormal

int foo(void) {
	f*=1.00001; //<- you can do almost any operation here...
	return 0;
}

(executable NetRun link)

As written, this takes 328ns per execution of foo, which is crazy slow.

If you initialize f to, say, 3.0 (or any non-denormal value), this takes like 13ns per execution, which is reasonable.

That is, floating-point code can take absurdly longer when computing denormals (infinities have the same problem). Denormals can have a huge impact on the performance of real code--I've written a paper on this.

The easiest way to fix denormals is to round them off to zero--one trick for doing this is to just add a big value ("big" compared to a denormal can be like 1.0e-10) and then subtract it off again!

Floating-Point Assembly

On many CPUs, floating-point values are usually stored in special "floating-point registers", and are added, subtracted, etc with special "floating-point instructions", but other than the name these registers and instructions are exactly analogous to regular integer registers and instructions. For example, the integer PowerPC assembly code to add registers 1 and 2 into register 3 is "add r3,r1,r2"; the floating-point code to add floating-point registers 1 and 2 into floating-point register 3 is "fadd fr3,fr1,fr2".

x86 is not like that.

The problem is that the x86 instruction set wasn't designed with floating-point in mind; they added floating-point instructions to the CPU later (with the 8087, a separate chip that handled all floating-point instructions). Unfortunately, there weren't many unused opcode bytes left, and (being the 1980's, when bytes were expensive) the designers really didn't want to make the instructions longer. So instead of the usual instructions like "add register A to register B", x86 floating-point has just "add", which saves the bits that would be needed to specify the source and destination registers!

But the question is, what the heck are you adding? The answer is the "top two values on the floating-point register stack". That's not "the stack" (the memory area used by function calls), it's a separate set of values totally internal to the CPU's floating-point hardware. There are various load functions that push values onto the floating-point register stack, and most of the arithmetic functions read from the top of the floating-point register stack. So to compute stuff, you load the values you want to manipulate onto the floating-point register stack, and then use some arithmetic instructions.

For example, to add together the three values a, b, and c, you'd "load a; load b; add; load c; add;". Or, you could "load a; load b; load c; add; add;". If you've ever used an HP calculator, or written Postscript or Forth code, you've seen this "Reverse Polish Notation".

x86 Floating-Point in Practice

Here's what this looks like. The whole bottom chunk of code just prints the float on the top of the x86 register stack, with the assembly equivalent of the C code: printf("Yo! Here's our float: %f\n",f);

fldpi ; Push "pi" onto floating-point stack

sub esp,8 ; Make room on the stack for an 8-byte double
fstp QWORD [esp]; Push printf's double parameter onto the stack
push my_string ; Push printf's string parameter (below)
extern printf
call printf  ; Print string
add esp,12    ; Clean up stack

ret ; Done with function

my_string: db "Yo!  Here's our float: %f",0xa,0

(Try this in NetRun now!)

There are lots of useful floating-point instructions:

Assembly	Description
fld1	Pushes into the floating-point registers the constant 1.0
fldz	Pushes into the floating-point registers the constant 0.0
fldpi	Pushes the constant pi. (Try this in NetRun now!)
fld DWORD [eax]	Pushes into the floating-point registers the 4-byte "float" loaded from memory at address eax. This is how most constants get loaded into the program. (Try this in NetRun now!)
fild DWORD [eax]	Pushes into the floating-point registers the 4-byte "int" loaded from memory at address eax.
fld QWORD [eax]	Pushes an 8-byte "double" loaded from address eax. (Try this in NetRun now!)
fld st0	Duplicates the top float, so there are now two copes of it. (Try this in NetRun now!)
fstp DWORD [eax]	Pops the top floating-point value, and stores it as a "float" to address eax.
fst DWORD [eax]	Reads the top floating-point value and stores it as a "float" to address eax. This doesn't change the value stored on the floating-point stack.
fstp QWORD [eax]	Pops the top floating-point value, and stores it as a "double" to address eax.
faddp	Add the top two values, pushes the result. (Try this in NetRun now!)
fsubp	Subtract the two values, pushes the result. Note "fld A; fld B; fsubp;" computes A-B. (Try this in NetRun now!) There's also a "fsubrp" that subtracts in the opposite order (computing B-A).
fmulp	Multiply the top two values.
fdivp	Divide the top two values. Note "fld A; fld B; fdivp;" computes A/B. (Try this in NetRun now!) There's also a "fdivrp" that divides in the opposite order (computing B/A).
fabs	Take the absolute value of the top floating-point value.
fsqrt	Take the square root of the top floating-point value.
fsin	Take the sin() of the top floating-point value, treated as radians. (Try this in NetRun now!)

Remember, "stack" here means the floating-point register stack, not the memory area used for passing parameters and such.

In general, the "p" instructions pop a value from the floating-point stack.

The non-"p" instructions don't. For example, there isn't a "fsinp" instruction, since sin only takes one argument, so the stack stays the same height after doing a sin().

x86 has quite a few really bizarre-sounding floating-point instructions. Intel's Reference Volume 2 has the complete list (Section 3, alphabetized under "f"). The "+1" and "-1" versions are designed to decrease roundoff, by shifting the input to the most sensitive region.

F2XM1	2^x - 1
FYL2X	y*log₂(x), where x is on top of the floating-point stack.
FYL2XP1	y*log₂(x+1), where x is on top
FCHS	-x
FSINCOS	Computes both sin(x) and cos(x). cos(x) ends up on top.
FPATAN	atan2(a/b), where b is on top
FPREM	fmod(a,b), where b is on top
FRNDINT	Round to the nearest integer
FXCH	Swap the top two values on the floating-point stack