Bits in Floating-Point Numbers

CS 301 Lecture, Dr. Lawlor

Ordinary integers can only represent integral values. "Floating-point numbers" can represent non-integral values. This is useful for engineering, science, statistics, graphics, and any time you need to represent numbers from the real world, which are rarely integral!

In binary, you can represent a non-integer like "two and three-eighths" as "10.011". That is, there's:

a 1 in the 2's place (2=2¹)
a 0 in the 1's place (1=2⁰)
a 0 in the (beyond the "binary point") 1/2's place (1/2=2^-1),
a 1 in the 1/4's place (1/4=2^-2), and
a 1 in the 1/8's place (1/8=2^-3)

for a total of two plus 1/4 plus 1/8, or "2+3/8". Note that this is a natural measurement in carpenter's fractional inches, but it's a weird unnatural thing in metric-style decimal inches. That is, fractions that are (negative) powers of two have a nice binary representation, but look weird in decimal (1/16 = 0.0001_base2 = 0.0625_base10). Conversely, short decimal numbers have a nice decimal representation, but often look weird as a binary fraction (0.2_base10 = 0.001100110011..._base2).

Normalized Numbers

In C++, "float" and "double" store numbers in an odd way--they're really storing the number in scientific notation, like
x = + 3.785746 * 10⁵
Note that:

You only need one bit to represent the sign--plus or minus.
The exponent's just an integer, so you can store it as an integer.
The 3.785746 part, called the "mantissa" or just "fraction" part, can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

Scientific notation is designed to be compatible with slide rules (here's a circular slide rule demo); slide rules are basically a log table starting at 1. This works because log(1) = 0, and log(a) + log(b) = log(ab). But slide rules only give you the mantissa; you need to figure out the exponent yourself. The "order of magnitude" guess that engineers (and I) like so much is just a calculation using zero significant digits--no mantissa, all exponent.

Scientific notation can represent the same number in several different ways:
x = + 3.785746 * 10⁵ = + 0.3785746 * 10⁶ = + 0.003785746 * 10⁷ = + 37.85746 * 10⁴

It's common to "normalize" a number in scientific notation so that:

There's exactly one digit to the left of the decimal point.
And that digit ain't zero.

This means the 10⁵ version above is the "normal" way to write the number above.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So sometimes there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

Roundoff in Arithmetic

They're funny old things, floats. The fraction part (mantissa) only stores so much precision; further bits are lost. For example, in reality,
    1.234* 10⁴ + 7.654* 10⁰= 1.2347654 * 10⁴
But if we only keep three decimal places,
    1.234* 10⁴ + 7.654* 10⁰ = 1.234 * 10⁴
which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places.   To avoid this "roundoff error", when you're doing arithmetic by hand, people recommend keeping lots of digits, and only rounding once, at the end. But for a given value of "lots of digits", did you keep enough?

For example, on a real computer adding one repeatedly will eventually stop doing anything:

float f=0.73;
while (1) {
	volatile float g=f+1;
	if (g==f) {
		printf("f+1 == f  at f=%.3f, or 2^%.3f\n",
			f,log(f)/log(2.0));
		return 0;
	}
	else f=g;
}

(executable NetRun link)
Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

For another example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result (up to roundoff):
    1.2355308 * 10⁴ = 1.234* 10⁴ + (7.654* 10⁰ + 7.654* 10⁰)
    1.2355308 * 10⁴ = (1.234* 10⁴ + 7.654* 10⁰) + 7.654* 10⁰
In other words, parenthesis don't matter if you're computing the exact result. But to three decimal places,
    1.235 * 10⁴ = 1.234* 10⁴ + (7.654* 10⁰ + 7.654* 10⁰)
    1.234 * 10⁴ = (1.234* 10⁴ + 7.654* 10⁰) + 7.654* 10⁰
In the first line, the small values get added together, and together they're enough to move the big value. But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!

double lil=1.0;
double big=pow(2.0,64);
printf(" big+(lil+lil) -big = %.0f\n", big+(lil+lil) -big);
printf("(big+lil)+lil  -big = %.0f\n",(big+lil)+lil  -big);

(executable NetRun link)

float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
	windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

(executable NetRun link)

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:

float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
	gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

(executable NetRun link)

Roundoff can be very annoying, but it doesn't matter if you don't care about exact answers, like in many simulations (where "exact" means the same as the real world, which you'll never get anyway) or games.

One very frustrating fact is that roundoff depends on the precision you keep in your numbers. This, in turn, depends on the size of the numbers. For example, a "float" is just 4 bytes, but it's not very precise. A "double" is 8 bytes, but it's more precise. A "long double" is 12 bytes (or more!), but it's got tons of precision. There's often a serious tradeoff between precision and space (and time), so just using long double for everything isn't a good idea: your program may get bigger and slower, and you still might not have enough precision.

Roundoff in Representation

Sadly, 0.1 decimal is an infinitely repeating pattern in binary: 0.0(0011), with 0011 repeating. This means multiplying by some *finite* pattern to approximate 0.1 is only an approximation of really dividing by the integer 10.0. The exact difference is proportional to the precision of the numbers and the size of the input data:

for (int i=1;i<1000000000;i*=10) {
	double mul01=i*0.1;
	double div10=i/10.0;
	double diff=mul01-div10;
	std::cout<<"i="<<i<<"  diff="<<diff<<"\n";
}

(executable NetRun link)

In a perfect world, multiplying by 0.1 and dividing by 10 would give the exact same result. But in reality, 0.1 has to be approximated by a finite series of binary digits, while the integer 10 can be stored exactly, so on NetRun's Pentium4 CPU, this code gives:

i=1  diff=5.54976e-18
i=10  diff=5.55112e-17
i=100  diff=5.55112e-16
i=1000  diff=5.55112e-15
i=10000  diff=5.55112e-14
i=100000  diff=5.55112e-13
i=1000000  diff=5.54934e-12
i=10000000  diff=5.5536e-11
i=100000000  diff=5.54792e-10
Program complete.  Return 0 (0x0)

That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10! This can add up over time.

Roundoff Taking Over Control

One place roundoff is very annoying is in your control structures. For example, this loop will execute *seven* times, even though it looks like it should only execute *six* times:

for (double k=0.0;k<1.0;k+=1.0/6.0) {
	printf("k=%a (about %.15f)\n",k,k);
}

(executable NetRun link)

The trouble is of course that 1/6 can't be represented exactly in floating-point, so if we add our approximation for 1/6 six times, we haven't quite hit 1.0, so the loop executes one additional time. There are several possible fixes for this:

Don't use floating-point as your loop variable. Loop over an integer i (without roundoff), and divide by six to get k. This is the recommended approach if you care about the exact number of times around the loop.
Or you could adjust the loop termination condition so it's "k<1.0-0.00001", where the "0.00001" provides some safety margin for roundoff. This sort of "epsilon" value is common along floating-point boundaries, although too small and you can still get roundoff, and too big and you've screwed up the computation.
Or you could use a lower-precision comparison, like "(float)k<1.0f". This also provides roundoff margin, because the comparison is taking place at the lower "float" precision.

Any of these fixes will work, but you do have to realize this is a potential problem, and put the precision-compensation code in there!

Bits in a Floating-Point Number

Floats represent continuous values. But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign	Exponent	Fraction (or "Mantissa")
1 bit-- 0 for positive 1 for negative	8 unsigned bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction. Don't forget the implicit leading 1!

The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1) ⁰ * 2 ^(130-127)* 1.0000....

You can stare at the bits inside a float by converting it to an integer. The quick and dirty way to do this is via a pointer typecast, but modern compilers will sometimes over-optimize this, especially in inlined code:

void print_bits(float f) {
	int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
	std::cout<<" float "<<std::setw(10)<<f<<" = ";
	for (int bit=31;bit>=0;bit--) {
		if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
		if (bit==31) std::cout<<" ";
		if (bit==23) std::cout<<" (implicit 1).";
	}
	std::cout<<std::endl;
}

int foo(void) {
	print_bits(0.0);
	print_bits(-1.0);
	print_bits(1.0);
	print_bits(2.0);
	print_bits(4.0);
	print_bits(8.0);
	print_bits(1.125);
	print_bits(1.25);
	print_bits(1.5);
	print_bits(1+1.0/10);
	return sizeof(float);
}

(Try this in NetRun now!)

The official way to dissect the parts of a float is using a "union" and a bitfield like so:

/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
	unsigned int exp:8; /**< Value is 2^(exp-127) */
	unsigned int sign:1; /**< 0 for positive, 1 for negative */
};

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
	float f;
	float_bits b;
};

float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";
return 0;

(Executable NetRun link)

In addition to the 32-bit "float", there are several other different sizes of floating-point types:

C Datatype	Size	Approx. Precision	Approx. Range	Exponent Bits	Fraction Bits	+-1 range
float	4 bytes (everywhere)	1.0x10^-7	10³⁸	8	23	2²⁴
double	8 bytes (everywhere)	2.0x10^-15	10³⁰⁸	11	52	2⁵³
long double	12-16 bytes (if it even exists)	2.0x10^-20	10⁴⁹³²	15	64	2⁶⁵

Nowadays floats have roughly the same performance as integers: addition takes about a nanosecond, multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions! The advantages of using floats are:

Floats can store fractional numbers.
Floats never overflow; they hit "infinity" as explored below.
"double" has more bits than "int" (but less than "long").

Normal (non-Weird) Floats

To summarize, a "float" as as defined by IEEE Standard 754 consists of three bitfields:

Sign	Exponent	Mantissa (or Fraction)
1 bit-- 0 for positive 1 for negative	8 bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction.

The hardware usually interprets a float as having the value:

value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa normally has an implicit leading 1 applied.