x = + 3.785746 * 10

Note that:

- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer (in binary).
- The 3.785746 part, called the "mantissa" or just "fraction"
part, can be stored as the integer 3785746 (at least as long as
you can figure out where the decimal point goes!). Of
course, we store the fraction part as a binary number.

x = + 3.785746 * 10

It's common to "normalize" a number in scientific notation so that:

- There's exactly one digit to the left of the decimal point.
- And that digit is not zero. In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So most machines don't even store the 1, it's implicit--you just know it's there!

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign |
Exponent |
Fraction (or
"Mantissa") |

1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |

The hardware interprets a float as having the value:

value = (-1)

Note that the mantissa has an implicit leading binary 1 applied. The 1 isn't stored, which actually causes some headaches. (Even worse, if the exponent field is zero, then it's an implicit leading 0; a "denormalized" number as we'll talk about on Wednesday.)

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1)

You can stare at the bits inside a float by converting it to an integer. The quick and dirty way to do this is via a pointer typecast, but modern compilers will sometimes over-optimize this, especially in inlined code:

void print_bits(float f) {The official way to dissect the parts of a float is using a "union" and a bitfield like so:

int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */

std::cout<<" float "<<std::setw(10)<<f<<" = ";

for (int bit=31;bit>=0;bit--) {

if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";

if (bit==31) std::cout<<" ";

if (bit==23) std::cout<<" (implicit 1).";

}

std::cout<<std::endl;

}

int foo(void) {

print_bits(0.0);

print_bits(-1.0);

print_bits(1.0);

print_bits(2.0);

print_bits(4.0);

print_bits(8.0);

print_bits(1.125);

print_bits(1.25);

print_bits(1.5);

print_bits(1+1.0/10);

return sizeof(float);

}

/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)

struct float_bits {

unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */

unsigned int exp:8; /**< Value is 2^(exp-127) */

unsigned int sign:1; /**< 0 for positive, 1 for negative */

};

/* A union is a struct where all the fields *overlap* each other */

union float_dissector {

float f;

float_bits b;

};

float_dissector s;

s.f=8.0;

std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";

return 0;

I like to joke that a union used to convert bits between incompatible types is an "unholy union".

In addition to the 32-bit "float", there are several other different sizes of floating-point types:

C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |

float |
4 bytes (everywhere) |
1.0x10^{-7} |
10^{38} |
8 |
23 |
2^{24} |

double |
8 bytes (everywhere) |
2.0x10^{-15} |
10^{308} |
11 |
52 |
2^{53} |

long double |
12-16 bytes (if it even exists) |
2.0x10^{-20} |
10^{4932} |
15 |
64 |
2^{65} |

half
float |
2 bytes (only on GPUs) |
1.0x10^{-3} |
10^{5} |
5 |
10 |
2^{11} |

Nowadays floats have roughly the same performance as integers: addition, subtraction, or multiplication all take about a nanosecond. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions! The advantages of using floats are:

- Floats can store fractional numbers.

- Floats never overflow; they hit "infinity" instead.

- "double" has more bits than "int" (but less than "long").

1.234* 10

But if we only keep three decimal places,

1.234* 10

which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places. To avoid this "roundoff error", when you're doing arithmetic by hand, people recommend keeping lots of digits, and only rounding once, at the end. But for a given value of "lots of digits", did you keep enough?

For example, on a real computer adding one to a float repeatedly will eventually stop changing the float!

float f=0.73;For "double", you can add one more times, but eventually the double will stop changing despite your additions. Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

while (1) {

volatile float g=f+1;

if (g==f) {

std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;

return 0;

}

else f=g;

}

This has really weird effects. For example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result due to accumulated roundoff. In exact arithmetic:

1.2355308 * 10

1.2355308 * 10

In other words, parenthesis don't matter if you're computing the exact result. But to three decimal places,

1.235 * 10

1.234 * 10

In the first line, the small values get added together, and together they're enough to move the big value. But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!

double lil=1.0;

double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)

std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;

std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;

float gnats=1.0;(executable NetRun link)

volatile float windshield=1<<24;

float orig=windshield;

for (int i=0;i<1000;i++)

windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";

else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:

float gnats=1.0;(executable NetRun link)

volatile float windshield=1<<24;

float orig=windshield;

volatile float gnatcup=0.0;

for (int i=0;i<1000;i++)

gnatcup += gnats;

windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";

else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

Roundoff can be very annoying. But it's not the end of the world if you don't care about exact answers, like in computer games, and even in many simulations (where "exact" is unmeasureable anyway). You just need to be able to estimate the amount of roundoff, and make sure it's not too much.

However, the amount of roundoff depends on the precision you keep in your numbers. This, in turn, depends on the size of the numbers. For example, a "float" is just 4 bytes, so it's not very precise. A "double" is 8 bytes, and so more precise. A "long double" is 12 bytes (or more!), using more memory, but it's got tons of precision. There's often a serious tradeoff between precision and space (and time), so just using long double for everything isn't a good idea: your program may get bigger and slower, and you still might not have enough precision.

for (int i=1;i<1000000000;i*=10) {(executable NetRun link)

double mul01=i*0.1;

double div10=i/10.0;

double diff=mul01-div10;

std::cout<<"i="<<i<<" diff="<<diff<<"\n";

}

In a perfect world, multiplying by 0.1 and dividing by 10 would give the exact same result. But in reality, 0.1 has to be approximated by a finite series of binary digits, while the integer 10 can be stored exactly, so on NetRun's Pentium4 CPU, this code gives:

i=1 diff=5.54976e-18That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10! This can add up over time.

i=10 diff=5.55112e-17

i=100 diff=5.55112e-16

i=1000 diff=5.55112e-15

i=10000 diff=5.55112e-14

i=100000 diff=5.55112e-13

i=1000000 diff=5.54934e-12

i=10000000 diff=5.5536e-11

i=100000000 diff=5.54792e-10

Program complete. Return 0 (0x0)