# Floating-Point Roundoff

CS 301 Lecture, Dr. Lawlor

## Roundoff in Arithmetic

They're funny old things, floats.  The fraction part only stores so much precision; further bits are lost.  For example, in reality,
1.2347654 * 104 = 1.234* 104 + 7.654* 100
But to three decimal places,
1.234 * 104 = 1.234* 104 + 7.654* 100
which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places.  This "roundoff" has implications.

For example, adding one repeatedly will eventually stop doing anything:
float f=0.73;
while (1) {
volatile float g=f+1;
if (g==f) {
printf("f+1 == f at f=%.3f, or 2^%.3f\n",
f,log(f)/log(2.0));
return 0;
}
else f=g;
}
Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

For another example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result (up to roundoff):
1.2355308 * 104 = 1.234* 104 + (7.654* 100 + 7.654* 100)
1.2355308 * 104 = (1.234* 104 + 7.654* 100) + 7.654* 100
In other words, parenthesis don't matter if you're computing the exact result.  But to three decimal places,
1.235 * 104 = 1.234* 104 + (7.654* 100 + 7.654* 100)
1.234 * 104 = (1.234* 104 + 7.654* 100) + 7.654* 100
In the first line, the small values get added together, and together they're enough to move the big value.  But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!
double lil=1.0;
double big=pow(2.0,64);
printf(" big+(lil+lil) -big = %.0f\n", big+(lil+lil) -big);
printf("(big+lil)+lil -big = %.0f\n",(big+lil)+lil -big);
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";

Roundoff can be very annoying, but it doesn't matter if you don't care about exact answers, like in simulation (where "exact" means the same as the real world, which you'll never get anyway) or games.

One very frustrating fact is that roundoff depends on the precision you keep in your numbers.  This, in turn, depends on the size of the numbers.  For example, a "float" is just 4 bytes, but it's not very precise.  A "double" is 8 bytes, but it's more precise.  A "long double" is 12 bytes (or more!), but it's got tons of precision.

## Roundoff in Representation

Sadly, 0.1 decimal is an infinitely repeating pattern in binary: 0.0(0011), with 0011 repeating.  This means multiplying by some *finite* pattern to approximate 0.1 is only an approximation of really dividing by the integer 10.0.  The exact difference is proportional to the precision of the numbers and the size of the input data:
for (int i=1;i<1000000000;i*=10) {
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
}

On the NetRun Pentium4 CPU, this gives:
i=1  diff=5.54976e-18
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)
That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10!

## Roundoff Taking Over Control

One place roundoff is very annoying is in your control structures.  For example, this loop will execute *seven* times, even though it looks like it should only execute *six* times:
for (double k=0.0;k<1.0;k+=1.0/6.0) {