# Floating-Point Roundoff & Assembly

CS 301 Lecture, Dr. Lawlor, 2005/11/18

## Roundoff & how to avoid it

Floating-point only keeps a certain number of digits around.  This can lead to some really weird results.  For example,
`int foo(void) {	double x=0.123456789;	printf("x=%.10f \n",x);	return 0;}`
This prints out exactly what you'd expect--  x=0.1234567890

But if we add something and then subtract it from x, like this:
`int foo(void) {	double y=1.0e10;	double x=0.123456789+y-y;	printf("x=%.10f \n",x);	return 0;}`
We *don't* get the same result, even though mathematically x should be unchanged.  The problem is that
`	 10000000000.0	+          0.123456789	----------------------	 10000000000.123456789 `
If we now round to (say) 17 significant digits, we get just 10000000000.123457; in other words, we've just rounded off the last three digits!  When we subtract off 1.0e10 again, those lost digits don't magically grow back, so
`	double y=1.0e10;	double x=0.123456789+y-y;`
This gives x=0.1234569550 on my machine--we've lost digits to roundoff!  And it gets worse--if y=1.0e20, then x=0.  However, note that if we do the additions in another order, like this:
`	double x=0.123456789+(y-y);`
Now because (y-y) is exactly zero, we get the right answer, x=0.123456789.  This is super funky, because the *order* matters!  In other words, adding parenthesis can significantly change the result.

In general, roundoff is a problem when:
• You mix huge and tiny values in the same computation--the huge values will overwhelm the tiny values, losing digits.
Because of roundoff, you can't freely perform mathematical transformations, like turning (a+b)+c into a+(b+c), because these have slightly different roundoff behavior.

## Fighting Roundoff

Roundoff happens in real life.  For example, let's say you're trying to find the average color of a 10,000x10,000 image:
`int foo(void) {	enum {n=10000}; /* size of image */	float sum=0.0;	for (int y=0;y<n;y++)	for (int x=0;x<n;x++)		sum+=0.1234;	float average=sum/(n*n);	printf("Average: %.10f\n",average);	return 0;}`
Without roundoff, this would always print out 0.1234.  But due to roundoff, the result is 0.02097152, which is totally wrong.

What's happening is that eventually "sum" gets so big, that adding 0.1234 doesn't actually do anything--the effect of 0.1234 is just rounded off!

There are two natural solutions to roundoff:
• Use a bigger datatype, like "double" instead of float, or "long double" instead of double.  Using "double" above gives the right answer (unless the image gets really really big). Using a bigger type costs memory (and hence cache performance), but it's definitely the easiest and most common way to avoid roundoff.
• Do the operations in a more roundoff-friendly order.  For example, if we keep the (small) sum for each row protected in another variable, and only add it to the (big, bad) total sum at the end of the row, we have significantly less roundoff error:
`int foo(void) {    enum {n=10000}; /* size of image */    float sum=0.0;    for (int y=0;y<n;y++) {        float sumRow=0.0;        for (int x=0;x<n;x++)            sumRow+=0.1234;        sum+=sumRow;    }    float average=sum/(n*n);    printf("Average: %.10f\n",average);    return 0;}`
Of course, you can always do both--use bigger datatypes, and use them in a roundoff-friendly way.

## Floating-Point Assembly

Normal machines, like the PowerPC, just have a dedicated set of floating-point registers and instructions--in other words, everything works in exactly the same way as with integers.  For example, this PowerPC assembly subroutine treats its integer argument as a double pointer to load up its argument into floating-point register 0 using lfd (load floating-point double), and then squares its argument using fmul (floating-point multiply).

`square_it:	lfd	%f0,(%r3)	fmul	%f1,%f0,%f0	blr`

The x86, by contrast, has a really funky floating-point assembly.  Instead of normal registers, the x86 keeps a "stack" of 8 slots.  This has *nothing* to do with the hardware stack; they're dedicated registers, not memory locations.  When you load up a number, it goes on the top of the stack.  To return a value, you just leave it on top of the stack.  So the simplest x86 assembly program is this:
`bar:  fldz  ret`
`bar:  fld1   # load 1.0  fldpi  # load pi (3.141592...)  faddp  # Pops above two, pushes sum  ret`