Floating-Point Roundoff & Assembly

CS 301 Lecture, Dr. Lawlor, 2005/11/18

Roundoff & how to avoid it

Floating-point only keeps a certain number of digits around. This can lead to some really weird results. For example,

int foo(void) {
	double x=0.123456789;
	printf("x=%.10f \n",x);
	return 0;
}

This prints out exactly what you'd expect-- x=0.1234567890

But if we add something and then subtract it from x, like this:

int foo(void) {
	double y=1.0e10;
	double x=0.123456789+y-y;
	printf("x=%.10f \n",x);
	return 0;
}

We *don't* get the same result, even though mathematically x should be unchanged. The problem is that

	 10000000000.0
	+          0.123456789
	----------------------
	 10000000000.123456789

If we now round to (say) 17 significant digits, we get just 10000000000.123457; in other words, we've just rounded off the last three digits! When we subtract off 1.0e10 again, those lost digits don't magically grow back, so

	double y=1.0e10;
	double x=0.123456789+y-y;

This gives x=0.1234569550 on my machine--we've lost digits to roundoff! And it gets worse--if y=1.0e20, then x=0. However, note that if we do the additions in another order, like this:

	double x=0.123456789+(y-y);

Now because (y-y) is exactly zero, we get the right answer, x=0.123456789. This is super funky, because the *order* matters! In other words, adding parenthesis can significantly change the result.

In general, roundoff is a problem when:

You care about lots of digits in the answer, AND
You mix huge and tiny values in the same computation--the huge values will overwhelm the tiny values, losing digits.

Because of roundoff, you can't freely perform mathematical transformations, like turning (a+b)+c into a+(b+c), because these have slightly different roundoff behavior.

Fighting Roundoff

Roundoff happens in real life. For example, let's say you're trying to find the average color of a 10,000x10,000 image:

int foo(void) {
	enum {n=10000}; /* size of image */
	float sum=0.0;
	for (int y=0;y<n;y++)
	for (int x=0;x<n;x++)
		sum+=0.1234;
	float average=sum/(n*n);
	printf("Average: %.10f\n",average);
	return 0;
}

Without roundoff, this would always print out 0.1234. But due to roundoff, the result is 0.02097152, which is totally wrong.

What's happening is that eventually "sum" gets so big, that adding 0.1234 doesn't actually do anything--the effect of 0.1234 is just rounded off!

There are two natural solutions to roundoff:

Use a bigger datatype, like "double" instead of float, or "long double" instead of double. Using "double" above gives the right answer (unless the image gets really really big). Using a bigger type costs memory (and hence cache performance), but it's definitely the easiest and most common way to avoid roundoff.

Do the operations in a more roundoff-friendly order. For example, if we keep the (small) sum for each row protected in another variable, and only add it to the (big, bad) total sum at the end of the row, we have significantly less roundoff error:

int foo(void) {
    enum {n=10000}; /* size of image */
    float sum=0.0;
    for (int y=0;y<n;y++) {
        float sumRow=0.0;
        for (int x=0;x<n;x++)
            sumRow+=0.1234;
        sum+=sumRow;
    }
    float average=sum/(n*n);
    printf("Average: %.10f\n",average);
    return 0;
}

Of course, you can always do both--use bigger datatypes, and use them in a roundoff-friendly way.

Floating-Point Assembly

Normal machines, like the PowerPC, just have a dedicated set of floating-point registers and instructions--in other words, everything works in exactly the same way as with integers. For example, this PowerPC assembly subroutine treats its integer argument as a double pointer to load up its argument into floating-point register 0 using lfd (load floating-point double), and then squares its argument using fmul (floating-point multiply).

square_it:
	lfd	%f0,(%r3)
	fmul	%f1,%f0,%f0
	blr

The x86, by contrast, has a really funky floating-point assembly. Instead of normal registers, the x86 keeps a "stack" of 8 slots. This has *nothing* to do with the hardware stack; they're dedicated registers, not memory locations. When you load up a number, it goes on the top of the stack. To return a value, you just leave it on top of the stack. So the simplest x86 assembly program is this:
(Executable NetRun Link)

bar:
  fldz
  ret

"fldz" (floating-point load zero) pushes 0.0 onto the floating point stack. The value on top of the stack is the floating-point return value (the equivalent of the integer register eax). So this assembly is equivalent to "double bar(void) {return 0.0;}".

To do arithmetic, like adding two numbers using "faddp" (floating-point add with pop), the hardware takes the inputs from the top two locations on the stack, and leaves the result sitting on top of the stack. So you can add two numbers like this:
(Executable NetRun Link)

bar:
  fld1   # load 1.0
  fldpi  # load pi (3.141592...)
  faddp  # Pops above two, pushes sum
  ret

There are alternate flavors of the arithmetic instructions that pull one source value from memory or from deep in the stack (using the funky syntax "%st" to refer to the top of the floating-point stack, or "%st(3)" to refer to the 3rd element in the stack), or to swap, rotate, and otherwise shuffle the values in the stack. But every operation still needs to take at least one input from the top of the stack, which means real programs end up spending a lot of time doing extra loads and stores, or stack shuffles, to find the values they need.