Speed of Floating-Point Operations and Weird Floats

CS 301 Lecture, Dr. Lawlor

Floating-point arithmetic is quite fast when everything works properly. For example, I can add, subtract, and multiply ordinary floats in about 1ns each:

   =: 1.063 ns/float
   +: 1.286 ns/float
   -: 1.390 ns/float
   *: 1.318 ns/float

Other operations are not very fast. Like integer, float divide is slow. The other operations are even slower:

   /: 16.317 ns/float
sqrt: 16.632 ns/float
cbrt: 148.874 ns/float
 sin: 82.280 ns/float
 cos: 65.373 ns/float
 tan: 109.266 ns/float
asin: 134.639 ns/float
acos: 174.358 ns/float
atan: 108.406 ns/float
 exp: 173.219 ns/float
 log: 79.628 ns/float
 pow: 322.878 ns/float

Here's the code I used to measure these values. Note that I'm being very careful in this code to keep the values from exploding to infinity, because (as we see below) such a malfunction can change the speed of the code!

enum {n_X=1000};
double X[n_X];
const double k=0.99, k2=1.01, k3=0.000001;

int assign_X(void) { for (int i=0;i<n_X-1;i++) X[i]=k;return 0;}
int add_X(void) { for (int i=0;i<n_X-1;i++) X[i]=X[i]+k;return 0;}
int sub_X(void) { for (int i=0;i<n_X-1;i++) X[i]=X[i]-k;return 0;}
int mul_X(void) { for (int i=0;i<n_X-1;i++) X[i]=X[i]*k;return 0;}
int div_X(void) { for (int i=0;i<n_X-1;i++) X[i]=X[i]/k2;return 0;}
int sqrt_X(void) { for (int i=0;i<n_X-1;i++) X[i]=sqrt(X[i]);return 0;}
int cbrt_X(void) { for (int i=0;i<n_X-1;i++) X[i]=cbrt(X[i]);return 0;}
int sin_X(void) { for (int i=0;i<n_X-1;i++) X[i]=sin(X[i]);return 0;}
int cos_X(void) { for (int i=0;i<n_X-1;i++) X[i]=cos(X[i]);return 0;}
int tan_X(void) { for (int i=0;i<n_X-1;i++) X[i]=tan(X[i]);return 0;}
int asin_X(void) { for (int i=0;i<n_X-1;i++) X[i]=asin(X[i]);return 0;}
int acos_X(void) { for (int i=0;i<n_X-1;i++) X[i]=k3*acos(X[i]);return 0;}
int atan_X(void) { for (int i=0;i<n_X-1;i++) X[i]=atan(X[i]);return 0;}
int exp_X(void) { for (int i=0;i<n_X-1;i++) X[i]=k3*exp(X[i]);return 0;}
int log_X(void) { for (int i=0;i<n_X-1;i++) X[i]=k2+log(X[i]);return 0;}
int pow_X(void) { for (int i=0;i<n_X-1;i++) X[i]=pow(X[i],k);return 0;}

int foo(void) {
	int i;
	printf("   =: %.3f ns/float\n",time_function(assign_X)/n_X*1.0e9);
	printf("   +: %.3f ns/float\n",time_function(add_X)/n_X*1.0e9);
	printf("   -: %.3f ns/float\n",time_function(sub_X)/n_X*1.0e9);
	for (i=0;i<n_X;i++) X[i]=1.0; 
	printf("   *: %.3f ns/float\n",time_function(mul_X)/n_X*1.0e9);
	printf("   /: %.3f ns/float\n",time_function(div_X)/n_X*1.0e9);
	printf("sqrt: %.3f ns/float\n",time_function(sqrt_X)/n_X*1.0e9);
	printf("cbrt: %.3f ns/float\n",time_function(cbrt_X)/n_X*1.0e9);
	printf(" sin: %.3f ns/float\n",time_function(sin_X)/n_X*1.0e9);
	printf(" cos: %.3f ns/float\n",time_function(cos_X)/n_X*1.0e9);
	printf(" tan: %.3f ns/float\n",time_function(tan_X)/n_X*1.0e9);
	printf("asin: %.3f ns/float\n",time_function(asin_X)/n_X*1.0e9);
	printf("acos: %.3f ns/float\n",time_function(acos_X)/n_X*1.0e9);
	printf("atan: %.3f ns/float\n",time_function(atan_X)/n_X*1.0e9);
	for (i=0;i<n_X;i++) X[i]=0.0; 
	printf(" exp: %.3f ns/float\n",time_function(exp_X)/n_X*1.0e9);
	for (i=0;i<n_X;i++) X[i]=1.0; 
	printf(" log: %.3f ns/float\n",time_function(log_X)/n_X*1.0e9);
	printf(" pow: %.3f ns/float\n",time_function(pow_X)/n_X*1.0e9);
	printf("Final X values: %.4f  %.4f  %.4f\n",X[0],X[n_X/2],X[n_X-2]);
	return 0;
}

(Try this in NetRun now!)

Normal (non-Weird) Floats

Recall that a "float" as as defined by IEEE Standard 754 consists of three bitfields:

Sign	Exponent	Mantissa (or Fraction)
1 bit-- 0 for positive 1 for negative	8 bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction.

The hardware usually interprets a float as having the value:

value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa normally has an implicit leading 1 applied.

Weird: Zeros and Denormals

However, if the "exponent" field is exactly zero, the implicit leading digit is taken to be 0, like this:

value = (-1) ^sign * 2 ^(-126)* 0.fraction

Supressing the leading 1 allows you to exactly represent 0: the bit pattern for 0.0 is just exponent==0 and fraction==00000000 (that is, everything zero). If you set the sign bit to negative, you have "negative zero", a strange curiosity. Positive and negative zero work the same way in arithmetic operations, and as far as I know there's no reason to prefer one to the other. The "==" operator claims positive and negative zero are the same!

If the fraction field isn't zero, but the exponent field is, you have a "denormalized number"--these are numbers too small to represent with a leading one. You always need denormals to represent zero, but denormals (also known as "subnormal" values) also provide a little more range at the very low end--they can store values down to around 1.0e-40 for "float", and 1.0e-310 for "double".

See below for the performance problem with denormals.

Weird: Infinity

If the exponent field is as big as it can get (for "float", 255), this indicates another sort of special number. If the fraction field is zero, the number is interpreted as positive or negative "infinity". The hardware will generate "infinity" when dividing by zero, or when another operation exceeds the representable range.

float z=0.0;
float f=1.0/z;
std::cout<<f<<"\n";
return (int)f;

(Try this in NetRun now!)

Arithmetic on infinities works just the way you'd expect:infinity plus 1.0 gives infinity, etc. (See tables below). Positive and negative infinities exist, and work as you'd expect. Note that while divide-by-integer-zero causes a crash (divide by zero error), divide-by-floating-point-zero just happily returns infinity by default.

Weird: NaN

If you do an operation that doesn't make sense, like:

0.0/0.0 (neither zero nor infinity, because we'd want (x/x)==1.0; but not 1.0 either, because we'd want (2*x)/x==2.0...)
infinity-infinity (might cancel out to anything)
infinity*0

The machine just gives a special "error" number called a "NaN" (Not-a-Number). The idea is if you run some complicated program that screws up, you don't want to get a plausible but wrong answer like "4" (like we get with integer overflow!); you want something totally implausible like "nan" to indicate an error happened. For example, this program prints "nan" and returns -2147483648 (0x80000000):

float f=sqrt(-1.0);
std::cout<<f<<"\n";
return (int)f;

(Try this in NetRun now!)

This is a "NaN", which is represented with a huge exponent and a *nonzero* fraction field. Positive and negative nans exist, but like zeros both signs seem to work the same. x86 seems to rewrite the bits of all NaNs to a special pattern it prefers (0x7FC00000 for float, with exponent bits and the leading fraction bit all set to 1).

Performance impact of special values

Machines properly handle ordinary floating-point numbers and zero in hardware at full speed.

However, most modern machines *don't* handle denormals, infinities, or NaNs in hardware--instead when one of these special values occurs, they trap out to software which handles the problem and restarts the computation. This trapping process takes time, as shown in the following program:
(Executable NetRun Link)

enum {n_vals=1000};
double vals[n_vals];

int average_vals(void) {
	for (int i=0;i<n_vals-1;i++) 
		vals[i]=0.5*(vals[i]+vals[i+1]);
	return 0;
}

int foo(void) {
	int i;
	for (i=0;i<n_vals;i++) vals[i]=0.0; 
	printf(" Zeros: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
	for (i=0;i<n_vals;i++) vals[i]=1.0; 
	printf(" Ones: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
	for (i=0;i<n_vals;i++) vals[i]=1.0e-310; 
	printf(" Denorm: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
	float x=0.0;
	for (i=0;i<n_vals;i++) vals[i]=1.0/x; 
	printf(" Inf: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
	for (i=0;i<n_vals;i++) vals[i]=x/x; 
	printf(" NaN: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
	return 0;
}

On my P4, this gives 3ns for zeros and ordinary values, 300ns for denormals (a 100x slowdown), and 700ns for infinities and NaNs (a 200x slowdown)!

On my PowerPC 604e, this gives 35ns for zeros, 65ns for denormals (a 2x slowdown), and 35ns for infinities and NaNs (no penalty).

My friends at Illinois and I wrote a paper on this with many more performance details.

Arithmetic Tables for Special Floating-Point Numbers:

These tables were computed for "float", but should be identical with any number size on any IEEE machine (which virtually everything is). "big" is a large but finite number, here 1.0e30. "lil" is a denormalized number, here 1.0e-40. "inf" is an infinity. "nan" is a Not-A-Number. Here's the source code to generate these tables.

These all go exactly how you'd expect--"inf" for things that are too big (or -inf for too small), "nan" for things that don't make sense (like 0.0/0.0, or infinity times zero, or nan with anything else).

Addition

-nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

+nan

-nan

nan

-inf

nan

-inf

nan

-big

nan

-inf

-2e+30

-big

+inf

nan

-1

nan

-inf

-big

-2

-1

+big

+inf

nan

-lil

nan

-inf

-big

-1

-2e-40

-lil

+big

+inf

nan

-0

nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

nan

-inf

-big

-1

-lil

+lil

+big

+inf

nan

+lil

nan

-inf

-big

-1

+lil

2e-40

+big

+inf

nan

-inf

-big

+big

+inf

nan

+big

nan

-inf

+big

2e+30

+inf

nan

+inf

nan

+inf

nan

+nan

nan

Note how infinity-infinity gives nan, but infinity+infinity is infinity.

Subtraction

-nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

+nan

-nan

nan

-inf

nan

-inf

nan

-big

nan

+inf

-big

-2e+30

-inf

nan

-1

nan

+inf

+big

-1

-2

-big

-inf

nan

-lil

nan

+inf

+big

-lil

-2e-40

-1

-big

-inf

nan

-0

nan

+inf

+big

+lil

-0

-lil

-1

-big

-inf

nan

+inf

+big

+lil

-lil

-1

-big

-inf

nan

+lil

nan

+inf

+big

2e-40

+lil

-1

-big

-inf

nan

+inf

+big

-big

-inf

nan

+big

nan

+inf

2e+30

+big

-inf

nan

+inf

nan

+inf

nan

+nan

nan

Multiplication

-nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

+nan

-nan

nan

-inf

nan

+inf

nan

-inf

nan

-big

nan

+inf

+big

1e-10

-0

-1e-10

-big

-inf

nan

-1

nan

+inf

+big

+lil

-0

-lil

-1

-big

-inf

nan

-lil

nan

+inf

1e-10

+lil

-0

-lil

-1e-10

-inf

nan

-0

nan

-0

nan

-0

nan

+lil

nan

-inf

-1e-10

-lil

-0

+lil

1e-10

+inf

nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

nan

+big

nan

-inf

-big

-1e-10

-0

1e-10

+big

+inf

nan

+inf

nan

-inf

nan

+inf

nan

+nan

nan

Note that 0*infinity gives nan, and out-of-range multiplications give infinities.

Division

-nan

-inf

-big

-1

-lil

-0

+lil

+big

+inf

+nan

-nan

nan

-inf

nan

+inf

-inf

nan

-big

nan

+big

+inf

-inf

-big

-1

-0

nan

-1

nan

1e-30

+inf

-inf

-1

-1e-30

-0

nan

-lil

nan

+lil

+inf

-inf

-1

-lil

-0

nan

-0

nan

-0

nan

-0

nan

+lil

nan

-0

-lil

-1

-inf

+inf

+lil

nan

-0

-1e-30

-1

-inf

+inf

1e-30

nan

+big

nan

-0

-1

-big

-inf

+inf

+big

nan

+inf

nan

-inf

+inf

nan

+nan

nan

Note that 0/0, and inf/inf give NaNs; while out-of-range divisions like big/lil or 1.0/0.0 give infinities (and not errors!).

Equality

==	-nan	-inf	-big	-1	-lil	-0	+0	+lil	+1	+big	+inf	+nan
-nan	0	0	0	0	0	0	0	0	0	0	0	0
-inf	0	+1	0	0	0	0	0	0	0	0	0	0
-big	0	0	+1	0	0	0	0	0	0	0	0	0
-1	0	0	0	+1	0	0	0	0	0	0	0	0
-lil	0	0	0	0	+1	0	0	0	0	0	0	0
-0	0	0	0	0	0	+1	+1	0	0	0	0	0
+0	0	0	0	0	0	+1	+1	0	0	0	0	0
+lil	0	0	0	0	0	0	0	+1	0	0	0	0
+1	0	0	0	0	0	0	0	0	+1	0	0	0
+big	0	0	0	0	0	0	0	0	0	+1	0	0
+inf	0	0	0	0	0	0	0	0	0	0	+1	0
+nan	0	0	0	0	0	0	0	0	0	0	0	0

Note that positive and negative zeros are considered equal, and a "NaN" doesn't equal anything--even itself!

Less-Than

<	-nan	-inf	-big	-1	-lil	-0	+0	+lil	+1	+big	+inf	+nan
-nan	0	0	0	0	0	0	0	0	0	0	0	0
-inf	0	0	+1	+1	+1	+1	+1	+1	+1	+1	+1	0
-big	0	0	0	+1	+1	+1	+1	+1	+1	+1	+1	0
-1	0	0	0	0	+1	+1	+1	+1	+1	+1	+1	0
-lil	0	0	0	0	0	+1	+1	+1	+1	+1	+1	0
-0	0	0	0	0	0	0	0	+1	+1	+1	+1	0
+0	0	0	0	0	0	0	0	+1	+1	+1	+1	0
+lil	0	0	0	0	0	0	0	0	+1	+1	+1	0
+1	0	0	0	0	0	0	0	0	0	+1	+1	0
+big	0	0	0	0	0	0	0	0	0	0	+1	0
+inf	0	0	0	0	0	0	0	0	0	0	0	0
+nan	0	0	0	0	0	0	0	0	0	0	0	0

Note that "NaN" returns false to all comparisons--it's neither smaller nor larger than the other numbers.