Floats represent continuous values. But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign |
Exponent |
Fraction (or
"Mantissa") |

1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |

The hardware interprets a float as having the value:

value = (-1)

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1)

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)

struct float_bits {

unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */

unsigned int exp:8; /**< Value is 2^(exp-127) */

unsigned int sign:1; /**< 0 for positive, 1 for negative */

};

/* A union is a struct where all the fields *overlap* each other */

union float_dissector {

float f;

float_bits b;

};

float_dissector s;

s.f=8.0;

std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";

return 0;

In addition to the 32-bit "float", there are several different sizes of floating-point types:

C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |

float |
4 bytes (everywhere) |
1.0x10^{-7} |
10^{38} |
8 |
23 |
2^{24} |

double |
8 bytes (everywhere) |
2.0x10^{-15} |
10^{308} |
11 |
52 |
2^{53} |

long double |
12-16 bytes (if it even exists) |
2.0x10^{-20} |
10^{4932} |
15 |
64 |
2^{65} |

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions! The advantages of using floats are:

- Floats can store fractional numbers.

- Floats never overflow.
- "double" has more bits than "int" (but less than "long").

Sign |
Exponent |
Mantissa (or Fraction) |

1 bit-- 0 for positive 1 for negative |
8 bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. |

The hardware usually interprets a float as having the value:

value = (-1)

Note that the mantissa normally has an implicit leading 1 applied.

value = (-1)

Supressing the leading 1 allows you to exactly represent 0: the bit pattern for 0.0 is just exponent==0 and fraction==00000000 (that is, everything zero). If you set the sign bit to negative, you have "negative zero", a strange curiosity. Positive and negative zero work the same way in arithmetic operations, and as far as I know there's no reason to prefer one to the other. The "==" operator claims positive and negative zero are the same!

If the fraction field isn't zero, but the exponent field is, you have a "denormalized number"--these are numbers too small to represent with a leading one. You always need denormals to represent zero, but denormals (also known as "subnormal" values) also provide a little more range at the very low end--they can store values down to around 1.0e-40 for "float", and 1.0e-310 for "double".

See below for the performance problem with denormals.

float z=0.0;Arithmetic on infinities works just the way you'd expect:infinity plus 1.0 gives infinity, etc. (See tables below). Positive and negative infinities exist, and work as you'd expect. Note that while divide-by-integer-zero causes a crash (divide by zero error), divide-by-floating-point-zero just happily returns infinity by default.

float f=1.0/z;

std::cout<<f<<"\n";

return (int)f;

- 0.0/0.0 (neither zero nor infinity, because we'd want (x/x)==1.0; but not 1.0 either, because we'd want (2*x)/x==2.0...)
- infinity-infinity (might cancel out to anything)
- infinity*0

float f=sqrt(-1.0);This is a "NaN", which is represented with a huge exponent and a *nonzero* fraction field. Positive and negative nans exist, but like zeros both signs seem to work the same. x86 seems to rewrite the bits of all NaNs to a special pattern it prefers (0x7FC00000 for float, with exponent bits and the leading fraction bit all set to 1).

std::cout<<f<<"\n";

return (int)f;

However, most modern machines *don't* handle denormals, infinities, or NaNs in hardware--instead when one of these special values occurs, they trap out to software which handles the problem and restarts the computation. This trapping process takes time, as shown in the following program:

(Executable NetRun Link)

enum {n_vals=1000};On my P4, this gives 3ns for zeros and ordinary values, 300ns for denormals (a 100x slowdown), and 700ns for infinities and NaNs (a 200x slowdown)!

double vals[n_vals];

int average_vals(void) {

for (int i=0;i<n_vals-1;i++)

vals[i]=0.5*(vals[i]+vals[i+1]);

return 0;

}

int foo(void) {

int i;

for (i=0;i<n_vals;i++) vals[i]=0.0;

printf(" Zeros: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);

for (i=0;i<n_vals;i++) vals[i]=1.0;

printf(" Ones: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);

for (i=0;i<n_vals;i++) vals[i]=1.0e-310;

printf(" Denorm: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);

float x=0.0;

for (i=0;i<n_vals;i++) vals[i]=1.0/x;

printf(" Inf: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);

for (i=0;i<n_vals;i++) vals[i]=x/x;

printf(" NaN: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);

return 0;

}

On my PowerPC 604e, this gives 35ns for zeros, 65ns for denormals (a 2x slowdown), and 35ns for infinities and NaNs (no penalty).

My friends at Illinois and I wrote a paper on this with many more performance details.

These all go exactly how you'd expect--"inf" for things that are too big (or -inf for too small), "nan" for things that don't make sense (like 0.0/0.0, or infinity times zero, or nan with anything else).

+ |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

-inf |
nan | -inf | -inf | -inf | -inf | -inf | -inf | -inf | -inf | -inf | nan | nan |

-big |
nan | -inf | -2e+30 | -big | -big | -big | -big | -big | -big | 0 | +inf | nan |

-1 |
nan | -inf | -big | -2 | -1 | -1 | -1 | -1 | 0 | +big | +inf | nan |

-lil |
nan | -inf | -big | -1 | -2e-40 | -lil | -lil | 0 | +1 | +big | +inf | nan |

-0 |
nan | -inf | -big | -1 | -lil | -0 | 0 | +lil | +1 | +big | +inf | nan |

+0 |
nan | -inf | -big | -1 | -lil | 0 | 0 | +lil | +1 | +big | +inf | nan |

+lil |
nan | -inf | -big | -1 | 0 | +lil | +lil | 2e-40 | +1 | +big | +inf | nan |

+1 |
nan | -inf | -big | 0 | +1 | +1 | +1 | +1 | 2 | +big | +inf | nan |

+big |
nan | -inf | 0 | +big | +big | +big | +big | +big | +big | 2e+30 | +inf | nan |

+inf |
nan | nan | +inf | +inf | +inf | +inf | +inf | +inf | +inf | +inf | +inf | nan |

+nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

- |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

-inf |
nan | nan | -inf | -inf | -inf | -inf | -inf | -inf | -inf | -inf | -inf | nan |

-big |
nan | +inf | 0 | -big | -big | -big | -big | -big | -big | -2e+30 | -inf | nan |

-1 |
nan | +inf | +big | 0 | -1 | -1 | -1 | -1 | -2 | -big | -inf | nan |

-lil |
nan | +inf | +big | +1 | 0 | -lil | -lil | -2e-40 | -1 | -big | -inf | nan |

-0 |
nan | +inf | +big | +1 | +lil | 0 | -0 | -lil | -1 | -big | -inf | nan |

+0 |
nan | +inf | +big | +1 | +lil | 0 | 0 | -lil | -1 | -big | -inf | nan |

+lil |
nan | +inf | +big | +1 | 2e-40 | +lil | +lil | 0 | -1 | -big | -inf | nan |

+1 |
nan | +inf | +big | 2 | +1 | +1 | +1 | +1 | 0 | -big | -inf | nan |

+big |
nan | +inf | 2e+30 | +big | +big | +big | +big | +big | +big | 0 | -inf | nan |

+inf |
nan | +inf | +inf | +inf | +inf | +inf | +inf | +inf | +inf | +inf | nan | nan |

+nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

* |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

-inf |
nan | +inf | +inf | +inf | +inf | nan | nan | -inf | -inf | -inf | -inf | nan |

-big |
nan | +inf | +inf | +big | 1e-10 | 0 | -0 | -1e-10 | -big | -inf | -inf | nan |

-1 |
nan | +inf | +big | +1 | +lil | 0 | -0 | -lil | -1 | -big | -inf | nan |

-lil |
nan | +inf | 1e-10 | +lil | 0 | 0 | -0 | -0 | -lil | -1e-10 | -inf | nan |

-0 |
nan | nan | 0 | 0 | 0 | 0 | -0 | -0 | -0 | -0 | nan | nan |

+0 |
nan | nan | -0 | -0 | -0 | -0 | 0 | 0 | 0 | 0 | nan | nan |

+lil |
nan | -inf | -1e-10 | -lil | -0 | -0 | 0 | 0 | +lil | 1e-10 | +inf | nan |

+1 |
nan | -inf | -big | -1 | -lil | -0 | 0 | +lil | +1 | +big | +inf | nan |

+big |
nan | -inf | -inf | -big | -1e-10 | -0 | 0 | 1e-10 | +big | +inf | +inf | nan |

+inf |
nan | -inf | -inf | -inf | -inf | nan | nan | +inf | +inf | +inf | +inf | nan |

+nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

/ |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

-inf |
nan | nan | +inf | +inf | +inf | +inf | -inf | -inf | -inf | -inf | nan | nan |

-big |
nan | 0 | +1 | +big | +inf | +inf | -inf | -inf | -big | -1 | -0 | nan |

-1 |
nan | 0 | 1e-30 | +1 | +inf | +inf | -inf | -inf | -1 | -1e-30 | -0 | nan |

-lil |
nan | 0 | 0 | +lil | +1 | +inf | -inf | -1 | -lil | -0 | -0 | nan |

-0 |
nan | 0 | 0 | 0 | 0 | nan | nan | -0 | -0 | -0 | -0 | nan |

+0 |
nan | -0 | -0 | -0 | -0 | nan | nan | 0 | 0 | 0 | 0 | nan |

+lil |
nan | -0 | -0 | -lil | -1 | -inf | +inf | +1 | +lil | 0 | 0 | nan |

+1 |
nan | -0 | -1e-30 | -1 | -inf | -inf | +inf | +inf | +1 | 1e-30 | 0 | nan |

+big |
nan | -0 | -1 | -big | -inf | -inf | +inf | +inf | +big | +1 | 0 | nan |

+inf |
nan | nan | -inf | -inf | -inf | -inf | +inf | +inf | +inf | +inf | nan | nan |

+nan |
nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

== |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-inf |
0 | +1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-big |
0 | 0 | +1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-1 |
0 | 0 | 0 | +1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-lil |
0 | 0 | 0 | 0 | +1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-0 |
0 | 0 | 0 | 0 | 0 | +1 | +1 | 0 | 0 | 0 | 0 | 0 |

+0 |
0 | 0 | 0 | 0 | 0 | +1 | +1 | 0 | 0 | 0 | 0 | 0 |

+lil |
0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | 0 | 0 | 0 | 0 |

+1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | 0 | 0 | 0 |

+big |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | 0 | 0 |

+inf |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | 0 |

+nan |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

< |
-nan |
-inf |
-big |
-1 |
-lil |
-0 |
+0 |
+lil |
+1 |
+big |
+inf |
+nan |

-nan |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

-inf |
0 | 0 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | 0 |

-big |
0 | 0 | 0 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | 0 |

-1 |
0 | 0 | 0 | 0 | +1 | +1 | +1 | +1 | +1 | +1 | +1 | 0 |

-lil |
0 | 0 | 0 | 0 | 0 | +1 | +1 | +1 | +1 | +1 | +1 | 0 |

-0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | +1 | +1 | +1 | 0 |

+0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | +1 | +1 | +1 | 0 |

+lil |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | +1 | +1 | 0 |

+1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | +1 | 0 |

+big |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +1 | 0 |

+inf |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

+nan |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |