Bits in a Float, and Infinity, NaN, and denormal

CS 301 Lecture, Dr. Lawlor

Bits in a Floating-Point Number

Floats represent continuous values.  But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:
Sign
Exponent
Fraction (or "Mantissa")
1 bit--
  0 for positive
  1 for negative
8 unsigned bits--
  127 means 20
  137 means 210
23 bits-- a binary fraction.

Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

    value = (-1) sign * 2 (exponent-127) * 1.fraction

Note that the mantissa has an implicit leading binary 1 applied.  The 1 isn't stored, which actually causes some headaches.  (Even worse, if the exponent field is zero, then it's an implicit leading 0; a "denormalized" number as we'll talk about on Wednesday.)

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

    8 = (-1) 0 * 2 (130-127) * 1.0000....

You can stare at the bits inside a float by converting it to an integer.  The quick and dirty way to do this is via a pointer typecast, but modern compilers will sometimes over-optimize this, especially in inlined code:
void print_bits(float f) {
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit--) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";
}
std::cout<<std::endl;
}

int foo(void) {
print_bits(0.0);
print_bits(-1.0);
print_bits(1.0);
print_bits(2.0);
print_bits(4.0);
print_bits(8.0);
print_bits(1.125);
print_bits(1.25);
print_bits(1.5);
print_bits(1+1.0/10);
return sizeof(float);
}

(Try this in NetRun now!)

The official way to dissect the parts of a float is using a "union" and a bitfield like so:
/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};

float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
(Executable NetRun link)

I like to joke that a union misused to convert bits between incompatible types is an "unholy union".

In addition to the 32-bit "float", there are several other different sizes of floating-point types:
C Datatype
Size
Approx. Precision
Approx. Range
Exponent Bits
Fraction Bits
+-1 range
float
4 bytes (everywhere)
1.0x10-7
1038
8
23
224
double
8 bytes (everywhere)
2.0x10-15
10308
11
52
253
long double
12-16 bytes (if it even exists)
2.0x10-20
104932
15
64
265

Nowadays floats have roughly the same performance as integers: addition, subtraction, or multiplication all take about a nanosecond.  That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions!  The advantages of using floats are:

Normal (non-Weird) Floats

Recall that a "float" as as defined by IEEE Standard 754 consists of three bitfields:
Sign
Exponent
Mantissa (or Fraction)
1 bit--
  0 for positive
  1 for negative
8 bits--
  127 means 20
  137 means 210
23 bits-- a binary fraction.

The hardware usually interprets a float as having the value:

    value = (-1) sign * 2 (exponent-127) * 1.fraction

Note that the mantissa normally has an implicit leading 1 applied.  

Weird: Zeros and Denormals

However, if the "exponent" field is exactly zero, the implicit leading digit is taken to be 0, like this:

   value = (-1) sign * 2 (-126) * 0.fraction

Supressing the leading 1 allows you to exactly represent 0: the bit pattern for 0.0 is just exponent==0 and fraction==00000000 (that is, everything zero).  If you set the sign bit to negative, you have "negative zero", a strange curiosity.  Positive and negative zero work the same way in arithmetic operations, and as far as I know there's no reason to prefer one to the other.  The "==" operator claims positive and negative zero are the same!

If the fraction field isn't zero, but the exponent field is, you have a "denormalized number"--these are numbers too small to represent with a leading one.  You always need denormals to represent zero, but denormals (also known as "subnormal" values) also provide a little more range at the very low end--they can store values down to around 1.0e-40 for "float", and 1.0e-310 for "double". 

See below for the performance problem with denormals.

Weird: Infinity

If the exponent field is as big as it can get (for "float", 255), this indicates another sort of special number.  If the fraction field is zero, the number is interpreted as positive or negative "infinity".  The hardware will generate "infinity" when dividing by zero, or when another operation exceeds the representable range.
float z=0.0;
float f=1.0/z;
std::cout<<f<<"\n";
return (int)f;

(Try this in NetRun now!)

Arithmetic on infinities works just the way you'd expect:infinity plus 1.0 gives infinity, etc. (See tables below).  Positive and negative infinities exist, and work as you'd expect.  Note that while divide-by-integer-zero causes a crash (divide by zero error), divide-by-floating-point-zero just happily returns infinity by default.

You can also get to infinity by adding a number to itself repeatedly, for example:
float x=1.0;
while (true) {
float old_x=x;
x=x+x;
std::cout<<x<<"\n";
if (x==old_x) {
std::cout<<"Finally hit "<<x<<" and stopped.\n";
return 0;
}
}

(Try this in NetRun now!)

This is the same type of infinity you'd get by dividing by zero.

Weird: NaN

If you do an operation that doesn't make sense, like:
The machine just gives a special "error" number called a "NaN" (Not-a-Number).  The idea is if you run some complicated program that screws up, you don't want to get a plausible but wrong answer like "4" (like we get with integer overflow!); you want something totally implausible like "nan" to indicate an error happened.   For example, this program prints "nan" and returns -2147483648 (0x80000000):
float f=sqrt(-1.0);
std::cout<<f<<"\n";
return (int)f;

(Try this in NetRun now!)

This is a "NaN", which is represented with a huge exponent and a *nonzero* fraction field.  Positive and negative nans exist, but like zeros both signs seem to work the same.  x86 seems to rewrite the bits of all NaNs to a special pattern it prefers (0x7FC00000 for float, with exponent bits and the leading fraction bit all set to 1).

Performance impact of special values

Machines properly handle ordinary floating-point numbers and zero in hardware at full speed.

However, most modern machines *don't* handle denormals, infinities, or NaNs in hardware--instead when one of these special values occurs, they trap out to software which handles the problem and restarts the computation.  This trapping process takes time, as shown in the following program:
(Executable NetRun Link)
enum {n_vals=1000};
double vals[n_vals];

int average_vals(void) {
for (int i=0;i<n_vals-1;i++)
vals[i]=0.5*(vals[i]+vals[i+1]);
return 0;
}

int foo(void) {
int i;
for (i=0;i<n_vals;i++) vals[i]=0.0;
printf(" Zeros: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=1.0;
printf(" Ones: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=1.0e-310;
printf(" Denorm: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
float x=0.0;
for (i=0;i<n_vals;i++) vals[i]=1.0/x;
printf(" Inf: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=x/x;
printf(" NaN: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
return 0;
}
Many machines run *seriously* slower for the weird numbers.  Here are the results of the above program, in nanoseconds per float operation, on a variety of machines:


Intel P3 Intel P4 Core2 Q6600 Sandy Bridge Phenom II PPC G5 MIPS R5000 Intel 486
Zero 4.0 1.6 1.6 1.1 0.6 1.0 2.3 131.0 1215.8
One 4.0 1.6 1.9 1.1 0.6 1.0 2.2 130.6 864.8
Denorm 335.1 295.5 517.9 130.0 46.3 109.0 10.1 24437.0 3879.0
Infinity 191.9 706.4 346.9 1.1 0.6 1.0 2.1 153.2 2558.2
NaN 206.2 772.2 356.3 1.1 0.6 1.0 2.1 10924.1 3103.7

Generally, no machine has any performance penalty for zero, despite it being somewhat "weird".

Virtually all current machines have some performance penalty for denormalized numbers, sometimes hundreds of times slower than ordinary numbers.

Infinities and NaN are fast again on most recent machines.

My friends at Illinois and I wrote a paper on this with many more performance details.


Bonus: Arithmetic Tables for Special Floating-Point Numbers

These tables were computed for "float", but should be identical with any number size on any IEEE machine (which virtually everything is).  "big" is a large but finite number, here 1.0e30.  "lil" is a denormalized number, here 1.0e-40. "inf" is an infinity.  "nan" is a Not-A-Number.  Here's the source code to generate these tables.

These all go about how you'd expect--"inf" for things that are too big (or -inf for too small), "nan" for things that don't make sense (like 0.0/0.0, or infinity times zero, or nan with anything else).

Addition

+ -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan nan nan nan nan nan nan nan nan nan nan nan nan
-inf nan -inf -inf -inf -inf -inf -inf -inf -inf -inf nan nan
-big nan -inf -2e+30 -big -big -big -big -big -big 0 +inf nan
-1 nan -inf -big -2 -1 -1 -1 -1 0 +big +inf nan
-lil nan -inf -big -1 -2e-40 -lil -lil 0 +1 +big +inf nan
-0 nan -inf -big -1 -lil -0 0 +lil +1 +big +inf nan
+0 nan -inf -big -1 -lil 0 0 +lil +1 +big +inf nan
+lil nan -inf -big -1 0 +lil +lil 2e-40 +1 +big +inf nan
+1 nan -inf -big 0 +1 +1 +1 +1 2 +big +inf nan
+big nan -inf 0 +big +big +big +big +big +big 2e+30 +inf nan
+inf nan nan +inf +inf +inf +inf +inf +inf +inf +inf +inf nan
+nan nan nan nan nan nan nan nan nan nan nan nan nan
Note how infinity-infinity gives nan, but infinity+infinity is infinity.

Subtraction

- -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan nan nan nan nan nan nan nan nan nan nan nan nan
-inf nan nan -inf -inf -inf -inf -inf -inf -inf -inf -inf nan
-big nan +inf 0 -big -big -big -big -big -big -2e+30 -inf nan
-1 nan +inf +big 0 -1 -1 -1 -1 -2 -big -inf nan
-lil nan +inf +big +1 0 -lil -lil -2e-40 -1 -big -inf nan
-0 nan +inf +big +1 +lil 0 -0 -lil -1 -big -inf nan
+0 nan +inf +big +1 +lil 0 0 -lil -1 -big -inf nan
+lil nan +inf +big +1 2e-40 +lil +lil 0 -1 -big -inf nan
+1 nan +inf +big 2 +1 +1 +1 +1 0 -big -inf nan
+big nan +inf 2e+30 +big +big +big +big +big +big 0 -inf nan
+inf nan +inf +inf +inf +inf +inf +inf +inf +inf +inf nan nan
+nan nan nan nan nan nan nan nan nan nan nan nan nan

Multiplication

* -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan nan nan nan nan nan nan nan nan nan nan nan nan
-inf nan +inf +inf +inf +inf nan nan -inf -inf -inf -inf nan
-big nan +inf +inf +big 1e-10 0 -0 -1e-10 -big -inf -inf nan
-1 nan +inf +big +1 +lil 0 -0 -lil -1 -big -inf nan
-lil nan +inf 1e-10 +lil 0 0 -0 -0 -lil -1e-10 -inf nan
-0 nan nan 0 0 0 0 -0 -0 -0 -0 nan nan
+0 nan nan -0 -0 -0 -0 0 0 0 0 nan nan
+lil nan -inf -1e-10 -lil -0 -0 0 0 +lil 1e-10 +inf nan
+1 nan -inf -big -1 -lil -0 0 +lil +1 +big +inf nan
+big nan -inf -inf -big -1e-10 -0 0 1e-10 +big +inf +inf nan
+inf nan -inf -inf -inf -inf nan nan +inf +inf +inf +inf nan
+nan nan nan nan nan nan nan nan nan nan nan nan nan
Note that 0*infinity gives nan, and out-of-range multiplications give infinities.

Division

/ -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan nan nan nan nan nan nan nan nan nan nan nan nan
-inf nan nan +inf +inf +inf +inf -inf -inf -inf -inf nan nan
-big nan 0 +1 +big +inf +inf -inf -inf -big -1 -0 nan
-1 nan 0 1e-30 +1 +inf +inf -inf -inf -1 -1e-30 -0 nan
-lil nan 0 0 +lil +1 +inf -inf -1 -lil -0 -0 nan
-0 nan 0 0 0 0 nan nan -0 -0 -0 -0 nan
+0 nan -0 -0 -0 -0 nan nan 0 0 0 0 nan
+lil nan -0 -0 -lil -1 -inf +inf +1 +lil 0 0 nan
+1 nan -0 -1e-30 -1 -inf -inf +inf +inf +1 1e-30 0 nan
+big nan -0 -1 -big -inf -inf +inf +inf +big +1 0 nan
+inf nan nan -inf -inf -inf -inf +inf +inf +inf +inf nan nan
+nan nan nan nan nan nan nan nan nan nan nan nan nan
Note that 0/0, and inf/inf give NaNs; while out-of-range divisions like big/lil or 1.0/0.0 give infinities (and not errors!).

Equality

== -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan 0 0 0 0 0 0 0 0 0 0 0 0
-inf 0 +1 0 0 0 0 0 0 0 0 0 0
-big 0 0 +1 0 0 0 0 0 0 0 0 0
-1 0 0 0 +1 0 0 0 0 0 0 0 0
-lil 0 0 0 0 +1 0 0 0 0 0 0 0
-0 0 0 0 0 0 +1 +1 0 0 0 0 0
+0 0 0 0 0 0 +1 +1 0 0 0 0 0
+lil 0 0 0 0 0 0 0 +1 0 0 0 0
+1 0 0 0 0 0 0 0 0 +1 0 0 0
+big 0 0 0 0 0 0 0 0 0 +1 0 0
+inf 0 0 0 0 0 0 0 0 0 0 +1 0
+nan 0 0 0 0 0 0 0 0 0 0 0 0
Note that positive and negative zeros are considered equal, and a "NaN" doesn't equal anything--even itself!

Less-Than

< -nan -inf -big -1 -lil -0 +0 +lil +1 +big +inf +nan
-nan 0 0 0 0 0 0 0 0 0 0 0 0
-inf 0 0 +1 +1 +1 +1 +1 +1 +1 +1 +1 0
-big 0 0 0 +1 +1 +1 +1 +1 +1 +1 +1 0
-1 0 0 0 0 +1 +1 +1 +1 +1 +1 +1 0
-lil 0 0 0 0 0 +1 +1 +1 +1 +1 +1 0
-0 0 0 0 0 0 0 0 +1 +1 +1 +1 0
+0 0 0 0 0 0 0 0 +1 +1 +1 +1 0
+lil 0 0 0 0 0 0 0 0 +1 +1 +1 0
+1 0 0 0 0 0 0 0 0 0 +1 +1 0
+big 0 0 0 0 0 0 0 0 0 0 +1 0
+inf 0 0 0 0 0 0 0 0 0 0 0 0
+nan 0 0 0 0 0 0 0 0 0 0 0 0
Note that "NaN" returns false to all comparisons--it's neither smaller nor larger than the other numbers.