Bits in a Float, and Infinity, NaN, and denormal
CS 301 Lecture, Dr. Lawlor
Bits in a FloatingPoint Number
Floats represent continuous values. But they do it using discrete
bits.
A "float" (as defined by IEEE Standard
754) consists of three bitfields:
Sign

Exponent

Fraction (or
"Mantissa")

1 bit
0 for positive
1 for negative

8 unsigned bits
127 means 2^{0
} 137 means 2^{10
} 
23 bits a binary fraction.
Don't forget the implicit
leading 1!

The sign is in the highestorder bit, the exponent in the next 8 bits,
and the fraction in the remaining bits.
The hardware interprets a float as having the value:
value = (1) ^{sign}
* 2 ^{(exponent127) }* 1.fraction
Note that the mantissa has an implicit
leading binary 1 applied. The 1 isn't stored, which actually
causes some headaches. (Even worse, if the exponent field is
zero, then it's an implicit leading 0; a
"denormalized" number as we'll talk about on Wednesday.)
For example, the value "8" would be stored with sign bit 0, exponent
130 (==3+127), and mantissa 000... (without the leading 1), since:
8 = (1) ^{0}
* 2 ^{(130127) }* 1.0000....
You can stare at the bits inside a float by converting it to an
integer. The quick and dirty way to do this is via a pointer
typecast, but modern compilers will sometimes overoptimize this,
especially in inlined code:
void print_bits(float f) {
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";
}
std::cout<<std::endl;
}
int foo(void) {
print_bits(0.0);
print_bits(1.0);
print_bits(1.0);
print_bits(2.0);
print_bits(4.0);
print_bits(8.0);
print_bits(1.125);
print_bits(1.25);
print_bits(1.5);
print_bits(1+1.0/10);
return sizeof(float);
}
(Try this in NetRun now!)
The official way to dissect the parts of a float is using a "union" and a
bitfield like so:
/* IEEE floatingpoint number's bits: sign exponent mantissa */
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};
float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
(Executable
NetRun link)
I like to joke that a union misused to convert bits between incompatible types is an "unholy union".
In addition to the 32bit "float", there are several other different sizes of floatingpoint types:
C Datatype

Size

Approx. Precision

Approx. Range

Exponent Bits

Fraction Bits

+1 range

float

4 bytes (everywhere)

1.0x10^{7}

10^{38}

8

23

2^{24}

double

8 bytes (everywhere)

2.0x10^{15}

10^{308}

11

52

2^{53}

long double

1216 bytes (if it even exists)

2.0x10^{20}

10^{4932}

15

64

2^{65}

Nowadays floats have roughly the same
performance as
integers:
addition, subtraction, or multiplication all take about a nanosecond. That is, floats are now cheap, and you
can consider using floats for all sorts of stuffeven when you don't
care about fractions! The advantages of using floats are:
 Floats can store fractional numbers.
 Floats never overflow; they hit "infinity" as explored below.
 "double" has more bits than "int" (but less than "long").
Normal (nonWeird) Floats
Recall that a "float" as as defined by IEEE Standard 754 consists of three bitfields:
Sign

Exponent

Mantissa (or Fraction)

1 bit
0 for positive
1 for negative

8 bits
127 means 2^{0
} 137 means 2^{10
} 
23 bits a binary fraction.

The hardware usually interprets a float as having the value:
value = (1) ^{sign} * 2 ^{(exponent127) }* 1.fraction
Note that the mantissa normally has an implicit leading 1 applied.
Weird: Zeros and Denormals
However, if the "exponent"
field is exactly zero, the implicit leading digit is taken to be 0, like this:
value = (1) ^{sign} * 2 ^{(126) }* 0.fraction
Supressing the leading 1 allows you to exactly represent 0:
the bit pattern for 0.0 is just exponent==0 and
fraction==00000000 (that is, everything zero). If you set the
sign bit to negative, you have "negative zero", a strange
curiosity. Positive and negative zero work the same way in
arithmetic operations, and as far as I know there's no reason to prefer
one to the other. The "==" operator claims positive and negative zero are the same!
If the fraction field isn't zero, but the exponent field is, you have a
"denormalized number"these are numbers too small to represent with a
leading one. You always need denormals to represent zero, but
denormals (also known as "subnormal" values) also provide a little more
range at the very
low endthey can store values down to around 1.0e40 for "float", and
1.0e310
for "double".
See below for the performance problem with
denormals.
Weird: Infinity
If the exponent field is as big as it can get (for "float", 255), this
indicates another sort of special number. If the fraction field
is zero, the number is interpreted as positive or negative
"infinity". The hardware will generate "infinity" when dividing
by zero, or when another operation exceeds the representable range.
float z=0.0;
float f=1.0/z;
std::cout<<f<<"\n";
return (int)f;
(Try this in NetRun now!)
Arithmetic on infinities works just the way you'd expect:infinity plus
1.0 gives infinity, etc. (See tables below). Positive and
negative infinities exist, and work as you'd expect. Note that
while dividebyintegerzero causes a crash (divide by zero
error), dividebyfloatingpointzero just happily returns infinity by
default.
You can also get to infinity by adding a number to itself repeatedly, for example:
float x=1.0;
while (true) {
float old_x=x;
x=x+x;
std::cout<<x<<"\n";
if (x==old_x) {
std::cout<<"Finally hit "<<x<<" and stopped.\n";
return 0;
}
}
(Try this in NetRun now!)
This is the same type of infinity you'd get by dividing by zero.
Weird: NaN
If you do an operation that doesn't make sense, like:
 0.0/0.0 (neither zero nor infinity, because we'd want (x/x)==1.0; but not 1.0 either, because we'd want (2*x)/x==2.0...)
 infinityinfinity (might cancel out to anything)
 infinity*0
The machine just gives a special "error" number called a "NaN"
(NotaNumber). The idea is if you run some complicated program
that screws up, you don't want to get a plausible but wrong answer like
"4" (like we get with integer overflow!); you want something totally
implausible like "nan" to indicate an error happened. For
example, this program prints "nan" and returns 2147483648 (0x80000000):
float f=sqrt(1.0);
std::cout<<f<<"\n";
return (int)f;
(Try this in NetRun now!)
This is a "NaN", which is represented with a huge exponent and a
*nonzero* fraction field. Positive and negative nans exist, but
like zeros both signs seem to work the same. x86 seems to rewrite the bits
of all NaNs to a special pattern it prefers (0x7FC00000 for float, with
exponent bits and the leading fraction bit all set to 1).
Performance impact of special values
Machines properly handle ordinary floatingpoint numbers and zero in hardware at full speed.
However, most modern machines *don't* handle denormals, infinities, or
NaNs in hardwareinstead when one of these special values occurs, they
trap out to software which handles the problem and restarts the
computation. This trapping
process takes time, as shown in the following program:
(Executable NetRun Link)
enum {n_vals=1000};
double vals[n_vals];
int average_vals(void) {
for (int i=0;i<n_vals1;i++)
vals[i]=0.5*(vals[i]+vals[i+1]);
return 0;
}
int foo(void) {
int i;
for (i=0;i<n_vals;i++) vals[i]=0.0;
printf(" Zeros: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=1.0;
printf(" Ones: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=1.0e310;
printf(" Denorm: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
float x=0.0;
for (i=0;i<n_vals;i++) vals[i]=1.0/x;
printf(" Inf: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
for (i=0;i<n_vals;i++) vals[i]=x/x;
printf(" NaN: %.3f ns/float\n",time_function(average_vals)/n_vals*1.0e9);
return 0;
}
Many machines run *seriously* slower for the weird numbers. Here
are the results of the above program, in nanoseconds per float
operation, on a variety of machines:

Intel P3 
Intel P4 
Core2 
Q6600 
Sandy Bridge 
Phenom II 
PPC G5 
MIPS R5000 
Intel 486 
Zero 
4.0 
1.6 
1.6 
1.1 
0.6 
1.0 
2.3 
131.0 
1215.8 
One 
4.0 
1.6 
1.9 
1.1 
0.6 
1.0 
2.2 
130.6 
864.8 
Denorm 
335.1 
295.5 
517.9 
130.0 
46.3 
109.0 
10.1 
24437.0 
3879.0 
Infinity 
191.9 
706.4 
346.9 
1.1 
0.6 
1.0 
2.1 
153.2 
2558.2 
NaN 
206.2 
772.2 
356.3 
1.1 
0.6 
1.0 
2.1 
10924.1 
3103.7 
Generally, no machine has any performance penalty for zero, despite it being somewhat "weird".
Virtually all current machines have some performance penalty for
denormalized numbers, sometimes hundreds of times slower than ordinary
numbers.
Infinities and NaN are fast again on most recent machines.
My friends at Illinois and I wrote a paper on this with many more performance details.
Bonus: Arithmetic Tables for Special FloatingPoint Numbers
These tables were computed for "float", but should be identical with any
number size on any IEEE machine (which virtually everything is).
"big" is a large but finite number, here
1.0e30. "lil" is a denormalized number, here 1.0e40. "inf" is an
infinity. "nan" is a NotANumber. Here's the source code to generate these tables.
These all go about how you'd expect"inf" for things that are too
big (or inf for too small), "nan" for things that don't make sense (like 0.0/0.0, or infinity
times zero, or nan with anything else).
Addition
+ 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
inf 
nan 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
nan 
nan 
big 
nan 
inf 
2e+30 
big 
big 
big 
big 
big 
big 
0 
+inf 
nan 
1 
nan 
inf 
big 
2 
1 
1 
1 
1 
0 
+big 
+inf 
nan 
lil 
nan 
inf 
big 
1 
2e40 
lil 
lil 
0 
+1 
+big 
+inf 
nan 
0 
nan 
inf 
big 
1 
lil 
0 
0 
+lil 
+1 
+big 
+inf 
nan 
+0 
nan 
inf 
big 
1 
lil 
0 
0 
+lil 
+1 
+big 
+inf 
nan 
+lil 
nan 
inf 
big 
1 
0 
+lil 
+lil 
2e40 
+1 
+big 
+inf 
nan 
+1 
nan 
inf 
big 
0 
+1 
+1 
+1 
+1 
2 
+big 
+inf 
nan 
+big 
nan 
inf 
0 
+big 
+big 
+big 
+big 
+big 
+big 
2e+30 
+inf 
nan 
+inf 
nan 
nan 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
nan 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
Note how infinityinfinity gives nan, but infinity+infinity is infinity.
Subtraction
 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
inf 
nan 
nan 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
inf 
nan 
big 
nan 
+inf 
0 
big 
big 
big 
big 
big 
big 
2e+30 
inf 
nan 
1 
nan 
+inf 
+big 
0 
1 
1 
1 
1 
2 
big 
inf 
nan 
lil 
nan 
+inf 
+big 
+1 
0 
lil 
lil 
2e40 
1 
big 
inf 
nan 
0 
nan 
+inf 
+big 
+1 
+lil 
0 
0 
lil 
1 
big 
inf 
nan 
+0 
nan 
+inf 
+big 
+1 
+lil 
0 
0 
lil 
1 
big 
inf 
nan 
+lil 
nan 
+inf 
+big 
+1 
2e40 
+lil 
+lil 
0 
1 
big 
inf 
nan 
+1 
nan 
+inf 
+big 
2 
+1 
+1 
+1 
+1 
0 
big 
inf 
nan 
+big 
nan 
+inf 
2e+30 
+big 
+big 
+big 
+big 
+big 
+big 
0 
inf 
nan 
+inf 
nan 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
+inf 
nan 
nan 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
Multiplication
* 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
inf 
nan 
+inf 
+inf 
+inf 
+inf 
nan 
nan 
inf 
inf 
inf 
inf 
nan 
big 
nan 
+inf 
+inf 
+big 
1e10 
0 
0 
1e10 
big 
inf 
inf 
nan 
1 
nan 
+inf 
+big 
+1 
+lil 
0 
0 
lil 
1 
big 
inf 
nan 
lil 
nan 
+inf 
1e10 
+lil 
0 
0 
0 
0 
lil 
1e10 
inf 
nan 
0 
nan 
nan 
0 
0 
0 
0 
0 
0 
0 
0 
nan 
nan 
+0 
nan 
nan 
0 
0 
0 
0 
0 
0 
0 
0 
nan 
nan 
+lil 
nan 
inf 
1e10 
lil 
0 
0 
0 
0 
+lil 
1e10 
+inf 
nan 
+1 
nan 
inf 
big 
1 
lil 
0 
0 
+lil 
+1 
+big 
+inf 
nan 
+big 
nan 
inf 
inf 
big 
1e10 
0 
0 
1e10 
+big 
+inf 
+inf 
nan 
+inf 
nan 
inf 
inf 
inf 
inf 
nan 
nan 
+inf 
+inf 
+inf 
+inf 
nan 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
Note that 0*infinity gives nan, and outofrange multiplications give infinities.
Division
/ 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
inf 
nan 
nan 
+inf 
+inf 
+inf 
+inf 
inf 
inf 
inf 
inf 
nan 
nan 
big 
nan 
0 
+1 
+big 
+inf 
+inf 
inf 
inf 
big 
1 
0 
nan 
1 
nan 
0 
1e30 
+1 
+inf 
+inf 
inf 
inf 
1 
1e30 
0 
nan 
lil 
nan 
0 
0 
+lil 
+1 
+inf 
inf 
1 
lil 
0 
0 
nan 
0 
nan 
0 
0 
0 
0 
nan 
nan 
0 
0 
0 
0 
nan 
+0 
nan 
0 
0 
0 
0 
nan 
nan 
0 
0 
0 
0 
nan 
+lil 
nan 
0 
0 
lil 
1 
inf 
+inf 
+1 
+lil 
0 
0 
nan 
+1 
nan 
0 
1e30 
1 
inf 
inf 
+inf 
+inf 
+1 
1e30 
0 
nan 
+big 
nan 
0 
1 
big 
inf 
inf 
+inf 
+inf 
+big 
+1 
0 
nan 
+inf 
nan 
nan 
inf 
inf 
inf 
inf 
+inf 
+inf 
+inf 
+inf 
nan 
nan 
+nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
nan 
Note that 0/0, and inf/inf give NaNs; while outofrange divisions like big/lil or 1.0/0.0 give infinities (and not errors!).
Equality
== 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
inf 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
big 
0 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
0 
0 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
lil 
0 
0 
0 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
+1 
0 
0 
0 
0 
0 
+0 
0 
0 
0 
0 
0 
+1 
+1 
0 
0 
0 
0 
0 
+lil 
0 
0 
0 
0 
0 
0 
0 
+1 
0 
0 
0 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
0 
0 
0 
+big 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
0 
0 
+inf 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
0 
+nan 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
Note that positive and negative zeros are considered equal, and a "NaN" doesn't equal anythingeven itself!
LessThan
< 
nan 
inf 
big 
1 
lil 
0 
+0 
+lil 
+1 
+big 
+inf 
+nan 
nan 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
inf 
0 
0 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
0 
big 
0 
0 
0 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
0 
1 
0 
0 
0 
0 
+1 
+1 
+1 
+1 
+1 
+1 
+1 
0 
lil 
0 
0 
0 
0 
0 
+1 
+1 
+1 
+1 
+1 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
+1 
+1 
+1 
0 
+0 
0 
0 
0 
0 
0 
0 
0 
+1 
+1 
+1 
+1 
0 
+lil 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
+1 
+1 
0 
+1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
+1 
0 
+big 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+1 
0 
+inf 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
+nan 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
Note that "NaN" returns false to all comparisonsit's neither smaller nor larger than the other numbers.