Floating-Point Numbers
CS 301: Assembly
Language Programming Lecture, Dr. Lawlor
Interactive demo: float.exposed
Nowadays floats have roughly the same
performance as integers: addition, subtraction, or multiplication
all take about a nanosecond. That is, floats are now cheap,
and you can consider using floats for all sorts of stuff--even
when you don't care about fractions! The advantages of using
floats are:
- Floats can store fractional numbers.
- Floats never overflow; they hit "infinity" instead.
- "double" has more bits than "int" (but less than "long").
Due to these advantages, many interpreted languages including
JavaScript have only one numeric type, usually double-precision
float.
Floats:
Normalized Numbers
In C++, "float" and "double" store
numbers in an odd way--internally they're really storing the
number in scientific notation, like
x = + 3.785746 * 105
Note that:
- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an
integer (in binary).
- The 3.785746 part, called the "mantissa" or just "fraction"
part, can be stored as the integer 3785746 (at least as long as
you can figure out where the decimal point goes!). Of
course, we store the fraction part as a binary number.
One problem is scientific notation can
represent the same number in several different ways:
x = + 3.785746 * 105 = + 0.3785746 * 106 =
+ 0.03785746 * 107 =
+ 37.85746 * 104
It's common to "normalize" a number in
scientific notation so that:
- There's exactly one digit to the left of the decimal point.
- And that digit is not zero. In binary, a "normalized" number *always* has a
1 at the left of the decimal point (if it ain't zero, it's
gotta be one). So most machines don't even store the 1,
it's implicit--you just know it's there!
This means the 105 version
above is the "normalized" way to write the number above.
Bits
in a Floating-Point Number
Floats represent continuous
values. But they do it using discrete bits.
\B1
|
v
|
v
|
v
|
|
v
|
v
|
v
|
v
|
\B1
|
e
|
e
|
e
|
|
e
|
e
|
e
|
e
|
|
v
|
v
|
v
|
v
|
|
v
|
v
|
v
|
v
|
e
|
f |
f |
f |
|
f |
f |
f |
f |
|
v
|
v
|
v
|
v
|
|
v
|
v
|
v
|
v
|
f
|
f
|
f
|
f
|
|
f
|
f
|
f
|
f
|
|
v
|
v
|
v
|
v
|
|
v
|
v
|
v
|
v
|
f
|
f
|
f
|
f
|
|
f
|
f
|
f
|
f
|
|
|
A "float" (as defined by IEEE Standard 754) consists of three bitfields:
Sign
|
Exponent
|
Fraction
(or "Mantissa")
|
1 bit--
0 for positive
1 for negative
|
8 unsigned bits--
117 means 2-10
127 means 20
137 means 210
|
23 bits-- a binary
fraction.
There's an implicit leading 1
(unless the exponent field is zero)
|
The sign is in the highest-order bit,
the exponent in the next 8 bits, and the fraction in the lower
bits.
The hardware interprets a normal float
as having the value:
value = (-1) sign * 2 (exponent-127) * 1.fraction
Note that the mantissa has an implicit leading
binary 1 applied. The 1 isn't stored, which actually causes
some headaches. (Even worse, if the exponent field is zero,
then it's an implicit leading 0; a "denormalized" number.)
For example, the value "8" would be
stored with sign bit 0, exponent 130 (==3+127), and mantissa
000... (without the leading 1), since:
8 = (-1) 0 * 2 (130-127) * 1.0000....
You can stare at the bits inside a float
by converting it to an integer. The quick and dirty way to
do this is via a pointer typecast, but modern compilers will
sometimes over-optimize this, especially in inlined code:
void print_bits(float f) {
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit--) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";
}
std::cout<<std::endl;
}
int foo(void) {
print_bits(0.0);
print_bits(-1.0);
print_bits(1.0);
print_bits(2.0);
print_bits(4.0);
print_bits(8.0);
print_bits(1.125);
print_bits(1.25);
print_bits(1.5);
print_bits(1+1.0/10);
return sizeof(float);
}
(Try
this in NetRun now!)
The official way to dissect the parts of
a float is using a "union" and a bitfield like so:
/* IEEE floating-point number's bits: sign exponent mantissa */
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};
float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
(Executable NetRun link)
I like to joke that a union used to
convert bits between incompatible types is an "unholy union".
In addition to the 32-bit "float", there
are several other different sizes of floating-point types:
C Datatype
|
Size
|
Approx. Precision
|
Approx. Range
|
Exponent Bits
|
Fraction Bits
|
+-1 range
|
float
|
4 bytes (everywhere)
|
1.0x10-7
|
1038
|
8
|
23
|
224
|
double
|
8 bytes (everywhere)
|
2.0x10-15
|
10308
|
11
|
52
|
253
|
long double
|
12-16 bytes (if it even exists)
|
2.0x10-20
|
104932
|
15
|
64
|
265
|
half
float / __fp16
|
2 bytes (only on late-2010ish machines)
|
1.0x10-3 |
105 |
5
|
10
|
211 |
fp8
float
|
1 byte (only on 2023+ GPUs)
|
Approx 1.0x10-1 |
104 |
4 or 5
|
3 or 2
|
23 |
Many of the various new 16 or 8 bit float types are just truncated versions of larger floats, so the only
circuit change is to throw away some fraction bits.
(Image adapted from the NVIDIA fp8 primer.)
They're pure research for now, but in the 2020 AI era there's some interest in
formats that provide more precision with fewer bits.
A "posit" adds a "regime" field to the usual float exponent field,
which can then be shorter, leaving more precision for reasonable-sized numbers.
This gives a numer format that smoothly scales down to even 8 bit floats (posit8_t) while
maintaining a consistent format.
Roundoff
in Arithmetic
They're funny old things, floats.
The fraction part (mantissa) only stores so much precision;
further bits are lost. For example, in reality,
1.234* 104 + 7.654*
100 = 1.2347654 * 104
But if we only keep three decimal
places,
1.234* 104 + 7.654*
100 = 1.234 *
104
which is to say, adding a tiny value to
a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3
places. To avoid this "roundoff error", when you're
doing arithmetic by hand, people recommend keeping lots of digits,
and only rounding once, at the end. But for a given value of
"lots of digits", did you keep enough?
For example, on a real computer adding
one to a float repeatedly will eventually stop changing the float!
float f=0.73;
while (1) {
volatile float g=f+1;
if (g==f) {
std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;
return 0;
}
else f=g;
}
(Try
this in NetRun now!)
For "double", you can add one more
times, but eventually the double will stop changing despite your
additions. Recall that for integers, adding one repeatedly
will *never* give you the same value--eventually the integer will
wrap around, but it won't just stop moving like floats!
This has really weird effects. For
example, floating-point arithmetic isn't "associative"--if you
change the order of
operations, you change the result due to accumulated
roundoff. In exact arithmetic:
1.2355308 * 104 = 1.234*
104 + (7.654*
100 + 7.654*
100)
1.2355308 * 104 = (1.234*
104 + 7.654*
100) + 7.654* 100
In other words, parenthesis don't matter
if you're computing the exact result. But to three decimal
places,
1.235 * 104 = 1.234*
104 + (7.654*
100 + 7.654*
100)
1.234 * 104 = (1.234*
104 + 7.654*
100) + 7.654* 100
In the first line, the small values get
added together, and together they're enough to move the big
value. But separately, they splat like bugs against the
windshield of the big value, and don't affect it at all!
double lil=1.0;
double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)
std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;
std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;
(Try
this in NetRun now!)
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable NetRun link)
In fact, if you've got a bunch of small
values to add to a big value, it's more roundoff-friendly to add
all the small values together first, then add them all to the big
value:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable NetRun link)
Roundoff can be very annoying. But
it's not the end of the world if you don't care about exact
answers, like in computer games, and even in many simulations
(where "exact" is unmeasureable anyway). You just need to be
able to estimate the amount of roundoff, and make sure it's not
too much.
However, the amount of roundoff depends
on the precision you keep in your numbers. This, in turn,
depends on the size of the numbers. For example, a "float"
is just 4 bytes, so it's not very precise. A "double" is 8
bytes, and so more precise. A "long double" is 12 bytes (or
more!), using more memory, but it's got tons of precision.
There's often a serious tradeoff between precision and space (and
time), so just using long double for everything isn't a good idea:
your program may get bigger and slower, and you still might not
have enough precision.
Roundoff
in Representation
Sadly, 0.1 decimal is an infinitely
repeating pattern in binary: 0.0(0011), with 0011 repeating.
This means multiplying by some *finite* pattern to approximate 0.1
is only an approximation of really dividing by the integer
10.0. The exact difference is proportional to the precision
of the numbers and the size of the input data:
for (int i=1;i<1000000000;i*=10) {
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
}
(executable NetRun link)
In a perfect world, multiplying by 0.1
and dividing by 10 would give the exact same result. But in
reality, 0.1 has to be approximated by a finite series of binary
digits, while the integer 10 can be stored exactly, so on NetRun's
Pentium4 CPU, this code gives:
i=1 diff=5.54976e-18
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)
That is, there's a factor of 10^-18
difference between double-precision 0.1 and the true 1/10!
This can add up over time.