Floating-Point Numbers

CS 301: Assembly Language Programming Lecture, Dr. Lawlor


The fact is, variables on a computer only have so many bits.  If the value gets bigger than can fit in those bits, the extra bits first go negative and then "overflow".  By default they're then ignored completely.

For example:

int value=1; /* value to test, starts at first (lowest) bit */
for (int bit=0;bit<100;bit++) {
	std::cout<<"at bit "<<bit<<" the value is "<<value<<"\n";
	value=value+value; /* moves over by one bit */
	if (value==0) break;
return 0;

(Try this in NetRun now!)

Because "int" currently has 32 bits, if you start at one, and add a variable to itself 32 times, the one overflows and is lost completely. 

In assembly, there's a handy instruction "jo" (jump if overflow) to check for overflow from the previous instruction.  The C++ compiler doesn't bother to use jo, though!

mov edi,1 ; loop variable
mov eax,0 ; counter

	add eax,1 ; increment bit counter

	add edi,edi ; add variable to itself
	jo noes ; check for overflow in the above add

	cmp edi,0
	jne start


noes: ; called for overflow
	mov eax,999

(Try this in NetRun now!)

Notice the above program returns 999 on overflow, which somebody else will need to check for.  (Responding correctly to overflow is actually quite difficult--see, e.g., Ariane 5 explosion, caused by poor handling of a detected overflow.  Ironically, ignoring the overflow would have caused no problems!)

Signed versus Unsigned Numbers

If you watch closely right before overflow, you see something funny happen:

signed char value=1; /* value to test, starts at first (lowest) bit */
for (int bit=0;bit<100;bit++) {
	std::cout<<"at bit "<<bit<<" the value is "<<(long)value<<"\n";
	value=value+value; /* moves over by one bit (value=value<<1 would work too) */
	if (value==0) break;
return 0;

(Try this in NetRun now!)

This prints out:

at bit 0 the value is 1
at bit 1 the value is 2
at bit 2 the value is 4
at bit 3 the value is 8
at bit 4 the value is 16
at bit 5 the value is 32
at bit 6 the value is 64
at bit 7 the value is -128 
Program complete.  Return 0 (0x0)

Wait, the last bit's value is -128?  Yes, it really is!

This negative high bit is called the "sign bit", and it has a negative value in two's complement signed numbers.  This means to represent -1, for example, you set not only the high bit, but all the other bits as well: in unsigned, this is the largest possible value.  The reason binary 11111111 represents -1 is the same reason you might choose 9999 to represent -1 on a 4-digit odometer: if you add one, you wrap around and hit zero.

A very cool thing about two's complement is addition is the same operation whether the numbers are signed or unsigned--we just interpret the result differently.  Subtraction is also identical for signed and unsigned.  Register names are identical in assembly for signed and unsigned.  However, when you change register sizes using an instruction like "movsxd rax,eax", when you check for overflow, when you compare numbers, multiply or divide, or shift bits, you need to know if the number is signed (has a sign bit) or unsigned (no sign bit, no negative numbers).

Signed Unsigned Language
int unsigned int C++, int is signed by default.
signed char unsigned char C++, char may be signed or unsigned.
movsxd movzxd Assembly, sign extend or zero extend to change register sizes.
jo jc Assembly, overflow is calculated for signed values, carry for unsigned values.
jg ja Assembly, jump greater is signed, jump above is unsigned.
jl jb Assembly, jump less signed, jump below unsigned.
imul mul Assembly, imul is signed (and more modern), mul is for unsigned (and ancient and horrible!). idiv/div work similarly.

Floats: Normalized Numbers

In C++, "float" and "double" store numbers in an odd way--internally they're really storing the number in scientific notation, like
    x = + 3.785746 * 105
Note that:
Scientific notation is designed to be compatible with slide rules (here's a circular slide rule demo); slide rules are basically a log table starting at 1.  This works because log(1) = 0, and log(a) + log(b) = log(ab).  But slide rules only give you the mantissa; you need to figure out the exponent yourself.  The "order of magnitude" guess that engineers (and I) like so much is just a calculation using zero significant digits--no mantissa, all exponent.

 One problem is scientific notation can represent the same number in several different ways:
    x = + 3.785746 * 105  = + 0.3785746 * 106 = + 0.03785746 * 107 = + 37.85746 * 104  

It's common to "normalize" a number in scientific notation so that:
  1. There's exactly one digit to the left of the decimal point.
  2. And that digit is not zero.
This means the 105 version above is the "normal" way to write the number above.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one).  So there's no reason to even store the 1; you just know it's there!

Bits in a Floating-Point Number

Floats represent continuous values.  But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:
Fraction (or "Mantissa")
1 bit-- 
  0 for positive
  1 for negative
8 unsigned bits--
  127 means 20
  137 means 210
23 bits-- a binary fraction. 

Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

    value = (-1) sign * 2 (exponent-127) * 1.fraction

Note that the mantissa has an implicit leading binary 1 applied.  The 1 isn't stored, which actually causes some headaches.  (Even worse, if the exponent field is zero, then it's an implicit leading 0; a "denormalized" number as we'll talk about on Wednesday.)

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

    8 = (-1) 0 * 2 (130-127) * 1.0000....

You can stare at the bits inside a float by converting it to an integer.  The quick and dirty way to do this is via a pointer typecast, but modern compilers will sometimes over-optimize this, especially in inlined code:
void print_bits(float f) {
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit--) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";

int foo(void) {
return sizeof(float);

(Try this in NetRun now!)

The official way to dissect the parts of a float is using a "union" and a bitfield like so:
/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;

float_dissector s;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
(Executable NetRun link)

I like to joke that a union used to convert bits between incompatible types is an "unholy union".

In addition to the 32-bit "float", there are several other different sizes of floating-point types:
C Datatype
Approx. Precision
Approx. Range
Exponent Bits
Fraction Bits
+-1 range
4 bytes (everywhere)
8 bytes (everywhere)
long double
12-16 bytes (if it even exists)
half float
2 bytes (only on GPUs)
1.0x10-3 105 5

Nowadays floats have roughly the same performance as integers: addition, subtraction, or multiplication all take about a nanosecond.  That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions!  The advantages of using floats are:
Due to these advantages, many interpreted languages including JavaScript have only one numeric type, usually double-precision float.

Roundoff in Arithmetic

They're funny old things, floats.  The fraction part (mantissa) only stores so much precision; further bits are lost.  For example, in reality,
    1.234* 104 + 7.654* 100 = 1.2347654 * 104
But if we only keep three decimal places,
    1.234* 104 + 7.654* 100 = 1.234 * 104
which is to say, adding a tiny value to a great big value might not change the great big value at all, because the tiny value gets lost when rounding off to 3 places.   To avoid this "roundoff error", when you're doing arithmetic by hand, people recommend keeping lots of digits, and only rounding once, at the end.  But for a given value of "lots of digits", did you keep enough?

For example, on a real computer adding one to a float repeatedly will eventually stop changing the float!
float f=0.73;
while (1) {
volatile float g=f+1;
if (g==f) {
std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;
return 0;
else f=g;

(Try this in NetRun now!)

For "double", you can add one more times, but eventually the double will stop changing despite your additions.  Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!

This has really weird effects.  For example, floating-point arithmetic isn't "associative"--if you change the order of operations, you change the result due to accumulated roundoff.  In exact arithmetic:
    1.2355308 * 104 = 1.234* 104 + (7.654* 100 + 7.654* 100)
    1.2355308 * 104 = (1.234* 104 + 7.654* 100) + 7.654* 100
In other words, parenthesis don't matter if you're computing the exact result.  But to three decimal places, 
    1.235 * 104 = 1.234* 104 + (7.654* 100 + 7.654* 100)
    1.234 * 104 = (1.234* 104 + 7.654* 100) + 7.654* 100
In the first line, the small values get added together, and together they're enough to move the big value.  But separately, they splat like bugs against the windshield of the big value, and don't affect it at all!
double lil=1.0;
double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)
std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;
std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;

(Try this in NetRun now!)

float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable NetRun link)

In fact, if you've got a bunch of small values to add to a big value, it's more roundoff-friendly to add all the small values together first, then add them all to the big value:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */

if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable NetRun link)

Roundoff can be very annoying.  But it's not the end of the world if you don't care about exact answers, like in computer games, and even in many simulations (where "exact" is unmeasureable anyway).  You just need to be able to estimate the amount of roundoff, and make sure it's not too much.

However, the amount of roundoff depends on the precision you keep in your numbers.  This, in turn, depends on the size of the numbers.  For example, a "float" is just 4 bytes, so it's not very precise.  A "double" is 8 bytes, and so more precise.  A "long double" is 12 bytes (or more!), using more memory, but it's got tons of precision.  There's often a serious tradeoff between precision and space (and time), so just using long double for everything isn't a good idea: your program may get bigger and slower, and you still might not have enough precision.

Roundoff in Representation

Sadly, 0.1 decimal is an infinitely repeating pattern in binary: 0.0(0011), with 0011 repeating.  This means multiplying by some *finite* pattern to approximate 0.1 is only an approximation of really dividing by the integer 10.0.  The exact difference is proportional to the precision of the numbers and the size of the input data:
for (int i=1;i<1000000000;i*=10) {
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
(executable NetRun link)

In a perfect world, multiplying by 0.1 and dividing by 10 would give the exact same result.  But in reality, 0.1 has to be approximated by a finite series of binary digits, while the integer 10 can be stored exactly, so on NetRun's Pentium4 CPU, this code gives:
i=1  diff=5.54976e-18
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)
That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10!  This can add up over time.