# Bits used to Implement Floating-Point Numbers

CS 301 Lecture, Dr. Lawlor

Floats represent continuous values.  But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:
 Sign Exponent Fraction (or "Mantissa") 1 bit--   0 for positive   1 for negative 8 unsigned bits--   127 means 20   137 means 210 23 bits-- a binary fraction. Don't forget the implicit leading 1!
The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

value = (-1) sign * 2 (exponent-127) * 1.fraction

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1) 0 * 2 (130-127) * 1.0000....

You can actually dissect the parts of a float using a "union" and a bitfield like so:
`/* IEEE floating-point number's bits:  sign  exponent   mantissa */struct float_bits {	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */	unsigned int exp:8; /**< Value is 2^(exp-127) */	unsigned int sign:1; /**< 0 for positive, 1 for negative */};/* A union is a struct where all the fields *overlap* each other */union float_dissector {	float f;	float_bits b;};float_dissector s;s.f=8.0;std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";return 0;`

In addition to the 32-bit "float", there are several different sizes of floating-point types:
 C Datatype Size Approx. Precision Approx. Range Exponent Bits Fraction Bits +-1 range float 4 bytes (everywhere) 1.0x10-7 1038 8 23 224 double 8 bytes (everywhere) 2.0x10-15 10308 11 52 253 long double 12-16 bytes (if it even exists) 2.0x10-20 104932 15 64 265

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds.  That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions!  The advantages of using floats are:
• Floats can store fractional numbers.
• Floats never overflow.
• "double" has more precision than "int".

## x86 Floating-Point Assembly Language

On many CPUs, floating-point values are usually stored in special "floating-point registers", and are added, subtracted, etc with special "floating-point instructions", but other than the name these registers and instructions are exactly analogous to regular integer registers and instructions.  For example, the integer PowerPC assembly code to add registers 1 and 2 into register 3 is "add r3,r1,r2"; the floating-point code to add floating-point registers 1 and 2 into floating-point register 3 is "fadd fr3,fr1,fr2".

x86 is not like that.

The problem is that the x86 instruction set wasn't designed with floating-point in mind; they added floating-point instructions to the CPU later (with the 8087, a separate chip that handled all floating-point instructions).  Unfortunately, there weren't many unused opcode bytes left, and (being the 1980's, when bytes were expensive) the designers really didn't want to make the instructions longer.  So instead of the usual instructions like "add register A to register B", x86 floating-point has just "add", which saves the bits that would be needed to specify the source and destination registers!

But the question is, what the heck are you adding?  The answer is the "top two values on the floating-point register stack".  That's not "the stack" (the memory area used by function calls), it's a separate set of values totally internal to the CPU's floating-point hardware.  There are various load functions that push values onto the floating-point register stack, and most of the arithmetic functions read from the top of the floating-point register stack.  So to compute stuff, you load the values you want to manipulate onto the floating-point register stack, and then use some arithmetic instructions.

## x86 Floating-Point Assembly in Practice

Here's what this looks like.  The whole bottom chunk of code just prints the float on the top of the x86 register stack, with the assembly equivalent of the C code: printf("Yo!  Here's our float: %f\n",f);
`fldpi ; Push "pi" onto floating-point stacksub esp,8 ; Make room on the stack for an 8-byte doublefstp QWORD [esp]; Push printf's double parameter onto the stackpush my_string ; Push printf's string parameter (below)extern printfcall printf  ; Print stringadd esp,12    ; Clean up stackret ; Done with functionmy_string: db "Yo!  Here's our float: %f",0xa,0`

(Try this in NetRun now!)

There are lots of useful floating-point instructions: