Ordinary integers can only represent integral values. "Floating-point numbers" can represent non-integral values. This is useful for engineering, science, statistics, graphics, and any time you need to represent numbers from the real world, which are rarely integral!

Floats store numbers in an odd way--they're really storing the number in scientific notation, like

x = + 3.785746 * 10

Note that:

- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer.
- The 3.785746 part, called the "mantissa", can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

In binary, you can represent a non-integer like "two and three-eighths" as "10.011". That is, there's:

- a 1 in the 2's place (2=2
^{1}) - a 0 in the 1's place (1=2
^{0}) - a 0 in the (beyond the "binary point") 1/2's place (1/2=2
^{-1}), - a 1 in the 1/4's place (1/4=2
^{-2}), and - a 1 in the 1/8's place (1/8=2
^{-3})

x = + 3.785746 * 10

It's common to "normalize" a number in scientific notation so that:

- There's exactly one digit to the left of the decimal point.
- And that digit ain't zero.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So sometimes there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign |
Exponent |
Fraction (or
"Mantissa") |

1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |

The hardware interprets a float as having the value:

value = (-1)

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1)

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)

struct float_bits {

unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */

unsigned int exp:8; /**< Value is 2^(exp-127) */

unsigned int sign:1; /**< 0 for positive, 1 for negative */

};

/* A union is a struct where all the fields *overlap* each other */

union float_dissector {

float f;

float_bits b;

};

float_dissector s;

s.f=8.0;

std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";

return 0;

There are several different sizes of floating-point types:

C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |

float |
4 bytes (everywhere) |
1.0x10^{-7} |
10^{38} |
8 |
23 |
2^{24} |

double |
8 bytes (everywhere) |
2.0x10^{-15} |
10^{308} |
11 |
52 |
2^{53} |

long double |
12-16 bytes (if it even exists) |
2.0x10^{-20} |
10^{4932} |
15 |
64 |
2^{65} |

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions.

/* A floating-point number, written inside a class (for no real reason) */One very cool thing about C++ is that because everything we do with the "my_float" class is "inline", the compiler is smart enough to "see through" our my_float class to the double underneath. This means our "my_float" class actually costs nothing at runtime--it's just as fast to use our own wrapper around a "double" as it is to use a plain "double":

class my_float {

public:

double v; /* value I represent */

/* Create a "my_float" from an actual hardware float. */

my_float(double value) :v(value) {}

};

/** Output operator, for easy cout-style printing */

std::ostream &operator<<(std::ostream &o,const my_float &f) {

o<<f.v;

return o;

}

/** Like "-a". Make this my_float have the opposite sign */

inline my_float operator-(const my_float &a)

{

return my_float(-a.v);

}

/** Like "a+b". Add these two my_floats */

inline my_float operator+(const my_float &a,const my_float &b) {

return my_float(a.v+b.v);

}

/** Like "a-b". Subtract these two my_floats */

inline my_float operator-(const my_float &a,const my_float &b) {

return my_float(a.v-b.v);

}

my_float ma(1.0), mb(1.0);

int my_fadd(void) {for (int i=0;i<1000;i++) ma=ma+mb; return 0;}

double fa(1.0), fb(1.0);

int hw_fadd(void) {for (int i=0;i<1000;i++) fa=fa+fb; return 0;}

int foo(void) {

print_time("my_float",my_fadd);

print_time("hw_float",hw_fadd);

my_float a(1.0);

my_float b(0.25);

std::cout<<" a="<<a<<" b="<<b<<" a-b="<<(a-b)<<"\n";

return 0;

}

my_float: 2216.08 ns/call

hw_float: 2211.89 ns/call

a=1 b=0.25 a-b=0.75

Program complete. Return 0 (0x0)

/* A floating-point number, written in software */Here's how we do output. I'm outputting the mantissa in hex, the exponent in signed decimal (just like printf's new "%a" format!), and then I'm also computing the floating-point value we represent:

class my_float {

public:

int sign; /* 0 for +, 1 for - */

int exponent; /* scaling on float is 2^exponent */

int mantissa; /* value of float */

/* Create a "my_float" from sign, exponent, and mantissa fields */

my_float(int sign_,int exponent_,int mantissa_)

:sign(sign_), exponent(exponent_), mantissa(mantissa_) {}

};

/** Output operator, for easy cout-style printing */OK. Let's start with something easy. How do we implement "-x"? Well, let's just flip the sign bit:

std::ostream &operator<<(std::ostream &o,const my_float &f) {

o<<(f.sign?"-":"+")<<

"0x"<<std::hex<<f.mantissa<<

"p"<<std::dec<<f.exponent<<

" ("<<(f.sign?-1.0:+1.0)*f.mantissa*pow(2,f.exponent)<<") ";

return o;

}

/** Like "-a". Make this my_float have the opposite sign */Let's try this out. We'll start with the number +1 times two to the zero power, and negate it:

inline my_float operator-(const my_float &a)

{

return my_float(!a.sign,a.exponent,a.mantissa);

}

int foo(void) {This prints out:

my_float a(0,0,1);

std::cout<<" a="<<a<<" -a="<<(-a)<<"\n";

return 0;

}

a=+0x1p0 (1) -a=-0x1p0 (-1)OK! Looks like we've got "negate" down!

Program complete. Return 0 (0x0)

/** Like "a+b". Add these two my_floats */(executable NetRun link)

inline my_float operator+(const my_float &a,const my_float &b) {

int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/

int e=a.exponent; /* exponent (FIXME: what if a.exponent!=b.exponent?) */

int m=a.mantissa + b.mantissa; /* mantissa (FIXME: what about a carry?) */

return my_float(s,e,m);

}

int foo(void) {

my_float a(0,0,1), b(0,0,1);

std::cout<<" a="<<a<<" b="<<b<<" a+b="<<(a+b)<<"\n";

return 0;

}

Wow, this actually works! And better yet, it's even faster than real hardware floating-point addition (because hardware integer addition, in "a.mantissa+b.mantissa", is a bit faster than hardware floating-point addition):

my_float: 639.15 ns/call

hw_float: 2171.61 ns/call

a=+0x1p0 (1) b=+0x1p0 (1) a+b=+0x2p0 (2)

Program complete. Return 0 (0x0)

- If we're adding numbers of opposite sign, we really should be subtracting.
- If we're adding numbers with different exponents, we should shift the mantissas first.
- If there's a carry out of the mantissa sum, we should
re-normalize the mantissa. (If you don't re-normalize, you'll
eventually have overflow/wraparound problems, since you've actually
only using integer arithmetic!)

class my_float { ...(executable NetRun link)

enum {mantissa_min=1u<<16}; /* <- minimum value to store in mantissa */

enum {mantissa_max=1u<<17}; /* <- maximum value to store in mantissa */

/* Create a "my_float" from an integer value */

my_float(int value) {

if (value<0) {sign=1; value=-value;} else {sign=0;}

exponent=0; /* find exponent needed to "normalize" value. */

while (value<mantissa_min) {value*=2;exponent--;}

while (value>=mantissa_max) {value=value>>1;exponent++;}

mantissa=(mantissa_t)value; /*<- value has now been scaled properly */

}

Now we normalize a value like "1" as "0x1000" times 2

OK, so our mantissas are now normalized before we add them. Let's make sure they're still normalized after we add them:

/** Like "a+b". Add these two my_floats */OK! Now

inline my_float operator+(const my_float &a,const my_float &b) {

int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/

int e=a.exponent; /* exponent (FIXME: what if a.exponent!=b.exponent?) */

int m=a.mantissa + b.mantissa; /* mantissa */

while (m>=my_float::mantissa_max) {m=m>>1;e++;} /* handle mantissa carry */

return my_float(s,e,m);

}

a=+0x10000p-16 (1) b=+0x10000p-16 (1) a+b=+0x10000p-15 (2)So our mantissas are normalized coming out!

We've got to use the exponent fields to make the mantissas line up. The usual way to do this is figure out which exponent is bigger, then shift both incoming mantissas to use that exponent:

/** Like "a+b". Add these two my_floats */OK! Now 1 + 2 == 3. I claim this software-floating-point code actually works pretty well, although if you look at the timings you'll notice that now our software version is about 5x slower than hardware floating-point!

inline my_float operator+(const my_float &a,const my_float &b) {

int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/

int e=std::max(a.exponent,b.exponent); /* exponent of return value */

int am=a.mantissa>>(e-a.exponent); /* shifted mantissas (lined-up on e) */

int bm=b.mantissa>>(e-b.exponent);

int m=am+bm; /* outgoing mantissa */

while (m>=my_float::mantissa_max) {m=m>>1;e++;} /* handle mantissa carry */

return my_float(s,e,m);

}

I'm going to leave the opposite-sign/subtract case as a homework problem...