Floating-Point Arithmetic

CS 301 Lecture, Dr. Lawlor

Ordinary integers can only represent integral values. "Floating-point numbers" can represent non-integral values. This is useful for engineering, science, statistics, graphics, and any time you need to represent numbers from the real world, which are rarely integral!

Floats store numbers in an odd way--they're really storing the number in scientific notation, like
x = + 3.785746 * 10⁵
Note that:

You only need one bit to represent the sign--plus or minus.
The exponent's just an integer, so you can store it as an integer.
The 3.785746 part, called the "mantissa", can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

Scientific notation is designed to be compatible with slide rules (here's a circular slide rule demo); slide rules are basically a log table starting at 1. This works because log(1) = 0, and log(a) + log(b) = log(ab). But slide rules only give you the mantissa; you need to figure out the exponent yourself. The "order of magnitude" guess that engineers (and I) like so much is just a calculation using zero significant digits--no mantissa, all exponent.

In binary, you can represent a non-integer like "two and three-eighths" as "10.011". That is, there's:

a 1 in the 2's place (2=2¹)
a 0 in the 1's place (1=2⁰)
a 0 in the (beyond the "binary point") 1/2's place (1/2=2^-1),
a 1 in the 1/4's place (1/4=2^-2), and
a 1 in the 1/8's place (1/8=2^-3)

for a total of two plus 1/4 plus 1/8, or "2+3/8". Note that this is a natural measurement in carpenter's fractional inches, but it's a weird unnatural thing in metric-style decimal inches. That is, fractions that are (negative) powers of two have a nice binary representation, but look weird in decimal (2 3/8 = 2.375); decimals have a nice decimal representation, but look very weird as a binary fraction.

Normalized Numbers

Scientific notation can represent the same number in several different ways:
x = + 3.785746 * 10⁵ = + 0.3785746 * 10⁶ = + 0.003785746 * 10⁷ = + 37.85746 * 10⁴

It's common to "normalize" a number in scientific notation so that:

There's exactly one digit to the left of the decimal point.
And that digit ain't zero.

This means the 10⁵ version is the "normal" way to write the number above.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So sometimes there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

Float as Bits

Floats represent continuous values. But they do it using discrete bits.

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign	Exponent	Fraction (or "Mantissa")
1 bit-- 0 for positive 1 for negative	8 unsigned bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction. Don't forget the implicit leading 1!

The sign is in the highest-order bit, the exponent in the next 8 bits, and the fraction in the remaining bits.

The hardware interprets a float as having the value:

value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1) ⁰ * 2 ^(130-127)* 1.0000....

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
	unsigned int exp:8; /**< Value is 2^(exp-127) */
	unsigned int sign:1; /**< 0 for positive, 1 for negative */
};

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
	float f;
	float_bits b;
};

float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";
return 0;

(Executable NetRun link)

There are several different sizes of floating-point types:

C Datatype	Size	Approx. Precision	Approx. Range	Exponent Bits	Fraction Bits	+-1 range
float	4 bytes (everywhere)	1.0x10^-7	10³⁸	8	23	2²⁴
double	8 bytes (everywhere)	2.0x10^-15	10³⁰⁸	11	52	2⁵³
long double	12-16 bytes (if it even exists)	2.0x10^-20	10⁴⁹³²	15	64	2⁶⁵

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions.

Floating-Point in Software: The Syntax

It's informative to try writing your own floating-point code. Here's a rather silly syntax example of what the C++ looks like when we write our own "my_float" class. Internally, this class just uses an ordinary "double", so we haven't actually done any work yet:

/* A floating-point number, written inside a class (for no real reason) */
class my_float {
public:
	double v; /* value I represent */
	
	/* Create a "my_float" from an actual hardware float. */
	my_float(double value) :v(value) {}
};

/** Output operator, for easy cout-style printing */
std::ostream &operator<<(std::ostream &o,const my_float &f) {
	o<<f.v;
	return o;
}

/** Like "-a".  Make this my_float have the opposite sign */
inline my_float operator-(const my_float &a)
{
	return my_float(-a.v);
}

/** Like "a+b".  Add these two my_floats */
inline my_float operator+(const my_float &a,const my_float &b) {
	return my_float(a.v+b.v);
}

/** Like "a-b".  Subtract these two my_floats */
inline my_float operator-(const my_float &a,const my_float &b) {
	return my_float(a.v-b.v);
}

my_float ma(1.0), mb(1.0);
int my_fadd(void) {for (int i=0;i<1000;i++) ma=ma+mb; return 0;}
double fa(1.0), fb(1.0);
int hw_fadd(void) {for (int i=0;i<1000;i++) fa=fa+fb; return 0;}

int foo(void) {
	print_time("my_float",my_fadd);
	print_time("hw_float",hw_fadd);
	my_float a(1.0);
	my_float b(0.25);
	std::cout<<" a="<<a<<"  b="<<b<<"   a-b="<<(a-b)<<"\n";
	return 0;
}

(executable NetRun link)

One very cool thing about C++ is that because everything we do with the "my_float" class is "inline", the compiler is smart enough to "see through" our my_float class to the double underneath. This means our "my_float" class actually costs nothing at runtime--it's just as fast to use our own wrapper around a "double" as it is to use a plain "double":

my_float: 2216.08 ns/call
hw_float: 2211.89 ns/call
 a=1  b=0.25   a-b=0.75
Program complete.  Return 0 (0x0)

Software Floating-Point Basics

OK! Now that we've seen the syntax, let's actually do some work instead of just using the hardware "double". First we'll need to store the sign, exponent, and mantissa as members of our class; and we'll need a constructor:

/* A floating-point number, written in software */
class my_float {
public:
	int sign; /* 0 for +, 1 for - */
	int exponent; /* scaling on float is 2^exponent */
	int mantissa; /* value of float */
	
	/* Create a "my_float" from sign, exponent, and mantissa fields */
	my_float(int sign_,int exponent_,int mantissa_)
		:sign(sign_), exponent(exponent_), mantissa(mantissa_) {}
};

Here's how we do output. I'm outputting the mantissa in hex, the exponent in signed decimal (just like printf's new "%a" format!), and then I'm also computing the floating-point value we represent:

/** Output operator, for easy cout-style printing */
std::ostream &operator<<(std::ostream &o,const my_float &f) {
	o<<(f.sign?"-":"+")<<
		"0x"<<std::hex<<f.mantissa<<
		"p"<<std::dec<<f.exponent<<
		" ("<<(f.sign?-1.0:+1.0)*f.mantissa*pow(2,f.exponent)<<")  ";
	return o;
}

OK. Let's start with something easy. How do we implement "-x"? Well, let's just flip the sign bit:

/** Like "-a".  Make this my_float have the opposite sign */
inline my_float operator-(const my_float &a)
{
	return my_float(!a.sign,a.exponent,a.mantissa);
}

Let's try this out. We'll start with the number +1 times two to the zero power, and negate it:

int foo(void) {
	my_float a(0,0,1);
	std::cout<<" a="<<a<<"  -a="<<(-a)<<"\n";
	return 0;
}

(executable NetRun link)

This prints out:

 a=+0x1p0 (1)    -a=-0x1p0 (-1)  
Program complete.  Return 0 (0x0)

OK! Looks like we've got "negate" down!

Software Floating-Point Addition (The Fast & Stupid Way)

Now let's try floating-point addition. Here's our first guess at how to do it--just add mantissas:

/** Like "a+b".  Add these two my_floats */
inline my_float operator+(const my_float &a,const my_float &b) {
	int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/
	int e=a.exponent; /* exponent (FIXME: what if a.exponent!=b.exponent?) */
	int m=a.mantissa + b.mantissa; /* mantissa (FIXME: what about a carry?) */
	return my_float(s,e,m);
}

int foo(void) {
	my_float a(0,0,1), b(0,0,1);
	std::cout<<" a="<<a<<"  b="<<b<<"   a+b="<<(a+b)<<"\n";
	return 0;
}

(executable NetRun link)

Wow, this actually works! And better yet, it's even faster than real hardware floating-point addition (because hardware integer addition, in "a.mantissa+b.mantissa", is a bit faster than hardware floating-point addition):

my_float: 639.15 ns/call
hw_float: 2171.61 ns/call
 a=+0x1p0 (1)    b=+0x1p0 (1)     a+b=+0x2p0 (2)  
Program complete.  Return 0 (0x0)

Software Floating-Point Addition (The Slow & Correct Way)

So far, we've ignored three different things in our floating-point addition function:

If we're adding numbers of opposite sign, we really should be subtracting.
If we're adding numbers with different exponents, we should shift the mantissas first.
If there's a carry out of the mantissa sum, we should re-normalize the mantissa. (If you don't re-normalize, you'll eventually have overflow/wraparound problems, since you've actually only using integer arithmetic!)

We'll look at these one at a time, in opposite order.

Floating-Point Normalization

First, none of our floats are normalized yet, so let's normalize the incoming numbers:

class my_float { ...
	enum {mantissa_min=1u<<16}; /* <- minimum value to store in mantissa */
	enum {mantissa_max=1u<<17}; /* <- maximum value to store in mantissa */
	
	/* Create a "my_float" from an integer value */
	my_float(int value) {
		if (value<0) {sign=1; value=-value;} else {sign=0;}
		exponent=0;  /* find exponent needed to "normalize" value. */
		while (value<mantissa_min) {value*=2;exponent--;}
		while (value>=mantissa_max) {value=value>>1;exponent++;}
		mantissa=(mantissa_t)value; /*<- value has now been scaled properly */
	}

(executable NetRun link)
Now we normalize a value like "1" as "0x1000" times 2^-16 (which is... 1).

OK, so our mantissas are now normalized before we add them. Let's make sure they're still normalized after we add them:

/** Like "a+b".  Add these two my_floats */
inline my_float operator+(const my_float &a,const my_float &b) {
	int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/
	int e=a.exponent; /* exponent (FIXME: what if a.exponent!=b.exponent?) */
	int m=a.mantissa + b.mantissa; /* mantissa */
	while (m>=my_float::mantissa_max) {m=m>>1;e++;} /* handle mantissa carry */
	return my_float(s,e,m);
}

(executable NetRun link)

OK! Now

 a=+0x10000p-16 (1)    b=+0x10000p-16 (1)     a+b=+0x10000p-15 (2)

So our mantissas are normalized coming out!

Exponent Mismatch

Second, our addition function currently ignores b's exponent. That's bad, because 1+1 isn't the same as 1+(1*2^-16)!

We've got to use the exponent fields to make the mantissas line up. The usual way to do this is figure out which exponent is bigger, then shift both incoming mantissas to use that exponent:

/** Like "a+b".  Add these two my_floats */
inline my_float operator+(const my_float &a,const my_float &b) {
	int s=a.sign; /* sign of return value (FIXME: what if a.sign!=b.sign?)*/
	int e=std::max(a.exponent,b.exponent); /* exponent of return value */
	int am=a.mantissa>>(e-a.exponent); /* shifted mantissas (lined-up on e) */
	int bm=b.mantissa>>(e-b.exponent);
	int m=am+bm; /* outgoing mantissa */
	while (m>=my_float::mantissa_max) {m=m>>1;e++;} /* handle mantissa carry */
	return my_float(s,e,m);
}

(executable NetRun link)

OK! Now 1 + 2 == 3. I claim this software-floating-point code actually works pretty well, although if you look at the timings you'll notice that now our software version is about 5x slower than hardware floating-point!

I'm going to leave the opposite-sign/subtract case as a homework problem...