Circuit-Level Floating Point Implementation

To understand the circuit-level operations happening during floating point computations, you need to understand the bit representation of floating point numbers (read it!).   If you're interested in why these values were chosen, there's a good floating point design rationale here.

We'll be using this code to show what's inside floats:

struct float_bits {
	unsigned int frac:23; // fraction bits, except for implicit leading 1
	unsigned int exp:8; // exponent bits, biased by 127
	unsigned int sign:1; // sign bit: 0 +, 1 -
};

union float_dissector {
	float f; // the float
	float_bits b; // its bits
};

// Convert this integer into a string of binary 1's and 0's.
std::string dump_bits(long value,long bitcount)
{
	std::string ret="";
	for (int bit=bitcount-1;bit>=0;bit--)
		if ((1L<<bit)&value) 
			ret+="1";
		else	ret+="0";
	return ret;
}
// Show the contents of this float
void dump(float f) {
	float_dissector ds; ds.f=f;
	std::cout<<" float	"<<f<<
		"	sign "<<ds.b.sign<<
		"	exp "<<ds.b.exp-127<<
		"	frac (1)."<<dump_bits(ds.b.frac,23)<<
		std::endl;
	
}

void foo(void) {
	dump(1.0);
	dump(2.0);
	dump(0.5);
	dump(1.125);
	dump(1.25);
	dump(1.5);
	dump(1.625);
	dump(3.0);
	dump(0.0);
}

(Try this in NetRun now!)

 

The trick to doing floating-point addition is preconditioning the inputs: it's easy enough to add two numbers with the same sign and exponent fields--just integer add their fraction fields.  If they don't have the same exponent, you can shift the smaller number down to line up with the bigger number.  If they don't have the same sign, you're really doing subtraction, not addition.  

// Add two floats, without touching the floating point hardware
float add(float a,float b) 
{ float_dissector ad; ad.f=a; float_dissector bd; bd.f=b; // Precondition the inputs if (a<b) return add(b,a); // swap so a>=b if (a<0.0) return -add(-a,-b); // crude handling of negative numbers // if (b<0.0) return sub(a,-b); // FIXME: need subtract for negative b // Now a and b are non-negative, with a>=b CHATTY( dump(a); dump(b); ) unsigned long afrac=(1<<23)+ad.bits.frac; // include the implicit 1 unsigned long bfrac=(1<<23)+bd.bits.frac; int expshift=ad.bits.exp - bd.bits.exp; // distance between exponents bfrac=bfrac>>expshift; // line up b with a's exponent (FIXME: rounding?) CHATTY( std::cout<<"Exponent shift "<<expshift<<" bit\n"; ) // Now that the fraction fields are aligned, do integer addition unsigned long sfrac = afrac + bfrac; float_dissector sd; // sum sd.bits.sign=0; // positive result if (sfrac&(1<<24)) { // carry! CHATTY( std::cout<<"Carry!\n"; ) sd.bits.exp=ad.bits.exp+1; sd.bits.frac=sfrac>>1; // lose precision (rounding mode?) } else { // no carry, use a's exponent in output CHATTY( std::cout<<"No carry\n"; ) sd.bits.exp=ad.bits.exp; sd.bits.frac=sfrac; // exact result } CHATTY( std::cout<<" sum: "<<sd.f<<"\n\n"; ) return sd.f; } void foo(void) { add(2.0,2.0); add(2.3,2.3); add(2.25,1.25); add(32.25,1.25); }

(Try this in NetRun now!)

Floating point multiplication is conceptually similar: we perform integer operations on the sign, exponent, and fraction fields to get the right answer.  Again, the trick to keeping the circuit simple is normalizing the input, such as by sign.


CS 441 Lecture Note, 2014, Dr. Orion LawlorUAF Computer Science Department.