# Fixed & Floating Point Arithmetic

CS 641 Lecture, Dr. Lawlor
To represent fractional values, prior to fast floating-point hardware we used to use fixed point arithmetic, where you keep track of the decimal point's location at compile time.  Floating point allows the decimal point to move at runtime, making it more flexible than fixed point.

Here's the software floating point class we built during class, with a little cleanup and a working multiply:
`/* Lawlor floating point number class.     Only supports positive numbers, and only + and * operations.*/class Lfloat {public:	unsigned short fraction; /* implicit decimal at left side */	signed short exponent; /* * 2^exponent */	enum {fractbits=8*sizeof(fraction)}; /* number of bits in fraction field */		/* Initialize us to this fraction/exponent pair. 	   Internally, normalizes the values. */	void normalize(unsigned long f,long e) {		if (f==0) { /* no leading one */			fraction=0;exponent=0;		}		else {			//std::cout<<std::dec<<"pre exp = "<<e<<"  frac = "<<std::hex<<f<<"\n";			/* FIXME: both loops could use binary bit search */			/* Push leading one down into fraction field */			while (f>=(1<<fractbits)) {				e++;				f=f>>1;			}			/* Pull leading one up into fraction field */			while (f<(1<<(fractbits-1))) {				e--;				f=f<<1;			}			//std::cout<<std::dec<<"scaled exp = "<<e<<"  frac = "<<std::hex<<f<<"\n";			fraction=f; exponent=e;		}	}		/* Initialize us to this integer value. */	Lfloat(long value) { normalize(value,fractbits); }	/* Initialize us to this fraction/integer pair. */	Lfloat(long fraction_,long exponent_) {normalize(fraction_,exponent_);}	// Add: basically just bit shift to line up exponents.	friend Lfloat operator+(const Lfloat &a,const Lfloat &b) {		long exp=a.exponent; // exponent field of output		unsigned long frac=0; // fraction field of output		if (exp>=b.exponent) { // line up output with a's exponent			frac=a.fraction+(b.fraction>>(exp-b.exponent));		}		else /* (exp<b.exponent) */ { // line up with b's exponent			frac=(a.fraction>>(b.exponent-exp))+b.fraction;			exp=b.exponent; // shifted a to line up with b		}		return Lfloat(frac,exp);	}// Multiply: actually easier, since we don't need to line up exponents.	friend Lfloat operator*(const Lfloat &a,const Lfloat &b) {		long exp=(a.exponent+b.exponent)-fractbits; 		unsigned long frac=a.fraction*(unsigned long)b.fraction; // scaled as 0.(2*fractbits)		return Lfloat(frac,exp);	}// Output	friend std::ostream &operator<<(std::ostream &o,const Lfloat &f) {		/* Converting to a machine float is sort of cheating:		  we could manually extract decimal chars here with enough work...*/		float fv=f.fraction * pow(2.0,f.exponent-fractbits);		o<<fv<<" ";		return o;	}};int foo(void) {	Lfloat a=51235, b=5000;	Lfloat c=a+b;	for (int reps=0;reps<10;reps++) {		std::cout<<c<<"\n";		c=c*10;	}	return 0;}`

(Try this in NetRun now!)

## Hardware Floating Point Numbers

If you need a review of floats, see these CS 301 lecture notes:
This patent provides a decent summary of a floating-point add circuit.  The basic idea is usually:
• Shift the two input numbers so their decimal points line up.
• Renormalize the sum: count off zero bits until you hit a one, and shift significant digits up.
You can shift both input numbers into huge fixed-point values (for example, a 32-bit float can be shifted into a fx128.153 floating-point number), but it's much more circuitry-efficient to shift the smaller number so it matches up with the larger value, as we discussed in class.

## Software Examples

x86 ancient (1980's) interface: floating-point register stack.
`fldpi ; Push "pi" onto floating-point stackfld DWORD[my_float]  ; push constantfaddp ; add one and pisub esp,8 ; Make room on the stack for an 8-byte doublefstp QWORD [esp]; Push printf's double parameter onto the stackpush my_string ; Push printf's string parameter (below)extern printfcall printf  ; Print stringadd esp,12    ; Clean up stackret ; Done with functionmy_string: db "Yo!  Here's our float: %f",0xa,0my_float: dd 1.0 ; floating-point DWORD`

(Try this in NetRun now!)

x86 newer (1990's) interface: SSE registers
`movups xmm0,[my_arr] ; load up arrayaddps xmm0,xmm0 ; add array to itselfmovups [my_arr],xmm0 ; store back to memorypush 4 ; number of values to printpush my_arr ; array to printextern farray_printcall farray_print  ; Print stringadd esp,8    ; Clean up stackret ; Done with functionsection ".data"my_arr: dd 1.0, 2.0, 3.0, 4.0 ; floating-point DWORD`

(Try this in NetRun now!)

## Bits in Floating-point Numbers

We can pretty easily count the bits in a float, by making the float smaller and smaller until roundoff loses the "x":
`float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */int itcount=0;while (x+1.0f!=1.0f) {	x=x*0.5;	itcount++;}std::cout<<"itcount=="<<itcount<<"\n";`

(Try this in NetRun now!)