Fixed & Floating Point Arithmetic

CS 641 Lecture, Dr. Lawlor
To represent fractional values, prior to fast floating-point hardware we used to use fixed point arithmetic, where you keep track of the decimal point's location at compile time.  Floating point allows the decimal point to move at runtime, making it more flexible than fixed point.

Here's the software floating point class we built during class, with a little cleanup and a working multiply:
/* Lawlor floating point number class.   
Only supports positive numbers, and only + and * operations.
class Lfloat {
unsigned short fraction; /* implicit decimal at left side */
signed short exponent; /* * 2^exponent */
enum {fractbits=8*sizeof(fraction)}; /* number of bits in fraction field */

/* Initialize us to this fraction/exponent pair.
Internally, normalizes the values. */
void normalize(unsigned long f,long e) {
if (f==0) { /* no leading one */
else {
//std::cout<<std::dec<<"pre exp = "<<e<<" frac = "<<std::hex<<f<<"\n";
/* FIXME: both loops could use binary bit search */
/* Push leading one down into fraction field */
while (f>=(1<<fractbits)) {
/* Pull leading one up into fraction field */
while (f<(1<<(fractbits-1))) {
//std::cout<<std::dec<<"scaled exp = "<<e<<" frac = "<<std::hex<<f<<"\n";
fraction=f; exponent=e;

/* Initialize us to this integer value. */
Lfloat(long value) { normalize(value,fractbits); }
/* Initialize us to this fraction/integer pair. */
Lfloat(long fraction_,long exponent_) {normalize(fraction_,exponent_);}

// Add: basically just bit shift to line up exponents.
friend Lfloat operator+(const Lfloat &a,const Lfloat &b) {
long exp=a.exponent; // exponent field of output
unsigned long frac=0; // fraction field of output
if (exp>=b.exponent) { // line up output with a's exponent
else /* (exp<b.exponent) */ { // line up with b's exponent
exp=b.exponent; // shifted a to line up with b
return Lfloat(frac,exp);

// Multiply: actually easier, since we don't need to line up exponents.
friend Lfloat operator*(const Lfloat &a,const Lfloat &b) {
long exp=(a.exponent+b.exponent)-fractbits;
unsigned long frac=a.fraction*(unsigned long)b.fraction; // scaled as 0.(2*fractbits)
return Lfloat(frac,exp);

// Output
friend std::ostream &operator<<(std::ostream &o,const Lfloat &f) {
/* Converting to a machine float is sort of cheating:
we could manually extract decimal chars here with enough work...*/
float fv=f.fraction * pow(2.0,f.exponent-fractbits);
o<<fv<<" ";
return o;

int foo(void) {
Lfloat a=51235, b=5000;
Lfloat c=a+b;
for (int reps=0;reps<10;reps++) {
return 0;

(Try this in NetRun now!)

Hardware Floating Point Numbers

If you need a review of floats, see these CS 301 lecture notes:
This patent provides a decent summary of a floating-point add circuit.  The basic idea is usually:
You can shift both input numbers into huge fixed-point values (for example, a 32-bit float can be shifted into a fx128.153 floating-point number), but it's much more circuitry-efficient to shift the smaller number so it matches up with the larger value, as we discussed in class.

Software Examples

x86 ancient (1980's) interface: floating-point register stack.
fldpi ; Push "pi" onto floating-point stack
fld DWORD[my_float] ; push constant
faddp ; add one and pi

sub esp,8 ; Make room on the stack for an 8-byte double
fstp QWORD [esp]; Push printf's double parameter onto the stack
push my_string ; Push printf's string parameter (below)
extern printf
call printf ; Print string
add esp,12 ; Clean up stack

ret ; Done with function

my_string: db "Yo! Here's our float: %f",0xa,0
my_float: dd 1.0 ; floating-point DWORD

(Try this in NetRun now!)

x86 newer (1990's) interface: SSE registers
movups xmm0,[my_arr] ; load up array
addps xmm0,xmm0 ; add array to itself
movups [my_arr],xmm0 ; store back to memory

push 4 ; number of values to print
push my_arr ; array to print
extern farray_print
call farray_print ; Print string
add esp,8 ; Clean up stack

ret ; Done with function

section ".data"
my_arr: dd 1.0, 2.0, 3.0, 4.0 ; floating-point DWORD

(Try this in NetRun now!)

Bits in Floating-point Numbers

We can pretty easily count the bits in a float, by making the float smaller and smaller until roundoff loses the "x":
float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */
int itcount=0;
while (x+1.0f!=1.0f) {

(Try this in NetRun now!)