Next: Multiplication Up: Floating Point Arithmetic Previous: FLOPS

Addition and Subtraction

Floating point addition is analogous to addition using scientific notation. For example, to add 2.25x to 1.340625x :

Shift the decimal point of the smaller number to the left until the exponents are equal. Thus, the first number becomes .0225x .
Add the numbers with decimal points aligned:

Normalize the result.

Once the decimal points are aligned, the addition can be performed by ignoring the decimal point and using integer addition.
The addition of two IEEE FPS numbers is performed in a similar manner. The number 2.25 in IEEE FPS is:

The number 134.0625 in IEEE FPS is:

To align the binary points, the smaller exponent is incremented and the mantissa is shifted right until the exponents are equal. Thus, 2.25 becomes:

The mantissas are added using integer addition:

The result is already in normal form. If the sum overflows the position of the hidden bit, then the mantissa must be shifted one bit to the right and the exponent incremented. The mantissa is always less than 2, so the hidden bits can sum to no more than 3 (11).

The exponents can be positive or negative with no change in the algorithm. A smaller exponent means more negative. In the bias-127 representation, the smaller exponent has the smaller value for E, the unsigned interpretation.
An important case occurs when the numbers differ widely in magnitude. If the exponents differ by more than 24, the smaller number will be shifted right entirely out of the mantissa field, producing a zero mantissa. The sum will then equal the larger number. Such truncation errors occur when the numbers differ by a factor of more than , which is approximately . The precision of IEEE single precision floating point arithmetic is approximately 7 decimal digits.
Negative mantissas are handled by first converting to 2's complement and then performing the addition. After the addition is performed, the result is converted back to sign-magnitude form.
When adding numbers of opposite sign, cancellation may occur, resulting in a sum which is arbitrarily small, or even zero if the numbers are equal in magnitude. Normalization in this case may require shifting by the total number of bits in the mantissa, resulting in a large loss of accuracy.
When the mantissa of the sum is zero, no amount of shifting will produce a 1 in the hidden bit. This case must be detected in the normalization step and the result set to the representation for 0, E = M = 0. This result does not mean the numbers are equal; only that their difference is smaller than the precision of the floating point representation.
Floating point subtraction is achieved simply by inverting the sign bit and performing addition of signed mantissas as outlined above.

Next: Multiplication Up: Floating Point Arithmetic Previous: FLOPS

CS 301 Class Account
Mon Sep 13 11:15:41 ADT 1999