next up previous
Next: Data Structures Up: Floating Point Arithmetic Previous: Division

Rounding Errors

In integer arithmetic, the result of an operation is well-defined: either the exact result is obtained or overflow occurs and the result cannot be represented.

In floating point arithmetic, rounding errors occur as a result of the limited precision of the mantissa. For example, consider the average of two floating point numbers with identical exponents, but mantissas which differ by 1. The average should be a number midway between the original numbers, but the average cannot be represented without increasing the size of the mantissa. Although the mathematical operation is well-defined and the result is within the range of representable numbers, the average of two adjacent floating point values cannot be represented exactly.

The IEEE FPS defines four rounding rules for choosing the closest floating point when a rounding error occurs:

RN
Round to Nearest. Break ties by choosing the least significant bit = 0.
RZ
Round toward Zero. Same as truncation in sign-magnitude.
RP
Round toward Positive infinity.
RM
Round toward Minus infinity. Same as truncation in 2's complement.

RN is generally preferred and introduces less systematic error than the other rules.

The absolute error introduced by rounding is the actual difference between the exact value and the floating point representation. The size of the absolute error is proportional to the magnitude of the number. For numbers in IEEE FPS format, the absolute error is less than

displaymath3054

The largest absolute rounding error occurs when the exponent is 127 and is approximately tex2html_wrap_inline3062 since

displaymath3055

The relative error is the absolute error divided by the magnitude of the number which is approximated. For normalized floating point numbers, the relative error is approximately tex2html_wrap_inline3064 since

displaymath3056

For denormalized numbers (E = 0), relative errors increase as the magnitude of the number decreases toward zero. However, the absolute error of a denormalized number is less than tex2html_wrap_inline3066 since the truncation error in a denormalized number is

displaymath3057

Rounding errors affect the outcome of floating point computations in several ways:

  1. Exact comparison of floating point variables often produces incorrect results. Floating variables should not be used as loop counters or loop increments. Convergence tests for iterative algorithms are limited by the precision of the floating point computations.
  2. Operations performed in different orders may give different results. On many computers, a+b may differ from b+a and (a+b)+c may differ from a+(b+c).
  3. Errors accumulate over time. While the relative error for a single operation in single precision floating point is about tex2html_wrap_inline3064 , algorithms which iterate many times may experience an accumulation of errors which is much larger.


next up previous
Next: Data Structures Up: Floating Point Arithmetic Previous: Division

CS 301 Class Account
Mon Sep 13 11:15:41 ADT 1999