IEEE Floating Point Standard

Next: Arithmetic and Logical Operations Up: Floating Point Previous: Decimal to Floating Point

IEEE Floating Point Standard

The IEEE FPS is the most widely accepted standard representation for floating point numbers. The standard provides definitions for single precision and double precision representations.
The single precision IEEE FPS format is composed of 32 bits, divided into a 23 bit mantissa, M, an 8 bit exponent, E, and a sign bit, S:

The normalized mantissa, m, is stored in bits 0-22 with the hidden bit, , omitted. Thus M = m-1.
The exponent, e, is represented as a bias-127 integer in bits 23-30. Thus, E = e+127.
The sign bit, S, indicates the sign of the mantissa, with S=0 for positive values and S=1 for negative values.
Zero is represented by E = M = 0. Since S may be 0 or 1, there are different representations for +0 and -0.
The maximum value of E = 255 is reserved to indicate overflow values (usually the result of floating point arithmetic) with exponents that are too large or too small to be represented.
The special interpretations for E = 255 and F = 0 are for S = 0 and for S=1. Floating point division by zero produces a number with E=255 and nonzero F called NaN (Not a Number).
To convert decimal 17.15 to IEEE FPS:

Convert decimal 17 to binary 10001. Convert decimal 0.15 to the repeating binary fraction . Combine integer and fraction to obtain binary .
Normalize the binary number to obtain Thus, M = m-1 = and E = e+127 = 131 = 1000 0011.
The number is positive, so S=0.
Align the values for M, E, and S in the correct fields.

The hexadecimal value is 0x41893333.
The range of values for the mantissa, m, is between 1 and .
Because E=0 and E=255 are reserved, the range of values for the exponent, e, is between -126 and +127.
The largest positive number that can be represented is approximately The decimal value of this number is approximately since

The mantissa represents a 24 bit binary fraction which corresponds to approximately 7 decimal digits since

Next: Arithmetic and Logical Operations Up: Floating Point Previous: Decimal to Floating Point

Mitch Roth
Wed Oct 9 13:38:30 ADT 1996