Nowadays floats have roughly the same
performance as integers: addition, subtraction, or multiplication
all take about a nanosecond. That is, floats are now cheap,
and you can consider using floats for all sorts of stuff--even
when you don't care about fractions! The advantages of using
floats are:
|
|
|
|
|
| Sign |
Exponent |
Fraction
(or "Mantissa") |
| 1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 117 means 2-10 127 means 20 137 means 210 |
23 bits-- a binary
fraction. There's an implicit leading 1 (unless the exponent field is zero) |
void print_bits(float f) {
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit--) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";
}
std::cout<<std::endl;
}
int foo(void) {
print_bits(0.0);
print_bits(-1.0);
print_bits(1.0);
print_bits(2.0);
print_bits(4.0);
print_bits(8.0);
print_bits(1.125);
print_bits(1.25);
print_bits(1.5);
print_bits(1+1.0/10);
return sizeof(float);
}
The official way to dissect the parts of
a float is using a "union" and a bitfield like so:/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};
float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
| C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |
| float |
4 bytes (everywhere) |
1.0x10-7 |
1038 |
8 |
23 |
224 |
| double |
8 bytes (everywhere) |
2.0x10-15 |
10308 |
11 |
52 |
253 |
| long double |
12-16 bytes (if it even exists) |
2.0x10-20 |
104932 |
15 |
64 |
265 |
| half
float / __fp16 |
2 bytes (only on late-2010ish machines) |
1.0x10-3 | 105 | 5 |
10 |
211 |
| fp8
float |
1 byte (only on 2023+ GPUs) |
Approx 1.0x10-1 | 104 | 4 or 5 |
3 or 2 |
23 |
Many of the various new 16 or 8 bit float types are just truncated versions of larger floats, so the only circuit change is to throw away some fraction bits.
(Image adapted from the NVIDIA fp8 primer.)
They're pure research for now, but in the 2020 AI era there's some interest in formats that provide more precision with fewer bits. A "posit" adds a "regime" field to the usual float exponent field, which can then be shorter, leaving more precision for reasonable-sized numbers. This gives a numer format that smoothly scales down to even 8 bit floats (posit8_t) while maintaining a consistent format.
float f=0.73;For "double", you can add one more times, but eventually the double will stop changing despite your additions. Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!
while (1) {
volatile float g=f+1;
if (g==f) {
std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;
return 0;
}
else f=g;
}
double lil=1.0;
double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)
std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;
std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;
float gnats=1.0;(executable NetRun link)
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
float gnats=1.0;(executable NetRun link)
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
for (int i=1;i<1000000000;i*=10) {
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
}
(executable NetRun link)i=1 diff=5.54976e-18That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10! This can add up over time.
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)