For example:
int value=1; /* value to test, starts at first (lowest) bit */ for (int bit=0;bit<100;bit++) { std::cout<<"at bit "<<bit<<" the value is "<<value<<"\n"; value=value+value; /* moves over by one bit */ if (value==0) break; } return 0;
Because "int" currently has 32 bits, if you start at one, and add a variable to itself 32 times, the one overflows and is lost completely.
In assembly, there's a handy instruction "jo" (jump if overflow) to check for overflow from the previous instruction. The C++ compiler doesn't bother to use jo, though!
mov edi,1 ; loop variable mov eax,0 ; counter start: add eax,1 ; increment bit counter add edi,edi ; add variable to itself jo noes ; check for overflow in the above add cmp edi,0 jne start ret noes: ; called for overflow mov eax,999 ret
Notice the above program returns 999 on overflow, which somebody else will need to check for. (Responding correctly to overflow is actually quite difficult--see, e.g., Ariane 5 explosion, caused by poor handling of a detected overflow. Ironically, ignoring the overflow would have caused no problems!)
If you watch closely right before overflow, you see something funny happen:
signed char value=1; /* value to test, starts at first (lowest) bit */ for (int bit=0;bit<100;bit++) { std::cout<<"at bit "<<bit<<" the value is "<<(long)value<<"\n"; value=value+value; /* moves over by one bit (value=value<<1 would work too) */ if (value==0) break; } return 0;
This prints out:
at bit 0 the value is 1 at bit 1 the value is 2 at bit 2 the value is 4 at bit 3 the value is 8 at bit 4 the value is 16 at bit 5 the value is 32 at bit 6 the value is 64 at bit 7 the value is -128 Program complete. Return 0 (0x0)
Wait, the last bit's value is -128? Yes, it really is!
This negative high bit is called the "sign bit", and it has a negative value in two's complement signed numbers. This means to represent -1, for example, you set not only the high bit, but all the other bits as well: in unsigned, this is the largest possible value. The reason binary 11111111 represents -1 is the same reason you might choose 9999 to represent -1 on a 4-digit odometer: if you add one, you wrap around and hit zero.
A very cool thing about two's complement is addition is the same operation whether the numbers are signed or unsigned--we just interpret the result differently. Subtraction is also identical for signed and unsigned. Register names are identical in assembly for signed and unsigned. However, when you change register sizes using an instruction like "movsxd rax,eax", when you check for overflow, when you compare numbers, multiply or divide, or shift bits, you need to know if the number is signed (has a sign bit) or unsigned (no sign bit, no negative numbers).
Signed | Unsigned | Language |
int | unsigned int | C++, int is signed by default. |
signed char | unsigned char | C++, char may be signed or unsigned. |
movsxd | movzxd | Assembly, sign extend or zero extend to change register sizes. |
jo | jc | Assembly, overflow is calculated for signed values, carry for unsigned values. |
jg | ja | Assembly, jump greater is signed, jump above is unsigned. |
jl | jb | Assembly, jump less signed, jump below unsigned. |
imul | mul | Assembly, imul is signed (and more modern), mul is for unsigned (and ancient and horrible!). idiv/div work similarly. |
Sign |
Exponent |
Fraction (or
"Mantissa") |
1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 20 137 means 210 |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |
void print_bits(float f) {The official way to dissect the parts of a float is using a "union" and a bitfield like so:
int i=*reinterpret_cast<int *>(&f); /* read bits with "pointer shuffle" */
std::cout<<" float "<<std::setw(10)<<f<<" = ";
for (int bit=31;bit>=0;bit--) {
if (i&(1<<bit)) std::cout<<"1"; else std::cout<<"0";
if (bit==31) std::cout<<" ";
if (bit==23) std::cout<<" (implicit 1).";
}
std::cout<<std::endl;
}
int foo(void) {
print_bits(0.0);
print_bits(-1.0);
print_bits(1.0);
print_bits(2.0);
print_bits(4.0);
print_bits(8.0);
print_bits(1.125);
print_bits(1.25);
print_bits(1.5);
print_bits(1+1.0/10);
return sizeof(float);
}
/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};
float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |
float |
4 bytes (everywhere) |
1.0x10-7 |
1038 |
8 |
23 |
224 |
double |
8 bytes (everywhere) |
2.0x10-15 |
10308 |
11 |
52 |
253 |
long double |
12-16 bytes (if it even exists) |
2.0x10-20 |
104932 |
15 |
64 |
265 |
half
float |
2 bytes (only on GPUs) |
1.0x10-3 | 105 | 5 |
10 |
211 |
float f=0.73;For "double", you can add one more times, but eventually the double will stop changing despite your additions. Recall that for integers, adding one repeatedly will *never* give you the same value--eventually the integer will wrap around, but it won't just stop moving like floats!
while (1) {
volatile float g=f+1;
if (g==f) {
std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;
return 0;
}
else f=g;
}
double lil=1.0;
double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)
std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;
std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;
float gnats=1.0;(executable NetRun link)
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
float gnats=1.0;(executable NetRun link)
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
for (int i=1;i<1000000000;i*=10) {(executable NetRun link)
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
}
i=1 diff=5.54976e-18That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10! This can add up over time.
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)