Bits, Sizes, and Signed vs Unsigned

The fact is, variables on a computer only have so many bits. If the value gets bigger than can fit in those bits, the extra bits "overflow", and by default they're ignored.

For example:

int value=1; /* value to test, starts at first (lowest) bit */
for (int bit=0;bit<100;bit++) {
	std::cout<<"at bit "<<bit<<" the value is "<<value<<"\n";
	value=value+value; /* moves over by one bit (value=value<<1 would work too) */
	if (value==0) break;
}
return 0;

(Try this in NetRun now!)

Because "int" currently has 32 bits, if you start at one, and add a variable to itself 32 times, the one overflows and is lost completely.

In assembly, there's a handy instruction "jo" (jump if overflow) to check for overflow from the previous instruction. The C++ compiler doesn't bother to use jo, though!

mov edi,1 ; loop variable
mov eax,0 ; counter

start:
	add eax,1 ; increment bit counter

	add edi,edi ; add variable to itself
	jo noes ; check for overflow in the above add

	cmp edi,0
	jne start

ret

noes: ; called for overflow
	mov eax,999
	ret

(Try this in NetRun now!)

Notice the above program returns 999 on overflow, which somebody else will need to check for. (Responding correctly to overflow is actually quite difficult--see, e.g., Ariane 5 explosion, caused by a detected overflow.)

C++ Storage Sizes

Eight bits make a "byte" (note: it's pronounced exactly like "bite", but always spelled with a 'y'), although in some rare networking manuals (and in French) the same eight bits would be called an "octet" (hard drive sizes are in "Go", Giga-octets, when sold in French). In DOS and Windows programming, 16 bits is a "WORD", 32 bits is a "DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts "word" means the machine's natural binary processing size, which ranges from 32 to 64 bits nowadays. "word" should now be considered ambiguous. Giving an actual bit count is the best approach ("The file begins with a 32-bit binary integer describing...").

Object	C++ Name	Bits	Bytes (8 bits)	Hex Digits (4 bits)	Octal Digits (3 bits)	Unsigned Range	Signed Range
Bit	none!	1	less than 1	less than 1	less than 1	0..1	-1..0
Byte, or octet	char	8	1	2	two and two thirds	255	-128 .. 127
Windows WORD	short	16	2	4	five and one third	65535	-32768 .. +32767
Windows DWORD	int	32	4	8	ten and two thirds	>4 billion	-2G .. +2G
Windows QWORD	long	64	8	16	twenty-one and one-third	>16 quadrillion	-8Q .. +8Q

Register Sizes in Assembly

Like C++ variables, registers are actually available in several sizes:

rax is the 64-bit, "long" size register. It was added in 2003. I've marked the added-with-64-bit registers in red below.
eax is the 32-bit, "int" size register. It was added in 1985. I'm in the habit of using this register size, since they also work in 32 bit mode, although I should probably use the longer rax registers for everything.
ax is the 16-bit, "short" size register. It was added in 1979.
al and ah are the 8-bit, "char" size parts of the register. al is the low 8 bits (like ax&0xff), ah is the high 8 bits (like ax>>8). They're original back to 1972.

Curiously, you can write a 64-bit value into rax, then read off the low 32 bits from eax, or the low 16 bitx from ax, or the low 8 bits from al--it's just one register, but they keep on extending it!

rax: 64-bit

eax: 32-bit

ax: 16-bit

For example,

mov rcx,0xf00d00d2beefc03; load 64-bit constant
mov eax,ecx; pull out low 32 bits
ret

(Try this in NetRun now!)

Here's the full list of x86 registers. The 64 bit registers are shown in red. "Scratch" registers you're allowed to overwrite and use for anything you want. "Preserved" registers serve some important purpose somewhere else, so as we'll talk about next week you have to put them back ("save" the register) if you use them--for now, just leave them alone!

Notes	64-bit long	32-bit int	16-bit short	8-bit char
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too.	rax	eax	ax	ah and al
Typical scratch register. Some instructions use it as a counter (such as SAL or REP).	rcx	ecx	cx	ch and cl
Scratch register. Multiply instructions put the high bits of the result here.	rdx	edx	dx	dh and dl
Preserved register: don't use it without saving it!	rbx	ebx	bx	bh and bl
The stack pointer. Points to the top of the stack (details next week!)	rsp	esp	sp	spl
Preserved register. Sometimes used to store the old value of the stack pointer, or the "base".	rbp	ebp	bp	bpl
Scratch register. Also used to pass function argument #2 in 64-bit mode (on Linux).	rsi	esi	si	sil
Scratch register. Function argument #1.	rdi	edi	di	dil
Scratch register. These were added in 64-bit mode, so the names are slightly different.	r8	r8d	r8w	r8b
Scratch register.	r9	r9d	r9w	r9b
Scratch register.	r10	r10d	r10w	r10b
Scratch register.	r11	r11d	r11w	r11b
Preserved register.	r12	r12d	r12w	r12b
Preserved register.	r13	r13d	r13w	r13b
Preserved register.	r14	r14d	r14w	r14b
Preserved register.	r15	r15d	r15w	r15b

Signed versus Unsigned Numbers

If you watch closely right before overflow, you see something funny happen:

signed char value=1; /* value to test, starts at first (lowest) bit */
for (int bit=0;bit<100;bit++) {
	std::cout<<"at bit "<<bit<<" the value is "<<(long)value<<"\n";
	value=value+value; /* moves over by one bit (value=value<<1 would work too) */
	if (value==0) break;
}
return 0;

(Try this in NetRun now!)

This prints out:

at bit 0 the value is 1
at bit 1 the value is 2
at bit 2 the value is 4
at bit 3 the value is 8
at bit 4 the value is 16
at bit 5 the value is 32
at bit 6 the value is 64
at bit 7 the value is -128 
Program complete.  Return 0 (0x0)

Wait, the last bit's value is -128? Yes, it really is!

This negative high bit is called the "sign bit", and it has a negative value in two's complement signed numbers. This means to represent -1, for example, you set not only the high bit, but all the other bits as well: in unsigned, this is the largest possible value. The reason binary 11111111 represents -1 is the same reason you might choose 9999 to represent -1 on a 4-digit odometer: if you add one, you wrap around and hit zero.

A very cool thing about two's complement is addition is the same operation whether the numbers are signed or unsigned--we just interpret the result differently. Subtraction is also identical for signed and unsigned. Register names are identical in assembly for signed and unsigned. However, when you change register sizes using an instruction like "movsxd rax,eax", when you check for overflow, when you compare numbers, multiply or divide, or shift bits, you need to know if the number is signed (has a sign bit) or unsigned (no sign bit, no negative numbers).

Signed	Unsigned	Language
int	unsigned int	C++, “int” is signed by default.
signed char	unsigned char	C++, “char” may be signed or unsigned.
movsxd	movzxd	Assembly, sign extend or zero extend to change register sizes.
jo	jc	Assembly, “overflow” is calculated for signed values, “carry” for unsigned values.
jg	ja	Assembly, “jump greater” is signed, “jump above” is unsigned.
jl	jb	Assembly, “jump less” signed, “jump below” unsigned.
imul	mul	Assembly, “imul” is signed (and more modern), “mul” is for unsigned (and ancient and horrible!). idiv/div work similarly.

CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.