Negative Numbers, and Creative Misuse of Branch Instructions

CS 301 Lecture, Dr. Lawlor

Signed and Unsigned

C++ supports integral values both "signed" (positive or negative) and "unsigned" (never negative). The only difference is the value represented by the highest bit. The highest bit is often called the "sign bit" because if it's 0, the number is positive, and if it's 1, the value is negative (for signed values).

Here's a hypothetical 3 bit version:

Bit Pattern	unsigned	signed
000	0	0
001	1	1
010	2	2
011	3	3
100	4	-4
101	5	-3
110	6	-2
111	7	-1

Here are the bit values for 8 bit numbers of both flavors:

Bit:	7	6	5	4	3	2	1	0
unsigned char	128	64	32	16	8	4	2	1
signed char	-128	64	32	16	8	4	2	1

So, for example, binary 10000010 represents decimal 130 (128+2) if it's unsigned, but -126 (-128+2) if that same value is signed. Negative one is 0xff, since 64+32+16+8+4+2+1==127.

When adding two positive signed numbers, a carry into the highest bit will cause the sum to go negative. If you add two negative numbers together, you get a spurious carry out of the highest bit, but luckily there's a separate "overflow flag" that represents signed wraparound.

The "obvious" way to represent negative numbers, just using the sign bit to store the +/- sign, has the weird behavior with zero plus the sign bit set--you get negative zero! With the weird-value sign bit above, it turns out that integer addition is the same operation for signed *or* unsigned values. For these reasons, the approach above with a big negative sign bit, called two's complement, is the standard today.

Background info on Storage Sizes

Eight bits make a "byte" (note: it's pronounced exactly like "bite", but always spelled with a 'y'), although in some rare networking manuals (and in French) the same eight bits would be called an "octet" (hard drive sizes are in "Go", Giga-octets, when sold in French). Some folks call four bits (one hex digit) a "nibble" or even "nybble". In DOS and Windows programming, 16 bits is a "WORD", 32 bits is a "DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts "word" means the machine's natural binary processing size, which ranges from 32 to 64 bits nowadays. "word" should now be considered ambiguous. Giving an actual bit count is the best approach ("The file begins with a 32-bit binary signed integer describing...").

Regster	Typical Name	Unsigned Range	Signed Range	Bits	Hex Digits (4 bits)	Bytes (8 bits)
n/a	Bit	0..1	-1..0	1	less than 1	less than 1
al	"char" (or byte)	255	-128 .. 127	8	2	1
ax	"short" (or WORD)	65535	-32768 .. +32767	16	4	2
eax	"int" (DWORD)	>4 billion	-2G .. +2G	32	8	4
rax	"long" (or "long long")	>16 quadrillion	-8Q .. +8Q	64	16	8

You can determine the usable range of a value by experimentally measuring when the value wraps around, for example with code like this:

int value=1; /* value to test, starts at first (lowest) bit */
for (int bit=0;bit<100;bit++) {
	std::cout<<"at bit "<<bit<<" the value is "<<value<<"\n";
	value=value+value; /* moves over by one bit (value=value<<1 would work too) */
}
return 0;

(Try this in NetRun now!)

If I have an unsigned value in a small register, I can stick the value into a bigger register by just sticking zeros into the high bits.

mov cl,0xde
movzx eax,cl
ret

(Try this in NetRun now!)

If I have a signed value in a small register, and I want to copy the *value* to a bigger register, I have to copy the sign bit upwards into the higher bits. The instruction "movsz" (move with sign extend) does this:

mov cl,0xde
movsx eax,cl
ret

(Try this in NetRun now!)

Bit Shifting in Signed and Unsigned

If you're shifting bits using the << and >> bit shift operators, there are times that the sign bit is important. For example, if you left-shift a positive signed number, pushing the bits up to higher values, the result will eventually go negative. If you right shift, for an unsigned value, you need to tack on zeros; but for a signed value, you need to preserve the sign bit. There are dedicated instructions for this:

	signed	unsigned
<< left	sal	shl
>> right	sar	shr

The signed "sar":

mov eax,0x87654321
sar eax,16
ret

(Try this in NetRun now!)

Result: 0xFFFF8765
(a negative number)

The unsigned "shr":

mov eax,0x87654321
shr eax,16
ret

(Try this in NetRun now!)

Result: 0x8765
(a positive number)

Java is a little unusual because they don't have "unsigned int". But they do have an unsigned-type logical bit shift operation ">>>"!

Wraparound Checking in Assembly

C++ just blindly ignores integer wraparound, so it's possible for a program to fail silently and give you the wrong answer. In assembly, you can check for signed overflow with "jo" (jump-if-overflow), as shown below; or for unsigned use "jc" (jump-if-carry), as we did last class.

mov edi,1000000000

add edi,edi
jo uhoh ; error-check the add

mov eax,edi
ret ; ordinary return

uhoh:
mov eax,-999 ; Special value to indicate we had an overflow
ret ; error return

(Try this in NetRun now!)

This program will hence return the right answer, or admit something failed. You do have to be careful to check after every possible arithmetic instruction, since the various flags get re-set after each "add".

warning: comparison between signed and unsigned integer expressions

Why do you care signed versus unsigned values?

Well, for one thing, it affects how you can write C++ code. For example, this code produces a warning, then gives the wrong answer "999":

unsigned int i=3000000000;
if (i<-3) return 999; /* what?! */
else return 0;

(Try this in NetRun now!)

The problem is, in assembly language:

When comparing unsigned numbers, you use "ja" and "jb", which read the CF (carry flag).
When comparing signed numbers, you use "jg" and "jl", which read SF and OF (sign and overflow flags).

When comparing unsigned i to signed -3, which jump do you use? Well, sadly, they both give the wrong answer! This is a hardware limitation (no mixed-sign comparisons) that ends up as a language limitation (warning and wrong answer for mixed-sign comparisons).

In the above code, if the compiler decides to perform the signed comparison:
unsigned 3 billion < signed -3
(convert to signed)
signed -1 billion < signed -3
So this will return true: 3 billion is less than -3 if they're both converted to signed 32-bit values.

It doesn't actually help if we perform an unsigned comparison:
unsigned 3 billion < signed -3
(convert to unsigned)
unsigned 3 billion < signed 4 billion
This will also return true: 3 billion is less than -3 if they're both converted to unsigned 32-bit values.

On a 64-bit machine, you can fix this problem by typecasting both sides to a "long" before doing the compare. This works because a signed long has enough bits to represent +3 billion without wraparound:

unsigned int i=3000000000;
if ((long)i<(long)-3) return 999;
else return 0;

(Try this in NetRun now!)

This fixes the problem, but "unsigned long" suffers from the same exact malfunction at 15 quadrillion:

unsigned long i=15000000000000000000ul;
if ((long)i<(long)-3) return 999;
else return 0;

(Try this in NetRun now!)

Bottom line: you can't reliably compare signed and unsigned numbers. Pick one or the other, and stick with it.

Reminder: negative signed numbers are represented by setting the high bit, which corresponds to a really huge unsigned number; similarly, huge unsigned numbers like above correspond to negative signed numbers.

Low-Branch Bounds Checking

You can actually exploit the signed/unsigned comparison differences to speed up this common bounds-checking code:

int i=3;
const int n=10;
int arr[n];
int foo(void) {
	if (i>=0 && i<n) return arr[i];
	else return 0;
}

(Try this in NetRun now!)

Note that we have to do two separate tests on "i": it could be negative, so we check against zero; or it could be n or bigger, so we compare against n. But if you look at the code the compiler generates, there's only one test on "i":

	mov    edx, <<i>>
	mov    eax,0x0
	cmp    edx,0x9
	ja     bad
	... load up arr[i] ...
bad:	ret

Wait, only one comparison, and it's an *unsigned* compare? But "i" is signed!

This is intentional: if "i" is negative (signed), then treated as an unsigned number, it's huge. This lets us replace a two-test signed "negative or too big" with a single unsigned "too big" test!