Bytes, ASCII, Big and Little Endian Integers

CS 301 Lecture, Dr. Lawlor

ASCII: Bytes as Characters

The American Standard Code for Information Interchange is a mapping from byte values to printable characters.

You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:
return '\n';

(Try this in NetRun now!)

This returns 10 (0xA), because that's the byte the ASCII committee chose to represent a newline.

You can print the whole ASCII table pretty easily, like:
for (int i=0;i<200;i++) {
char c=(char)i;
std::cout<<" ASCII "<<i<<" = "<<c<<"\n";
}

(Try this in NetRun now!)

You can think of ASCII codes as decimal or hexadecimal bytes.  It's the same data, deep down.

Dec   Hex   Char
--------------------
0 00 NUL '\0'
1 01 SOH
2 02 STX
3 03 ETX
4 04 EOT
5 05 ENQ
6 06 ACK
7 07 BEL '\a'
8 08 BS '\b'
9 09 HT '\t'
10 0A LF '\n'
11 0B VT '\v'
12 0C FF '\f'
13 0D CR '\r'
14 0E SO
15 0F SI
16 10 DLE
17 11 DC1
18 12 DC2
19 13 DC3
20 14 DC4
21 15 NAK
22 16 SYN
23 17 ETB
24 18 CAN
25 19 EM
26 1A SUB
27 1B ESC
28 1C FS
29 1D GS
30 1E RS
31 1F US
Dec   Hex   Char
-----------------
32 20 SPACE
33 21 !
34 22 "
35 23 #
36 24 $
37 25 %
38 26 &
39 27 '
40 28 (
41 29 )
42 2A *
43 2B +
44 2C ,
45 2D -
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <
61 3D =
62 3E >
63 3F ?
Dec   Hex   Char
--------------------
64 40 @
65 41 A
66 42 B
67 43 C
68 44 D
69 45 E
70 46 F
71 47 G
72 48 H
73 49 I
74 4A J
75 4B K
76 4C L
77 4D M
78 4E N
79 4F O
80 50 P
81 51 Q
82 52 R
83 53 S
84 54 T
85 55 U
86 56 V
87 57 W
88 58 X
89 59 Y
90 5A Z
91 5B [
92 5C \ '\\'
93 5D ]
94 5E ^
95 5F _
Dec   Hex   Char
----------------
96 60 `
97 61 a
98 62 b
99 63 c
100 64 d
101 65 e
102 66 f
103 67 g
104 68 h
105 69 i
106 6A j
107 6B k
108 6C l
109 6D m
110 6E n
111 6F o
112 70 p
113 71 q
114 72 r
115 73 s
116 74 t
117 75 u
118 76 v
119 77 w
120 78 x
121 79 y
122 7A z
123 7B {
124 7C |
125 7D }
126 7E ~
127 7F DEL
ASCII less than 32 ("control characters") or greater than 128 ("high ASCII") show up in various weird ways depending on which machine and web browser you're running.

Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:
ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x \0
       
\n
\r
 
1x                
2x
! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ 
8x
ƒ ˆ Š Œ
Ž
9x
˜ š œ
ž Ÿ
Ax   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­
® ¯
Bx ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

ASCII table.  Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit.  E.g., "A" is 0x41.

Inside a C++ string, the characters you type are automatically converted from ASCII to binary.  You can enter hex bytes in the middle of a string using a backslash-x: the string "Funky symbol is \x80" ends with a Euro byte (character €, value 0x80).

Possibly the world's most obtuse first program is:
std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";

(Try this in NetRun now!)

Here's the bytes inside the "cout" object itself, printed out as ASCII bytes.  They look like random garbage.

cout<<(const char *)&cout;

(Try this in NetRun now!)

"sizeof": Get Number of Bytes

Eight bits make a "byte" (note: it's pronounced exactly like "bite", but always spelled with a 'y'), although in some rare networking manuals (and in French) the same eight bits would be called an "octet" (hard drive sizes are in "Go" or "To", Giga-octets or Tera-octets, when sold in French).  In DOS and Windows programming, 16 bits is a "WORD", 32 bits is a "DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts "word" means the machine's natural binary processing size, which ranges from 32 to 64 bits nowadays.  "word" should be considered ambiguous; "bit" and "byte" have the same meaning everywhere.

Object
Overflow Value
Bits
Hex Digits (4 bits each)
Bytes (8 bits each)
Bit
2
1
less than 1
less than 1
Byte, char
256
8
2
1
"short" (or Windows WORD)
65536
16
4
2
"int" (Windows DWORD)
>4 billion
32
8
4
"long" (or in 32-bit C++, "long long")
>16 quadrillion
64
16
8

There's a nice little builtin function in C/C++ called "sizeof" that returns the number of bytes (technically, the number of characters) used by a variable or data type.  Sadly, C/C++ don't specify how many bytes various data types like "int" have, so it depends on the machine:
32-bit x86 (little endian)
32-bit PowerPC (big endian)
64-bit x86 or Itanium
Java / C#
sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==4
sizeof(long long)==8
sizeof(void *)==4
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==12

"ILP32"
sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==4
sizeof(long long)==8
sizeof(void *)==4
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==8

"ILP32"
sizeof(char)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==8
sizeof(long long)==8
sizeof(void *)==8
sizeof(float)==4
sizeof(double)==8
sizeof(long double)==16

"LP64"
sizeof(byte)==1
sizeof(short)==2
sizeof(int)==4
sizeof(long)==8
/* no need for long long */
/* no pointers in Java */
sizeof(float)==4
sizeof(double)==8
/* no long double in Java */
sizeof(Char)==2
Note the deciding difference between "32 bit machines" and "64 bit machines" is the size of a pointer--4 or 8 bytes.  "int" is 4 bytes on all modern machines.  "long" is 8 bytes in Java or a 64-bit machine, and just 4 bytes on 32-bit machines.

Here's a program that prints out the above:
char c;
short s;
int i;
long l;
long long ll;
void *v;
float f;
double d;
long double ld;
std::cout<<"sizeof(char)=="<<sizeof(c)<<"\n";
std::cout<<"sizeof(short)=="<<sizeof(s)<<"\n";
std::cout<<"sizeof(int)=="<<sizeof(i)<<"\n";
std::cout<<"sizeof(long)=="<<sizeof(l)<<"\n";
std::cout<<"sizeof(long long)=="<<sizeof(ll)<<"\n";
std::cout<<"sizeof(void *)=="<<sizeof(v)<<"\n";
std::cout<<"sizeof(float)=="<<sizeof(f)<<"\n";
std::cout<<"sizeof(double)=="<<sizeof(d)<<"\n";
std::cout<<"sizeof(long double)=="<<sizeof(ld)<<"\n";
return 0;
(executable NetRun link)

Try this out on some different machines!  Note that on some Windows compilers, you might need to say "__int64" instead of "long long".  Also note that "long long" has nothing to do with the Chinese concert pianist Lang Lang.

Big and Little Endian Memory Access

Let's say we ask the CPU to treat four bytes as a single integer, using a typecast like so:
const unsigned char table[]={
1,2,3,4
};

int foo(void) {
typedef int *myPtr;
myPtr p=(myPtr)table;
return p[0];
}

(Try this in NetRun now!)

This program returns "0x4030201", which is rather the opposite of what you might expect.  The mismatch here is that we write (arabic) numerals right-to-left (just like arabic), but we write table entries (and everything else) left-to-right. 

So the CPU reads the first, leftmost table entry (1) to get the lowest-valued byte (0x01), which we write on the right side (0x...01).  Similarly, the last table entry (4) is interpreted as the highest-valued byte (0x04), which we write on the left side (0x04...).

But this depends on the CPU!  All x86 CPUs start with the lowest-valued byte (the "little end" of the integer comes first, hence "little endian"), but many other CPUs, such as the PowerPC, MIPS, and SPARC CPUs, start with the highest-valued byte (the "big end" of the integer, hence "big endian").  So this same code above returns 0x01020304 on a PowerPC--try this! 

The big and little endian naming confusing exists even in the non-computer world.  Consider that the following are all little-endian (starting with the least-significant information):
Yet the following are all big-endian (starting with the biggest information):
You can see big- and little-endian byte storage going not just from bytes to ints, but also from ints to bytes:
int foo(void) {
int x=0xa0b0c0d0; /* Integer value we'll pick apart into bytes */
typedef unsigned char *myTable; /* We'll make it an array of chars */
myTable table=(myTable)&x; /* point to the bytes of the integer x */
for (int i=0;i<4;i++) /* print each byte of the integer x */
std::cout<<std::hex<<(int)table[i]<<" ";
std::cout<<std::endl;
return 0;
}

(Try this in NetRun now!)

This code prints "d0 c0 b0 a0" on a little-endian machine--the first byte is the lowest-value "0xd0".

Machine Code as Bytes

Here's some x86 machine code encoded into a C++ string, and run on the CPU.  Once it's compiled, the bytes of this string work fine as CPU machine code, just like the char arrays in the homework.

const char *fn="\xb8\x07\x00\x00\x00\xc3";
return ((int (*)(void))(fn))();

(Try this in NetRun now!)

Here's the same machine code, encoded into a C++ "long".  Remember that the long is stored in memory little-endian!
const static long fn=0xc300000007b8;
return ((int (*)(void))(&fn))();

(Try this in NetRun now!)

Note that newer x86 machines mark the stack with the "NX" (No eXecute) bit to prevent the CPU from executing code there.  This is a useful security feature, but it means the above code crashes without the "const static".
std::vector<char> fn;
fn.push_back(0xb8);
fn.push_back(0x07);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0x00);
fn.push_back(0xc3);
return ((int (*)(void))(&fn[0]))();

(Try this in NetRun now!)

Here, I'm putting the bytes of machine code into a std::vector.  This only works on my 32-bit machine; on my 64-bit machine, std::vector's storage space is marked NX, so this code crashes rather than run.