# Bytes, ASCII, Big and Little Endian Integers

CS 301 Lecture, Dr. Lawlor

## ASCII: Bytes as Characters

The American Standard Code for Information Interchange is a mapping from byte values to printable characters.

You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:
`return '\n';`

(Try this in NetRun now!)

This returns 10 (0xA), because that's the byte the ASCII committee chose to represent a newline.

You can print the whole ASCII table pretty easily, like:
`for (int i=0;i<200;i++) {	char c=(char)i;	std::cout<<" ASCII "<<i<<" = "<<c<<"\n";}`

You can think of ASCII codes as decimal or hexadecimal bytes.  It's the same data, deep down.

 `Dec Hex Char--------------------0 00 NUL '\0'1 01 SOH 2 02 STX 3 03 ETX 4 04 EOT 5 05 ENQ 6 06 ACK 7 07 BEL '\a'8 08 BS '\b'9 09 HT '\t'10 0A LF '\n'11 0B VT '\v'12 0C FF '\f'13 0D CR '\r'14 0E SO 15 0F SI 16 10 DLE 17 11 DC1 18 12 DC2 19 13 DC3 20 14 DC4 21 15 NAK 22 16 SYN 23 17 ETB 24 18 CAN 25 19 EM 26 1A SUB 27 1B ESC 28 1C FS 29 1D GS 30 1E RS 31 1F US ` `Dec Hex Char-----------------32 20 SPACE33 21 ! 34 22 " 35 23 # 36 24 \$ 37 25 % 38 26 & 39 27 ' 40 28 ( 41 29 ) 42 2A * 43 2B + 44 2C , 45 2D - 46 2E . 47 2F / 48 30 0 49 31 1 50 32 2 51 33 3 52 34 4 53 35 5 54 36 6 55 37 7 56 38 8 57 39 9 58 3A : 59 3B ; 60 3C < 61 3D = 62 3E > 63 3F ? ` `Dec Hex Char--------------------64 40 @65 41 A66 42 B67 43 C68 44 D69 45 E70 46 F71 47 G72 48 H73 49 I74 4A J75 4B K76 4C L77 4D M78 4E N79 4F O80 50 P81 51 Q82 52 R83 53 S84 54 T85 55 U86 56 V87 57 W88 58 X89 59 Y90 5A Z91 5B [92 5C \ '\\'93 5D ]94 5E ^95 5F _` `Dec Hex Char----------------96 60 `97 61 a98 62 b99 63 c100 64 d101 65 e102 66 f103 67 g104 68 h105 69 i106 6A j107 6B k108 6C l109 6D m110 6E n111 6F o112 70 p113 71 q114 72 r115 73 s116 74 t117 75 u118 76 v119 77 w120 78 x121 79 y122 7A z123 7B {124 7C |125 7D }126 7E ~127 7F DEL`
ASCII less than 32 ("control characters") or greater than 128 ("high ASCII") show up in various weird ways depending on which machine and web browser you're running.

Here's the same thing, indexed by hex digits, and including all the funny "high ASCII" characters:
 ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF 0x \0         \n \r   1x             2x ! " # \$ % & ' ( ) * + , - . / 3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4x @ A B C D E F G H I J K L M N O 5x P Q R S T U V W X Y Z [ \ ] ^ _ 6x ` a b c d e f g h i j k l m n o 7x p q r s t u v w x y z { | } ~  8x € � ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ � Ž � 9x � ‘ ’ “ ” • – — ˜ ™ š › œ � ž Ÿ Ax ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ex à á â ã ä å æ ç è é ê ë ì í î ï Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

ASCII table.  Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit.  E.g., "A" is 0x41.

Inside a C++ string, the characters you type are automatically converted from ASCII to binary.  You can enter hex bytes in the middle of a string using a backslash-x: the string "Funky symbol is \x80" ends with a Euro byte (character €, value 0x80).

Possibly the world's most obtuse first program is:
`std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";`

Here's the bytes inside the "cout" object itself, printed out as ASCII bytes.  They look like random garbage.

`cout<<(const char *)&cout;`

(Try this in NetRun now!)

## "sizeof": Get Number of Bytes

Eight bits make a "byte" (note: it's pronounced exactly like "bite", but always spelled with a 'y'), although in some rare networking manuals (and in French) the same eight bits would be called an "octet" (hard drive sizes are in "Go" or "To", Giga-octets or Tera-octets, when sold in French).  In DOS and Windows programming, 16 bits is a "WORD", 32 bits is a "DWORD" (double word), and 64 bits is a "QWORD"; but in other contexts "word" means the machine's natural binary processing size, which ranges from 32 to 64 bits nowadays.  "word" should be considered ambiguous; "bit" and "byte" have the same meaning everywhere.

 Object Overflow Value Bits Hex Digits (4 bits each) Bytes (8 bits each) Bit 2 1 less than 1 less than 1 Byte, char 256 8 2 1 "short" (or Windows WORD) 65536 16 4 2 "int" (Windows DWORD) >4 billion 32 8 4 "long" (or in 32-bit C++, "long long") >16 quadrillion 64 16 8

There's a nice little builtin function in C/C++ called "sizeof" that returns the number of bytes (technically, the number of characters) used by a variable or data type.  Sadly, C/C++ don't specify how many bytes various data types like "int" have, so it depends on the machine:
 32-bit x86 (little endian) 32-bit PowerPC (big endian) 64-bit x86 or Itanium Java / C# `sizeof(char)==1sizeof(short)==2sizeof(int)==4sizeof(long)==4sizeof(long long)==8sizeof(void *)==4sizeof(float)==4sizeof(double)==8sizeof(long double)==12"ILP32"` `sizeof(char)==1sizeof(short)==2sizeof(int)==4sizeof(long)==4sizeof(long long)==8sizeof(void *)==4sizeof(float)==4sizeof(double)==8sizeof(long double)==8"ILP32"` `sizeof(char)==1sizeof(short)==2sizeof(int)==4sizeof(long)==8sizeof(long long)==8sizeof(void *)==8sizeof(float)==4sizeof(double)==8sizeof(long double)==16"LP64"` `sizeof(byte)==1sizeof(short)==2sizeof(int)==4sizeof(long)==8 /* no need for long long */ /* no pointers in Java */sizeof(float)==4sizeof(double)==8 /* no long double in Java */sizeof(Char)==2`
Note the deciding difference between "32 bit machines" and "64 bit machines" is the size of a pointer--4 or 8 bytes.  "int" is 4 bytes on all modern machines.  "long" is 8 bytes in Java or a 64-bit machine, and just 4 bytes on 32-bit machines.

Here's a program that prints out the above:
`char c;short s;int i;long l;long long ll;void *v;float f;double d;long double ld;std::cout<<"sizeof(char)=="<<sizeof(c)<<"\n";std::cout<<"sizeof(short)=="<<sizeof(s)<<"\n";std::cout<<"sizeof(int)=="<<sizeof(i)<<"\n";std::cout<<"sizeof(long)=="<<sizeof(l)<<"\n";std::cout<<"sizeof(long long)=="<<sizeof(ll)<<"\n";std::cout<<"sizeof(void *)=="<<sizeof(v)<<"\n";std::cout<<"sizeof(float)=="<<sizeof(f)<<"\n";std::cout<<"sizeof(double)=="<<sizeof(d)<<"\n";std::cout<<"sizeof(long double)=="<<sizeof(ld)<<"\n";return 0;`

Try this out on some different machines!  Note that on some Windows compilers, you might need to say "__int64" instead of "long long".  Also note that "long long" has nothing to do with the Chinese concert pianist Lang Lang.

## Big and Little Endian Memory Access

Let's say we ask the CPU to treat four bytes as a single integer, using a typecast like so:
`const unsigned char table[]={	1,2,3,4};int foo(void) {	typedef int *myPtr;	myPtr p=(myPtr)table;	return p[0];}`

(Try this in NetRun now!)

This program returns "0x4030201", which is rather the opposite of what you might expect.  The mismatch here is that we write (arabic) numerals right-to-left (just like arabic), but we write table entries (and everything else) left-to-right.

So the CPU reads the first, leftmost table entry (1) to get the lowest-valued byte (0x01), which we write on the right side (0x...01).  Similarly, the last table entry (4) is interpreted as the highest-valued byte (0x04), which we write on the left side (0x04...).

But this depends on the CPU!  All x86 CPUs start with the lowest-valued byte (the "little end" of the integer comes first, hence "little endian"), but many other CPUs, such as the PowerPC, MIPS, and SPARC CPUs, start with the highest-valued byte (the "big end" of the integer, hence "big endian").  So this same code above returns 0x01020304 on a PowerPC--try this!

The big and little endian naming confusing exists even in the non-computer world.  Consider that the following are all little-endian (starting with the least-significant information):
• John Smith, Carpet Cleaner
• Pittsburgh Technical Institute
Yet the following are all big-endian (starting with the biggest information):
You can see big- and little-endian byte storage going not just from bytes to ints, but also from ints to bytes:
`int foo(void) {	int x=0xa0b0c0d0; /* Integer value we'll pick apart into bytes */	typedef unsigned char *myTable; /* We'll make it an array of chars */	myTable table=(myTable)&x; /* point to the bytes of the integer x */	for (int i=0;i<4;i++) /* print each byte of the integer x */		std::cout<<std::hex<<(int)table[i]<<" ";	std::cout<<std::endl;	return 0;}`

(Try this in NetRun now!)

This code prints "d0 c0 b0 a0" on a little-endian machine--the first byte is the lowest-value "0xd0".

## Machine Code as Bytes

Here's some x86 machine code encoded into a C++ string, and run on the CPU.  Once it's compiled, the bytes of this string work fine as CPU machine code, just like the char arrays in the homework.

`const char *fn="\xb8\x07\x00\x00\x00\xc3";return ((int (*)(void))(fn))();`

(Try this in NetRun now!)

Here's the same machine code, encoded into a C++ "long".  Remember that the long is stored in memory little-endian!
`const static long fn=0xc300000007b8;return ((int (*)(void))(&fn))();`

(Try this in NetRun now!)

Note that newer x86 machines mark the stack with the "NX" (No eXecute) bit to prevent the CPU from executing code there.  This is a useful security feature, but it means the above code crashes without the "const static".
`std::vector<char> fn;fn.push_back(0xb8);fn.push_back(0x07);fn.push_back(0x00);fn.push_back(0x00);fn.push_back(0x00);fn.push_back(0xc3);return ((int (*)(void))(&fn[0]))();`

(Try this in NetRun now!)

Here, I'm putting the bytes of machine code into a std::vector.  This only works on my 32-bit machine; on my 64-bit machine, std::vector's storage space is marked NX, so this code crashes rather than run.