Bytes, ASCII, and UTF-8

CS 301 Lecture, Dr. Lawlor

ASCII: Bytes as Characters

The American Standard Code for Information Interchange (ASCII, pronounced "ask-ee") is a mapping from byte values to printable characters.  It's nearly universal, with the exception of IBM mainframes, which use EBCDIC (pronounced "eb-sah-dic", and has the weird property that A-Z aren't contiguous because it's based on the even more ancient binary-coded decimal).

You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:
return 'A';

(Try this in NetRun now!)

This returns 65 (0x41), because that's the byte the ASCII committee chose to represent a newline.

You can print the whole ASCII table pretty easily, like:

for (int i=0;i<200;i++) {
char c=(char)i;
std::cout<<" ASCII "<<i<<" = "<<c<<"\n";
}

(Try this in NetRun now!)

You can think of ASCII codes as decimal or hexadecimal bytes.  It's the same data, deep down.

Dec   Hex   Char
--------------------
0 00 NUL '\0'
1 01 SOH
2 02 STX
3 03 ETX
4 04 EOT
5 05 ENQ
6 06 ACK
7 07 BEL '\a'
8 08 BS '\b'
9 09 HT '\t'
10 0A LF '\n'
11 0B VT '\v'
12 0C FF '\f'
13 0D CR '\r'
14 0E SO
15 0F SI
16 10 DLE
17 11 DC1
18 12 DC2
19 13 DC3
20 14 DC4
21 15 NAK
22 16 SYN
23 17 ETB
24 18 CAN
25 19 EM
26 1A SUB
27 1B ESC
28 1C FS
29 1D GS
30 1E RS
31 1F US
Dec   Hex   Char
-----------------
32 20 SPACE
33 21 !
34 22 "
35 23 #
36 24 $
37 25 %
38 26 &
39 27 '
40 28 (
41 29 )
42 2A *
43 2B +
44 2C ,
45 2D -
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <
61 3D =
62 3E >
63 3F ?
Dec   Hex   Char
--------------------
64 40 @
65 41 A
66 42 B
67 43 C
68 44 D
69 45 E
70 46 F
71 47 G
72 48 H
73 49 I
74 4A J
75 4B K
76 4C L
77 4D M
78 4E N
79 4F O
80 50 P
81 51 Q
82 52 R
83 53 S
84 54 T
85 55 U
86 56 V
87 57 W
88 58 X
89 59 Y
90 5A Z
91 5B [
92 5C \ '\\'
93 5D ]
94 5E ^
95 5F _
Dec   Hex   Char
----------------
96 60 `
97 61 a
98 62 b
99 63 c
100 64 d
101 65 e
102 66 f
103 67 g
104 68 h
105 69 i
106 6A j
107 6B k
108 6C l
109 6D m
110 6E n
111 6F o
112 70 p
113 71 q
114 72 r
115 73 s
116 74 t
117 75 u
118 76 v
119 77 w
120 78 x
121 79 y
122 7A z
123 7B {
124 7C |
125 7D }
126 7E ~
127 7F DEL

ASCII less than 32 ("control characters") or greater than 128 ("high ASCII") show up in various weird ways depending on which machine and web browser you're running.

Here's the same thing, indexed by hex digits:

ASCII x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x \0
       
\n


\r
 
1x                
2x
! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ 
8x €
‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ
Ž


ASCII table.  Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit.  E.g., "A" is 0x41.

Inside a C++ string, the characters you type are automatically converted from ASCII to binary.  You can enter hex bytes in the middle of a string using a backslash-x: the string "Funky symbol is \x80" ends with a Euro byte (character €, value 0x80).

Possibly the world's most obtuse first program is:

std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";

(Try this in NetRun now!)

This just prints each char using its ASCII hex value instead of its actual ASCII character.

Reading a Number in ASCII

If you call the C function getchar(), you read the next char from stdin as ASCII. 

long number=0;
for (int place=0;place<16;place++)
{
char c=(char)getchar();
if (c==(char)-1) break; // getchar returns -1 if the input is over
if (c>='0' && c<='9')
{ // We have a normal digit
int value=c-'0'; // ascii to numeric value
number = number*10 + value; // shift old places up, add new value
}
else // unknown char
break;
}
return number;
(Try this in NetRun now!)


UTF-8: Supporting the World's Languages

ASCII alone is a US standard, and doesn't even support European languages (which is not schφn).  Initially an attempt was made to support each language with its own "code page" describing the high ASCII (>=128) chars, but this doesn't have enough space for most non-European languages.

Unicode is how we encode the entire world's languages in a single character system.  This requires a *lot* of characters; they initially thought 16 bits would be enough, but there turn out to be a lot of languages in the world.  Microsoft standardized early on "UTF-16" using 16-bit short ints to store Unicode characters, so they have "ASCII" and "Unicode" versions of every function, have two incompatible types of "text files", and spend a lot of time converting back and forth between normal strings and "wide" strings.

By contrast, UTF-8 supports full Unicode using only a string of bytes, which lets UTF-8 be interspersed with normal ASCII easily.  This is the newer standard increasingly used everywhere (97.8% of all web pages as of September 2022).

This is the byte layout of UTF-8.  Note that ASCII (<128) lives happily as a single byte as before.
Layout of UTF-8 byte sequences
Number of bytes First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 U+0000 U+007F 0xxxxxxx
2 U+0080 U+07FF 110xxxxx 10xxxxxx
3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
True text processing like wrapping text is still complicated, but string concatenation (like for filenames or user interface messages) is super easy in UTF-8.  If some interface hands you a byte string today, you should assume it's UTF-8 unless told otherwise.