Bytes, ASCII, and UTF-8

CS 301 Lecture, Dr. Lawlor

ASCII: Bytes as Characters

The American Standard Code for Information Interchange (ASCII, pronounced "ask-ee") is a mapping from byte values to printable characters. It's nearly universal, with the exception of IBM mainframes, which use EBCDIC (pronounced "eb-sah-dic", and has the weird property that A-Z aren't contiguous because it's based on the even more ancient binary-coded decimal).

You can look up an ASCII code in C++ by converting a single-quoted 'char' constant into an integer, like:

return 'A';

(Try this in NetRun now!)

This returns 65 (0x41), because that's the byte the ASCII committee chose to represent a newline.

You can print the whole ASCII table pretty easily, like:

for (int i=0;i<200;i++) {
	char c=(char)i;
	std::cout<<" ASCII "<<i<<" = "<<c<<"\n";
}

(Try this in NetRun now!)

You can think of ASCII codes as decimal or hexadecimal bytes. It's the same data, deep down.

Dec   Hex   Char
--------------------
0     00    NUL '\0'
1     01    SOH 
2     02    STX 
3     03    ETX 
4     04    EOT 
5     05    ENQ 
6     06    ACK 
7     07    BEL '\a'
8     08    BS  '\b'
9     09    HT  '\t'
10    0A    LF  '\n'
11    0B    VT  '\v'
12    0C    FF  '\f'
13    0D    CR  '\r'
14    0E    SO  
15    0F    SI  
16    10    DLE 
17    11    DC1 
18    12    DC2 
19    13    DC3 
20    14    DC4 
21    15    NAK 
22    16    SYN 
23    17    ETB 
24    18    CAN 
25    19    EM  
26    1A    SUB 
27    1B    ESC 
28    1C    FS  
29    1D    GS  
30    1E    RS  
31    1F    US

Dec   Hex   Char
-----------------
32    20    SPACE
33    21    ! 
34    22    " 
35    23    # 
36    24    $ 
37    25    % 
38    26    & 
39    27    ' 
40    28    ( 
41    29    ) 
42    2A    * 
43    2B    + 
44    2C    , 
45    2D    - 
46    2E    . 
47    2F    / 
48    30    0 
49    31    1 
50    32    2 
51    33    3 
52    34    4 
53    35    5 
54    36    6 
55    37    7 
56    38    8 
57    39    9 
58    3A    : 
59    3B    ; 
60    3C    < 
61    3D    = 
62    3E    > 
63    3F    ?

Dec   Hex   Char
--------------------
64    40    @
65    41    A
66    42    B
67    43    C
68    44    D
69    45    E
70    46    F
71    47    G
72    48    H
73    49    I
74    4A    J
75    4B    K
76    4C    L
77    4D    M
78    4E    N
79    4F    O
80    50    P
81    51    Q
82    52    R
83    53    S
84    54    T
85    55    U
86    56    V
87    57    W
88    58    X
89    59    Y
90    5A    Z
91    5B    [
92    5C    \	'\\'
93    5D    ]
94    5E    ^
95    5F    _

Dec   Hex   Char
----------------
96    60    `
97    61    a
98    62    b
99    63    c
100   64    d
101   65    e
102   66    f
103   67    g
104   68    h
105   69    i
106   6A    j
107   6B    k
108   6C    l
109   6D    m
110   6E    n
111   6F    o
112   70    p
113   71    q
114   72    r
115   73    s
116   74    t
117   75    u
118   76    v
119   77    w
120   78    x
121   79    y
122   7A    z
123   7B    {
124   7C    |
125   7D    }
126   7E    ~
127   7F    DEL

ASCII less than 32 ("control characters") or greater than 128 ("high ASCII") show up in various weird ways depending on which machine and web browser you're running.

Here's the same thing, indexed by hex digits:

ASCII

(

)

;

[

]

{

}

€

�

‚

„

…

†

‡

‰

‹

�

ASCII table. Horizontal axis gives the low hex digit, vertical axis the high hex digit, and the entry is ASCII for that hex digit. E.g., "A" is 0x41.

Inside a C++ string, the characters you type are automatically converted from ASCII to binary. You can enter hex bytes in the middle of a string using a backslash-x: the string "Funky symbol is \x80" ends with a Euro byte (character €, value 0x80).

Possibly the world's most obtuse first program is:

std::cout<<"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0A";

(Try this in NetRun now!)

This just prints each char using its ASCII hex value instead of its actual ASCII character.

Reading a Number in ASCII

If you call the C function getchar(), you read the next char from stdin as ASCII.

long number=0;
for (int place=0;place<16;place++)
{
	char c=(char)getchar();
	if (c==(char)-1) break; // getchar returns -1 if the input is over
	if (c>='0' && c<='9') 
	{ // We have a normal digit
		int value=c-'0'; // ascii to numeric value
		number = number*10 + value; // shift old places up, add new value
	}
	else // unknown char
		break;
}
return number;

(Try this in NetRun now!)

UTF-8: Supporting the World's Languages

ASCII alone is a US standard, and doesn't even support European languages (which is not schön). Initially an attempt was made to support each language with its own "code page" describing the high ASCII (>=128) chars, but this doesn't have enough space for most non-European languages.

Unicode is how we encode the entire world's languages in a single character system. This requires a *lot* of characters; they initially thought 16 bits would be enough, but there turn out to be a lot of languages in the world. Microsoft standardized early on "UTF-16" using 16-bit short ints to store Unicode characters, so they have "ASCII" and "Unicode" versions of every function, have two incompatible types of "text files", and spend a lot of time converting back and forth between normal strings and "wide" strings.

By contrast, UTF-8 supports full Unicode using only a string of bytes, which lets UTF-8 be interspersed with normal ASCII easily. This is the newer standard increasingly used everywhere (97.8% of all web pages as of September 2022).

This is the byte layout of UTF-8. Note that ASCII (<128) lives happily as a single byte as before.

Layout of UTF-8 byte sequences
Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+07FF	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	U+10000	^{[nb 2]}U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

True text processing like wrapping text is still complicated, but string concatenation (like for filenames or user interface messages) is super easy in UTF-8. If some interface hands you a byte string today, you should assume it's UTF-8 unless told otherwise.