CS 411 Fall 2025  >  Outline & Supplemental Notes for November 7, 2025


CS 411 Fall 2025
Outline & Supplemental Notes
for November 7, 2025

Outline

Huffman Codes [L 9.4]

Supplemental Notes

UTF-8

Unicode

Unicode (from Universal Coded Character Set) is a standard aimed at consistent handling of characters from any of the writing systems in use throughout the world, as well as text made up of such characters. The standardization effort that led to Unicode began in the late 1980s. The first standard, Unicode 1.0, was released in 1991, with roughly yearly releases in following years. Since 2021 there has been a new Unicode release each September. The most recent version, Unicode 17.0, was released in September 2025 and contains information on over 150,000 characters.

Unicode includes a large number of accented letters, punctuation marks, mathematical symbols, arrows, and various dingbats. For each character, Unicode specifies a reference for its glyph—visual representation—and an official name. Information on conventions for display and reading is included; for example, some languages are written right-to-left. Thus Unicode is a rather large and complex standard.

For our purposes, the most important part of Unicode is relatively simple: it assigns a number to each character; this is the character’s codepoint. For characters in the ASCII set, the codepoint is the same as the ASCII code. For example, the ASCII code for upper-case A is 65; this is also that character’s Unicode codepoint. Other characters may have much larger codepoints.

Here are a few examples. Some of the glyphs below may not display correctly, if your system does not have the proper fonts installed.

Glyph Codepoint Name
A 65  LATIN CAPITAL LETTER A
[ 91  LEFT SQUARE BRACKET
© 169  COPYRIGHT SIGN
é 233  LATIN SMALL LETTER E WITH ACUTE
Θ 920  GREEK CAPITAL LETTER THETA
ש 1513  HEBREW LETTER SHIN
8594  RIGHTWARDS ARROW
8804  LESS-THAN OR EQUAL TO
9787  BLACK SMILING FACE
9996  VICTORY HAND
12414  HIRAGANA LETTER MA
64257  LATIN SMALL LIGATURE FI
😿 128575  CRYING CAT FACE

The UTF-8 Encoding

Unicode is not an encoding. However, the Unicode standard includes descriptions of a number of encodings. The most important is UTF-8 (for Unicode Transformation Format—8-bit).

UTF-8 is a variable-length encoding that represents each character with a single codeword. All codewords are sequences of bits whose length is divisible by 8. The shortest codewords have 8 bits, and the longest have 32. Codewords of up to 48 bits have been defined, but those longer than 32 bits are not currently used.

Here is the format of the UTF-8 codewords, including those specified but currently unused. Each “b” represents an arbitrary bit.

Length Format
8 bits 0bbbbbbb
16 bits 110bbbbb 10bbbbbb
24 bits 1110bbbb 10bbbbbb 10bbbbbb
32 bits 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
40 bits
unused
111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb
48 bits
unused
1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

For example, consider the capital theta (“Θ”). This character has codepoint 920, or 1110011000 in binary.

The codepoint is a 10-bit number, so we cannot represent it using the 8-bit format, as this only allow for 7-bit codepoints (there are 7 bs in the 8-bit format).

However, the 16-bit format has 11 bs, which is sufficient. We replace the each b in the format with the appropriate digit.

110bbbbb 10bbbbbb
   01110   011000

And here is the result.

11001110 10011000

The 24-bit format has 16 bs, so we could conceivably represent a capital theta using a 24-bit codeword, too. However, UTF-8 specifies that only the shortest possible codeword is used; others are overlong and are considered invalid.

Note that, for characters with codepoints from 0 to 127, the codeword is simply the binary representation of the codepoint. In other words, ASCII characters are represented as always. This makes UTF-8 backward-compatible with ASCII.

Properties of UTF-8