CS 411 Fall 2025 > Outline & Supplemental Notes for November 7, 2025
CS 411 Fall 2025
Outline & Supplemental Notes
for November 7, 2025
Outline
Huffman Codes [L 9.4]
- Encodings
- An encoding is a way of representing
data as a stream of items from some fixed set.
The items are codewords.
- Our codewords will be strings of bits.
- Possible goals of an encoding method
- Secrecy (limitations on who can read encoded data)
- Authentication (limitations on who can write encoded data)
- Small size
- Generality, system independence
- Possible properties of an encoding method
- Each codeword represents a single symbol OR a codeword can represent multiple symbols.
- Fixed-length (every codeword contains the same number of bits) OR variable-length (different codewords may contain different numbers of bits)
- Prefix-free
- Generally a necessary property for a variable-length encoding.
- There is a correspondence between prefix-free codes and decision trees.
- Suffix-free
- Often not a necessary property.
- Backspace problem.
- An encoding is a way of representing
data as a stream of items from some fixed set.
The items are codewords.
- UTF-8
- Unicode gives a uniform way of dealing with a very large set of characters, which includes pretty much all of the major character sets used to write human languages.
- Unicode assigns each character a number, its codepoint.
- Unicode is not an encoding. There are a number of standard Unicode encodings. The most important is UTF-8. See the Supplemental Notes.
- Huffman Trees & Codes
- Huffman tree: decision tree whose leaves correspond to elements of a weighted symbol set.
- Greedy algorithm for forming a Huffman tree, given a weighted set of symbols.
- Property of Huffman Trees: for symbols whose frequencies are given by their weights, Huffman tree is decision tree that minimizes average number of decisions.
- Prefix-free code corresponding to a Huffman tree:
Huffman code.
- Huffman codes are generally not suffix-free.
- Property of Huffman Codes: encoded text has shortest average length, among all encodings in which each codeword is a string of bits representing a single symbol.
- Improvements
- Lempel-Ziv: family of encodings, goal is small size, allows for codewords that represent multiple symbols.
- Due to patent encumbrance of better forms of L-Z, DEFLATE algorithm, which combines ideas from L-Z and Huffman coding, is the basis for major text compression software. Also used in PNG image file format.
Supplemental Notes
UTF-8
Unicode
Unicode (from Universal Coded Character Set) is a standard aimed at consistent handling of characters from any of the writing systems in use throughout the world, as well as text made up of such characters. The standardization effort that led to Unicode began in the late 1980s. The first standard, Unicode 1.0, was released in 1991, with roughly yearly releases in following years. Since 2021 there has been a new Unicode release each September. The most recent version, Unicode 17.0, was released in September 2025 and contains information on over 150,000 characters.
Unicode includes a large number of accented letters, punctuation marks, mathematical symbols, arrows, and various dingbats. For each character, Unicode specifies a reference for its glyph—visual representation—and an official name. Information on conventions for display and reading is included; for example, some languages are written right-to-left. Thus Unicode is a rather large and complex standard.
For our purposes, the most important part of Unicode is relatively simple: it assigns a number to each character; this is the character’s codepoint. For characters in the ASCII set, the codepoint is the same as the ASCII code. For example, the ASCII code for upper-case A is 65; this is also that character’s Unicode codepoint. Other characters may have much larger codepoints.
Here are a few examples. Some of the glyphs below may not display correctly, if your system does not have the proper fonts installed.
| Glyph | Codepoint | Name |
|---|---|---|
| A | 65 | LATIN CAPITAL LETTER A |
| [ | 91 | LEFT SQUARE BRACKET |
| © | 169 | COPYRIGHT SIGN |
| é | 233 | LATIN SMALL LETTER E WITH ACUTE |
| Θ | 920 | GREEK CAPITAL LETTER THETA |
| ש | 1513 | HEBREW LETTER SHIN |
| → | 8594 | RIGHTWARDS ARROW |
| ≤ | 8804 | LESS-THAN OR EQUAL TO |
| ☻ | 9787 | BLACK SMILING FACE |
| ✌ | 9996 | VICTORY HAND |
| ま | 12414 | HIRAGANA LETTER MA |
| fi | 64257 | LATIN SMALL LIGATURE FI |
| 😿 | 128575 | CRYING CAT FACE |
The UTF-8 Encoding
Unicode is not an encoding. However, the Unicode standard includes descriptions of a number of encodings. The most important is UTF-8 (for Unicode Transformation Format—8-bit).
UTF-8 is a variable-length encoding that represents each character with a single codeword. All codewords are sequences of bits whose length is divisible by 8. The shortest codewords have 8 bits, and the longest have 32. Codewords of up to 48 bits have been defined, but those longer than 32 bits are not currently used.
Here is the format of the UTF-8 codewords,
including those specified but currently unused.
Each “b” represents an arbitrary bit.
| Length | Format |
|---|---|
| 8 bits | 0bbbbbbb |
| 16 bits | 110bbbbb 10bbbbbb |
| 24 bits | 1110bbbb 10bbbbbb 10bbbbbb |
| 32 bits | 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb |
| 40 bits
unused |
111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb |
| 48 bits
unused |
1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb |
For example, consider the capital theta
(“Θ”).
This character has codepoint 920, or 1110011000 in binary.
The codepoint is a 10-bit number,
so we cannot represent it using the 8-bit format,
as this only allow for 7-bit codepoints
(there are 7 bs in the 8-bit format).
However, the 16-bit format has 11 bs, which is sufficient.
We replace the each b in the format
with the appropriate digit.
110bbbbb 10bbbbbb 01110 011000
And here is the result.
11001110 10011000
The 24-bit format has 16 bs,
so we could conceivably represent a capital theta
using a 24-bit codeword, too.
However, UTF-8 specifies that only the shortest possible codeword
is used;
others are overlong and are considered invalid.
Note that, for characters with codepoints from 0 to 127, the codeword is simply the binary representation of the codepoint. In other words, ASCII characters are represented as always. This makes UTF-8 backward-compatible with ASCII.
Properties of UTF-8
- Supports the entire Unicode character set.
- Backward-compatible with ASCII.
- Variable-length.
- Prefix-free.
- Suffix-free (so backspace is constant-time).