CS 411 Fall 2025 > Outline & Supplemental Notes for October 20, 2025

CS 411 Fall 2025
Outline & Supplemental Notes
for October 20, 2025

Outline

Hashing [L 7.3]

Hashing basics
- Second way of storing a dictionary: structure holding associative data. (1st way was balanced search tree, e.g., Red-Black Tree).
- Example of prestructuring.
- Idea: Hash function, given key, returns hash address. Store key-value pairs in an array, called a Hash Table, indexed by hash address.
- Two keys with the same hash address: collision.
- Hash function needs to:
  - Be computable quickly.
  - Spread data around evenly, as much as possible.
  - And maybe (esp. when using open hashing): avoid patterned output when given patterned input. (See the Supplemental Notes.)
- Two major categories of collision-resolution methods.
  - Open hashing. Array item can hold arbitrarily many key-value pairs.
  - Closed hashing (a.k.a. open addressing). Array item holds single key-value pair. If collision, look for other place to store key.
Open hashing
- Array item is a bucket that can hold arbitrarily many key-value pairs. Buckets are almost always singly linked lists: separate chaining.
- Load factor: \(\alpha\) = number of items / number of buckets. Want \(\alpha\) to stay well below \(1\).
- Higher load factor means less efficient searches.
- When load factor gets too large: rehashing.
Closed hashing (= open addressing)
- Array item holds single key-value pair. Can also be marked as empty, deleted.
- If hash address for a key is taken, look elsewhere. These searches are probes. The list of probes that may be done is the probe sequence. (See the Supplemental Notes.)

Analysis (See the Supplemental Notes.)
- Worst case for CRUD operations: \(\Theta(n)\) for each.
- Average case for CRUD operations in well-written Hash Table: amortized constant time for insert (“C”)—due to rehashing—and constant time for others.
- Compare: balanced search trees have \(\Theta(\log\,n)\) worst- and average-case time for all CRUD operations.
Hashing in Practice
- Good option for external associative data: in-memory Hash Table indexing external buckets. Extendible hashing.
- (See the Supplemental Notes.)

Supplemental Notes

Good Hash Functions

The text lists two important properties of a good hash function.

It is computable quickly.
It’s values are spread evenly among the possible values.

I will add a third that is sometimes important.

It avoids producing patterned output when given patterned input.

Real-world input often has structure. When we use open hashing, we do not want such structure to cause many keys to get the same hash address. When we use closed hashing, this third property can be less important; we might just let a good probe sequence (see below) deal with the problem.

Probe Sequences

Linear probing means that, if \(k\) is the hash address, then the probe sequence is

\[ k, k+1, k+2, k+3, k+4, \dots. \]

An important probe sequence that the text does not mention is quadratic probing. In this probe sequence, we add consecutive squares to the hash address:

\[ k, k+1, k+4, k+9, k+16, \dots. \]

Quadratic probing greatly reduces the formation of clusters.

Analysis

CRUD Efficiency

I would like to be more precise than the text about the time efficiency of hash-table CRUD operations.

First, consider worst-case time. There are generally many more possible keys than there are locations in the table. It is thus possible that all keys will be given the same hash address. If separate chaining is used, this means that every key will be stored in the same bucket, and so a search for a given key may require looking at every key in the table. If closed hashing is used, this means that a search for a given key may require a probe of every item in the table. In either case, we see that all CRUD operations are \(\Theta(n)\) time.

Now consider average-case time. Suppose for the moment that we are using separate chaining. An unsuccessful search averages \(\alpha\) comparisons, while a successful search averages about \(1+\frac{\alpha}{2}\) comparisons. In either case, it is constant-time on average, assuming we keep the value of \(\alpha\) low. That takes care of the Read and Update operations. The Delete operation requires the additional operation of removing a node from a linked list, after the node has been found; it is also constant-time on average.

Inserting a new key (the Create operation) is trickier. This may raise \(\alpha\) unacceptably high, resulting in rehashing, which requires calling the hash function for every key in the table, a \(\Theta(n)\) operation. Since \(\alpha\) depends only on the number of items in the table, and not on their values, the Create operation is linear-time even on average.

Just as with inserting into a smart array, the linear-time operation usually does not need to happen very often. In a well written Hash Table, the average time for a large number of consecutive insertions of average data, will be constant-time. That is, for repeated insertions, Create is amortized constant-time for average data.

It should be noted that the previous sentence is only true if both kinds of averages are done. The Create operation is not amortized constant-time, since every key may lie in the same bucket, making every insertion require a linear number of steps. The Create operation is also not constant-time for average data, as even average data will eventually require rehashing, a linear-time operation. However, Create is amortized constant-time for average data.

For closed hashing, the analysis is much trickier, but the conclusions are essentially the same.

“Rare” Behavior

The text notes that worst-case behavior is rare in a well written Hash Table & hash function. This is correct; however, exactly what “rare” means is worth thinking about. In particular, if our input is provided by a malicious user who can produce data that leads to worst-case hash-table behavior, then it may not matter how rare such data is.

Such concerns are not merely academic. For example, in 2003 security researchers Scott A. Crosby and Dan S. Wallach showed that by feeding carefully selected data to the Bro network intrusion detection system, they could cause poor behavior in a Hash Table, rendering the intrusion detection ineffective.

Hashing in Practice

It is somewhat surprising that there is no consensus on what the best Hash Table implementation is.

A little searching and reading suggests that open hashing is more popular than closed hashing for ordinary in-memory associative data. Apparently, the built-in Hash Tables for the Perl and Ruby programming languages use open hashing. In the C++ Standard Library, the interface to std::unordered_map includes functionality giving client code access to specific buckets. Thus, implementations would seem to be required to use open hashing. Regardless, a quick check shows that the GNU implementation of the C++11 Standard Library does in fact use open hashing in std::unordered_map (v. 4.8.1 checked).

On the other hand, the built-in Hash Table in the standard implementation of the Python programming language uses closed hashing (CPython v3.3 checked). Little effort is made to avoid patterned output from the hash function. The table size is always a power of \(2\), and the load factor is kept under \(2/3\). The probe sequence used is illustrated by the following code.

[C++]

size_t hash_addr;   // Hash address
size_t table_size;  // Locations in table; ALWAYS POWER OF 2

size_t perturb = hash_addr;
size_t i = hash_addr % table_size;
while (!probe(i))  // Probe @ index i; returns true on success
{
    i = (5*i + 1 + perturb) % table_size;
    perturb >>= 5;
}
// Now i is the index where the value is stored

CS 411 Fall 2025 Outline & Supplemental Notesfor October 20, 2025