CS 311 Fall 2024 > Exam Review Problems, Set G
CS 311 Fall 2024
Exam Review Problems, Set G
This is the seventh of seven sets of exam review problems. A complete review for the Final Exam includes all seven problem sets.
Problems
Review problems are given below. Answers are in the Answers section of this document. Do not turn these in.
- What problems occur when a Hash Table gets too full?
- What is the load factor of a Hash Table?
- What symbol do we use to represent the load factor?
- Typically, how does the code for a Hash Table determine whether the Hash Table is too full?
- What is done when a Hash Table gets too full?
- What are the downsides of the fact that the procedure from the previous part is required?
- “How collisions are resolved is the primary design decision
involved in a Hash-Table implementation.”
Explain.
- What advantages do Hash Tables have over self-balancing search trees, as a Table implementation?
- What disadvantages do Hash Tables have (list three disadvantages)?
- “Use Hash Tables intelligently,” says your instructor. Explain.
- Consider the following C++ code fragment:
vector<double> data(20); for (int i = 0; i < 20; ++i) data[i] = double(i) + 7.1; cout << data[3] << endl;
- Change this code so that instead of a
vector
, it uses one of the STL Table implementations. Make as few changes as possible. You may assume that the required header files have already been included, and there are appropriate “using
” lines. - The change made in the previous part is probably
not a great idea;
it is better to use
vector
, as the original code does. Why?
- Change this code so that instead of a
- The bracket operator of
std::map
andstd::unordered_map
is non-const
. So, for example, we have the following issue:
Why is there noconst std::map<int, int> m = mymap; // No problem cout << m[0] << endl; // COMPILER ERROR! m is const
const
version of this operator?
- List the C++ STL containers that are implemented using Hash Tables, and explain how these containers differ from each other. Hint. There are four such containers.
- In a
std::vector
, if one has an iterator (“iter
”) to an item, one can say
to change the item’s value—if the*iter = value;
vector
is non-const. The same is true for astd::deque
andstd::list
. However, this is not the case for the STL Table implementations (std::set
,std::map
, etc.). Why not?
- Draw a Prefix Tree (Trie) holding the following words:
a, an, and, ankle, any
- Discuss the efficiency of the Table insert, delete, and retrieve operations
for a Table implemented using a Prefix Tree.
- Your instructor refers to Prefix Trees as
“the Radix Sort of Table implementations”.
Explain.
- Suppose we put our data structures on a mass storage device.
How does this affect the optimal design of …
- … a Hash Table?
- … a self-balancing search tree?
- What is a “B-Tree”?
- What is a “B+ Tree”?
- A graph is shown below.
- What is a “greedy” algorithm?
- Give a problem that is correctly solved by a greedy algorithm.
- What does Prim’s Algorithm find?
- Outline how Prim’s Algorithm works.
- What does Kruskal’s Algorithm find?
- Outline how Kruskal’s Algorithm works.
- We would say that insert-at-beginning is inefficient for a smart array.
On the other hand, we would say that one can efficiently iterate
through all items in an array.
And yet both of these operations are linear time.
Why do we say one is “inefficient”
and the other is “efficient”,
when they have the same order?
- In each part below, a problem is given.
This problem has two solutions that are generally considered to be the best.
In each part,
- Name the two solutions.
- Discuss trade-offs between the two solutions.
- Is there a situation in which we might not use one of the two “best” solutions? Explain.
- Sorting a sequence.
- Searching for an item in a sequence.
- Implementing a Table.
- Modern processors all use caching,
in which, when a memory location is read,
nearby memory locations are also read and stored on the processor
for faster access.
What implications does this have for the relative desirability
of smart arrays vs. Linked Lists.
- For each of the following time efficiency classes,
we have discussed a quintessential operation or algorithm
that lies in that class.
For each, indicate what this is.
- Constant time.
- Logarithmic time.
- Linear time.
- Log-linear time.
- Quadratic time.
Answers
- When a Hash Table gets too full, performance can degrade.
- The load factor of a Hash Table, is the number of items stored in the table divided by the number of locations in the array. (When open hashing is used, these locations are buckets, and the load factor is the average number of items per bucket.)
- We denote the load factor by \(\alpha\) (lower-case Greek alpha).
- The usual signal that a Hash Table is too full is that the load factor gets too high. “Too high” means that it reaches some fixed, implementation-dependent value—typically a number a bit less than \(1\). For example, the Hash Table implementation built into CPython is considereed too full when \(\alpha\) reaches \(2/3\) or greater.
- When the load factor of a Hash Table gets too high, rehashing is necessary. This corresponds to the reallocate-and-copy operation of a smart array; however, it is more complex and time-consuming (although still linear time). The Hash Table must be reallocated, and every item must be inserted into the new table. Thus, the hash function must be called for every item.
- The requirement of occasional reehashing means that, for an expanding Hash Table, every now and then a slow operation is required. Of course, just as with smart arrays, we can avoid this problem if we preallocate, that is, if we start our Hash Table with a large enough array to accomodate all the items we will insert without pushing the load factor too high.
- Conceptually, a Hash Table is very simple: it is an array whose index is computed as the hash code modulo the array size. The complexity in the implementation comes from what we do in the case of a collision. Thus, nearly all the implementation details of a Hash Table are determined by the collision-resolution technique.
- On the average, a Hash-Table has performance significantly better than a Table implemented using a self-balancing search tree. Hash Tables may also be easier to implement (but now that we have generic programming and the internet, who cares?).
- Here are three disadvantages of Hash Tables:
- Poor worst-case performance—when a dataset with many collisions is used, and also due to the overhead involved in rehashing.
- Requirement that a hash function be specified for user-defined types.
- Traverse is unsorted.
- Hash Table are often presented as an option that always works well. However, they can have serious disadvantages (see the previous part). If the drawbacks of Hash Tables are acceptable, then by all means use one. But first think about those drawbacks and how they may affect the performance and quality of your code.
- There are both keys and associated values,
and there are no duplicate keys.
Thus, the appropriate Table implementation is
either
std::map
orstd::unordered_map
. Since each of these has a bracket operator, the only thing that needs to be changed is the declaration of the variabledata
. I will usestd::map
.
Old:
New:vector<double> data(20);
OR:map<int, double> data;
unordered_map<int, double> data;
- Using
std::map
orstd::unordered_map
for array-style data gives certain advantages, such as compact storage of sparse datasets, and the availability of non-integer key types. However, none of those applies here. On the other hand, the disadvantages ofmap
/unordered_map
, like slower execution, do apply.
- There are both keys and associated values,
and there are no duplicate keys.
Thus, the appropriate Table implementation is
either
- The
operator[]
ofstd::map
andstd::unordered_map
returns a reference to an item in the structure. The item must exist; otherwise we cannot return a reference to it. Thus, if no item with the given key exists, one must be inserted. This modifies the structure, so it cannot be aconst
member function. - There are four C++ STL containers
that are implemented using Hash Tables:
std::unordered_set
std::unordered_map
std::unordered_multiset
std::unordered_multimap
All of the above containers are Table implementations. In the …
set
containers, the key is the whole value, while the …map
containers store key-data pairs. The …multi
… containers allow duplicate keys; the others do not.std::unordered_map
has a bracket operator; none of the other three containers have this. - In an implementation of a value-oriented ADT, the location an item is stored in depends on what its key is. Thus, we cannot change the value of an item in-place; we must delete and re-insert the item. This may be time-consuming, so the STL Table implementations do not provide a single function that performs this operation.
- Here is the Prefix Tree, drawn using the conventions from the lecture slides:
- If we implement a Table using a Prefix Tree,
then the number of steps required for a Table insert, delete, or retrieve operation
is essentially the number of characters in the given key.
(Each key is a string,
so we can talk about the number of characters in it.)
Thus, if we consider the maximum length of a key to be fixed, then all three operations
are constant time.
On the other hand, if we want each item to have a different key, then with larger data sets, we need longer keys. The key length needs to be something like the log of the number of keys. Thus, there is a hidden logarithm (just as there was with Radix Sort); for arbitrarily large data sets, it is reasonable to consider the three operations to be logarithmic time. - Prefix Trees are
“the Radix Sort of Table implementations because,
like Radix Sort:
- They are a bit off the beaten path, handling data rather differently from more mainstream methods.
- They are applicable only to the special case of datasets that are collections of strings in a general sense—for example, a positive integer can be considered as a string of digits.
- They would seem to be extremely fast. They are pretty fast, but there is a hidden logarithm involving the length of a string, which means that, in practice, their performance is comparable to more mainstream methods.
- They are not difficult to implement well. So implementing them yourself, for use in production code, might not be a bad idea.
- When we keep a data structure on external storage, our primary concern is to minimize the number of block accesses. Thus, we would avoid closed hashing, since a typical probe sequence involves multiple locations within the table, and thus multiple block reads. We would prefer to use buckets designed so that items in a bucket are stored close together, and in the same block (or small number of blocks) if possible.
- Again, we wish to minimize the number of block accesses. Making nodes large (around the size of a block?) can help with this. Thus, we use a B-Tree of high degree—or a variation, like a B+ Tree—which has large nodes.
- B-Trees are a generalization of 2-3 Trees. A B-Tree of degree \(m\) is an \(\left\lceil\frac{m}{2}\right\rceil\)-…-\(m\) Tree. That is, nodes can be \(\left\lceil\frac{m}{2}\right\rceil\)-nodes, …, \(m-1\)-nodes, or \(m\)-nodes. (Recall: A \(k\)-node has \(k-1\) data items and, if it is not a leaf, exactly \(k\) children). The exception is the root node, which can have \(1\), …, \(m-1\) data items; the number of children of the root is still one more than the number of data items in it, unless it is a leaf. The algorithms for a B-Tree generalize those for a 2-3 Tree.
- A B+ Tree is a variation on a B-Tree in which each data item that is in a non-leaf node is duplicated in a leaf node, associated values are found only in the leaf nodes, and the leaf nodes are joined into an auxiliary Linked List.
- DFS: 0, 1, 2, 4, 3, 6, 5.
- BFS: 0, 1, 6, 2, 5, 3, 4.
- A greedy algorithm is one that
makes a series of choices.
Choices are:
- feasible (they make sense),
- locally optimal (they are best-possible based on currently known information), and
- irrevocable (once a choice is made, it is permanent).
- A greedy algorithm correctly finds a minimum spanning tree in a connected weighted graph.
- A greedy algorithm is one that
makes a series of choices.
Choices are:
- Prim’s Algorithm finds a minimum spanning tree in a connected weighted graph.
- Prim’s Algorithm begins with one vertex that is declared to be reachable. It then repeatedly adds to the spanning tree the edge with least weight that joins a reachable vertex to a not-reachable vertex, with the not-reachable vertex then becoming reachable.
- Kruskal’s Algorithm finds a minimum spanning tree in a connected weighted graph.
- Kruskal’s Algorithm repeatedly adds to the spanning tree the edge with least weight that joins two vertices that cannot yet be reached from each other using only edges added to the tree so far.
- Insertion is a single-item operation. Such operations are often done in bunches. Thus, we need insertion to be faster than linear time for it to be very useful. Iteration through all items, on the other hand, involves all items in the structure. Thus it cannot be faster than linear time, so we call a linear-time implementation “efficient”.
- Introsort, Merge Sort.
- Both algorithms are \(O(n\log n)\). Introsort uses much less memory when sorting an array, but requires random-access data and is not stable. Merge Sort is stable and can be written to work on a Linked List. In different situations, each algorithm may be faster than the other.
- We may use Insertion Sort to handle nearly sorted lists or small lists. Other algorithms (e.g., Radix Sort) may be used when sorting special kinds of data. We may use Heap Sort in low-memory situations. We may use variants of Heap Sort to handle more general situations, for example, when a sequence is modified during the sorting process, or when we only want to sort the greatest or least items in a sequence.
- Binary Search, Sequential Search.
- Binary Search is \(O(\log n)\), while Sequential Search is \(O(n)\). However, Binary Search requires sorted data, and to be efficient, it requires random-access data, while Sequential Search works on any sequence.
- For a simple sequence type (array, Linked List), we would pretty much always use Binary Search or Sequential Search. But in a fancier data structure (various trees, Hash Tables) we would use the search algorithm appropriate to the structure.
- Self-Balancing Search Tree, Hash Table.
- Self-Balancing Search Trees have better worst-case performance (\(O(\log n)\) for Insert, Delete, Retrieve) and sorted traverse. They may require a user-provided ordering. We generally use a Red-Black Tree for in-memory data structures and a B-Tree (or B+ Tree or other B-Tree variation) of high degree for external data structures. Hash Tables have better typical-case performance (\(O(1)\) on average for Insert, Delete, Retrieve if Hash Table is not overly full), but worse worst-case performance (\(O(n)\) Insert, Delete, Retrieve). Their traverse is unsorted. They also may require occasional rehashing and a user-provided hash function.
- If we need Table-style functionality, but our application requires an extended Insert phase, followed by an extended Retrieve phase, with little or no deletion, then we may not use a Table implementation at all. Instead, we simply place all our data in an (unsorted) array, sort it, and use Binary Search for our retrievals.
- If the application generally accesses items in a list in such a way that, when an item is accessed, nearby items are likely to be accessed next (locality of reference), then the existence of caching makes arrays more desirable and Linked Lists less desirable. This is because consecutive items in an array are stored in adjacent memory locations, which will be loaded into the cache. Linked lists, on the other hand, do not have this property.
- Constant time. Look-up by index (bracket operator) for an array.
- Logarithmic time. Binary Search.
- Linear time. Any simple loop that goes through all the items in a list and performs a simple (constant time) operation on each. For example: copying an array, summing a list of numbers.
- Log-linear time. Sorting, using a good algorithm (e.g., Merge Sort, Introsort).
- Quadratic time. Inefficient sorting algorithms (e.g., Bubble Sort), two typical nested loops.