|CS 321 Spring 2013 > Lecture Notes for Friday, March 29, 2013|
The process of starting a computer using a storage device is curiously paradoxical. To start up, we need to read information on the device. But the code to read the device is part of the OS, which is stored on the device. How can we read the code if we do not have the code to read yet? The situation is rather like locking your keys in your car. You need the keys to unlock the car, but the keys are in the car, so you need to keys to unlock the car and get the keys.
Thus, starting a computer from a storage device has been likened to “pulling oneself up by ones bootstraps”. It is therefore called bootstrapping, or, more commonly, booting.
In order to boot, code that can read some small part of the device must be a permanent part of the computer. This code can then read more sophisiticated code, which may in turn read still more sophisticated code, until the entire OS is loaded.
Modern storage devices follow a standard in which the first sector on a disk is the master boot record, which contains the partition table, which tells how the device is partitioned into logical volumes. On booting, we read the master boot record, then go to the first partition, and read its first block, called the boot block. This is followed by the superblock, which contains information about the file system type, number of blocks used, etc.
How do we tell which blocks hold the contents of a file? There are three systems: contiguous allocation, linked list, and index block.
Contiguous allocation is a file system version of the simple address space. We store a file’s contents in a contiguous range of blocks. A directory entry gives the first block and the number of blocks. This method allows for fast file reads, but it can be difficult to find space for a file.
In the linked list approach, the directory entry holds a pointer to the first block. Then each block contains a pointer to the next. A variation is that a master table for the volume has an entry for each block, and if a block is part of a file, then its entry holds a pointer to the next block. (This latter approach is used by the FAT family of file systems.) The linked list idea allows for simple data structures that do not fill up, but it can give poor performance when we do random-access I/O.
An index block for a file is a block holding a list of all the blocks containing part of the file’s contents—and probably some metadata too. If an index block fills up, then we need additional index blocks. We may list these in a higher-level index block, or make them into a linked list.
A directory can be thought of as a key-value store, where a key is a filename and the associated value is the file’s metadata and contents. We can thus implement a directory using some of the same techniques we use for in-memory key-value data structures (see CS 311).
Most modern file systems store a directory using some variation on a B-Tree. This is a search tree with very large nodes. It can store many key-value pairs while keeping its height very small. We might store one node per block. The number of node accesses required for a search-tree operation is generally something like the height or two times the height. Thus, since each node is a block, and the height is small, a B-Tree results in a small number of block accesses when we do a directory look-up.
The actual data structures used are generally not B-Trees per se, but close relatives of them. These have names like “B+ Tree” and “B* Tree”. For example, a B+ Tree is like a B-Tree, except that, for each value in an internal node, there is a duplicate in a leaf node, and the leaves are arranged into a linked list. This allows for fast traversal of a directory.
Due to caching, we generally expect most disk accesses to be writes. Since many small writes are inefficient, we add a log of changes to the file system, buffer our writes, and write all at once.
We can actually arrange things so that the log is the file system. The result is a log-structured file system. The log holds inodes, data blocks, and metadata. We keep it in a circular list on the device; when this fills up, we “clean” the volume, making the changes listed in the log, which will compact the data greatly.
In a journalling file system, we write a description of each action before it is done. This aids crash recovery.
Suppose there is a crash. If the description of an action was not written, then we know the action was not performed. If the description was written, then we cannot be sure whether the action was performed. We make all the actions idempotent, meaning that doing an action twice in a row has the same effect as doing it once. Thus, if a descripton was written, we can perform the action. Whether the action was previously performed or not, the result will be the same.