Binary I/O and Filesystems

CS 321 2007 Lecture, Dr. Lawlor

So a file's full of bytes.  You don't want bytes.  You first want to stick bytes together to make ints, doubles, and the other types in your program.  You then want to stick those together into structs, like "std::string".  These structs need to be laid out into data structures, which all need to get stored on disk somehow.

Let's take those one at a time.

Storing Ints as Bytes

Say you've got the hex value 0xa0b1c2d3.  The obvious way to store this 32-bit int in memory is using 4 bytes, as follows:
Byte:
0
1
2
3
Value:
0xa0
0xb1
0xc2
0xd3

This is called "big-endian" notation--the first byte is the big end of the int.  Almost all CPUs in the history of computers have used big-endian storage. 

In fact, only two desktop CPUs have ever not been big-endian: the ancient VAX and the modern x86.  Sadly, those two are exceedingly popular machines, which store that same int 0xa0b1c2d3 in "little-endian" notation:
Byte:
0
1
2
3
Value:
0xd3
0xc2
0xb1 0xa0

The difference between big and little endian machines ("endianness") stinks, but that's life.  You can verify what's happening by writing out an int and reading bytes, or by copying memory between byte and int variables:
#include <string.h>

int foo(void)
{
char b[4]={0xa0, 0xb1, 0xc2, 0xd3}; // 4 bytes
int i=0; // 1 int
memcpy(&i,&b[0],sizeof(i)); // copy bytes from b to i
return i;
}
(executable NetRun link)

On a little-endian x86 machine, this program will print 0xd3c2b1a0.
On a big-endian PowerPC machine, this program will print 0xa0b1c2d3.

Note that this is a function of how the CPU stores an "int" in the bytes of memory, so have the exact same endian-dependent situation with files:
#include <fstream>

int foo(void)
{
// Write out 4 bytes into this file:
std::ofstream fo("test.bin",std::ios_base::binary);
char b[4]={0xa0, 0xb1, 0xc2, 0xd3}; // 4 bytes
fo.write((char *)&b[0],4);
fo.flush(); //<- else fo leaves our bytes in output buffer!

// Read those same bytes out as a binary int:
std::ifstream fi("test.bin",std::ios_base::binary);
int i=0;
fi.read((char *)&i,sizeof(i));
return i;
}
(executable NetRun link)

Stupid ways to deal with endianness

Smart ways to deal with endianness

That last technique, the magic class, is by far my favorite.  The biggest advantage of this is that now if you have fourteen things you need to store on disk, you can make a new class out of Big32 objects, and the new class will also have a known on-disk byte layout:
class stuffpile {
public:
Big32 foos;
Big32 bars[11];
Big32 baz,boz;
};
"stuffpile" objects can now be written and read easily and portably as bytes, just like Big32s.

Argh!  I Hate Binary!  Why not just use ASCII?

ASCII really is fine if you don't care too much about:
Unfortunately, we often do care about all four of these things.  Hence it's important for you to learn about reading binary files.

Real-life complicated binary files

A real binary file usually has an interesting structure.  The first thing in the file is a "header".  This is a sequence of stuff at known locations.  For simple files everything in the file is at a known fixed location, but real life is rarely simple.  Instead, often the header will give the file locations ("offsets") to where you can find the other stuff in the file.

Example: EXE file format

Modern Windows executables are in the "PE" format (Portable Executable).  They start with an old MS-DOS program header, but that data isn't used anymore (it's just a tiny DOS program that prints "This program can't be run in DOS mode").  To find the real executable info, you jump to byte 0x3c in the file (with a seek) and then read 4 bytes, which are a little-endian byte offset.  At this byte in the file (again, you get there with a seek), there's a whole struct full of information about the program.  Here's a complete example program.  Through the use of the "lil32" class, this program can run on any machine, not just little-endian Windows machines.

Example: FAT file system

Any filesystem is just a big binary data structure sitting on your disk.  One common filesystem, used in USB keychain drives, floppy disks, and old hard disks, is the "File Allocation Table" filesystem

The first thing on disk is the FAT "boot sector", which tells you how many entires are actually in the FAT.  Then comes the FAT itself (read the Wikipedia article, it's good!).  Then comes the blocks of data in the normal user files sitting on the disk.  Because the boot sector, FAT, and user data blocks are all a known size, the OS can directly seek (the disk) to a particular location to read a particular file.