Reading and Writing Binary Files

CS 493 Lecture Notes, Dr. Lawlor

Quite a bit of computer security involves reading and writing raw binary data files:

Hex Editing

A key tool in understanding binary files is a hex editor, which works like a text editor but for binary files.
There are also some standard UNIX tools for manipulating binary files:
Executable files usually have their own analysis tools:

Object Formatting

There are a variety of situations where we need to communicate entire objects:
There are essentially only two ways to store objects: fixed-size formats, and variable-size formats.

Fixed-Size Format
Variable-Size Format
Constant number of bytes (sizeof)
Non-constant number of bytes
C++ builtin type, class or struct
Most binary file headers (e.g., BMP)
Fixed-width ASCII records
C++ string, vector, map, list, ...
Delimited ASCII data
JSON
XML
Very fast to allocate and deallocate
Surprisingly slow to allocate and deallocate (must use malloc or new)
Easy to allocate in a C++ array
Cannot be directly stored in an array (must use pointers or offsets)
Not extensible (tempted to squeeze bits)
Can be extended (though you must plan ahead!)

Fixed-size integers lead to the frankly ridiculous problem of integer overflow when you need more bits to represent the result than are available in the integer size you have chosen.  Running out of room is the classic problem with fixed-size objects, and encourages the hack of repurposing bits for other uses, such as the "0x3FFF rowBytes problem" on classic MacOS.

To make any variable-size format, you need to be able to mark the end of the thing you're currently on.  But because you want to be able to send anything, you either need to encode binary data as plain text:
The alternative to encoding is marking the end.  But this means you need a way to quote out "the end" symbol so you can send a literal end symbol, and then you need a way to quote the quotes for when you want to send actual quotes.  This means you immediately run into the exponential quoting growth problem for information nested several levels deep:
	std::cout<<"Hello World!\n";
std::cout<<"std::cout<<\"Hello World!\\n\";\n";
std::cout<<"std::cout<<\"std::cout<<\\\"Hello World!\\\\n\\\";\\n\";\n";
std::cout<<"std::cout<<\"std::cout<<\\\"std::cout<<\\\\\\\"Hello World!\\\\\\\\n\\\\\\\";\\\\n\\\";\\n\"\n";
The solution to exponential quoting growth, and the speed and security flaws it invites, is to send a byte count, then arbitrary bytes.  This means each additional nested layer just needs to add another byte count, and it doesn't care what data is being sent, so the data can contain its own inner structures, including byte counts, without causing problems.

This "byte count" approach is nearly universal in complicated media formats:

Endian-ness in Binary Data Exchange

One recurring issue in exchanging binary data is there are no standards: x86 and ARM machines store data to memory starting with the little end, so 0x4321 gets stored as 0x21, then 0x43.  Other CPUs like PowerPC (by default, it's switchable) store data starting with the big end, which was once the standard, and so is called "network byte order".  I've programmed on machines where "int" is 2 bytes, most set it to 4 bytes for now, but there are hints of an eventual transition to 8 bytes.  This means you might send a binary "int", and get any of three sizes in either of two orders!  The fix is to just specify, such as "big-endian 32 bit unsigned integer", and then everybody will know what to use.

How do you read a 32-bit big endian integer on a little endian machine?  There are several ways such as htonl() to handle byte order, and other ways such as <stdint.h> uint32_t to handle sizes, but my favorite fix is to solve both at once by specifying the bytes manually, via a special class.

class Big32 { //Big-endian (network byte order) 32-bit integer
        typedef unsigned char byte;
byte d[4]; public: Big32() {} Big32(unsigned int i) { set(i); } operator unsigned int () const { return (d[0]<<24)|(d[1]<<16)|(d[2]<<8)|d[3]; } unsigned int operator=(unsigned int i) {set(i);return i;} void set(unsigned int i) { d[0]=(byte)(i>>24); d[1]=(byte)(i>>16); d[2]=(byte)(i>>8); d[3]=(byte)i; } };

The cool part about this class is it's always the correct size, and it never needs alignment padding, so you can stick together a long struct to represent the file's byte order and it will match byte for byte.  My osl/socket.h header includes this class (and Big16) by default.