Reading and Writing Binary Files

CS 493 Lecture Notes, Dr. Lawlor

Quite a bit of computer security involves reading and writing raw binary data files:

Direct attack vectors like shellcode consist of binary machine code.
In malware reverse engineering, often we only have a binary executable file to examine.
In network intrusion detection, often the key evidence is a recorded packet trace consisting of binary packet data.
A huge variety of security flaws center around non-ASCII data handling, like embedded NUL characters or unicode handling errors, that only make sense when viewed in binary.

Hex Editing

A key tool in understanding binary files is a hex editor, which works like a text editor but for binary files.

shed (sudo apt install shed) is a very simple UNIX console application that can do basic hex editing. Its single column format is a bit limiting, but it simultaneously shows ASCII, hex, decimal, octal, and binary data. It also lets you view the same bytes as 1, 2, or 4 byte values, in big or little endian format. It's also just 1300 lines of code in 2 files, so it's very simple to modify.

There are also some standard UNIX tools for manipulating binary files:

strings extracts all contiguous ASCII text strings from a binary file, stripping out the binary garbage. It's handy for a first pass at a binary file, just to see if the data you need is easily extracted as plain text.
od or Octal Dump is an ancient UNIX utility for showing binary files. I usually use "od -t cx1 -A d < infile | less" to dump both char (c) and hex 1 byte (x1) values while listing decimal (d) addresses (-A).

od -t cx1 -A d < infile | less

xxd is most commonly used as the opposite of od, to automatically reassemble binary data from hex, or insert a hex patch at a known location.

echo -n "61 62 63" | xxd -r -p

Executable files usually have their own analysis tools:

objdump disassembles compiled code for your CPU.

objdump -M intel -drC /bin/ls | less

radare2 is a super complicated reverse engineering toolkit.

Object Formatting

There are a variety of situations where we need to communicate entire objects:

When doing any work across the network, both the request and response are in general multi-part objects.
When reading or writing files, we often have a variety of data stored in an object.
A filesystem or database is essentially just a big complicated object with many separate pieces.

There are essentially only two ways to store objects: fixed-size formats, and variable-size formats.

Fixed-Size Format	Variable-Size Format
Constant number of bytes (sizeof)	Non-constant number of bytes
C++ builtin type, class or struct Most binary file headers (e.g., BMP) Fixed-width ASCII records	C++ string, vector, map, list, ... Delimited ASCII data JSON XML
Very fast to allocate and deallocate	Surprisingly slow to allocate and deallocate (must use malloc or new)
Easy to allocate in a C++ array	Cannot be directly stored in an array (must use pointers or offsets)
Not extensible (tempted to squeeze bits)	Can be extended (though you must plan ahead!)

Fixed-size integers lead to the frankly ridiculous problem of integer overflow when you need more bits to represent the result than are available in the integer size you have chosen. Running out of room is the classic problem with fixed-size objects, and encourages the hack of repurposing bits for other uses, such as the "0x3FFF rowBytes problem" on classic MacOS.

To make any variable-size format, you need to be able to mark the end of the thing you're currently on. But because you want to be able to send anything, you either need to encode binary data as plain text:

Hexadecimal uses two chars to represent each binary byte. This works everywhere, even where lowercase/uppercase gets broken, but it's space inefficient.
Base64 uses blocks of four chars to represent blocks of three binary bytes. This is more space efficient than hex, but uses a larger character set.
UTF-8 encodes the full unicode charset (over a million possible codes) using a series of variable length codes.

The alternative to encoding is marking the end. But this means you need a way to quote out "the end" symbol so you can send a literal end symbol, and then you need a way to quote the quotes for when you want to send actual quotes. This means you immediately run into the exponential quoting growth problem for information nested several levels deep:

	std::cout<<"Hello World!\n";
	std::cout<<"std::cout<<\"Hello World!\\n\";\n";
	std::cout<<"std::cout<<\"std::cout<<\\\"Hello World!\\\\n\\\";\\n\";\n";
	std::cout<<"std::cout<<\"std::cout<<\\\"std::cout<<\\\\\\\"Hello World!\\\\\\\\n\\\\\\\";\\\\n\\\";\\n\"\n";

The solution to exponential quoting growth, and the speed and security flaws it invites, is to send a byte count, then arbitrary bytes. This means each additional nested layer just needs to add another byte count, and it doesn't care what data is being sent, so the data can contain its own inner structures, including byte counts, without causing problems.

This "byte count" approach is nearly universal in complicated media formats:

AVI (RIFF) files are assembled from "chunks" that have a 4-byte type, and a 32-bit size field.
Quicktime files are assembled from "atoms" that have a 4-byte type, a 32-bit size field, and a 32-bit integer ID.
TIFF files use a 16-bit tag identifier, a 16-bit data type field, a 32-bit count field, and a 32-bit file offset field (all the tags are stored together at the start of the file, so it's fast to skim the file structure, but you need the offset to find the actual data)
MPEG-4 (ISO) files are assembled from "boxes" that have a 32-bit type and size fields, with options for a 128-bit UUID or 64 bit size field.
Filesystems are assembled from "files" that have a string name, a length field, access permissions, and other metadata such as access times.

Endian-ness in Binary Data Exchange

One recurring issue in exchanging binary data is there are no standards: x86 and ARM machines store data to memory starting with the little end, so 0x4321 gets stored as 0x21, then 0x43. Other CPUs like PowerPC (by default, it's switchable) store data starting with the big end, which was once the standard, and so is called "network byte order". I've programmed on machines where "int" is 2 bytes, most set it to 4 bytes for now, but there are hints of an eventual transition to 8 bytes. This means you might send a binary "int", and get any of three sizes in either of two orders! The fix is to just specify, such as "big-endian 32 bit unsigned integer", and then everybody will know what to use.

How do you read a 32-bit big endian integer on a little endian machine? There are several ways such as htonl() to handle byte order, and other ways such as <stdint.h> uint32_t to handle sizes, but my favorite fix is to solve both at once by specifying the bytes manually, via a special class.

class Big32 { //Big-endian (network byte order) 32-bit integer
        typedef unsigned char byte;
        byte d[4];
public:
        Big32() {}
        Big32(unsigned int i) { set(i); }
        operator unsigned int () const { return (d[0]<<24)|(d[1]<<16)|(d[2]<<8)|d[3]; }
        unsigned int operator=(unsigned int i) {set(i);return i;}
        void set(unsigned int i) { 
                d[0]=(byte)(i>>24); 
                d[1]=(byte)(i>>16); 
                d[2]=(byte)(i>>8); 
                d[3]=(byte)i; 
        }
};

The cool part about this class is it's always the correct size, and it never needs alignment padding, so you can stick together a long struct to represent the file's byte order and it will match byte for byte. My osl/socket.h header includes this class (and Big16) by default.