Reading and Writing Binary Files
CS
493 Lecture Notes, Dr.
Lawlor
Quite a bit of computer security involves reading and writing raw
binary data files:
- Direct attack vectors like shellcode consist of binary machine
code.
- In malware reverse engineering, often we only have a binary
executable file to examine.
- In network intrusion detection, often the key evidence is a
recorded packet trace consisting of binary packet data.
- A huge variety of security flaws center around non-ASCII data
handling, like embedded NUL characters or unicode handling
errors, that only make sense when viewed in binary.
Hex Editing
A key tool in understanding binary files is a hex editor, which
works like a text editor but for binary files.
- shed
(sudo apt install shed) is a very simple UNIX console
application that can do basic hex editing. Its single
column format is a bit limiting, but it simultaneously shows
ASCII, hex, decimal, octal, and binary data. It also lets
you view the same bytes as 1, 2, or 4 byte values, in big or
little endian format. It's also just 1300 lines of code in
2 files, so it's very simple to modify.
There are also some standard UNIX tools for manipulating binary
files:
- strings
extracts all contiguous ASCII text strings from a binary file,
stripping out the binary garbage. It's handy for a first
pass at a binary file, just to see if the data you need is
easily extracted as plain text.
- od
or Octal Dump is an ancient UNIX utility for showing
binary files. I usually use "od -t cx1 -A d < infile |
less" to dump both char (c) and hex 1 byte (x1) values while
listing decimal (d) addresses (-A).
- od -t cx1 -A d < infile | less
- xxd
is most commonly used as the opposite of od, to automatically
reassemble binary data from hex, or insert a hex patch at a
known location.
- echo -n "61 62 63" | xxd -r -p
Executable files usually have their own analysis tools:
- objdump disassembles compiled code for your CPU.
- objdump -M intel -drC /bin/ls | less
- radare2
is a super complicated reverse engineering toolkit.
Object Formatting
There are a variety of situations where we need to communicate
entire objects:
- When doing any work across the network, both the
request and response are in general multi-part objects.
- When reading or writing files, we often have a variety
of data stored in an object.
- A filesystem or database is essentially just a big complicated
object with many separate pieces.
There are essentially only two ways to store objects: fixed-size
formats, and variable-size formats.
Fixed-Size Format
|
Variable-Size Format
|
Constant number of bytes (sizeof)
|
Non-constant number of bytes
|
C++ builtin type, class or struct
Most binary file headers (e.g., BMP)
Fixed-width
ASCII records
|
C++ string, vector, map, list, ...
Delimited ASCII data
JSON
XML
|
Very fast to allocate and deallocate
|
Surprisingly slow to allocate and
deallocate (must use malloc or new)
|
Easy to allocate in a C++ array
|
Cannot be directly stored in an array (must
use pointers or offsets)
|
Not extensible (tempted to squeeze bits)
|
Can be extended (though you must plan ahead!)
|
Fixed-size integers lead to the frankly ridiculous problem of integer
overflow when you need more bits to represent the result than
are available in the integer size you have chosen. Running out
of room is the classic problem with fixed-size objects, and
encourages the hack of repurposing bits for
other uses, such as the "0x3FFF rowBytes problem" on classic
MacOS.
To make any variable-size format, you need to be able to mark the
end of the thing you're currently on. But because you want to
be able to send anything, you either need to encode binary data as
plain text:
- Hexadecimal uses two chars to represent each binary
byte. This works everywhere, even where
lowercase/uppercase gets broken, but it's space inefficient.
- Base64 uses
blocks of four chars to represent blocks of three binary
bytes. This is more space efficient than hex, but uses a
larger character set.
- UTF-8
encodes the full unicode charset (over a million possible codes)
using a series of variable length codes.
The alternative to encoding is marking the end. But this means
you need a way to quote out "the end" symbol so you can send a
literal end symbol, and then you need a way to quote the quotes for
when you want to send actual quotes. This means you
immediately run into the exponential quoting growth problem for
information nested several levels deep:
std::cout<<"Hello World!\n";
std::cout<<"std::cout<<\"Hello World!\\n\";\n";
std::cout<<"std::cout<<\"std::cout<<\\\"Hello World!\\\\n\\\";\\n\";\n";
std::cout<<"std::cout<<\"std::cout<<\\\"std::cout<<\\\\\\\"Hello World!\\\\\\\\n\\\\\\\";\\\\n\\\";\\n\"\n";
The solution to exponential quoting growth, and the speed and security flaws it invites, is to send a byte count, then arbitrary bytes. This means each additional nested layer just needs to add another byte count, and it doesn't care what data is being sent, so the data can contain its own inner structures, including byte counts, without causing problems.
This "byte count" approach is nearly universal in complicated media formats:
- AVI (RIFF) files are assembled from "chunks" that have a 4-byte type, and a 32-bit size field.
Quicktime files are assembled from "atoms" that have a 4-byte type, a 32-bit size field, and a 32-bit integer ID. - TIFF files use a 16-bit tag identifier, a 16-bit data type field, a 32-bit count field, and a 32-bit file offset field (all the tags are stored together at the start of the file, so it's fast to skim the file structure, but you need the offset to find the actual data)
- MPEG-4 (ISO) files are assembled from "boxes" that have a 32-bit type and size fields, with options for a 128-bit UUID or 64 bit size field.
- Filesystems are assembled from "files" that have a string name, a length field, access permissions, and other metadata such as access times.
Endian-ness in Binary Data Exchange
One recurring issue in exchanging binary data is there are no standards: x86 and ARM machines store data to memory starting with the little end, so 0x4321 gets stored as 0x21, then 0x43. Other CPUs like PowerPC (by default, it's switchable) store data starting with the big end, which was once the standard, and so is called "network byte order". I've programmed on machines where "int" is 2 bytes, most set it to 4 bytes for now, but there are hints of an eventual transition to 8 bytes. This means you might send a binary "int", and get any of three sizes in either of two orders! The fix is to just specify, such as "big-endian 32 bit unsigned integer", and then everybody will know what to use.
How do you read a 32-bit big endian integer on a little endian machine? There are several ways such as htonl() to handle byte order, and other ways such as <stdint.h> uint32_t to handle sizes, but my favorite fix is to solve both at once by specifying the bytes manually, via a special class.
class Big32 { //Big-endian (network byte order) 32-bit integer
typedef unsigned char byte;
byte d[4];
public:
Big32() {}
Big32(unsigned int i) { set(i); }
operator unsigned int () const { return (d[0]<<24)|(d[1]<<16)|(d[2]<<8)|d[3]; }
unsigned int operator=(unsigned int i) {set(i);return i;}
void set(unsigned int i) {
d[0]=(byte)(i>>24);
d[1]=(byte)(i>>16);
d[2]=(byte)(i>>8);
d[3]=(byte)i;
}
};
The cool part about this class is it's always the correct size, and it never needs alignment padding, so you can stick together a long struct to represent the file's byte order and it will match byte for byte. My osl/socket.h header includes this class (and Big16) by default.