Binary vs. ASCII Communication Protocols for Cryptography

CS 463 Lecture, Dr. Lawlor

There are generally two types of communication protocol in the world:

ASCII-style protocols let you know when things are done with a special character, like a newline.
Binary-style protocols have fixed-size pieces, like a 4-byte binary integer.

It's informative to look at HTTP for this:

The header (variable-length, extensible, semi-structured, small in size) is ASCII. But if you want to send a strange character in the header, you need to URL-encode (%7E) it.
The content (often fixed-length, of uniform data type, large in size) is binary. The length is stored in the "Content-Length" field from the ASCII section. If you want to send a strange character in the content, you just send it.

Plain old line-oriented ASCII works great for most stuff:

It's easy for people to read and debug. You don't need any special tools or skills for this.
It's supported well by every computer language ever created.
It's easy to extend--for example, you can add a new "Content-Color: green\n" or "Classified-Level: taupe secret\n" field (e.g., to an HTTP header). Numbers, in particular, extend very smoothly in ASCII from something like 6 to a giant 4096-bit RSA prime. Extending binary protocols is usually harder, sometimes impossible if the designer didn't plan the protocol well.

But there are problems with ASCII communication:

It's not always space efficient. For example, 2^32-1 is 9 bytes in ASCII (4294967295), 8 bytes in hex (ffffffff), but only 4 bytes as raw binary data. (There are counterexamples to this. We once converted an ASCII file of numbers to binary, thinking it'd be smaller. But the numbers were mostly single-digit values like 0 or 3, so most values were only 2 bytes including the separating space. This made the file *bigger* in binary, so we actually switched back.)
It's not very time efficient, because you need to look at each byte (e.g., to see if it's the newline). This means there's no parallelism, and the I/O batch size is a tiny 1 byte at a time. Binary communication, by contrast, normally occurs in blocks of size >1 byte, so you minimize block setup cost, and can send or process parts of the block in parallel.
It's not easy to send some values. For example, a newline ends the HTTP "GET" header, but for NetRun I'd like to put the code (including newlines) into the URL.

Most ASCII protocols have some form of 'escape code' for encoding disallowed values like field separators in the data:

	Example	Specification	Meta-usage
Backslash Escapes in C, C++, Java, etc	\n	n (meaning 0x0A) r (meaning 0x0D) t (meaning tab) " (meaning literal quote) \ (meaning literal backslash) x<hex digit><hex digit><NON hex digit> 0-7<octal digit><octal digit>	"\\\"Hello\\\""
URL Encoding	%2F	%<hex digit><hex digit>	%2525
Ampersand Encoding in HTML	<	&<short string>;	&amp;

Generally, the hardest thing is to encode strings from the language in which you're doing the encoding--for example, escaping the HTML for a web page so you can build a web page to explain how to write web pages, or correctly escaping the code for the parser generator that will get unescaped first to generate the compiler and then again to generate the actual source code.

The traditional approach for escapes is to leave them out entirely on the first version, then say "oops" repeatedly as various cases arise requiring the value to be protected. These aren't just complicated and tricky to program and maintain--they're often serious security holes. For example, you might security check for certain bad characters like quotes before inserting a string into SQL, and there might be none at the firewall level. But then some intermediate stage might replace an URL-encoded or backslash-encoded value, making a quote appear and thus allowing an SQL injection attack (and note the variety of responses for how to simply escape a quote into SQL). An amazing variety of these, including things like "../" or "\.\.\/" or "\x2E\x2E\x2F" (and unicode variants), can allow attackers to manipulate filenames to point outside an intended area, allowing system files to be read or even overwritten. Some buffer overflows only occur due to misguided attempts to encode or decode bad strings at the wrong moment. Generally, best practice has converged on only allowing known-good characters through instead of trying to filter out all the bad ones.

By contrast, sending a byte count followed by arbitrary data requires no escapes, and it's much easier to get a high-performance and reliable system with no special cases that always works. A typical usage would be:

	... somehow get a length len ...
	std::vector<char> buf(len);
	skt_recvN(s,&buf[0],len);

You can have a fixed length for len (e.g., always 4 or 8 bytes), although then you need to do something special for exchanges bigger than this. DER, for its flaws, uses an interesting encoding based on a single byte:

If the high bit is clear, the low 7 bits give the length directly. For example, 0x07 means the data is 7 bytes long.
If the high bit is set, the low 7 bits give the length *of the length*, in bytes. For example, 0x84 would be followed by a 4-byte length, followed by the actual data.

UTF-8 uses a similar trick to send variable-length unicode characters over an 8-bit communication channel.

If the high bit is clear, the low 7 bits are just plain low ASCII.
If the high bit is set, the number of leading 1 bits indicates the number of bytes in the unicode character.