Binary vs. ASCII Communication Protocols for Cryptography

CS 463 Lecture, Dr. Lawlor

There are generally two types of communication protocol in the world:
It's informative to look at HTTP for this:
Plain old line-oriented ASCII works great for most stuff:
But there are problems with ASCII communication:
Most ASCII protocols have some form of 'escape code' for encoding disallowed values like field separators in the data:

Backslash Escapes
in C, C++, Java, etc
n (meaning 0x0A)
r (meaning 0x0D)
t (meaning tab)
" (meaning literal quote)
\ (meaning literal backslash)
x<hex digit><hex digit><NON hex digit>
0-7<octal digit><octal digit>
URL Encoding
%<hex digit><hex digit>
Ampersand Encoding
&<short string>;

Generally, the hardest thing is to encode strings from the language in which you're doing the encoding--for example, escaping the HTML for a web page so you can build a web page to explain how to write web pages, or correctly escaping the code for the parser generator that will get unescaped first to generate the compiler and then again to generate the actual source code.

The traditional approach for escapes is to leave them out entirely on the first version, then say "oops" repeatedly as various cases arise requiring the value to be protected.  These aren't just complicated and tricky to program and maintain--they're often serious security holes.  For example, you might security check for certain bad characters like quotes before inserting a string into SQL, and there might be none at the firewall level.  But then some intermediate stage might replace an URL-encoded or backslash-encoded value, making a quote appear and thus allowing an SQL injection attack (and note the variety of responses for how to simply escape a quote into SQL).  An amazing variety of these, including things like "../" or "\.\.\/" or "\x2E\x2E\x2F" (and unicode variants), can allow attackers to manipulate filenames to point outside an intended area, allowing system files to be read or even overwritten.  Some buffer overflows only occur due to misguided attempts to encode or decode bad strings at the wrong moment.  Generally, best practice has converged on only allowing known-good characters through instead of trying to filter out all the bad ones.

By contrast, sending a byte count followed by arbitrary data requires no escapes, and it's much easier to get a high-performance and reliable system with no special cases that always works.  A typical usage would be:
	... somehow get a length len ...
std::vector<char> buf(len);
You can have a fixed length for len (e.g., always 4 or 8 bytes), although then you need to do something special for exchanges bigger than this.  DER, for its flaws, uses an interesting encoding based on a single byte:
UTF-8 uses a similar trick to send variable-length unicode characters over an 8-bit communication channel.