Binary vs. ASCII Communication Protocols for Cryptography
Lecture, Dr. Lawlor
There are generally two types of communication protocol in the
It's informative to look at HTTP for this:
- ASCII-style protocols let you know when things are done with a
special character, like a newline.
- Binary-style protocols have fixed-size pieces, like a 4-byte
Plain old line-oriented ASCII works great for most stuff:
- The header (variable-length, extensible, semi-structured,
small in size) is ASCII. But if you want to send a strange
character in the header, you need to URL-encode (%7E) it.
- The content (often fixed-length, of uniform data type, large
in size) is binary. The length is stored in the
"Content-Length" field from the ASCII section. If you want
to send a strange character in the content, you just send it.
But there are problems with ASCII communication:
- It's easy for people to read and debug. You don't need
any special tools or skills for this.
- It's supported well by every computer language ever created.
- It's easy to extend--for example, you can add a new
"Content-Color: green\n" or "Classified-Level: taupe secret\n"
field (e.g., to an HTTP header). Numbers, in particular,
extend very smoothly in ASCII from something like 6 to a giant
4096-bit RSA prime. Extending binary protocols is usually
harder, sometimes impossible if the designer didn't plan the
Most ASCII protocols have some form of 'escape code' for encoding
disallowed values like field separators in the data:
- It's not always space efficient. For example, 2^32-1 is
9 bytes in ASCII (4294967295), 8 bytes in hex (ffffffff), but
only 4 bytes as raw binary data. (There are
counterexamples to this. We once converted an ASCII file
of numbers to binary, thinking it'd be smaller. But the
numbers were mostly single-digit values like 0 or 3, so most
values were only 2 bytes including the separating space.
This made the file *bigger* in binary, so we actually switched
- It's not very time efficient, because you need to look at each
byte (e.g., to see if it's the newline). This means
there's no parallelism, and the I/O batch size is a tiny 1 byte
at a time. Binary communication, by contrast, normally
occurs in blocks of size >1 byte, so you minimize block setup
cost, and can send or process parts of the block in parallel.
- It's not easy to send some values. For example, a
newline ends the HTTP "GET" header, but for NetRun I'd like to
put the code (including newlines) into the URL.
in C, C++, Java, etc
|n (meaning 0x0A)
r (meaning 0x0D)
t (meaning tab)
" (meaning literal quote)
\ (meaning literal backslash)
x<hex digit><hex digit><NON hex digit>
0-7<octal digit><octal digit>
|%<hex digit><hex digit>
Generally, the hardest thing is to encode strings from the language
in which you're doing the encoding--for example, escaping the
HTML for a web page so you can build a web page to explain how to
write web pages, or correctly escaping the code for the parser
generator that will get unescaped first to generate the compiler and
then again to generate the actual source code.
The traditional approach for escapes is to leave them out entirely
on the first version, then say "oops" repeatedly as various cases
arise requiring the value to be protected. These aren't just
complicated and tricky to program and maintain--they're often
serious security holes. For example, you might security check
for certain bad characters like quotes before inserting a string
into SQL, and there might be none at the firewall level. But
then some intermediate stage might replace an URL-encoded or
backslash-encoded value, making a quote appear and thus allowing an
SQL injection attack (and note the variety of responses for how
to simply escape a quote into SQL). An amazing variety
of these, including things like "../" or "\.\.\/" or "\x2E\x2E\x2F"
(and unicode variants), can allow attackers to manipulate filenames
to point outside an intended area, allowing system files to be read
or even overwritten. Some buffer overflows only occur due to
misguided attempts to encode or decode bad strings at the wrong
moment. Generally, best practice has converged on only
allowing known-good characters through instead of trying to filter
out all the bad ones.
By contrast, sending a byte count followed by arbitrary data
requires no escapes, and it's much easier to get a high-performance
and reliable system with no special cases that always works. A
typical usage would be:
... somehow get a length len ...
You can have a fixed length for len (e.g., always 4 or 8 bytes),
although then you need to do something special for exchanges bigger
than this. DER, for
its flaws, uses an interesting encoding based on a single byte:
UTF-8 uses a
similar trick to send variable-length unicode characters over an
8-bit communication channel.
- If the high bit is clear, the low 7 bits give the length
directly. For example, 0x07 means the data is 7 bytes
- If the high bit is set, the low 7 bits give the length *of the
length*, in bytes. For example, 0x84 would be followed by
a 4-byte length, followed by the actual data.
- If the high bit is clear, the low 7 bits are just plain low
- If the high bit is set, the number of leading 1 bits indicates
the number of bytes in the unicode character.