Probability & Statistics

CS 463 Lecture, Dr. Lawlor

One difficult aspect of deciphering encrypted data is recognizing that you've actually got plaintext.  For example, here are some decryption results from my attempts to solve hw1_0:
key=0x13739 results in:

key=0x13795 results in:
A little bird told me a local Fairbanks stolen milling machine would have been entirely shredded by an imported Crotobaltislavonian answering machine?!

key=0x13799 results in:

key=0x1379d results in:

key=0x137a1 results in:

key=0x137a5 results in:
Clearly, we want the program to automatically find the key=0x13795 result.  There are several ways to do this:
  1. Permitted values.  Decide which byte values could be in the plaintext, and throw out decrypts that include any other byte values.  This is by far the fastest and simplest approach, and above I'm actually filtering out the many decryption keys that result in some binary output data (above 127, or non-whitespace below 32).  This approach is very fast, especially since you can throw out a possible key as soon as it spits out bad data, which usually happens very early in the decryption process.  The downside?  If we assume the weird characters like ] ! " # $ never occur, we will throw out any decrypt where the do occur.  This can even be gamed, by mixing blocks of binary garbage into the actual plaintext.
  2. Permitted sets.  This works as above, but we reduce the chance of false negatives by checking several bytes.  For example, "!m" rarely occurs in English, but "! " or "!!" are possible.  This is just a generalization of the above across multiple bytes, and suffers from the same flaws.  In particular, now we'll be thrown off by things like proper names or foreign words.
  3. Likely values or sets.  Since a binary choice is error-prone and easy to subvert, we can instead compute the probability that a message is actually English given, that it contains this value or set of values.  We can then save decrypts that have a high enough probability, or rank decrypts by their probability.  This is essentially a generalization of the histogram approach.
This last approach requires some statistical knowledge.  For example, something like 12% of the letters in English words are 'e', so:
      P(this letter is 'e' | message is English) = 0.12
(The bar | means "given that", making this a conditional probability.)

This is unfortunately not the probability we want.  For example, it tells us that the message "Hello" has less than a (12%)5 probability, or 0.003%, from the space of all possible 5-letter English messages.  Instead, we actually want the conditions facing the other way:
      P(message is English | this letter is 'e') = ?

We know conditional probability says:
     P(A and B) = P(A|B)P(B) = P(B|A)P(A)

So solving for the conditional probability:
    P(A|B) = P(B|A) * P(A) / P(B)

Substituting A="message is English" and B="this letter is 'e'" gives:

We can compute this from:
Rearranging, note that this is:
    P(English | 'e') = P(English) * P('e' | English) / P('e')
    Chance of English *= chance of this letter in English / chance of letter overall

This of course can and should be repeated with pairs of letters, triplets, etc.  You're only limited by the quantity of plaintext and the care to which you'd like to do your statistical frequency analysis.