Histograms, Coincidence, and Correlation

CS 463 Lecture, Dr. Lawlor

Basically the only thing you can do to analyze ciphertext is to look for patterns.  One place many patterns show up is in the statistics of the bytes.  For text ciphers, it's common to build text histograms, showing how often each letter occurs.  For binary data, a plain 1D histogram showing how often each byte occurs is also useful.

You get more info by looking at more data, so it's common to build a 'coincidence table' showing how often values occur together.  Here's a tiny C++ coincidence counting program:
/*
Binary file byte-coincidence histogramming tool.

Dr. Orion Sky Lawlor, lawlor@alaska.edu, 2013-01-28 (Public Domain)
*/
#include <iostream>
#include <fstream>
#include <string>
#include <cmath>

double clamp(double v) {
if (v<0.0) return 0.0;
if (v>1.0) return 1.0;
else return v;
}

class histo2d {
public:
enum {n=256}; // byte histogram
unsigned int data[n][n]; /* [first byte][next byte] */
int last;

histo2d() { // zero the histogram
for (int i=0;i<n;i++)
for (int j=0;j<n;j++)
data[i][j]=0;
last=-1;
}

void add(unsigned char next) {
if (last!=-1) data[last][next]++;
last=next;
}

void save(std::string filename)
{ // save to a PPM file
filename+=".ppm";
std::ofstream of(filename.c_str(),std::ios_base::binary);
of<<"P6\n"<<n<<" "<<n<<"\n255\n";

// Find the biggest value (to scale output to bytes)
unsigned int big=0;
for (int i=0;i<n;i++)
for (int j=0;j<n;j++)
if (data[i][j]>big) big=data[i][j];
double scale=1.0/big;

// Scale the output to bytes, and write them to disk.
for (int i=0;i<n;i++)
for (int j=0;j<n;j++)
{
unsigned char out=255*clamp(-0.1*log(data[i][j]*scale));

unsigned char out2=out;
if (((i&0x1f)==0 || (j&0x1f)==0))
out2*=0.75; // make grid lines
of.write((char *)&out2,1);
of.write((char *)&out2,1);
of.write((char *)&out,1);
}
}
};

int main(int argc,char *argv[]) {
if (argc<=1) {
std::cout<<"Usage: coincidence <input file...>\n";
return 1;
}

for (int argi=1;argi<argc;argi++) {
const char *fileName=argv[argi];
std::cout<<"Histogramming file "<<fileName<<"\n";
histo2d h;
std::ifstream file(fileName,std::ios_base::binary);
unsigned char c;
while (file.read((char *)&c,1)) h.add(c);
h.save(fileName);
}
}
The output of this program is a PPM image, a weird format that's very easy to write.  It's readable by The GIMP.

In the output image, the X axis is the first byte, from 0 on the left to 255 on the right.  The Y axis is the value of the next byte, from 0 at the top to 255 at the bottom.  White indicates the value never happens; black indicates the value happens all the time.  I also put in light blue grid lines every 32 pixels to keep you oriented.

Just looking at this byte coincidence pattern can tell you a lot about a file.
binary zeros: note the black dot in the upper left at (0,0).
correlation
              image for zeros.ppm
hexadecimal ASCII: numbers, lowercase letters, and newlines.
correlation
              image for hex.ppm
base64 text: lowercase, uppercase, punctuation, and newlines.
correlation
              image for base64.ppm
HP Lovecraft text: typical english text.  Note mostly lowercase, plenty of spaces.  Note the assymetry around capital letters--capital-lowercase occurs much more often than vice versa.
correlation image for lovecraft.ppm
Wikipedia XML: more high ASCII/unicode, more punctuation, more complex structure.
correlation image for wikipedia.ppm
tar-gzip file: Note classic binary fill-everywhere pattern.  White dot is intriguing.
correlation
              image for tgz.ppm
zip file: similar to tar-gzip.  Biased toward lower numbers.
correlation
              image for zip.ppm
bz2 file: much less biased (better compression), although patterns still visible.
correlation
              image for bz2.ppm
JPEG image: weird fine horizontal structures
correlation
              image for jpeg.ppm
MP3 Audio: relatively few heavily used codes.
correlation
              image for mp3.ppm
/dev/urandom: almost perfectly uniform, except for random correlations.
correlation
              image for random.ppm
AES-128 in CBC mode: looks perfectly uniform to me, even though we were encrypting just binary zeros.
correlation image for AES CBC
AES-128 in ECB mode: uh-oh! That ain't random!
correlation image for AES ECB