# Histograms, Coincidence, and Correlation

CS 463 Lecture, Dr. Lawlor

Basically the only thing you can do to analyze ciphertext is to look for patterns.  One place many patterns show up is in the statistics of the bytes.  For text ciphers, it's common to build text histograms, showing how often each letter occurs.  For binary data, a plain 1D histogram showing how often each byte occurs is also useful.

You get more info by looking at more data, so it's common to build a 'coincidence table' showing how often values occur together.  Here's a tiny C++ coincidence counting program:
`/* Binary file byte-coincidence histogramming tool. Dr. Orion Sky Lawlor, lawlor@alaska.edu, 2013-01-28 (Public Domain)*/#include <iostream>#include <fstream>#include <string>#include <cmath>double clamp(double v) {	if (v<0.0) return 0.0;	if (v>1.0) return 1.0;	else return v;}class histo2d {public:	enum {n=256}; // byte histogram	unsigned int data[n][n]; /* [first byte][next byte] */	int last;		histo2d() { // zero the histogram		for (int i=0;i<n;i++)		for (int j=0;j<n;j++)			data[i][j]=0;		last=-1;	}		void add(unsigned char next) {		if (last!=-1) data[last][next]++;		last=next;	}		void save(std::string filename) 	{ // save to a PPM file		filename+=".ppm";		std::ofstream of(filename.c_str(),std::ios_base::binary);		of<<"P6\n"<<n<<" "<<n<<"\n255\n";		// Find the biggest value (to scale output to bytes)		unsigned int big=0;		for (int i=0;i<n;i++)		for (int j=0;j<n;j++)			if (data[i][j]>big) big=data[i][j];		double scale=1.0/big;		// Scale the output to bytes, and write them to disk.		for (int i=0;i<n;i++)		for (int j=0;j<n;j++)		{			unsigned char out=255*clamp(-0.1*log(data[i][j]*scale));						unsigned char out2=out;			if (((i&0x1f)==0 || (j&0x1f)==0)) 				out2*=0.75; // make grid lines			of.write((char *)&out2,1);			of.write((char *)&out2,1);			of.write((char *)&out,1);		}	}	};int main(int argc,char *argv[]) {	if (argc<=1) {		std::cout<<"Usage: coincidence <input file...>\n";		return 1;	}		for (int argi=1;argi<argc;argi++) {		const char *fileName=argv[argi];		std::cout<<"Histogramming file "<<fileName<<"\n";		histo2d h;		std::ifstream file(fileName,std::ios_base::binary);		unsigned char c;		while (file.read((char *)&c,1)) h.add(c);		h.save(fileName);	}}`
The output of this program is a PPM image, a weird format that's very easy to write.  It's readable by The GIMP.

In the output image, the X axis is the first byte, from 0 on the left to 255 on the right.  The Y axis is the value of the next byte, from 0 at the top to 255 at the bottom.  White indicates the value never happens; black indicates the value happens all the time.  I also put in light blue grid lines every 32 pixels to keep you oriented.

Just looking at this byte coincidence pattern can tell you a lot about a file.
 binary zeros: note the black dot in the upper left at (0,0). hexadecimal ASCII: numbers, lowercase letters, and newlines. base64 text: lowercase, uppercase, punctuation, and newlines. HP Lovecraft text: typical english text.  Note mostly lowercase, plenty of spaces.  Note the assymetry around capital letters--capital-lowercase occurs much more often than vice versa. Wikipedia XML: more high ASCII/unicode, more punctuation, more complex structure. tar-gzip file: Note classic binary fill-everywhere pattern.  White dot is intriguing. zip file: similar to tar-gzip.  Biased toward lower numbers. bz2 file: much less biased (better compression), although patterns still visible. JPEG image: weird fine horizontal structures MP3 Audio: relatively few heavily used codes. /dev/urandom: almost perfectly uniform, except for random correlations. AES-128 in CBC mode: looks perfectly uniform to me, even though we were encrypting just binary zeros. AES-128 in ECB mode: uh-oh! That ain't random! 