Branch Prediction vs. SSE

CS 441 Lecture, Dr. Lawlor

Here's a little benchmark to compare ordinary sequential C++ with the SSE "fourfloats" class we developed last time. The surprising fact is that the sequential code's performance depends heavily on the predictability of the "if (src[i]<4.0)" branch:

enum {n=1000};
float src[n]={1.0,5.0,3.0,4.0};
float dest[n];

int serial_loop(void) {
	for (int i=0;i<n;i++) {
		if (src[i]<4.0) dest[i]=src[i]*2.0; else dest[i]=17.0;
	}
	return 0;
}
int sse_loop(void) {
	for (int i=0;i<n;i+=4) {
		fourfloats s(&src[i]);
		fourfloats d=(s<4.0).if_then_else(s+s,17.0);
		d.store(&dest[i]);
	}
	return 0;
}
int sort_loop(void) {
	std::sort(&src[0],&src[n]);
	return 0;
}


int foo(void) {
	for (int i=0;i<n;i++) {src[i]=rand()%8;}
	print_time("serial(rand)",serial_loop);
	print_time("SSE(rand)",sse_loop);
	print_time("sort",sort_loop);
	print_time("serial(post-sort)",serial_loop);
	print_time("SSE(post-sort)",sse_loop);
	//farray_print(dest,4); // <- for debugging
	return 0;
}

(Try this in NetRun now!)

The performance of this code on our various NetRun machines is summarized here, in nanoseconds/float:


	
	
		
			

			Serial (rand)
			Serial (sorted)
			SSE (rand)
			SSE (sorted)
			Sort time
		
		
			Q6600
			7.5
			1.1
			1.2
			1.2
			16.7
		
		
			Core2
			9.4
			1.9
			1.6
			1.6
			21.7
		
		
			Pentium 4
			12.1
			2.8
			1.4
			1.4
			31.9
		
		
			Pentium III
			13.0
			7.2
			3.0
			3.0
			42.9

Unpredictable branches cause a serious performance impact even on modern CPUs. Most of the benefit of modern superscalar, out-of-order, etc is lost if branch prediction fails.
On modern CPUs, when the branch predictor is working perfectly, ordinary C++ code is competitive with SSE code.
SSE performance is not affected by unpredictable branches.
Sorting the data makes the branches more predictable, but it takes much longer to sort than to branch!

	Serial (rand)	Serial (sorted)	SSE (rand)	SSE (sorted)	Sort time
Q6600	7.5	1.1	1.2	1.2	16.7
Core2	9.4	1.9	1.6	1.6	21.7
Pentium 4	12.1	2.8	1.4	1.4	31.9
Pentium III	13.0	7.2	3.0	3.0	42.9