Branch instructions & SIMD Intro

CS 441 Lecture, Dr. Lawlor

Branches are tricky to integrate into today's deep pipelines. Branch prediction is hard.

This piece of code is basically a random number generator, combined with a little branch code.

add r11,1 ; treat r11 as counter.  HACK: assumes r11 not used by NetRun's timing loop
mov r8,r11 ; copy the counter
imul r8,65747 ; scale the counter by some big number (not a power of two, though!)
and r8,4 ; grab a random bit from inside the scaled counter
cmp r8,0 ; check if that bit is zero
je same_thing ; if it's zero, jump
ret

same_thing:   ; basically the same as not jumping, except for performance!
ret

(Try this in NetRun now!)

What's curious about this code is how the performance varies if you adjust the frequency of branching:

Never branching is really fast, like 2.9ns
Always branching is only a little slower, like 3.7ns
Alternating branching or not, every time around, is 3.3ns (about halfway in between).
Randomly branching at some unpredictable interval is quite bad, like 7.5ns!

Avoiding Branching with the "If Then Else" Trick

There are several interesting ways to avoid the performance cost of unpredictable branches:

You can use a "conditional" instruction, like Intel's cmovXX instructions (cmovle, cmovg, etc), which only does a move if some condition is true.

You can fold and select a conditional into arithmetic, transforming

if (x) y=a; else y=b;   // conditional

into any of these forms:

y=b+x*(a-b); // linear version (assumes a-b works)
y=x*a+(1-x)*b; // rearranged version of above
y=(x&a)|(~x)&b); // bitwise version (assumes x is all zero bits, or all one bits)

Note that this last one is just a software version of a hardware multiplexer!

SIMD

We covered an old, simple, but powerful idea today-- "SIMD", which stands for Single Instruction Multiple Data:

Single, meaning just one.
Instruction, as in a machine code instruction, executed by hardware.
Multiple, as in more than one--from 2 to a thousand or so.
Data, as in floats or ints.

You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once. This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers. For example, the 1988 Cray Y-MP was a typical vector machine. When I was an undergraduate, ARSC still had a Y-MP vector machine. The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!). A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second. Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed. Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
PowerPC, where they are called "AltiVec"
Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
Graphics cards, where they are part of the pixel processors

We'll look at the x86 version, SSE, in the most detail.

Branch Prediction vs SSE

Here's a little benchmark to compare ordinary sequential C++ with the SSE "fourfloats" class we will develop next class. The "if_then_else" method is written using the bitwise branch trick described above. The surprising fact is that the sequential code's performance depends heavily on the predictability of the "if (src[i]<4.0)" branch:

enum {n=1000};
float src[n]={1.0,5.0,3.0,4.0};
float dest[n];

int serial_loop(void) {
	for (int i=0;i<n;i++) {
		if (src[i]<4.0) dest[i]=src[i]*2.0; else dest[i]=17.0;
	}
	return 0;
}
int sse_loop(void) {
	for (int i=0;i<n;i+=4) {
		fourfloats s(&src[i]);
		fourfloats d=(s<4.0).if_then_else(s+s,17.0);
		d.store(&dest[i]);
	}
	return 0;
}
int sort_loop(void) {
	std::sort(&src[0],&src[n]);
	return 0;
}


int foo(void) {
	for (int i=0;i<n;i++) {src[i]=rand()%8;}
	print_time("serial(rand)",serial_loop);
	print_time("SSE(rand)",sse_loop);
	print_time("sort",sort_loop);
	print_time("serial(post-sort)",serial_loop);
	print_time("SSE(post-sort)",sse_loop);
	//farray_print(dest,4); // <- for debugging
	return 0;
}

(Try this in NetRun now!)

The performance of this code on our various NetRun machines is summarized here, in nanoseconds/float:


	
	
		
			

			Serial (rand)
			Serial (sorted)
			SSE (rand)
			SSE (sorted)
			Sort time
		
		
			Q6600
			7.5
			1.1
			1.2
			1.2
			16.7
		
		
			Core2
			9.4
			1.9
			1.6
			1.6
			21.7
		
		
			Pentium 4
			12.1
			2.8
			1.4
			1.4
			31.9
		
		
			Pentium III
			13.0
			7.2
			3.0
			3.0
			42.9

Unpredictable branches cause a serious performance impact even on modern CPUs. Most of the benefit of modern superscalar, out-of-order, etc is lost if branch prediction fails.
On modern CPUs, when the branch predictor is working perfectly, ordinary C++ code is competitive with SSE code.
SSE performance is not affected by unpredictable branches.
Sorting the data makes the branches more predictable, but it takes much longer to sort than to branch!

	Serial (rand)	Serial (sorted)	SSE (rand)	SSE (sorted)	Sort time
Q6600	7.5	1.1	1.2	1.2	16.7
Core2	9.4	1.9	1.6	1.6	21.7
Pentium 4	12.1	2.8	1.4	1.4	31.9
Pentium III	13.0	7.2	3.0	3.0	42.9