Branch instructions & SIMD Intro
CS 441 Lecture, Dr. Lawlor
Branches are tricky to integrate into today's deep pipelines.  Branch prediction is hard.  
This piece of code is basically a random number generator, combined with a little branch code.
add r11,1 ; treat r11 as counter.  HACK: assumes r11 not used by NetRun's timing loop
mov r8,r11 ; copy the counter
imul r8,65747 ; scale the counter by some big number (not a power of two, though!)
and r8,4 ; grab a random bit from inside the scaled counter
cmp r8,0 ; check if that bit is zero
je same_thing ; if it's zero, jump
ret
same_thing:   ; basically the same as not jumping, except for performance!
ret
(Try this in NetRun now!)
What's curious about this code is how the performance varies if you adjust the frequency of branching:
  - Never branching is really fast, like 2.9ns
- Always branching is only a little slower, like 3.7ns
- Alternating branching or not, every time around, is 3.3ns (about halfway in between).
- Randomly branching at some unpredictable interval is quite bad, like 7.5ns!
Avoiding Branching with the "If Then Else" Trick
There are several interesting ways to avoid the performance cost of unpredictable branches:
You can use a "conditional" instruction, like Intel's cmovXX
instructions (cmovle, cmovg, etc), which only does a move if some
condition is true.
You can fold and select a conditional into arithmetic, transforming
if (x) y=a; else y=b;   // conditional
into any of these forms:
y=b+x*(a-b); // linear version (assumes a-b works)
y=x*a+(1-x)*b; // rearranged version of above
y=(x&a)|(~x)&b); // bitwise version (assumes x is all zero bits, or all one bits)
Note that this last one is just a software version of a hardware multiplexer!
SIMD
We covered an old, simple, but powerful idea today-- "SIMD", which stands for Single Instruction Multiple Data:
- Single, meaning just one.
 
- Instruction, as in a machine code instruction, executed by hardware.
 
- Multiple, as in more than one--from 2 to a thousand or so.
 
- Data, as in floats or ints.
 
You can do lots interesting SIMD work without using
any special instructions--plain old C will do, if you treat an "int" as
32 completely independent bits, because any normal "int" instructions
will operate on all the bits at once.  This use of
bitwise operations is often called "SIMD within a register (SWAR)" or
"word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.
Back in the 1980's, "vector" machines were quite popular in 
supercomputing centers.  For example, the 1988 Cray Y-MP was a 
typical vector machine.  When I was an undergraduate, ARSC
still had a Y-MP vector machine.  The Y-MP had eight "vector"
registers, each of which held 64 doubles (that's a total of 4KB of
registers!).  A single Y-MP machine language instruction could add
all 64 corresponding numbers in two vector registers, which enabled the
Y-MP to achieve the (then) mind-blowing speed of *millions*
floating-point operations per second.  Vector machines have now
almost completely died out; the NEC SV-1 and the Japanese "Earth
Simulator" are the last of this breed.  Vector machines are
classic SIMD, because one instruction can modify 64 doubles.
But the most common form of SIMD today are the "multimedia" instruction
set extensions in normal CPUs.  The usual arrangment for
multimedia instructions is for single instructions to operate on four
32-bit floats.  These four-float instructions exist almost
everywhere nowdays:
- x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
 
- PowerPC, where they are called "AltiVec"
- Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
- Graphics cards, where they are part of the pixel processors
We'll look at the x86 version, SSE, in the most detail.
Branch Prediction vs SSE
Here's a little benchmark to compare ordinary sequential C++ with the
SSE "fourfloats" class we will develop next class.  The
"if_then_else" method is written using the bitwise branch trick
described above.  The surprising
fact is that the sequential code's performance depends heavily on the
predictability of the "if (src[i]<4.0)" branch:
enum {n=1000};
float src[n]={1.0,5.0,3.0,4.0};
float dest[n];
int serial_loop(void) {
	for (int i=0;i<n;i++) {
		if (src[i]<4.0) dest[i]=src[i]*2.0; else dest[i]=17.0;
	}
	return 0;
}
int sse_loop(void) {
	for (int i=0;i<n;i+=4) {
		fourfloats s(&src[i]);
		fourfloats d=(s<4.0).if_then_else(s+s,17.0);
		d.store(&dest[i]);
	}
	return 0;
}
int sort_loop(void) {
	std::sort(&src[0],&src[n]);
	return 0;
}
int foo(void) {
	for (int i=0;i<n;i++) {src[i]=rand()%8;}
	print_time("serial(rand)",serial_loop);
	print_time("SSE(rand)",sse_loop);
	print_time("sort",sort_loop);
	print_time("serial(post-sort)",serial_loop);
	print_time("SSE(post-sort)",sse_loop);
	//farray_print(dest,4); // <- for debugging
	return 0;
}
(Try this in NetRun now!)
The performance of this code on our various NetRun machines is summarized here, in nanoseconds/float:
	
	
		
			| 
 | Serial (rand) | Serial (sorted) | SSE (rand) | SSE (sorted) | Sort time | 
		
			| Q6600 | 7.5 | 1.1 | 1.2 | 1.2 | 16.7 | 
		
			| Core2 | 9.4 | 1.9 | 1.6 | 1.6 | 21.7 | 
		
			| Pentium 4 | 12.1 | 2.8 | 1.4 | 1.4 | 31.9 | 
		
			| Pentium III | 13.0 | 7.2 | 3.0 | 3.0 | 42.9 | 
	
- Unpredictable branches cause a serious performance impact even on
modern CPUs.  Most of the benefit of modern superscalar,
out-of-order, etc is lost if branch prediction fails.
- On modern CPUs, when the branch predictor is working perfectly, ordinary C++ code is competitive with SSE code.
- SSE performance is not affected by unpredictable branches.
- Sorting the data makes the branches more predictable, but it takes much longer to sort than to branch!