# Branch instructions & SIMD Intro

CS 441 Lecture, Dr. Lawlor

Branches are tricky to integrate into today's deep pipelines.  Branch prediction is hard.

This piece of code is basically a random number generator, combined with a little branch code.
`add r11,1 ; treat r11 as counter.  HACK: assumes r11 not used by NetRun's timing loopmov r8,r11 ; copy the counterimul r8,65747 ; scale the counter by some big number (not a power of two, though!)and r8,4 ; grab a random bit from inside the scaled countercmp r8,0 ; check if that bit is zeroje same_thing ; if it's zero, jumpretsame_thing:   ; basically the same as not jumping, except for performance!ret`
(Try this in NetRun now!)

• Never branching is really fast, like 2.9ns
• Always branching is only a little slower, like 3.7ns
• Alternating branching or not, every time around, is 3.3ns (about halfway in between).
• Randomly branching at some unpredictable interval is quite bad, like 7.5ns!

## Avoiding Branching with the "If Then Else" Trick

There are several interesting ways to avoid the performance cost of unpredictable branches:

You can use a "conditional" instruction, like Intel's cmovXX instructions (cmovle, cmovg, etc), which only does a move if some condition is true.

You can fold and select a conditional into arithmetic, transforming
`if (x) y=a; else y=b;   // conditional`
into any of these forms:
`y=b+x*(a-b); // linear version (assumes a-b works)y=x*a+(1-x)*b; // rearranged version of abovey=(x&a)|(~x)&b); // bitwise version (assumes x is all zero bits, or all one bits)`
Note that this last one is just a software version of a hardware multiplexer!

## SIMD

We covered an old, simple, but powerful idea today-- "SIMD", which stands for Single Instruction Multiple Data:
• Single, meaning just one.
• Instruction, as in a machine code instruction, executed by hardware.
• Multiple, as in more than one--from 2 to a thousand or so.
• Data, as in floats or ints.
You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once.  This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers.  For example, the 1988 Cray Y-MP was a typical vector machine.  When I was an undergraduate, ARSC still had a Y-MP vector machine.  The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!).  A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second.  Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed.  Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs.  The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats.  These four-float instructions exist almost everywhere nowdays:
• x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
• PowerPC, where they are called "AltiVec"
• Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
• Graphics cards, where they are part of the pixel processors
We'll look at the x86 version, SSE, in the most detail.

## Branch Prediction vs SSE

Here's a little benchmark to compare ordinary sequential C++ with the SSE "fourfloats" class we will develop next class.  The "if_then_else" method is written using the bitwise branch trick described above.  The surprising fact is that the sequential code's performance depends heavily on the predictability of the "if (src[i]<4.0)" branch:
`enum {n=1000};float src[n]={1.0,5.0,3.0,4.0};float dest[n];int serial_loop(void) {	for (int i=0;i<n;i++) {		if (src[i]<4.0) dest[i]=src[i]*2.0; else dest[i]=17.0;	}	return 0;}int sse_loop(void) {	for (int i=0;i<n;i+=4) {		fourfloats s(&src[i]);		fourfloats d=(s<4.0).if_then_else(s+s,17.0);		d.store(&dest[i]);	}	return 0;}int sort_loop(void) {	std::sort(&src[0],&src[n]);	return 0;}int foo(void) {	for (int i=0;i<n;i++) {src[i]=rand()%8;}	print_time("serial(rand)",serial_loop);	print_time("SSE(rand)",sse_loop);	print_time("sort",sort_loop);	print_time("serial(post-sort)",serial_loop);	print_time("SSE(post-sort)",sse_loop);	//farray_print(dest,4); // <- for debugging	return 0;}`

The performance of this code on our various NetRun machines is summarized here, in nanoseconds/float:

```

Serial (rand)
Serial (sorted)
SSE (rand)
SSE (sorted)
Sort time

Q6600
7.5
1.1
1.2
1.2
16.7

Core2
9.4
1.9
1.6
1.6
21.7

Pentium 4
12.1
2.8
1.4
1.4
31.9

Pentium III
13.0
7.2
3.0
3.0
42.9

```
• Unpredictable branches cause a serious performance impact even on modern CPUs.  Most of the benefit of modern superscalar, out-of-order, etc is lost if branch prediction fails.
• On modern CPUs, when the branch predictor is working perfectly, ordinary C++ code is competitive with SSE code.
• SSE performance is not affected by unpredictable branches.
• Sorting the data makes the branches more predictable, but it takes much longer to sort than to branch!