Branch instructions & SIMD Intro

CS 441 Lecture, Dr. Lawlor

Branches are tricky to integrate into today's deep pipelines.  Branch prediction is hard. 

This piece of code is basically a random number generator, combined with a little branch code.
add r11,1 ; treat r11 as counter.  HACK: assumes r11 not used by NetRun's timing loop
mov r8,r11 ; copy the counter
imul r8,65747 ; scale the counter by some big number (not a power of two, though!)
and r8,4 ; grab a random bit from inside the scaled counter
cmp r8,0 ; check if that bit is zero
je same_thing ; if it's zero, jump
ret

same_thing: ; basically the same as not jumping, except for performance!
ret
(Try this in NetRun now!)

What's curious about this code is how the performance varies if you adjust the frequency of branching:

Avoiding Branching with the "If Then Else" Trick

There are several interesting ways to avoid the performance cost of unpredictable branches:

You can use a "conditional" instruction, like Intel's cmovXX instructions (cmovle, cmovg, etc), which only does a move if some condition is true.

You can fold and select a conditional into arithmetic, transforming
if (x) y=a; else y=b;   // conditional
into any of these forms:
y=b+x*(a-b); // linear version (assumes a-b works)
y=x*a+(1-x)*b; // rearranged version of above
y=(x&a)|(~x)&b); // bitwise version (assumes x is all zero bits, or all one bits)
Note that this last one is just a software version of a hardware multiplexer!

SIMD

We covered an old, simple, but powerful idea today-- "SIMD", which stands for Single Instruction Multiple Data:
You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once.  This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers.  For example, the 1988 Cray Y-MP was a typical vector machine.  When I was an undergraduate, ARSC still had a Y-MP vector machine.  The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!).  A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second.  Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed.  Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs.  The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats.  These four-float instructions exist almost everywhere nowdays:
We'll look at the x86 version, SSE, in the most detail.

Branch Prediction vs SSE

Here's a little benchmark to compare ordinary sequential C++ with the SSE "fourfloats" class we will develop next class.  The "if_then_else" method is written using the bitwise branch trick described above.  The surprising fact is that the sequential code's performance depends heavily on the predictability of the "if (src[i]<4.0)" branch:
enum {n=1000};
float src[n]={1.0,5.0,3.0,4.0};
float dest[n];

int serial_loop(void) {
for (int i=0;i<n;i++) {
if (src[i]<4.0) dest[i]=src[i]*2.0; else dest[i]=17.0;
}
return 0;
}
int sse_loop(void) {
for (int i=0;i<n;i+=4) {
fourfloats s(&src[i]);
fourfloats d=(s<4.0).if_then_else(s+s,17.0);
d.store(&dest[i]);
}
return 0;
}
int sort_loop(void) {
std::sort(&src[0],&src[n]);
return 0;
}


int foo(void) {
for (int i=0;i<n;i++) {src[i]=rand()%8;}
print_time("serial(rand)",serial_loop);
print_time("SSE(rand)",sse_loop);
print_time("sort",sort_loop);
print_time("serial(post-sort)",serial_loop);
print_time("SSE(post-sort)",sse_loop);
//farray_print(dest,4); // <- for debugging
return 0;
}

(Try this in NetRun now!)

The performance of this code on our various NetRun machines is summarized here, in nanoseconds/float:


Serial (rand) Serial (sorted) SSE (rand) SSE (sorted) Sort time
Q6600 7.5 1.1 1.2 1.2 16.7
Core2 9.4 1.9 1.6 1.6 21.7
Pentium 4 12.1 2.8 1.4 1.4 31.9
Pentium III 13.0 7.2 3.0 3.0 42.9