Branching from SIMD Code

CS 301 Lecture, Dr. Lawlor

There are a really curious set of instructions in SSE to support per-float branching:
These funky compare-and-AND instructions are actually useful to simulate branches.  The situation where these are useful is when you're trying to convert a loop like this to SSE:
	for (int i=0;i<n;i++) { 
        if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash:
	for (int i=0;i<n;i++) { 
        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
}
Written in ordinary sequential code, this is actually a slowdown, not a speedup!  But in SSE this branch-to-logical transformation means you can keep barreling along in parallel, without having to switch to sequential floating point to do the branches:
	__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);
__m128 Thresh=_mm_load1_ps(&thresh);
for (int i=0;i<n;i+=4) {
__m128 V=_mm_load_ps(&vec[i]);
__m128 mask=_mm_cmplt_ps(V,Thresh); // Do all four comparisons
__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"
__m128 V_else=C; // "else" half of "if"
V=_mm_or_ps( _mm_and_ps(V_then,mask), _mm_andnot_ps(V_else,mask) );
_mm_store_ps(&vec[i],V);
}

(Try this in NetRun now!)

This gives about a 3.8x speedup over the original loop on my machine!

Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.

Apple explains how to use this bitwise branching technique when translating code written for PowerPC's version of SSE, called "AltiVec".