There are a really curious set of instructions in SSE to support per-float branching:

- Comparison instructions,
like cmpeq, that fill the corresponding SSE register's float with
0xffffffff (binary all ones) if the corresponding float component
comparison comes out true, and all zeros if the comparison comes out
false. That is, these are per-component comparisons, so you could
get an output like {0xffffffff,0x0,0xffffffff,0x0} if the even two
components come out true, and the odd two components come out
false. Note that as floats, these values are useless--0xffffffff
is some crazy negative NaN.

- Logical operations, like and, nand, or, and xor, that bitwise-and SSE registers.
Note that bitwising AND'ing two ordinary floats is totally useless (the sign,
exponent, and mantissa fields get anded!). But bitwise AND'ing
with all-zeros zeros out the corresponding float, and AND'ing with all
ones leaves the float untouched.

for (int i=0;i<n;i++) {(Try this in NetRun now!)

if (vec[i]<7)

vec[i]=vec[i]*a+b;

else

vec[i]=c;

}

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash:

for (int i=0;i<n;i++) {Written in ordinary sequential code, this is actually a slowdown, not a speedup! But in SSE this branch-to-logical transformation means you can keep barreling along in parallel, without having to switch to sequential floating point to do the branches:

unsigned int mask=(vec[i]<7)?0xffFFffFF:0;

vec[i]=((vec[i]*a+b)&mask) | (c&~mask);

}

__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);This gives about a 3.8x speedup over the original loop on my machine!

__m128 Thresh=_mm_load1_ps(&thresh);

for (int i=0;i<n;i+=4) {

__m128 V=_mm_load_ps(&vec[i]);

__m128 mask=_mm_cmplt_ps(V,Thresh); // Do all four comparisons

__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"

__m128 V_else=C; // "else" half of "if"

V=_mm_or_ps( _mm_and_ps(V_then,mask), _mm_andnot_ps(V_else,mask) );

_mm_store_ps(&vec[i],V);

}

Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.

Apple explains how to use this bitwise branching technique when translating code written for PowerPC's version of SSE, called "AltiVec".