Branching from SIMD Code
CS 301 Lecture, Dr. Lawlor
There are a really curious set of instructions in SSE to support per-float branching:
- Comparison instructions,
like cmpeq, that fill the corresponding SSE register's float with
0xffffffff (binary all ones) if the corresponding float component
comparison comes out true, and all zeros if the comparison comes out
false. That is, these are per-component comparisons, so you could
get an output like {0xffffffff,0x0,0xffffffff,0x0} if the even two
components come out true, and the odd two components come out
false. Note that as floats, these values are useless--0xffffffff
is some crazy negative NaN.
- Logical operations, like and, nand, or, and xor, that bitwise-and SSE registers.
Note that bitwising AND'ing two ordinary floats is totally useless (the sign,
exponent, and mantissa fields get anded!). But bitwise AND'ing
with all-zeros zeros out the corresponding float, and AND'ing with all
ones leaves the float untouched.
These funky compare-and-AND instructions are actually useful to simulate branches. The
situation where these are useful is when you're trying to convert a
loop like this to SSE:
for (int i=0;i<n;i++) {
if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try this in NetRun now!)
You can implement this branch by setting a mask indicating where
vals[i]<7, and then using the mask to pick the correct side of the
branch to squash:
for (int i=0;i<n;i++) {
unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
}
Written in ordinary sequential code, this is actually a slowdown, not a
speedup! But in SSE this branch-to-logical transformation means
you can keep barreling along in parallel, without having to switch to
sequential floating point to do the branches:
__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);
__m128 Thresh=_mm_load1_ps(&thresh);
for (int i=0;i<n;i+=4) {
__m128 V=_mm_load_ps(&vec[i]);
__m128 mask=_mm_cmplt_ps(V,Thresh); // Do all four comparisons
__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"
__m128 V_else=C; // "else" half of "if"
V=_mm_or_ps( _mm_and_ps(V_then,mask), _mm_andnot_ps(V_else,mask) );
_mm_store_ps(&vec[i],V);
}
(Try this in NetRun now!)
This gives about a 3.8x speedup over the original loop on my machine!
Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.
Apple explains how to use this bitwise branching technique when translating code written for PowerPC's version of SSE, called "AltiVec".