<xmmintrin.h>: SSE instructions from C++

CS 301 Lecture, Dr. Lawlor
The x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>.  (Corresonding Apple headers exist for PowerPC AltiVec; the AltiVec instructions have different names but are almost identical.)   The xmmintrin header exists and works out-of-the-box with most modern compilers:
Documentation can be found by just reading the xmmintrin header.  The underlying instructions have different names, which are listed in the x86 reference manual and summarized below and at this graphical and helpful SSE site.

The C version of an SSE register is the user-friendly and self-explanatory type "__m128".  All the instructions start with "_mm_" (i.e., MultiMedia).  The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats).  SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot.  So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc.  Major useful operations are:
__m128 _mm_load_ps(float *src)
Load 4 floats from a 16-byte aligned address.  WARNING: Segfaults if the address isn't a multiple of 16!
__m128 _mm_loadu_ps(float *src) Load 4 floats from an unaligned address (about 4x slower!)
__m128 _mm_load1_ps(float *src) Load 1 individual float into all 4 fields of an __m128
__m128 _mm_setr_ps(float a,float b,float c,float d)
Load 4 separate floats from parameters into an __m128
void _mm_store_ps(float *dest,__m128 src)
Store 4 floats to an aligned address.
void _mm_storeu_ps(float *dest,__m128 src) Store 4 floats to unaligned address
__m128 _mm_add_ps(__m128 a,__m128 b)
Add corresponding floats (also "sub")
__m128 _mm_mul_ps(__m128 a,__m128 b) Multiply corresponding floats (also "div", but it's slow)
__m128 _mm_min_ps(__m128 a,__m128 b) Take corresponding minimum (also "max")
__m128 _mm_sqrt_ps(__m128 a) Take square roots of 4 floats (12ns, slow like divide)
__m128 _mm_rcp_ps(__m128 a) Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!)
__m128 _mm_rsqrt_ps(__m128 a) Rough (12-bit) reciprocal-square-root of all 4 floats (fast)
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
       _MM_SHUFFLE(hi3,hi2,lo1,lo0))
Interleave inputs into low 2 floats and high 2 floats of output. Basically
   out[0]=lo[lo0];
   out[1]=lo[lo1];
   out[2]=hi[hi2];
   out[3]=hi[hi3];
For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.
There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.

So take your classic "loop over floats":
	for (int i=0;i<n_vals;i++) { 
        vals[i]=vals[i]*a+b;
}
(executable NetRun link)

This takes about 4.5 ns per float.

Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:
	for (int i=0;i<n_vals;i+=4) { 
        vals[i+0]=vals[i+0]*a+b;
        vals[i+1]=vals[i+1]*a+b;
        vals[i+2]=vals[i+2]*a+b;
        vals[i+3]=vals[i+3]*a+b;
}
(Try this in NetRun now!)

This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.

We can then replace the guts of the loop with SSE instructions:
	__m128 SSEa=_mm_load1_ps(&a);
__m128 SSEb=_mm_load1_ps(&b);
__m128 v=_mm_load_ps(&vec[i]);
v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);
_mm_store_ps(&vec[i],v);

(Try this in NetRun now!)

This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!

Note that if we use the unaligned loads and stores (loadu and storeu), on the Pentium 4 we lose almost all the performance benefit from using SSE in the first place!

Branching from SIMD Code

There are a really curious set of instructions in SSE to support per-float branching:
These funky compare-and-AND instructions are actually useful to simulate branches.  The situation where these are useful is when you're trying to convert a loop like this to SSE:
	for (int i=0;i<n;i++) { 
        if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash:
	for (int i=0;i<n;i++) { 
        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
}
Written in ordinary sequential code, this is actually a slowdown, not a speedup!  But in SSE this branch-to-logical transformation means you can keep barreling along in parallel, without having to switch to sequential floating point to do the branches:
	__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);
__m128 Thresh=_mm_load1_ps(&thresh);
for (int i=0;i<n;i+=4) {
__m128 V=_mm_load_ps(&vec[i]);
__m128 mask=_mm_cmplt_ps(V,Thresh); // Do the comparison
__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"
__m128 V_else=C; // "else" half of "if"
V=_mm_or_ps( _mm_and_ps(mask,V_then), _mm_andnot_ps(mask,V_else) );
_mm_store_ps(&vec[i],V);
}

(Try this in NetRun now!)

This gives about a 3.8x speedup over the original loop on my machine!

Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.

Apple explains how to use this bitwise branching technique when translating code written for PowerPC's version of SSE, called "AltiVec".

Here's another CPU example:
long THEN=0xa7a6a5a4a3a2a1a0;
long ELSE=0xb7b6b5b4b3b2b1b0;
long MASK=0xffff00ff00ff0000;
printf("%s%016lx\n","MASK =",MASK);
printf("%s%016lx\n","MASK & THEN =",(MASK&THEN));
printf("%s%016lx\n","~MASK & ELSE =",((~MASK)&ELSE));
long BRANCH=(MASK&THEN)|((~MASK)&ELSE);
printf("%s%016lx\n","BRANCH =",BRANCH);

(Try this in NetRun now!)

This prints out:
MASK         =ffff00ff00ff0000
MASK & THEN =a7a600a400a20000
~MASK & ELSE =0000b500b300b1b0
BRANCH =a7a6b5a4b3a2b1b0

Next-Generation SSE: Larrabee

Intel seems to have started getting interested in GPU-style computing, so they're pushing their upcoming Larrabee hardware. It's basically a set of in-order but dual-issue x86 cores (derived from the 1992-era Pentium Pro!), each hyperthreaded 4 ways (to hide latency) and with 16-float SIMD instructions ("the VPU").  Throwing away out-of-order, branch prediction, and most of the remaining 1994-2006 CPU technology allows them to pack 10-50 such cores onto a single chip with current technology, and reach GPU-competitive floating point performance.  For non-graphics, the plan is to support pthreads (if you can afford creating and scheduling 100+ threads!), Intel Threading Building Blocks, and OpenMP.

Intel has released a header file, the "Larrabee Prototype Library", that lets you try out this new instruction set.  It's got 32 64-byte registers, that can hold either 16 floats or 8 doubles.  There are "masked" versions of each operation that read per-float bit masks from special mask registers, set by comparison functions.  This is the replacement for the bizarre and horrible SSE compare and logical operations above, and it looks substantially cleaner.