Parallelism: Single-Instruction-Multiple-Data (SIMD) and SSE

CS 301 Lecture, Dr. Lawlor, 2005/11/30

Modern machines do tons of crazy stuff to get things happening in parallel:
The problem with all these heroics is that they're getting more and more complicated (more expensive to design, test, and build; more power and heat) and they're producing less and less performance improvement. 

But consider a program like this:
for (int i=0;i<n_vals;i++) { 
There's actually n_vals-way parallelism in this program, but to extract that parallelism from the C program is really tricky, since n_vals, vals, a, or b could all be changing (because of another thread, because the array points to it, etc.).

What we really mean above is just:
for all i simultaneously
but that's not what the code above means in C; the C version means "set i to 0; do the loop; i++; if (i<n_vals) repeat", which is totally different!

Consider how different the loop becomes if we write:
for (int i=1;i<n_vals;i++) { 
Now it's not a parallel loop at all.

Fundamentally, there's a bottleneck in the way we write code:
Lots of people have tried to bridge this "parallelism gap", with varying degrees of success:

SIMD: Parallel Arithmetic

The key acronym here is "SIMD": Single Instruction, Multiple Data.  For example, in Intel's x86 "SSE" (Streaming SIMD Extensions), there are special registers (the "SSE" registers) that hold 4 32-bit floating-point values.  You can then issue a Single Instruction (e.g., "padd") which works on all 4 floats at once ("Multiple Data").

The advantages of SIMD are:
The disadvantages are:
The advantages are so compelling, though, that 4-way 32-bit float SIMD instructions now exist on all major processors:

x86 SSE Interface

The x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>.  (Corresonding Apple headers exist for PowerPC AltiVec; the AltiVec instructions have different names but are almost identical.)   The xmmintrin header exists and works out-of-the-box with most modern compilers:
Documentation can be found by just reading the xmmintrin header.  The underlying instructions have different names, which are listed in the x86 reference manual and summarized at this graphical and helpful SSE site.

The C version of an SSE register is the friendly and self-explanatory type "__m128".  All the instructions start with "_mm_" (i.e., MultiMedia).  The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats).  SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot.  So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc.  Major useful operations are:
__m128 _mm_load_ps(float *src)
Load 4 floats from a 16-byte aligned address.
__m128 _mm_loadu_ps(float *src) Load from an unaligned address (4x slower!)
__m128 _mm_load1_ps(float *src) Load 1 float into all 4 fields of an __m128
__m128 _mm_setr_ps(float a,float b,float c,float d)
Load 4 floats from parameters into an __m128
void _mm_store_ps(float *dest,__m128 src)
Store 4 floats to an aligned address.
void _mm_storeu_ps(float *dest,__m128 src) Store 4 floats to unaligned address
__m128 _mm_add_ps(__m128 a,__m128 b)
Add corresponding floats (also "sub")
__m128 _mm_mul_ps(__m128 a,__m128 b) Multiply corresponding floats (also "div")
__m128 _mm_min_ps(__m128 a,__m128 b) Take corresponding minimum (also "max")
__m128 _mm_sqrt_ps(__m128 a) Take square roots of 4 floats (12ns, slow like divide)
__m128 _mm_rcp_ps(__m128 a) Compute rough (12-bit accuracy) reciprocal of all 4 floats (fast as an add!)
__m128 _mm_rsqrt_ps(__m128 a) Rough (12-bit) reciprocal-square-root of all 4 floats (fast)
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
Interleave inputs into low 2 floats and high 2 floats of output. Basically
For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.
There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.

Be careful!  The non "u" versions of load and store are much faster (and if you want performance, you really do need to use them), but they will segfault if the address you pass in isn't a multiple of 16 bytes!  Luckily, most "malloc" implementations return you 16-byte aligned data, and the stack is aligned to 16 bytes with most compilers, especially if you've got __m128 objects in your stack frame.  But for portable code, you may have to manually add padding (using pointer arithmetic), which is really painful.

When things work, like in the "times A plus B" array loop above, we can unroll the loop 4 times, and replace the guts of the loop with SSE instructions:
	__m128 va=_mm_load_ps1(&a); /* va contains 4 copies of a */
__m128 vb=_mm_load_ps1(&b); /* vb contains 4 copies of b */
for (int i=0;i<n_vals;i+=4) { /* careful! n_vals must be multiple of 4! */
__m128 v=_mm_load_ps(&vals[i]); /* careful about alignment! */
This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!