Parallelism: Single-Instruction-Multiple-Data (SIMD) and SSE

CS 301 Lecture, Dr. Lawlor, 2005/11/30

Modern machines do tons of crazy stuff to get things happening in parallel:

"Pipelining" is just a way to break up a long series of tasks into smaller pieces that can be executed simultaneously. Machines have been pipelined since the 386.
"Superscalar" machines use fancy lookahead and scheduling algorithms to try to execute several instructions in parallel. All modern machines since about the Pentium era are superscalar.
"Out-of-order" machines will actually analyze dependencies and execute instructions that have their data available before previous instructions have even finished. The Pentium II and subsequent machine execute out-of-order.
"Load buffers" and "store buffers" allow memory reads and writes to proceed simultaneously with other processing. The Pentium 4 has a 24-entry store buffer, which means the processor can have 24 store instructions "in flight" at once!

The problem with all these heroics is that they're getting more and more complicated (more expensive to design, test, and build; more power and heat) and they're producing less and less performance improvement.

But consider a program like this:

for (int i=0;i<n_vals;i++) { 
        vals[i]=vals[i]*a+b;
}

There's actually n_vals-way parallelism in this program, but to extract that parallelism from the C program is really tricky, since n_vals, vals, a, or b could all be changing (because of another thread, because the array points to it, etc.).

What we really mean above is just:

for all i simultaneously
	vals[i]=vals[i]*a+b;

but that's not what the code above means in C; the C version means "set i to 0; do the loop; i++; if (i<n_vals) repeat", which is totally different!

Consider how different the loop becomes if we write:

for (int i=1;i<n_vals;i++) { 
        vals[i]=vals[i-1]*a+b;
}

Now it's not a parallel loop at all.

Fundamentally, there's a bottleneck in the way we write code:

The problems we want to solve often (but not always) have an obvious, trivially parallel structure.
The code we write to solve the problem is written using the inherently sequential, one-thing-after-another model of all assembly language, C, C++, Java, etc.
The machine we run the code on then goes to great lengths to rediscover the original parallelism, and hence execute our sequential code in parallel.

Lots of people have tried to bridge this "parallelism gap", with varying degrees of success:

People that write "parallelizing compilers" (e.g., OpenMP) either automatically recognize "parallel loops" or allow you to provide hints in your source code as to which loops can be parallelized. These attempts are somewhat hamstrung by the fact that they're still based on inherently sequential languages, which prevents many of the kind of transformations you want to make for performance (e.g., load balancing).
People that write "parallel languages" (e.g., Parallel Prolog) attempt to describe the entire problem in an inherently parallel way. This is really tough for people to do, and usually isn't a good match to modern sequential-centric machines.
People have added "parallel arithmetic" to virtually all modern hardware. This is a low-benefit but low-cost way to capture more parallelism.

SIMD: Parallel Arithmetic

The key acronym here is "SIMD": Single Instruction, Multiple Data. For example, in Intel's x86 "SSE" (Streaming SIMD Extensions), there are special registers (the "SSE" registers) that hold 4 32-bit floating-point values. You can then issue a Single Instruction (e.g., "padd") which works on all 4 floats at once ("Multiple Data").

The advantages of SIMD are:

Up to a 4x speedup over normal sequential code. Note that 4 32-bit floats is 128 bits, so you can impress people by saying you've got a "128-bit machine", which is true in some sense.
4 elements is perfect for doing calculations on colors (Red-Green-Blue-Alpha), 3D locations (XYZW), or 4-channel stereo.
Easy to implement in hardware, because the hardware doesn't have to *recognize* all 4 operations are independent, it's just *told* they are.

The disadvantages are:

The "4" is hardcoded into the hardware and source code. For higher levels of parallelism, you're back to pipelining and chasing dependencies.
Sometimes it's tricky to get exactly 4 copies of what you need--sometimes you need to add extra padding.

The advantages are so compelling, though, that 4-way 32-bit float SIMD instructions now exist on all major processors:

x86: SSE, SSE2, SSE3. "MMX" is the integer version.
PowerPC: AltiVec
Cell Processor
All graphics cards (see next lecture)

x86 SSE Interface

The x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>. (Corresonding Apple headers exist for PowerPC AltiVec; the AltiVec instructions have different names but are almost identical.) The xmmintrin header exists and works out-of-the-box with most modern compilers:

GCC 3.x under linux. Be sure to compile with "-msse", or the header will whine.
Visual C++ 7.0 under Windows.
Intel C/C++ compiler under Linux or Windows. Makes bogus complaint about "no EMMS instruction before return", although the annoying emms() call is thankfully only needed with MMX, not SSE.

Documentation can be found by just reading the xmmintrin header. The underlying instructions have different names, which are listed in the x86 reference manual and summarized at this graphical and helpful SSE site.

The C version of an SSE register is the friendly and self-explanatory type "__m128". All the instructions start with "_mm_" (i.e., MultiMedia). The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:

__m128 _mm_load_ps(float *src)	Load 4 floats from a 16-byte aligned address.
__m128 _mm_loadu_ps(float *src)	Load from an unaligned address (4x slower!)
__m128 _mm_load1_ps(float *src)	Load 1 float into all 4 fields of an __m128
__m128 _mm_setr_ps(float a,float b,float c,float d)	Load 4 floats from parameters into an __m128
void _mm_store_ps(float *dest,__m128 src)	Store 4 floats to an aligned address.
void _mm_storeu_ps(float *dest,__m128 src)	Store 4 floats to unaligned address
__m128 _mm_add_ps(__m128 a,__m128 b)	Add corresponding floats (also "sub")
__m128 _mm_mul_ps(__m128 a,__m128 b)	Multiply corresponding floats (also "div")
__m128 _mm_min_ps(__m128 a,__m128 b)	Take corresponding minimum (also "max")
__m128 _mm_sqrt_ps(__m128 a)	Take square roots of 4 floats (12ns, slow like divide)
__m128 _mm_rcp_ps(__m128 a)	Compute rough (12-bit accuracy) reciprocal of all 4 floats (fast as an add!)
__m128 _mm_rsqrt_ps(__m128 a)	Rough (12-bit) reciprocal-square-root of all 4 floats (fast)
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0))	Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.

There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.

Be careful! The non "u" versions of load and store are much faster (and if you want performance, you really do need to use them), but they will segfault if the address you pass in isn't a multiple of 16 bytes! Luckily, most "malloc" implementations return you 16-byte aligned data, and the stack is aligned to 16 bytes with most compilers, especially if you've got __m128 objects in your stack frame. But for portable code, you may have to manually add padding (using pointer arithmetic), which is really painful.

When things work, like in the "times A plus B" array loop above, we can unroll the loop 4 times, and replace the guts of the loop with SSE instructions:

	__m128 va=_mm_load_ps1(&a); /* va contains 4 copies of a */
	__m128 vb=_mm_load_ps1(&b); /* vb contains 4 copies of b */
	for (int i=0;i<n_vals;i+=4) { /* careful!  n_vals must be multiple of 4! */
		__m128 v=_mm_load_ps(&vals[i]); /* careful about alignment! */
		v=_mm_mul_ps(v,va);
		v=_mm_add_ps(v,vb);
		_mm_store_ps(&vals[i],v);
	}

This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!