Streaming SIMD Extensions: SSE
CS 441 Lecture, Dr. Lawlor
SIMD
This is an old, simple, but powerful idea-- "SIMD", which stands for Single Instruction Multiple Data:
- Single, meaning just one.
- Instruction, as in a machine code instruction, executed by hardware.
- Multiple, as in more than one--from 2 to a thousand or so.
- Data, as in floats or ints.
You can do lots interesting SIMD work without using
any special instructions--plain old C will do, if you treat an "int" as
32 completely independent bits, because any normal "int" instructions
will operate on all the bits at once. This use of
bitwise operations is often called "SIMD within a register (SWAR)" or
"word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.
Vector Machines
Back in the 1980's, "vector" machines were quite popular in
supercomputing centers. For example, the 1988 Cray Y-MP was a
typical vector machine. When I was an undergraduate, ARSC
still had a Y-MP vector machine. The Y-MP had eight "vector"
registers, each of which held 64 doubles (that's a total of 4KB of
registers!). A single Y-MP machine language instruction could add
all 64 corresponding numbers in two vector registers, which enabled the
Y-MP to achieve the (then) mind-blowing speed of *millions* of
floating-point operations per second. Vector machines have now
almost completely died out; the NEC SV-1 and the Japanese "Earth
Simulator" were the last of this breed. Vector machines are
classic SIMD, because one instruction can modify 64 doubles.
SIMD Floats
The most common form of SIMD today are the "multimedia" instruction
set extensions in normal CPUs. The usual arrangment for
multimedia instructions is for single instructions to operate on four
32-bit floats. These four-float instructions exist almost
everywhere nowdays:
- x86, where they are called "SSE", or "Streaming SIMD Extensions" ("MMX" is actually an older integer-only set of extensions)
- PowerPC, where they are called "AltiVec"
- Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
- Graphics cards, where they are part of the pixel processors
We'll look at the x86 version below.
SSE Assembly
The old 1970's floating-point on x86 was so bad, they actually built a
better optional way to do floating point called "SSE". Unlike the
old floating point:
- SSE registers can be accessed in any order, not just as a stack.
- All SSE registers can be trashed (no need to clean up the stack across function calls).
- SSE instructions look very similar to integer instructions.
- SSE won't work on ancient processors, like the Pentium I (but anymore, who cares?).
SSE instructions were first introduced with the Intel Pentium II, but
they're now found on all modern x86 processors, including the 64-bit
versions. SSE introduces 8 new registers, called xmm0 through
xmm7, that each contain four 32-bit single-precision floats. New
instructions that operate on these registers have the suffix "ps", for
"Packed Single-precision". See the x86 reference manual for a complete list of SSE instructions.
For example, "add" adds two integer registers, like eax and ebx.
"addps" adds two SSE registers, like xmm3 and xmm6. There
are SSE versions of most other arithmetic operations: subps, mulps, divps,
etc.
There's one curious operation called "shufps", which does a
permutation of the incoming floats. The permutation is controlled
by 4 2-bit "select" fields, one for each destination float, that
determines which incoming float goes to that slot. For example, a
swizzle can copy the first float into all four destination slots, with
a select field of 0x00. To copy the last float into all four
destination slots, the select field is 0xff. A select field of
00010111, or 0x17, would set the first float from the last (float 3),
copy float 1 into the middle two slots, and fill the last slot with the
first float (float 0). This is the *only* operation which
rearranges floats between different slots in the registers. (In
graphics language, this operation is called a "swizzle".)
For example, if xmm1 contains the four floats "(0.0, 1.1, 2.2, 3.3)",
then:
shufps xmm1,xmm1, 0xff
will set all 4 of xmm1's floats to the last value (index 11).
There are actually two flavors of the SSE move instruction: "movups"
moves a value between *unaligned* addresses (not a multiple of 16);
"movaps" moves a value between *aligned* addresses. The aligned
move is substantially faster, but it will segfault if the address you give isn't a multiple of 16! Luckily,
most "malloc" implementations return you 16-byte aligned data, and the
stack is aligned to 16 bytes with most compilers. But for maximum portability,
for the aligned move to work you sometimes have to manually add padding (using pointer arithmetic), which
is really painful.
Here's a simple SSE assembly example where we load up a constant, add it to itself with SSE, and then write it out:
movups xmm1,[thing1]; <- copy the four floats into xmm1
addps xmm1,xmm1; <- add floats to themselves
movups [retval],xmm1; <- move that constant into the global "retval"
; Print out retval
extern farray_print
push 4 ;<- number of floats to print
push retval ;<- points to array of floats
call farray_print
add esp,8 ; <- pop off arguments
ret
section .data
thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant
retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value
(Try this in NetRun now!)
Here's a version that loads up two separate float constants, and adds them:
movups xmm1,[thing1]; <- copy the four floats into xmm1
movups xmm6,[thing2]; <- copy the four floats into xmm1
addps xmm1,xmm6; <- add floats
movups [retval],xmm1; <- move that constant into the global "retval"
; Print out retval
extern farray_print
push 4 ;<- number of floats to print
push retval ;<- points to array of floats
call farray_print
add esp,8 ; <- pop off arguments
ret
section .data
thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant
thing2: dd 1.2, 2.2, 3.2, 4.2;<- source constant
retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value
(Try this in NetRun now!)
The "dd" lines declare a 32-bit constant. NASM is smart enough to
automatically use float format if you type a constant with a decimal
point!
SSE in C/C++
The x86 SSE instructions can be accessed from C/C++ via the header
<xmmintrin.h>. (Corresonding Apple headers exist for PowerPC AltiVec;
the AltiVec instructions have different names but are almost
identical.) The xmmintrin header exists and works out-of-the-box
with most modern compilers:
- GCC 3.x under linux. Be sure to compile with "-msse", or the header will whine.
- Visual C++ 7.0 under Windows.
- Intel C/C++ compiler under Linux or Windows. Makes bogus
complaint about "no EMMS instruction before return", although the
annoying emms() call is thankfully only needed with MMX, not SSE.
Documentation can be found by just reading the xmmintrin header. The underlying instructions have different names, which are listed in the x86 reference manual and summarized below and at this graphical and helpful SSE site.
The C version of an SSE register is the user-friendly and self-explanatory
type "__m128". All the instructions start with "_mm_" (i.e.,
MultiMedia). The suffix indicates the data type; in these
examples, we'll just be talking about 4 floats, which use the suffix
"_ps" (Packed Single-precision floats). SSE supports other data
types in the same 128 bits, but 4 floats seems to be the sweet
spot. So, for example, "_mm_load_ps" loads up 4 floats into a
__m128, "_mm_add_ps" adds 4 corresponding floats together, etc.
Major useful operations are:
__m128 _mm_load_ps(float *src)
|
Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16!
|
__m128 _mm_loadu_ps(float *src) |
Load 4 floats from an unaligned address (4x slower!)
|
__m128 _mm_load1_ps(float *src) |
Load 1 individual float into all 4 fields of an __m128
|
__m128 _mm_setr_ps(float a,float b,float c,float d)
|
Load 4 separate floats from parameters into an __m128
|
void _mm_store_ps(float *dest,__m128 src)
|
Store 4 floats to an aligned address.
|
void _mm_storeu_ps(float *dest,__m128 src) |
Store 4 floats to unaligned address
|
__m128 _mm_add_ps(__m128 a,__m128 b)
|
Add corresponding floats (also "sub")
|
__m128 _mm_mul_ps(__m128 a,__m128 b) |
Multiply corresponding floats (also "div", but it's slow)
|
__m128 _mm_min_ps(__m128 a,__m128 b) |
Take corresponding minimum (also "max")
|
__m128 _mm_sqrt_ps(__m128 a) |
Take square roots of 4 floats (12ns, slow like divide)
|
__m128 _mm_rcp_ps(__m128 a) |
Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!)
|
__m128 _mm_rsqrt_ps(__m128 a) |
Rough (12-bit) reciprocal-square-root of all 4 floats (fast)
|
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi,
_MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2 floats and high 2 floats of output. Basically
out[0]=lo[lo0];
out[1]=lo[lo1];
out[2]=hi[hi2];
out[3]=hi[hi3];
For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.
|
There are also instructions for integer conversion, comparsions,
various bitwise operations, and cache-friendly prefetches and streaming
store operations.
So take your classic "loop over floats":
for (int i=0;i<n_vals;i++) {
vals[i]=vals[i]*a+b;
}
(executable NetRun link)
This takes about 4.5 ns per float.
Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:
for (int i=0;i<n_vals;i+=4) {
vals[i+0]=vals[i+0]*a+b;
vals[i+1]=vals[i+1]*a+b;
vals[i+2]=vals[i+2]*a+b;
vals[i+3]=vals[i+3]*a+b;
}
(Try this in NetRun now!)
This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.
We can then replace the guts of the loop with SSE
instructions:
__m128 SSEa=_mm_load1_ps(&a);
__m128 SSEb=_mm_load1_ps(&b);
__m128 v=_mm_load_ps(&vec[i]);
v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);
_mm_store_ps(&vec[i],v);
(Try this in NetRun now!)
This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!
Note that if we use the unaligned loads and stores (loadu and storeu),
on the Pentium 4 we lose almost all the performance benefit from using SSE in the first
place! Luckily, newer machines have a lower unaligned load penalty.
A nice C++ interface for SSE operations
It's pretty easy to use C++ operator overloading to build a decent C++ interface to SSE:
#include <xmmintrin.h> /* Standard Intel header, has _mm_... functions */
/* A "vec4" is four floats: one SSE register. */
class vec4 {
public:
inline vec4(void) {}
inline vec4(__m128 val) :v(val) {}
/* Load up 4 floats from this 16-BYTE ALIGNED pointer */
inline void operator=(float *a) {v=_mm_load_ps(a);}
inline vec4(float *a) {(*this)=a;}
/* Load up 4 copies of 1 float */
inline void operator=(float a) {v=_mm_load1_ps(&a);}
inline vec4(float a) {(*this)=a;}
/* Extract the underlying xmmintrin value from this vec4 */
inline __m128 get(void) const {return v;}
/* Copy our 4 floats out to this 16-BYTE ALIGNED pointer */
inline void write(float *a) {_mm_store_ps(a,v);}
private:
__m128 v; // We contain one four-float (128-bit) value.
};
/* Make arithmetic work on vec4's */
inline vec4 operator+(const vec4 &a,const vec4 &b) { return _mm_add_ps(a.get(),b.get()); }
inline vec4 operator-(const vec4 &a,const vec4 &b) { return _mm_sub_ps(a.get(),b.get()); }
inline vec4 operator*(const vec4 &a,const vec4 &b) { return _mm_mul_ps(a.get(),b.get()); }
inline vec4 operator/(const vec4 &a,const vec4 &b) { return _mm_div_ps(a.get(),b.get()); }
/* User code */
enum {n=1024};
float arr[n];
float a=1.2, b=0.3;
int bar(void) {
vec4 A=a, B=b;
for (int i=0;i<n;i+=8) /* each loop iteration does 8 floats */
{
vec4 V=&arr[i]; /* load arr[i] .. arr[i+3] */
V=(V+A)*B;
V.write(&arr[i]); /* write back */
/* Unrolled SSE: second set of four floats */
V=&arr[i+4];
V=(V+A)*B;
V.write(&arr[i+4]);
}
return 0;
}
int foo(void) {
print_time("Bar",bar);
return 0;
}
(Try this in NetRun now!)
Branching from SIMD Code
There are a really curious set of instructions in SSE to support per-float branching:
- Comparison instructions,
like cmpeq, that fill the corresponding SSE register's float with
0xffffffff (binary all ones) if the corresponding float component
comparison comes out true, and all zeros if the comparison comes out
false. That is, these are per-component comparisons, so you could
get an output like {0xffffffff,0x0,0xffffffff,0x0} if the even two
components come out true, and the odd two components come out
false. Note that as floats, these values are useless--0xffffffff
is some crazy negative NaN.
- Logical operations, like and, nand, or, and xor, that bitwise-and SSE registers.
Note that bitwising AND'ing two ordinary floats is totally useless (the sign,
exponent, and mantissa fields get anded!). But bitwise AND'ing
with all-zeros zeros out the corresponding float, and AND'ing with all
ones leaves the float untouched.
These funky compare-and-AND instructions are actually useful to simulate branches. The
situation where these are useful is when you're trying to convert a
loop like this to SSE:
for (int i=0;i<n;i++) {
if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try this in NetRun now!)
You can implement this branch by setting a mask indicating where
vals[i]<7, and then using the mask to pick the correct side of the
branch to squash:
for (int i=0;i<n;i++) {
unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
}
Written in ordinary sequential code, this is actually a slowdown, not a
speedup! But in SSE this branch-to-logical transformation means
you can keep barreling along in parallel, without having to switch to
sequential floating point to do the branches:
__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);
__m128 Thresh=_mm_load1_ps(&thresh);
for (int i=0;i<n;i+=4) {
__m128 V=_mm_load_ps(&vec[i]);
__m128 mask=_mm_cmplt_ps(V,Thresh); // Do the comparison
__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"
__m128 V_else=C; // "else" half of "if"
V=_mm_or_ps( _mm_and_ps(mask,V_then), _mm_andnot_ps(mask,V_else) );
_mm_store_ps(&vec[i],V);
}
(Try this in NetRun now!)
This gives about a 3.8x speedup over the original loop on my
machine! (Note: I'd flipped the arguments to _mm_andnot_ps above
until 2009-10-30!)
Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.
Apple explains how to use this bitwise branching technique when translating code written for PowerPC's version of SSE, called "AltiVec".
Next-Generation SSE: Larrabee
Intel seems to have started getting interested in GPU-style computing, so they're pushing their upcoming Larrabee
hardware. It's basically a set of in-order but dual-issue x86 cores
(derived from the 1992-era Pentium Pro!), each hyperthreaded 4 ways (to
hide latency) and with 16-float
SIMD instructions ("the VPU"). Throwing away out-of-order, branch
prediction, and most of the remaining 1994-2006 CPU technology allows
them to pack 10-50 such cores onto a single chip with current
technology, and reach GPU-competitive floating point performance.
For non-graphics, the plan is to support pthreads (if you can afford
creating and scheduling 100+ threads!), Intel Threading Building
Blocks, and OpenMP.
Intel has released a header file, the "Larrabee Prototype Library",
that lets you try out this new instruction set. It's got 32
64-byte registers, that can hold either 16 floats or 8 doubles.
There are "masked" versions of each operation that read per-float bit
masks from special mask registers, set by comparison functions.
This is the replacement for the bizarre and horrible SSE compare and
logical operations, and it looks substantially cleaner.