We covered an old, simple, but powerful idea today-- "SIMD", which stands for Single Instruction Multiple Data:

- Single, meaning just one.

- Instruction, as in a machine code instruction, executed by hardware.

- Multiple, as in more than one--from 2 to a thousand or so.

- Data, as in floats or ints.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers. For example, the 1988 Cray Y-MP was a typical vector machine. When I was an undergraduate, ARSC still had a Y-MP vector machine. The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!). A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second. Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed. Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

- x86, where they are called "SSE", or "Streaming SIMD Extensions"
- PowerPC, where they are called "AltiVec"
- Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
- Graphics cards, where they are part of the pixel processors

For example, "add" adds two integer registers, like eax and ebx. "addps" adds two SSE registers, like xmm3 and xmm6. There are SSE versions of most other arithmetic operations: subps, mulps, divps, etc.

There are actually two flavors of the SSE move instruction: "movups" moves a value between *unaligned* addresses (not a multiple of 16); "movaps" moves a value between *aligned* addresses. The aligned move is substantially faster, but it will segfault if the address you give isn't a multiple of 16! Luckily, most "malloc" implementations return you 16-byte aligned data, and the stack is aligned to 16 bytes with most compilers. But for maximum portability, for the aligned move to work you sometimes have to manually add padding (using pointer arithmetic), which is really painful.

Here's a simple SSE assembly example where we load up a constant, add it to itself with SSE, and then write it out:

movups xmm1,[thing1]; <- copy the four floats into xmm1

addps xmm1,xmm1; <- add floats to themselves

movups [retval],xmm1; <- move that constant into the global "retval"

; Print out retval

extern farray_print

push 4 ;<- number of floats to print

push retval ;<- points to array of floats

call farray_print

add esp,8 ; <- pop off arguments

ret

section .data

thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant

retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value

Here's a version that loads up two separate float constants, and adds them:

movups xmm1,[thing1]; <- copy the four floats into xmm1The "dd" lines declare a 32-bit constant. NASM is smart enough to automatically use float format if you type a constant with a decimal point!

movups xmm6,[thing2]; <- copy the four floats into xmm1

addps xmm1,xmm6; <- add floats

movups [retval],xmm1; <- move that constant into the global "retval"

; Print out retval

extern farray_print

push 4 ;<- number of floats to print

push retval ;<- points to array of floats

call farray_print

add esp,8 ; <- pop off arguments

ret

section .data

thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant

thing2: dd 1.2, 2.2, 3.2, 4.2;<- source constant

retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value

There's one curious operation called "shufps", which does a permutation of the incoming floats. The permutation is controlled by 4 2-bit "select" fields, one for each destination float, that determines which incoming float goes to that slot. For example, a swizzle can copy the first float into all four destination slots, with a select field of 0x00. To copy the last float into all four destination slots, the select field is 0xff. A select field of 00010111, or 0x17, would set the first float from the last (float 3), copy float 1 into the middle two slots, and fill the last slot with the first float (float 0). This is the *only* operation which rearranges floats between different slots in the registers. (In graphics language, this operation is called a "swizzle".) For example, if xmm1 contains the four floats "(0.0, 1.1, 2.2, 3.3)", then:

shufps xmm1,xmm1, 0xff

will set all 4 of xmm1's floats to the last value (index 11).

- GCC 3.x under linux or MacOS X. Be sure to compile with "-msse", or the header will whine.

- Visual C++ 2003 or better under Windows.

- Intel C/C++ compiler under Linux or Windows. Makes bogus
complaint about "no EMMS instruction before return", although the
annoying emms() call is thankfully only needed with MMX, not SSE.

The C version of an SSE register is the user-friendly and self-explanatory type "__m128". All the instructions start with "_mm_" (i.e., MultiMedia). The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:

__m128 _mm_load_ps(float *src) |
Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16! |

__m128 _mm_loadu_ps(float *src) | Load 4 floats from an unaligned address (4x slower!) |

__m128 _mm_load1_ps(float *src) | Load 1 individual float into all 4 fields of an __m128 |

__m128 _mm_setr_ps(float a,float b,float c,float d) |
Load 4 separate floats from parameters into an __m128 |

void _mm_store_ps(float *dest,__m128 src) |
Store 4 floats to an aligned address. |

void _mm_storeu_ps(float *dest,__m128 src) | Store 4 floats to unaligned address |

__m128 _mm_add_ps(__m128 a,__m128 b) |
Add corresponding floats (also "sub") |

__m128 _mm_mul_ps(__m128 a,__m128 b) | Multiply corresponding floats (also "div", but it's slow) |

__m128 _mm_min_ps(__m128 a,__m128 b) | Take corresponding minimum (also "max") |

__m128 _mm_sqrt_ps(__m128 a) | Take square roots of 4 floats (12ns, slow like divide) |

__m128 _mm_rcp_ps(__m128 a) | Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!) |

__m128 _mm_rsqrt_ps(__m128 a) | Rough (12-bit) reciprocal-square-root of all 4 floats (fast) |

__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats. |

So take your classic "loop over floats":

for (int i=0;i<n_vals;i++) {(executable NetRun link)

vals[i]=vals[i]*a+b;

}

This takes about 4.5 ns per float.

Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:

for (int i=0;i<n_vals;i+=4) {(Try this in NetRun now!)

vals[i+0]=vals[i+0]*a+b;

vals[i+1]=vals[i+1]*a+b;

vals[i+2]=vals[i+2]*a+b;

vals[i+3]=vals[i+3]*a+b;

}

This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.

We can then replace the guts of the loop with SSE instructions:

__m128 SSEa=_mm_load1_ps(&a);This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!

__m128 SSEb=_mm_load1_ps(&b);

__m128 v=_mm_load_ps(&vec[i]);

v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);

_mm_store_ps(&vec[i],v);

(Note: I'd previously called "_mm_load_ps1" in the above example, which works, but is a bit slower.)

Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place!

Apple points out that even when you're doing arithmetic on actual 3D vectors (for example), you might *not* want to store one vector per SSE register. The reason is that not every vector operation maps nicely onto SSE, for example the final summation in a dot product takes two calls to the ever-so-funky haddps instruction. It's sometimes better to just unroll each of your vector loops by four, collect up four adjacent vector X, Y, or Z coordinates, and then do your usual computation on these blocks of adjacent vectors.

- Comparison instructions,
like cmpeq, that fill the corresponding SSE register's float with
0xffffffff (binary all ones) if the corresponding float component
comparison comes out true, and all zeros if the comparison comes out
false. That is, these are per-component comparisons, so you could
get an output like {0xffffffff,0x0,0xffffffff,0x0} if the even two
components come out true, and the odd two components come out false.

- Logical operations, like and, nand, or, and xor, that bitwise-and SSE registers.
Note that bitwising AND'ing two floats is totally useless (the sign,
exponent, and mantissa fields get anded!). But bitwise AND'ing
with all-zeros zeros out the corresponding float, and AND'ing with all
ones leaves the float untouched.

for (int i=0;i<n;i++) {(Try this in NetRun now!)

if (vec[i]<7)

vec[i]=vec[i]*a+b;

else

vec[i]=c;

}

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash:

for (int i=0;i<n;i++) {Written in ordinary sequential code, this is actually a slowdown, not a speedup! But in SSE this branch-to-logical transformation means you can keep barreling along in parallel, without having to switch to sequential floating point to do the branches:

unsigned int mask=(vec[i]<7)?0xffFFffFF:0;

vec[i]=((vec[i]*a+b)&mask) | (c&~mask);

}

__m128 A=_mm_load1_ps(&a), B=_mm_load1_ps(&b), C=_mm_load1_ps(&c);This gives about a 3.8x speedup over the original loop on my machine!

__m128 Thresh=_mm_load1_ps(&thresh);

for (int i=0;i<n;i+=4) {

__m128 V=_mm_load_ps(&vec[i]);

__m128 mask=_mm_cmplt_ps(V,Thresh); // Do the comparison

__m128 V_then=_mm_add_ps(_mm_mul_ps(V,A),B); // "then" half of "if"

__m128 V_else=C; // "else" half of "if"

V=_mm_or_ps( _mm_and_ps(mask,V_then), _mm_andnot_ps(mask,V_else) );

_mm_store_ps(&vec[i],V);

}

Intel hinted in their Larrabee paper that NVIDIA is actually doing this exact float-to-SSE branch transformation in CUDA, NVIDIA's very high-performance language for running sequential-looking code in parallel on the graphics card.