More SSE, SSE vs non-x86 SIMD

SSE Instruction List

	Scalar Single-precision (float)	Scalar Double-precision (double)	Packed Single-precision (4 floats)	Packed Double-precision (2 doubles)	Comments
add	addss	addsd	addps	addpd	sub, mul, div all work the same way
min	minss	minsd	minps	minpd	max works the same way
sqrt	sqrtss	sqrtsd	sqrtps	sqrtpd	Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way
mov	movss	movsd	movaps (aligned) movups (unaligned)	movapd (aligned) movupd (unaligned)	Aligned loads are up to 4x faster, but will crash if given an unaligned address! Stack is always 16-byte aligned specifically for this instruction. Use "align 16" directive for static data.
cvt	cvtss2sd cvtss2si cvttss2si	cvtsd2ss cvtsd2si cvttsd2si	cvtps2pd cvtps2dq cvttps2dq	cvtpd2ps cvtpd2dq cvttpd2dq	Convert to ("2", get it?) Single Integer (si, stored in register like eax) or four DWORDs (dq, stored in xmm register). "cvtt" versions do truncation (round down); "cvt" versions round to nearest.
com	ucomiss	ucomisd	n/a	n/a	Sets CPU flags like normal x86 "cmp" instruction, from SSE registers.
cmp	cmpeqss	cmpeqsd	cmpeqps	cmpeqpd	Compare for equality ("lt", "le", "neq", "nlt", "nle" versions work the same way). Sets all bits of float to zero if false (0.0), or all bits to ones if true (a NaN). Result is used as a bitmask for the bitwise AND and OR operations.
and	n/a	n/a	andps andnps	andpd andnpd	Bitwise AND operation. "andn" versions are bitwise AND-NOT operations (A=(~A) & B). "or" version works the same way.

Simple SSE Output Code

The easy way to get SSE output is to just convert to integer, like this:

movss xmm3,[pi]; load up constant
addss xmm3,xmm3 ; add pi to itself
cvtss2si eax,xmm3 ; round to integer
ret
section .data
pi: dd 3.14159265358979 ; constant

(Try this in NetRun now!)

It's annoyingly tricky to display full floating-point values. The trouble here is that our function "foo" returns an int to main, so we have to call a function to print floating-point values. Also, with SSE floating-point, on a 64-bit machine you're supposed to keep the stack aligned to a 16-byte boundary (the SSE "movaps" instruction crashes if it's not given a 16-byte aligned value). Sadly, the "call" instruction messes up your stack's alignment by pushing an 8-byte return address, so we've got to use up another 8 bytes of stack space purely for stack alignment, like this.

movss xmm3,[pi]; load up constant
addss xmm3,xmm3 ; add pi to itself
movss [output],xmm3; write register out to memory

; Print floating-point output
mov rdi,output ; first parameter: pointer to floats
mov rsi,1 ; second parameter: number of floats
sub rsp,8 ; keep stack 16-byte aligned (else get crash!)
extern farray_print
call farray_print
add rsp,8

ret

section .data
pi: dd 3.14159265358979 ; constant
output: dd 0.0 ; overwritten at runtime

(Try this in NetRun now!)

Array of Structures, or Structure of Arrays?

Good description on page 5&6 of this PDF.

SSE Matrix-Vector Multiply

It's informative to look at the performance of matrix-vector multiply. I'll pick a 4x4 matrix, just to match SSE data sizes. To start with, the naive float version takes 45ns on a Pentium 4, and quite nearly the same speed on a newer Q6600 (serial performance of newer processors is pretty much identical).

enum {n=4};
float mat[n][n];
float vec[n];
float outvector[n];

int foo(void) {
 for (int row=0;row<4;row++) {
	 float sum=0.0;
	 for (int col=0;col<n;col++) {
		   float m=mat[row][col];
		   float v=vec[col];
		   sum+=m*v;
	 }
	 outvector[row]=sum;
 }
 return 0;
}

(Try this in NetRun now!)

Unrolling the inner loop, as ugly as it is, speeds things up substantially, to 26ns:

enum {n=4};
float mat[n][n];
float vec[n];
float outvector[n];

int foo(void) {
	for (int row=0;row<4;row++) {
		float sum=0.0, m,v;
		m=mat[row][0];
		v=vec[0];
		sum+=m*v;
		m=mat[row][1];
		v=vec[1];
		sum+=m*v;
		m=mat[row][2];
		v=vec[2];
		sum+=m*v;
		m=mat[row][3];
		v=vec[3];
		sum+=m*v;
		outvector[row]=sum;
	}
	return 0;
}

(Try this in NetRun now!)

Making a line-by-line transformation to SSE doesn't really buy any performance, at 25ns:

#include <pmmintrin.h>

enum {n=4};
__m128 mat[n]; /* rows */
__m128 vec;
float outvector[n];

int foo(void) {
 for (int row=0;row<n;row++) {
   __m128 mrow=mat[row];
   __m128 v=vec;
   __m128 sum=mrow*v;
   sum=_mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
   _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); /* adds those floats */
 }
 return 0;
}

(Try this in NetRun now!)

The trouble here is that we can cheaply operate on 4-vectors, but summing up the elements of those 4-vectors (with the hadd instruction) is expensive. We can eliminate that horizontal summation by operating on columns, although now we need a new matrix layout. This is down to 19ns on a Pentium 4, and just 12ns on the Q6600!

#include <xmmintrin.h>

enum {n=4};
__m128 mat[n]; /* by column */
float vec[n];
__m128 outvector;

int foo(void) {
 float z=0.0;
 __m128 sum=_mm_load1_ps(&z);
 for (int col=0;col<n;col++) {
   __m128 mcol=mat[col];
   float v=vec[col];
   sum+=mcol*_mm_load1_ps(&v);
 }
 outvector=sum;
 return 0;
}

(Try this in NetRun now!)