SIMD: Parallel Floating Point with SSE

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

This is an old, simple, but powerful idea--"SIMD", which stands for Single Instruction Multiple Data:

Single, meaning just one.
Instruction, as in a machine code instruction, executed by hardware.
Multiple, as in more than one--from 2 to a thousand or so.
Data, as in floats or ints.

You can do lots of interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once. This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

The most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangement for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

x86, where they are called "SSE", or "Streaming SIMD Extensions" for the 4-float versions (or "AVX" or "Advanced Vector eXtensions" for 8-float, or even AVX-512 for 16-float versions)
ARM, where they are called NEON float32x4
PowerPC, where they are called "AltiVec"
Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
Graphics cards, where they are part of the pixel processors

We'll look at the x86 version below.

Parallel Floats in Assembly

Instead of "ss" (Scalar Single-precision) using "ps" (Packed Single-precision) makes each operation work on four floats at once!

movaps xmm0,[a]
addps xmm0,xmm0 ; add each float to itself
movaps [out],xmm0 ;    copy floats out to memory

push rax; align stack
extern farray_print ;     prototype
mov rdi, out ;               print the floats in memory
mov rsi,4 ;                    count of floats: four to print
call farray_print
pop rax ;                     clean up stack
ret

section .data align=16
a: dd 3.141592,1.0,10.0,100.0
out: dd 0,0,0,0

(Try this in NetRun now!)

Note we need to use the align directives so "movaps" (Aligned Packed Single-precision) works.

Speeding up Floating Point code Using Streaming SIMD Extensions (SSE)

All the SSE instructions (listed below) scale out to parallel execution, SIMD style. So "addss" adds one float, "addps" adds four floats, SIMD style.

	scalar/ serial	packed/ parallel
single-precision "float"	ss	ps (4 floats)
double-precision "double"	sd	pd (2 doubles)

Here's some silly floating-point code. It takes 2.7ns/float.

enum {n=1024};
float a[n];
for (int i=0;i<n;i++) a[i]=3.4;
for (int i=0;i<n;i++) a[i]+=1.2;
return a[0];

(Try this in NetRun now!)

Staring at the assembly language, there are a number of "cvtss2sd" and back again, due to the double-precision constants and single-precision data. So we can get a good speedup to 1.4ns/float, just by making the constants floating point.

enum {n=1024};
float a[n];
for (int i=0;i<n;i++) a[i]=3.4f;
for (int i=0;i<n;i++) a[i]+=1.2f;
return a[0];

(Try this in NetRun now!)

We can run a *lot* faster by using SSE parallel instructions. I'm going to do this the "hard way," making separate functions to do the assembly computation. Essentially, we're converting the loops to run across four iterations at once, so they can operate on blocks of floats.

Here's the C++ conversion to call assembly language functions on the array.

extern "C" void init_array(float *arr,int n);
extern "C" void add_array(float *arr,int n);

int foo(void) {
	enum {n=1024};
	float a[n];
	init_array(a,n);
	add_array(a,n);
	return a[0]*1000;
}

(Try this in NetRun now!)

Here are the two assembly language functions called above. Together, we're down to under 0.5ns/float!

; extern "C" void init_array(float *arr,int n);
;for (int i=0;i<n;i+=4) {
;	a[i]=3.4f;
;	a[i+1]=3.4f;
;	a[i+2]=3.4f;
;	a[i+3]=3.4f;
;}

global init_array
init_array:
	; rdi points to arr
	; rsi is n, the array length
	mov rcx,0 ; i
	movaps xmm1,[constant3_4]

	jmp loopcompare
loopstart:
	movaps [rdi+4*rcx],xmm1 ; init array with xmm1

	add rcx,4
loopcompare:
	cmp rcx,rsi
	jl loopstart
	ret

section .data align=16
constant3_4:
	dd 3.4,3.4,3.4,3.4 ; movaps!

section .text
; extern "C" void add_array(float *arr,int n);
;for (int i=0;i<n;i++) a[i]+=1.2f;

global add_array
add_array:
	; rdi points to arr
	; rsi is n, the array length
	mov rcx,0 ; i
	movaps xmm1,[constant1_2]

	jmp loopcompare2
loopstart2:
	movaps xmm0,[rdi+4*rcx] ; loads arr[i] through arr[i+3]
	addps xmm0,xmm1
	movaps [rdi+4*rcx],xmm0

	add rcx,4
loopcompare2:
	cmp rcx,rsi
	jl loopstart2
	ret

section .data align=16
constant1_2:
	dd 1.2,1.2,1.2,1.2 ; movaps!

(Try this in NetRun now!)

0.5ns/float is pretty impressive performance for this code, since:

We store each float once in init_array, then load and store it again in add_array. That's 3 trips through memory per float.
Each float is 4 bytes.
0.5ns/float means we're doing over 2 billion floats per second, or 24 gigabytes per second to and from memory! (In this case, the processor's cache.)

SSE Instruction List

	Scalar Single-precision (float)	Scalar Double-precision (double)	Packed Single-precision (4 floats)	Packed Double-precision (2 doubles)	Example	Comments
Arithmetic	addss	addsd	addps	addpd	addss xmm0,xmm1	sub, mul, div all work the same way
Compare	minss	minsd	minps	minpd	minps xmm0,xmm1	max works the same way
Sqrt	sqrtss	sqrtsd	sqrtps	sqrtpd	sqrtss xmm0,xmm1	Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way
Move	movss	movsd	movaps (aligned) movups (unaligned)	movapd (aligned) movupd (unaligned)	movss xmm0,xmm1	Aligned loads are up to 4x faster, but will crash if given an unaligned address! Stack is always 16-byte aligned specifically for this instruction. Use "align 16" directive for static data.
Convert	cvtss2sd cvtss2si cvttss2si	cvtsd2ss cvtsd2si cvttsd2si	cvtps2pd cvtps2dq cvttps2dq	cvtpd2ps cvtpd2dq cvttpd2dq	cvtsi2ss xmm0,eax	Convert to ("2", get it?) Single Integer (si, stored in register like eax) or four DWORDs (dq, stored in xmm register). "cvtt" versions do truncation (round down); "cvt" versions round to nearest.
High Bits	n/a	n/a	movmskps	movmskpd	movmskps eax,xmm0	Extract the sign bits of an xmm register into an integer register. Often used to see if all the floats are "done" and you can exit.
Compare to flags	ucomiss	ucomisd	n/a	n/a	ucomiss xmm0,xmm1 jbe dostuff	Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers. Use with "jb", "jbe", "je", "jae", or "ja" for normal comparisons. Sets "pf", the parity flag, if either input is a NaN.
Compare to bitwise mask	cmpeqss	cmpeqsd	cmpeqps	cmpeqpd	cmpleps xmm0,xmm1	Compare for equality ("lt", "le", "neq", "nlt", "nle" versions work the same way). There's also a "cmpunordss" that marks NaN values. Sets all bits of float to zero if false (0.0), or all bits to ones if true (a NaN). Result is used as a bitmask for the bitwise AND and OR operations.
Bitwise	n/a	n/a	andps andnps	andpd andnpd	andps xmm0,xmm1	Bitwise AND operation. "andn" versions are bitwise AND-NOT operations (A=(~A) & B). "or" version works the same way. This is for the "branchless if-then-else" covered below: result = (maskV & thenV) \| ((~maskV) & elseV);

Bitwise Branching in SIMD

One big problem in SIMD is branching. If half the elements in a single SIMD register need one instruction, and half need a different instruction, you can't handle them both in a single instruction.

So the AND-OR trick is used to simulate branches. The situation where these are useful is when you're trying to convert a loop like this to SIMD:

	for (int i=0;i<n;i++) { 
	        if (vec[i]<7) 
			vec[i]=vec[i]*a+b;
		else
			vec[i]=c;
	}

(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to AND with zeros.

	for (int i=0;i<n;i++) { 
	        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
		vec[i]=(mask&(vec[i]*a+b)) | ((~mask)&c);
	}

SSE totally wants you to write branches this way--the comparison instructions like cmpltps return each float filled with all one bits (0xffFFffFF, NaN in float) if the comparison came out true, and all zeros (0.0 in float) if the comparison came out false. These are useless by themselves, but then you can use the andps, andnps, and orps instructions to combine them to make a branch.

Yes, this is very weird!