This is an old, simple, but powerful idea--"SIMD", which stands for Single Instruction Multiple Data:
You
can do lots of interesting SIMD work without using any special
instructions--plain old C will do, if you treat an "int" as 32
completely independent bits, because any normal "int"
instructions will operate on all the bits at once. This
use of bitwise operations is often called "SIMD within a
register (SWAR)" or "word-SIMD"; see Sean
Anderson's "Bit Twiddling Hacks" for a variety of
amazing examples.
The
most common form of SIMD today are the "multimedia" instruction
set extensions in normal CPUs. The usual arrangement for
multimedia instructions is for single instructions to operate on
four 32-bit floats. These four-float instructions exist
almost everywhere nowdays:
We'll look at the x86 version below.
All
the SSE instructions (full list below) scale out to parallel
execution, SIMD style.
So "addss" adds one float, "addps" adds four floats.
scalar/ serial |
packed/ parallel |
|
single-precision "float" | ss | ps (4 floats) |
double-precision "double" | sd | pd (2 doubles) |
Instead
of "ss" (Scalar Single-precision) using "ps" (Packed
Single-precision) makes each operation work on four floats at
once!
movaps xmm0,[a] ; load up 4 floats addps xmm0,xmm0 ; add each float to itself ret section .data align=16 a: dd 3.141592,1.0,10.0,100.0Note we need to use the align directive so "movaps" (Aligned Packed Single-precision) works.
xmm0 register
contains: |
||||
ss instructions |
float |
|||
sd instructions |
double |
|||
ps instructions |
float |
float |
float |
float |
pd instructions |
double |
double |
Here's some silly floating-point code. It takes 2.7ns/float.
enum {n=1024};
float a[n];
for (int i=0;i<n;i++) a[i]=3.4;
for (int i=0;i<n;i++) a[i]+=1.2;
return a[0];
Staring at the assembly language, there are a number of "cvtss2sd" and back again, due to the double-precision constants and single-precision data. So we can get a good speedup to 1.4ns/float, just by making the constants floating point.
enum {n=1024};
float a[n];
for (int i=0;i<n;i++) a[i]=3.4f;
for (int i=0;i<n;i++) a[i]+=1.2f;
return a[0];
We
can run a *lot* faster by using SSE parallel instructions.
I'm going to do this the "hard way," making separate functions
to do the assembly computation. Essentially, we're
converting the loops to run across four iterations at once, so
they can operate on blocks of floats.
Here's
the C++ conversion to call assembly language functions on the
array.
extern "C" void init_array(float *arr,int n);
extern "C" void add_array(float *arr,int n);
int foo(void) {
enum {n=1024};
float a[n];
init_array(a,n);
add_array(a,n);
return a[0]*1000;
}
Here are the two assembly language functions called above. Together, we're down to under 0.5ns/float!
; extern "C" void init_array(float *arr,int n);
;for (int i=0;i<n;i+=4) {
; a[i]=3.4f;
; a[i+1]=3.4f;
; a[i+2]=3.4f;
; a[i+3]=3.4f;
;}
global init_array
init_array:
; rdi points to arr
; rsi is n, the array length
mov rcx,0 ; i
movaps xmm1,[constant3_4]
jmp loopcompare
loopstart:
movaps [rdi+4*rcx],xmm1 ; init array with xmm1
add rcx,4
loopcompare:
cmp rcx,rsi
jl loopstart
ret
section .data align=16
constant3_4:
dd 3.4,3.4,3.4,3.4 ; movaps!
section .text
; extern "C" void add_array(float *arr,int n);
;for (int i=0;i<n;i++) a[i]+=1.2f;
global add_array
add_array:
; rdi points to arr
; rsi is n, the array length
mov rcx,0 ; i
movaps xmm1,[constant1_2]
jmp loopcompare2
loopstart2:
movaps xmm0,[rdi+4*rcx] ; loads arr[i] through arr[i+3]
addps xmm0,xmm1
movaps [rdi+4*rcx],xmm0
add rcx,4
loopcompare2:
cmp rcx,rsi
jl loopstart2
ret
section .data align=16
constant1_2:
dd 1.2,1.2,1.2,1.2 ; movaps!
0.5ns/float is pretty impressive performance for this code, since:
Scalar Single-precision (float) |
Scalar Double-precision (double) |
Packed Single-precision (4 floats) |
Packed Double-precision (2 doubles) |
Example | Comments | |
Arithmetic | addss | addsd | addps | addpd | addss xmm0,xmm1 | sub, mul, div all work the same way |
Compare | minss | minsd | minps | minpd | minps xmm0,xmm1 | max works the same way |
Sqrt | sqrtss | sqrtsd | sqrtps | sqrtpd | sqrtss xmm0,xmm1 | Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way |
Move | movss | movsd | movaps (aligned) movups (unaligned) |
movapd (aligned) movupd (unaligned) |
movss xmm0,xmm1 | Aligned loads are up to 4x faster, but will crash if given an unaligned address! Stack is always 16-byte aligned specifically for this instruction. Use "align 16" directive for static data. |
Convert | cvtss2sd cvtss2si cvttss2si |
cvtsd2ss cvtsd2si cvttsd2si |
cvtps2pd cvtps2dq cvttps2dq |
cvtpd2ps cvtpd2dq cvttpd2dq |
cvtsi2ss xmm0,eax | Convert to ("2", get it?) Single Integer (si, stored in register like eax) or four DWORDs (dq, stored in xmm register). "cvtt" versions do truncation (round down); "cvt" versions round to nearest. |
High Bits | n/a | n/a | movmskps | movmskpd | movmskps eax,xmm0 | Extract the sign bits of an xmm register into an integer register. Often used to see if all the floats are "done" and you can exit. |
Compare to flags | ucomiss | ucomisd | n/a | n/a | ucomiss xmm0,xmm1 jbe dostuff |
Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers. Use with "jb", "jbe", "je", "jae", or "ja" for normal comparisons. Sets "pf", the parity flag, if either input is a NaN. |
Compare to bitwise mask | cmpeqss | cmpeqsd | cmpeqps | cmpeqpd | cmpleps xmm0,xmm1 | Compare for equality ("lt", "le", "neq", "nlt", "nle" versions work the same way). There's also a "cmpunordss" that marks NaN values. Sets all bits of float to zero if false (0.0), or all bits to ones if true (a NaN). Result is used as a bitmask for the bitwise AND and OR operations. |
Bitwise | n/a | n/a | andps andnps |
andpd andnpd |
andps xmm0,xmm1 | Bitwise AND operation.
"andn" versions are bitwise AND-NOT operations (A=(~A) &
B). "or" version works the same way. This is for
the "branchless if-then-else" covered below: result = (maskV & thenV) | ((~maskV) & elseV); |
One big
problem in SIMD is branching. If half the elements in a
single SIMD register need one instruction, and half need a
different instruction, you can't handle them both in a single
instruction.
So the AND-OR trick is used to simulate branches. The
situation where these are useful is when you're trying to convert
a loop like this to SIMD:
for (int i=0;i<n;i++) {
if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try
this in NetRun now!)
You can implement this branch by setting a mask indicating where
vals[i]<7, and then using the mask to pick the correct side of
the branch to AND with zeros.
for (int i=0;i<n;i++) {
unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=(mask&(vec[i]*a+b)) | ((~mask)&c);
}
SSE
totally wants you to write branches this way--the comparison
instructions like cmpltps return each float filled with all one
bits (0xffFFffFF, NaN in float) if the comparison came out true,
and all zeros (0.0 in float) if the comparison came out false.
These are useless by themselves, but then you can use the
andps, andnps, and orps instructions to combine them to make a
branch. This is easiest with the three-operand VEX
instruction encoding:
; Input:
; xmm6 vec
; xmm7 constant to compare against (7)
; xmm5 vec[i]*a+b, "then" value
; xmm4 c "else" value
vcmpltps xmm2, xmm6, xmm7 ; xmm2 = the mask from comparing vec[i]<7
vandps xmm0, xmm2, xmm5 ; xmm0 = mask & then value
vandnps xmm1, xmm2, xmm5 ; xmm1 = (~mask) & else value
vorps xmm0, xmm0, xmm1 ; xmm0 = OR together the then (xmm0) and else (xmm1) values
; Output: xmm0 is the new value for vec
Yes,
this is very weird!
Loops
are even harder: you need to keep going around the loop until
*all* the SIMD lanes are done with the loop. movmskps
is useful for extracting the sign bits of the comparisons, but you
need to use the same branching trick on every assignment inside
the loop.