This simple C++ program loops over an array of floats, and does some work on each float:

enum {n=1024}; float arrIn[n]; float arrOut[n]; int foo(void) { for (int i=0;i<n;i++) arrOut[i]=arrIn[i]*2.1+1.2; return 3; }

Currently, it takes over 2000 nanoseconds for 1000 floats. This is slower than you expect. Examining the assembly code, you see this inner loop code:

1f: f3 0f 10 14 02 movss xmm2,DWORD PTR [rdx+rax*1] 24: 0f 5a d2 cvtps2pd xmm2,xmm2 27: f2 0f 59 d1 mulsd xmm2,xmm1 2b: f2 0f 58 d0 addsd xmm2,xmm0 2f: f2 0f 12 d2 movddup xmm2,xmm2 33: 66 0f 5a d2 cvtpd2ps xmm2,xmm2 37: f3 0f 11 14 01 movss DWORD PTR [rcx+rax*1],xmm2 3c: 48 83 c0 04 add rax,0x4 40: 48 3d 00 10 00 00 cmp rax,0x1000 46: 75 d7 jne 1f

Note the conversion from "ss" (serial single-precision float) to "sd" (serial double-precision). This is caused by the floating-point constants like 2.1, which have type "double" by default. We can speed the program up by over twofold(!) by adding the "f" suffix to the constants, making them floats instead of doubles, and eliminating the sd <-> ss conversions.

arrOut[i]=arrIn[i]*2.1f+1.2f;

For this new version, the inner loop is now very clean, similar to what we might write by hand:

1f: f3 0f 10 04 02 movss xmm0,DWORD PTR [rdx+rax*1] 24: f3 0f 59 c2 mulss xmm0,xmm2 28: f3 0f 58 c1 addss xmm0,xmm1 2c: f3 0f 11 04 01 movss DWORD PTR [rcx+rax*1],xmm0 31: 48 83 c0 04 add rax,0x4 35: 48 3d 00 10 00 00 cmp rax,0x1000 3b: 75 e2 jne 1f

It is possible to speed this up even further by using "ps" instructions to operate on four floats at once, as discussed last lecture. We can start to understand this process by writing the single-float loop in assembly language. This is exactly as fast as the C++ version above, 1100 nanoseconds per run.

mov rcx,0 ; i movss xmm10,[ten] movss xmm2,[mulConst] start: movss xmm0,[arrIn + 4*rcx] addss xmm0,xmm10 mulss xmm0,xmm2 movss [arrOut + 4*rcx],xmm0 add rcx,1 cmp rcx,1024 jl start ret section .data arrIn: times 1024 dd 3.4 arrOut: times 1024 dd 0.0 ten: dd 10.0 mulConst: dd 2.1234566

The "times 1024" prefix indicates the following instruction should be repeated that many times.

To convert this serial loop to operating on four floats at once, we need to:

- Convert "ss" arithmetic to "ps" arithmetic.
- Convert "movss" loads and stores to either "movups" (unaligned, always works but slower) or "movaps" (must be 16-byte aligned or it segfaults, but it's faster). We may need to replicate constant instructions to do this.
- Convert "ucomiss" branch instructions into "maxps", "minps", or the complicated masked if-then-else conditionals (see last lecture).
- Change the loop to jump over 4 floats at each iteration.

Here are the
code changes in ** red italics**.
We don't need to change the register names in SSE.
Because I'm using the fast aligned loads, this takes almost
1/4 the time, 290 ns per run!

mov rcx,0 ; i movaps xmm10,[ten] movaps xmm2,[mulConst] start: movaps xmm0,[arrIn + 4*rcx] addps xmm0,xmm10 mulps xmm0,xmm2 movaps [arrOut + 4*rcx],xmm0 add rcx,4cmp rcx,1024 jl start ret section .dataalign 16arrIn: times 1024 dd 3.4 arrOut: times 1024 dd 0.0 ten:times 4dd 10.0 mulConst:times 4dd 2.1234566

I see these common bugs year after year, sometimes even in my own code!

- Load four-wide constant from a single isolated float? You'll get the wrong answer, since the other three floats are garbage.
- Increment the loop by one float instead of four floats? It'll crash with movaps, since i==1 is misaligned, and it'll be very slow with movups anyway. It'll also run several floats off the end of the array when i==n-1.
- Miss a "ps" on one instruction? "ss" will only modify the low float of the four.

In other words, it's assembly: it's trickier than it looks!

Since nobody likes writing in assembly (even me!), the x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>. This header exists and works out-of-the-box with most modern compilers:

- GCC 3.x or newer under linux or MacOS X. Be sure to compile with "-msse", or the header will whine.
- Visual C++ 2003 or better under Windows.

Documentation
can be found by just reading the xmmintrin header.
The underlying instructions have different names, which are
listed in the x86 reference manual and
summarized below and at this graphical and helpful SSE site.
Keep in mind that the names of these things are absurdly ugly; I
take this as a hint that you should wrap them in a nicer
interface, such as a "floats" or "vec4" class (see below!).

The
C version of an SSE register is the type "__m128". gcc
provides factory overloads for arithmetic on __m128's. All
the instructions start with "_mm_", while the suffix indicates
the data type; in these examples, we'll just be talking about 4
floats, which use the suffix "_ps" (Packed Single-precision
floats). SSE supports other data types in the same 128
bits, but 4 floats seems to be the sweet spot. So, for
example, "_mm_load_ps" loads up 4 floats into a __m128,
"_mm_add_ps" adds 4 corresponding floats together, etc.
Major useful operations are:

__m128 _mm_load_ps(float *src) | Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16! |

__m128 _mm_loadu_ps(float *src) | Load 4 floats from an unaligned address (4x slower on older machines, same speed on latest machines *if* data is aligned) |

__m128 _mm_load1_ps(float *src) | Load 1 individual float into all 4 fields of an __m128 |

__m128 _mm_setr_ps(float a,float b,float c,float d) | Load 4 separate floats from parameters into an __m128 |

void _mm_store_ps(float *dest,__m128 src) | Store 4 floats to an aligned address. |

void _mm_storeu_ps(float *dest,__m128 src) | Store 4 floats to unaligned address |

__m128 _mm_add_ps(__m128 a,__m128 b) | Add corresponding floats (also "sub") |

__m128 _mm_mul_ps(__m128 a,__m128 b) | Multiply corresponding floats (also "div", but it's slow) |

__m128 _mm_min_ps(__m128 a,__m128 b) | Take corresponding minimum (also "max") |

__m128 _mm_sqrt_ps(__m128 a) | Take square roots of 4 floats (12ns, slow like divide) |

__m128 _mm_rcp_ps(__m128 a) | Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!) |

__m128 _mm_rsqrt_ps(__m128 a) | Rough (12-bit) reciprocal-square-root of all 4 floats (fast) |

__m128 _mm_shuffle_ps(__m128
lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2
floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats. |

There
are also instructions for integer conversion, comparsions,
various bitwise operations, and cache-friendly prefetches and
streaming store operations.

So
take your classic "loop over floats":

for (int i=0;i<n_vals;i++) {

vals[i]=vals[i]*a+b;

}

(executable
NetRun link)

This
takes about 4.5 ns per float.

Step
1 is to unroll this loop 4 times, since SSE works on blocks of 4
floats at a time:

for (int i=0;i<n_vals;i+=4) {

vals[i+0]=vals[i+0]*a+b;

vals[i+1]=vals[i+1]*a+b;

vals[i+2]=vals[i+2]*a+b;

vals[i+3]=vals[i+3]*a+b;

}

(Try
this in NetRun now!)

This
alone speeds the code up by about 2x, because we don't have to
check the loop counter as often.

We
can then replace the guts of the loop with SSE
instructions:

__m128 SSEa=_mm_load1_ps(&a);

__m128 SSEb=_mm_load1_ps(&b);

__m128 v=_mm_load_ps(&vec[i]);

v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);

_mm_store_ps(&vec[i],v);

This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version! Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place.

The intrinsics above are much cleaner than assembly, but they're still quite ugly. You can make a very nice interface using C++ classes and operator overloading, such as the "floats" class we developed during today's lecture:

class floats { __m128 v; // our four floats public: enum {n=4}; // how many floats?

floats(const float *ptr) { v=_mm_load_ps(ptr); } floats(float c) { v=_mm_load1_ps(&c); } floats(__m128 m) { v=m; }

void store(float *dest) { _mm_store_ps(dest,v); } friend floats operator+(const floats &a,const floats &b) { return _mm_add_ps(a.v,b.v); } friend floats operator*(const floats &a,const floats &b) { return _mm_mul_ps(a.v,b.v); } }; int foo(void) { for (long i=0;i<n;i+=floats::n) { ((floats(&arrIn[i])+10.0)*2.12345).store(&arrOut[i]); } return 3; }

The thing I really like about C++ is this kind of wrapper can have very nice syntax, a complete feature set, and yet compile into no-overhead machine code just like I'd write by hand.

If you want a real version of this wrapper, supporting all the math operations, integers, bools, and floats, you should try Agner Fog's Vector Class Library (VCL).