# SSE via C++ Intrinsics

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

This simple C++ program loops over an array of floats, and does some work on each float:

```enum {n=1024};
float arrIn[n];
float arrOut[n];

int foo(void) {
for (int i=0;i<n;i++) arrOut[i]=arrIn[i]*2.1+1.2;
return 3;
}```

(Try this in NetRun now!)

Currently, it takes over 2000 nanoseconds for 1000 floats.  This is slower than you expect.  Examining the assembly code, you see this inner loop code:

```  1f:	f3 0f 10 14 02       	movss  xmm2,DWORD PTR [rdx+rax*1]
24:	0f 5a d2             	cvtps2pd xmm2,xmm2
27:	f2 0f 59 d1          	mulsd  xmm2,xmm1
2b:	f2 0f 58 d0          	addsd  xmm2,xmm0
2f:	f2 0f 12 d2          	movddup xmm2,xmm2
33:	66 0f 5a d2          	cvtpd2ps xmm2,xmm2
37:	f3 0f 11 14 01       	movss  DWORD PTR [rcx+rax*1],xmm2
3c:	48 83 c0 04          	add    rax,0x4
40:	48 3d 00 10 00 00    	cmp    rax,0x1000
46:	75 d7                	jne    1f```

Note the conversion from "ss" (serial single-precision float)  to "sd" (serial double-precision).  This is caused by the floating-point constants like 2.1, which have type "double" by default.  We can speed the program up by over twofold(!) by adding the "f" suffix to the constants, making them floats instead of doubles, and eliminating the sd <-> ss conversions.

`arrOut[i]=arrIn[i]*2.1f+1.2f;`

(Try this in NetRun now!)

For this new version, the inner loop is now very clean, similar to what we might write by hand:

```  1f:	f3 0f 10 04 02       	movss  xmm0,DWORD PTR [rdx+rax*1]
24:	f3 0f 59 c2          	mulss  xmm0,xmm2
28:	f3 0f 58 c1          	addss  xmm0,xmm1
2c:	f3 0f 11 04 01       	movss  DWORD PTR [rcx+rax*1],xmm0
31:	48 83 c0 04          	add    rax,0x4
35:	48 3d 00 10 00 00    	cmp    rax,0x1000
3b:	75 e2                	jne    1f```

## Parallel SSE in Assembly

It is possible to speed this up even further by using "ps" instructions to operate on four floats at once, as discussed last lecture.  We can start to understand this process by writing the single-float loop in assembly language.  This is exactly as fast as the C++ version above, 1100 nanoseconds per run.

```mov rcx,0 ; i
movss xmm10,[ten]
movss xmm2,[mulConst]

start:
movss xmm0,[arrIn + 4*rcx]
mulss xmm0,xmm2
movss [arrOut + 4*rcx],xmm0
cmp rcx,1024
jl start
ret

section .data
arrIn:
times 1024 dd 3.4
arrOut:
times 1024 dd 0.0
ten:
dd 10.0
mulConst:
dd 2.1234566```

(Try this in NetRun now!)

The "times 1024" prefix indicates the following instruction should be repeated that many times.

To convert this serial loop to operating on four floats at once, we need to:

• Convert "ss" arithmetic to "ps" arithmetic.
• Convert "movss" loads and stores to either "movups" (unaligned, always works but slower) or "movaps" (must be 16-byte aligned or it segfaults, but it's faster).  We may need to replicate constant instructions to do this.
• Convert "ucomiss" branch instructions into "maxps", "minps", or the complicated masked if-then-else conditionals (see last lecture).
• Change the loop to jump over 4 floats at each iteration.

Here are the code changes in red italics.  We don't need to change the register names in SSE.  Because I'm using the fast aligned loads, this takes almost 1/4 the time, 290 ns per run!

```mov rcx,0 ; i
movaps xmm10,[ten]
movaps xmm2,[mulConst]

start:
movaps xmm0,[arrIn + 4*rcx]
mulps xmm0,xmm2
movaps [arrOut + 4*rcx],xmm0
cmp rcx,1024
jl start
ret

section .data
align 16arrIn:
times 1024 dd 3.4
arrOut:
times 1024 dd 0.0
ten:
times 4 dd 10.0
mulConst:
times 4 dd 2.1234566```

(Try this in NetRun now!)

I see these common bugs year after year, sometimes even in my own code!

• Load four-wide constant from a single isolated float?  You'll get the wrong answer, since the other three floats are garbage.
• Increment the loop by one float instead of four floats?  It'll crash with movaps, since i==1 is misaligned, and it'll be very slow with movups anyway.  It'll also run several floats off the end of the array when i==n-1.
• Miss a "ps" on one instruction?  "ss" will only modify the low float of the four.

In other words, it's assembly: it's trickier than it looks!

## Beyond Assembly: xmmintrin Functions for SSE

Since nobody likes writing in assembly (even me!), the x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>. This header exists and works out-of-the-box with most modern compilers:

• GCC 3.x or newer under linux or MacOS X.  Be sure to compile with "-msse", or the header will whine.
• Visual C++ 2003 or better under Windows.

Documentation can be found by just reading the xmmintrin header.  The underlying instructions have different names, which are listed in the x86 reference manual and summarized below and at this graphical and helpful SSE site.  Keep in mind that the names of these things are absurdly ugly; I take this as a hint that you should wrap them in a nicer interface, such as a "floats" or "vec4" class (see below!).

The C version of an SSE register is the type "__m128".  gcc provides factory overloads for arithmetic on __m128's.  All the instructions start with "_mm_", while the suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats).  SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot.  So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc.  Major useful operations are:

There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.

So take your classic "loop over floats":

`	for (int i=0;i<n_vals;i++) { 	        vals[i]=vals[i]*a+b;	}`

This takes about 4.5 ns per float.

Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:

`	for (int i=0;i<n_vals;i+=4) { 	        vals[i+0]=vals[i+0]*a+b;	        vals[i+1]=vals[i+1]*a+b;	        vals[i+2]=vals[i+2]*a+b;	        vals[i+3]=vals[i+3]*a+b;	}`

(Try this in NetRun now!)

This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.

We can then replace the guts of the loop with SSE instructions:

`	__m128 SSEa=_mm_load1_ps(&a);	__m128 SSEb=_mm_load1_ps(&b);	__m128 v=_mm_load_ps(&vec[i]);	v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);	_mm_store_ps(&vec[i],v);`

(Try this in NetRun now!)

This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!  Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place.

## C++ Classes and SSE

The intrinsics above are much cleaner than assembly, but they're still quite ugly.  You can make a very nice interface using C++ classes and operator overloading, such as the "floats" class we developed during today's lecture:

```class floats {
__m128 v; // our four floats
public:
enum {n=4}; // how many floats?
floats(const float *ptr) { v=_mm_load_ps(ptr); }
floats(__m128 m) { v=m; }
void store(float *dest) { _mm_store_ps(dest,v); }

friend floats operator+(const floats &a,const floats &b) {
}
friend floats operator*(const floats &a,const floats &b) {
return _mm_mul_ps(a.v,b.v);
}
};

int foo(void) {
for (long i=0;i<n;i+=floats::n) {
((floats(&arrIn[i])+10.0)*2.12345).store(&arrOut[i]);
}
return 3;
}```

(Try this in NetRun now!)

The thing I really like about C++ is this kind of wrapper can have very nice syntax, a complete feature set, and yet compile into no-overhead machine code just like I'd write by hand.

If you want a real version of this wrapper, supporting all the math operations, integers, bools, and floats, you should try Agner Fog's Vector Class Library (VCL).