Faster Floating Point with SSE: Streaming SIMD Extensions

CS 301 Lecture, Dr. Lawlor

Aside: The Bit Shuffle

In HW8, you had to read some bits out of the guts of a float.  This is tricky, because none of the bitwise operators *work* on a float, and casting to integer (like "int x=(int)f;") translates the *value* of the float, not the bits inside of it.

To read out the bits of any type, you've got to make it an integer without changing the underlying bytes.  Up until 2004 or so, you could do this legally and safely using "the pointer shuffle" (my term), which is:
int x = * (int *) &f;
This is just the do-nothing "* & f" (take f's address, then dereference the address, leaving f again), but with a pointer typecast in the middle to convert the bytes of f into the bytes of an int.  Unfortunately, more recent compilers may not recognize that f is getting read during this code, so this approach isn't reliable anymore.

The modern way to do this sort of bitwise conversion is with a "union" (the creepy uncle of "class"):
`union floatConverter {	int i;	float f;};floatConverter u;u.f=1.5; /* write into the float side */return u.i; /* read out the bits from the integer side */`

(Try this in NetRun now!)

And a super-fancy way to do this is with a "bitfield" class, where you assign bit sizes to the class's fields:
`/* IEEE floating-point number's bits:  sign  exponent   mantissa */struct float_bits {	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */	unsigned int exp:8; /**< Value is 2^(exp-127) */	unsigned int sign:1; /**< 0 for positive, 1 for negative */};/* A union is a struct where all the fields *overlap* each other */union float_dissector {	float f;	float_bits b;};float_dissector s;s.f=8.0;std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";return 0;`

SSE Assembly

The old 1970's floating-point on x86 was so bad, they actually built a better optional way to do floating point called "SSE".  Unlike the old floating point:
• SSE registers can be accessed in any order, not just as a stack.
• All SSE registers can be trashed (no need to clean up the stack across function calls).
• SSE instructions look very similar to integer instructions.
• SSE won't work on ancient processors, like the Pentium I (but anymore, who cares?).
SSE instructions were first introduced with the Intel Pentium II, but they're now found on all modern x86 processors, including the 64-bit versions.  SSE introduces 8 new registers, called xmm0 through xmm7, that each contain 32-bit single-precision floats.  The new float instructions that operate on these registers have the suffix "ss", for "Scalar Single-precision" (one float).  See the x86 reference manual for a complete list of SSE instructions.

For example, "add" adds two integer registers, like eax and ebx.  "addss" adds two SSE registers, like xmm3 and xmm6.  There are  SSE versions of most other operations: movss, subss, mulss, divss, cmpss, etc.

Here we're adding two floats using SSE:
`	movss xmm1,[thing1]; <- copy the float into xmm1	movss xmm6,[thing2]; <- copy the float into xmm6	addss xmm1,xmm6; <- add floats	movss [retval],xmm1; <- move that constant into the global "retval"	; Print out retval	extern farray_print	push 1 ;<- number of floats to print	push retval ;<- points to array of floats	call farray_print	add esp,8 ; <- pop off arguments	retsection .datathing1: dd 10.2;<- source constantthing2: dd 1.2;<- source constantretval: dd 0.0;<- our return value`

(Try this in NetRun now!)

You can make gcc use these new SSE registers for all its floating point operations with the "-mfpmath=sse" flag, but this is only the default on Mac OS X Intel machines for now.