Faster Floating Point with SSE: Streaming SIMD Extensions

Aside: The Bit Shuffle

In HW8, you had to read some bits out of the guts of a float. This is tricky, because none of the bitwise operators *work* on a float, and casting to integer (like "int x=(int)f;") translates the *value* of the float, not the bits inside of it.

To read out the bits of any type, you've got to make it an integer without changing the underlying bytes. Up until 2004 or so, you could do this legally and safely using "the pointer shuffle" (my term), which is:
int x = * (int *) &f;
This is just the do-nothing "* & f" (take f's address, then dereference the address, leaving f again), but with a pointer typecast in the middle to convert the bytes of f into the bytes of an int. Unfortunately, more recent compilers may not recognize that f is getting read during this code, so this approach isn't reliable anymore.

The modern way to do this sort of bitwise conversion is with a "union" (the creepy uncle of "class"):

union floatConverter {
	int i;
	float f;
};
floatConverter u;
u.f=1.5; /* write into the float side */
return u.i; /* read out the bits from the integer side */

(Try this in NetRun now!)

And a super-fancy way to do this is with a "bitfield" class, where you assign bit sizes to the class's fields:

/* IEEE floating-point number's bits:  sign  exponent   mantissa */
struct float_bits {
	unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
	unsigned int exp:8; /**< Value is 2^(exp-127) */
	unsigned int sign:1; /**< 0 for positive, 1 for negative */
};

/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
	float f;
	float_bits b;
};

float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<"  fract "<<s.b.fraction<<"\n";
return 0;

(Executable NetRun link)

SSE Assembly

The old 1970's floating-point on x86 was so bad, they actually built a better optional way to do floating point called "SSE". Unlike the old floating point:

SSE registers can be accessed in any order, not just as a stack.
All SSE registers can be trashed (no need to clean up the stack across function calls).
SSE instructions look very similar to integer instructions.
SSE won't work on ancient processors, like the Pentium I (but anymore, who cares?).

SSE instructions were first introduced with the Intel Pentium II, but they're now found on all modern x86 processors, including the 64-bit versions. SSE introduces 8 new registers, called xmm0 through xmm7, that each contain 32-bit single-precision floats. The new float instructions that operate on these registers have the suffix "ss", for "Scalar Single-precision" (one float). See the x86 reference manual for a complete list of SSE instructions.

For example, "add" adds two integer registers, like eax and ebx. "addss" adds two SSE registers, like xmm3 and xmm6. There are SSE versions of most other operations: movss, subss, mulss, divss, cmpss, etc.

Here we're adding two floats using SSE:

	movss xmm1,[thing1]; <- copy the float into xmm1
	movss xmm6,[thing2]; <- copy the float into xmm6
	addss xmm1,xmm6; <- add floats
	movss [retval],xmm1; <- move that constant into the global "retval"
	
; Print out retval
	extern farray_print
	push 1 ;<- number of floats to print
	push retval ;<- points to array of floats
	call farray_print
	add esp,8 ; <- pop off arguments
	ret

section .data
thing1: dd 10.2;<- source constant
thing2: dd 1.2;<- source constant
retval: dd 0.0;<- our return value

(Try this in NetRun now!)

You can make gcc use these new SSE registers for all its floating point operations with the "-mfpmath=sse" flag, but this is only the default on Mac OS X Intel machines for now.