x86 Floating Point Assembly: SSE

CS 301 Lecture, Dr. Lawlor

SSE Instruction List


Scalar
Single-precision
(float)
Scalar
Double-precision
(double)
Packed
Single-precision
(4 floats)
Packed
Double-precision
(2 doubles)
Example
Comments
Arithmetic
addss
addsd
addps
addpd
addss xmm0,xmm1
sub, mul, div all work the same way
Compare
minss
minsd
minps
minpd
minps xmm0,xmm1
max works the same way
Sqrt
sqrtss
sqrtsd
sqrtps
sqrtpd
sqrtss xmm0,xmm1
Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way
Move
movss
movsd
movaps (aligned)
movups (unaligned)
movapd (aligned)
movupd (unaligned)
movss xmm0,xmm1
Aligned loads are up to 4x faster, but will crash if given an unaligned address!  Stack is always 16-byte aligned specifically for this instruction. Use "align 16" directive for static data.
Convert cvtss2sd
cvtss2si
cvttss2si

cvtsd2ss
cvtsd2si
cvttsd2si
cvtps2pd
cvtps2dq
cvttps2dq
cvtpd2ps
cvtpd2dq
cvttpd2dq
cvtsi2ss xmm0,eax
Convert to ("2", get it?) Single Integer (si, stored in register like eax) or four DWORDs (dq, stored in xmm register).  "cvtt" versions do truncation (round down); "cvt" versions round to nearest.
High Bits
n/a
n/a
movmskps
movmskpd
movmskps eax,xmm0
Extract the sign bits of an xmm register into an integer register.  Often used to see if all the floats are "done" and you can exit.
Compare to flags
ucomiss
ucomisd
n/a
n/a
ucomiss xmm0,xmm1
jbe dostuff
Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers.  Use with "jb", "jbe", "je", "jae", or "ja" for normal comparisons.  Sets "pf", the parity flag, if either input is a NaN.
Compare to bitwise mask
cmpeqss
cmpeqsd
cmpeqps
cmpeqpd
cmpleps xmm0,xmm1
Compare for equality ("lt", "le", "neq", "nlt", "nle" versions work the same way).  There's also a "cmpunordss" that marks NaN values.  Sets all bits of float to zero if false (0.0), or all bits to ones if true (a NaN).  Result is used as a bitmask for the bitwise AND and OR operations.
Bitwise
n/a
n/a
andps
andnps
andpd
andnpd
andps xmm0,xmm1
Bitwise AND operation.  "andn" versions are bitwise AND-NOT operations (A=(~A) & B).  "or" version works the same way.

Simple SSE Output Code

The easy way to get SSE output is to just convert to integer, like this:
movss xmm3,[pi]; load up constant
addss xmm3,xmm3 ; add pi to itself
cvtss2si eax,xmm3 ; round to integer
ret
section .data
pi: dd 3.14159265358979 ; constant

(Try this in NetRun now!)

It's annoyingly tricky to display full floating-point values.  The trouble here is that our function "foo" returns an int to main, so we have to call a function to print floating-point values.  Also, with SSE floating-point, on a 64-bit machine you're supposed to keep the stack aligned to a 16-byte boundary (the SSE "movaps" instruction crashes if it's not given a 16-byte aligned value).  Sadly, the "call" instruction messes up your stack's alignment by pushing an 8-byte return address, so we've got to use up another 8 bytes of stack space purely for stack alignment, like this.
movss xmm3,[pi]; load up constant
addss xmm3,xmm3 ; add pi to itself
movss [output],xmm3; write register out to memory

; Print floating-point output
mov rdi,output ; first parameter: pointer to floats
mov rsi,1 ; second parameter: number of floats
sub rsp,8 ; keep stack 16-byte aligned (else get crash!)
extern farray_print
call farray_print
add rsp,8

ret

section .data
pi: dd 3.14159265358979 ; constant
output: dd 0.0 ; overwritten at runtime

(Try this in NetRun now!)

Here's how to use the "ucomiss" comparison: do the compare, and check the flags.
movss xmm1,DWORD[A]
movss xmm2,DWORD[B]

ucomiss xmm1,xmm2 ; set CF for ja/jb, ZF for je/jne, and PF for jp (NaN)
jb isbelow
mov eax,0
ret
isbelow:
mov eax,999
ret

section .data
A: dd 1.0
B: dd 2.0

(Try this in NetRun now!)

Here's how to use those weird "cmp" instructions: they create a bitmask you can use in a later bitwise-AND instruction:
movss xmm1,DWORD[A]
movss xmm2,DWORD[B]

cmpltss xmm1,xmm2 ; makes xmm1 into a mask: 0's for false, 1's for true
movss xmm3,DWORD[one] ; "then" case
andps xmm3,xmm1 ; if (compare) xmm3==xmm1, else xmm3==0
cvttss2si eax,xmm3
ret

section .data
A: dd 1.0
B: dd 2.0
one: dd 999.0

(Try this in NetRun now!)

Floating-Point Bit Counts

Here's an example where we're using the SSE floating-point instructions to determine how many bits you can store in a "float" (single-precision number).  We can do this by adding smaller and smaller numbers to 1.0 until roundoff causes the result to equal 1.0.
; Count number of bits in floating-point mantissa
movss xmm10,[one]; load constants
movss xmm5,[one_half]
movss xmm0,xmm10; testbit--drops by half every iteration
mov eax,0 ; bit count
loopstart:
add eax,1 ; increment bit count
mulss xmm0,xmm5 ; multiply by one half: drops down to next test bit
movss xmm2,xmm0 ; build test pattern, starting at 1.0
addss xmm2,xmm10 ; compute 1+testbit
ucomiss xmm2,xmm10 ; compare test pattern against 1.0
jne loopstart ; if they're not equal, try again
ret

section .data
one: dd 1.0 ; constants
one_half: dd 0.5

(Try this in NetRun now!)

This returns 24, meaning my 32-bit float can represent 1.0+1.0*2-23 exactly, but 1.0+1.0*2-24 gets rounded off to 1.0.

Here's the same exact experiment on 64-bit "double"s.  Now we're using quadwords to store the numbers, and "sd" (solitary double-precision number) SSE instructions:
; Count number of bits in floating-point mantisda
movsd xmm10,[one]; load constants
movsd xmm5,[one_half]
movsd xmm0,xmm10; testbit--drops by half every iteration
mov eax,0 ; bit count
loopstart:
add eax,1 ; increment bit count
mulsd xmm0,xmm5 ; multiply by one half: drops down to next test bit
movsd xmm2,xmm0 ; build test pattern, starting at 1.0
addsd xmm2,xmm10 ; compute 1+testbit
ucomisd xmm2,xmm10 ; compare test pattern against 1.0
jne loopstart ; if they're not equal, try again
ret

section .data
one: dq 1.0 ; constants
one_half: dq 0.5

(Try this in NetRun now!)

This returns 53: clearly, a 64-bit "double" uses most of its bits to represent the mantissa!