Scalar Singleprecision (float) 
Scalar Doubleprecision (double) 
Packed Singleprecision (4 floats) 
Packed Doubleprecision (2 doubles) 
Example 
Comments 

Arithmetic 
addss 
addsd 
addps 
addpd 
addss xmm0,xmm1 
sub, mul, div all work the same way 
Compare 
minss 
minsd 
minps 
minpd 
minps xmm0,xmm1 
max works the same way 
Sqrt 
sqrtss 
sqrtsd 
sqrtps 
sqrtpd 
sqrtss xmm0,xmm1 
Square root (sqrt), reciprocal (rcp), and reciprocalsquareroot (rsqrt) all work the same way 
Move 
movss 
movsd 
movaps (aligned) movups (unaligned) 
movapd (aligned) movupd (unaligned) 
movss xmm0,xmm1 
Aligned loads are up to 4x
faster, but will crash if given an unaligned address! Stack is always
16byte aligned specifically for this instruction. Use "align 16" directive for static data. 
Convert  cvtss2sd cvtss2si cvttss2si 
cvtsd2ss cvtsd2si cvttsd2si 
cvtps2pd cvtps2dq cvttps2dq 
cvtpd2ps cvtpd2dq cvttpd2dq 
cvtsi2ss xmm0,eax 
Convert to ("2", get it?) Single
Integer (si, stored in register like eax) or four DWORDs (dq, stored in
xmm register). "cvtt" versions do truncation (round down); "cvt"
versions round to nearest. 
High Bits 
n/a 
n/a 
movmskps 
movmskpd 
movmskps eax,xmm0 
Extract the sign bits of an xmm
register into an integer register. Often used to see if all the
floats are "done" and you can exit. 
Compare to flags 
ucomiss 
ucomisd 
n/a 
n/a 
ucomiss xmm0,xmm1 jbe dostuff 
Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers.
Use with "jb", "jbe", "je", "jae", or "ja" for normal
comparisons. Sets "pf", the parity flag, if either input is a NaN. 
Compare to bitwise mask 
cmpeqss 
cmpeqsd 
cmpeqps 
cmpeqpd 
cmpleps xmm0,xmm1 
Compare for equality ("lt",
"le", "neq", "nlt", "nle" versions work the same way). There's
also a "cmpunordss" that marks NaN values. Sets all bits
of float to zero if false (0.0), or all bits to ones if true (a
NaN).
Result is used as a bitmask for the bitwise AND and OR operations. 
Bitwise 
n/a 
n/a 
andps andnps 
andpd andnpd 
andps xmm0,xmm1 
Bitwise AND operation. "andn"
versions are bitwise ANDNOT operations (A=(~A) & B). "or"
version works the same way. 
movss xmm3,[pi]; load up constantIt's annoyingly tricky to display full floatingpoint values. The trouble here is that our function "foo" returns an int to main, so we have to call a function to print floatingpoint values. Also, with SSE floatingpoint, on a 64bit machine you're supposed to keep the stack aligned to a 16byte boundary (the SSE "movaps" instruction crashes if it's not given a 16byte aligned value). Sadly, the "call" instruction messes up your stack's alignment by pushing an 8byte return address, so we've got to use up another 8 bytes of stack space purely for stack alignment, like this.
addss xmm3,xmm3 ; add pi to itself
cvtss2si eax,xmm3 ; round to integer
ret
section .data
pi: dd 3.14159265358979 ; constant
movss xmm3,[pi]; load up constantHere's how to use the "ucomiss" comparison: do the compare, and check the flags.
addss xmm3,xmm3 ; add pi to itself
movss [output],xmm3; write register out to memory
; Print floatingpoint output
mov rdi,output ; first parameter: pointer to floats
mov rsi,1 ; second parameter: number of floats
sub rsp,8 ; keep stack 16byte aligned (else get crash!)
extern farray_print
call farray_print
add rsp,8
ret
section .data
pi: dd 3.14159265358979 ; constant
output: dd 0.0 ; overwritten at runtime
movss xmm1,DWORD[A]Here's how to use those weird "cmp" instructions: they create a bitmask you can use in a later bitwiseAND instruction:
movss xmm2,DWORD[B]
ucomiss xmm1,xmm2 ; set CF for ja/jb, ZF for je/jne, and PF for jp (NaN)
jb isbelow
mov eax,0
ret
isbelow:
mov eax,999
ret
section .data
A: dd 1.0
B: dd 2.0
movss xmm1,DWORD[A]
movss xmm2,DWORD[B]
cmpltss xmm1,xmm2 ; makes xmm1 into a mask: 0's for false, 1's for true
movss xmm3,DWORD[one] ; "then" case
andps xmm3,xmm1 ; if (compare) xmm3==xmm1, else xmm3==0
cvttss2si eax,xmm3
ret
section .data
A: dd 1.0
B: dd 2.0
one: dd 999.0
; Count number of bits in floatingpoint mantissaThis returns 24, meaning my 32bit float can represent 1.0+1.0*2^{23} exactly, but 1.0+1.0*2^{24} gets rounded off to 1.0.
movss xmm10,[one]; load constants
movss xmm5,[one_half]
movss xmm0,xmm10; testbitdrops by half every iteration
mov eax,0 ; bit count
loopstart:
add eax,1 ; increment bit count
mulss xmm0,xmm5 ; multiply by one half: drops down to next test bit
movss xmm2,xmm0 ; build test pattern, starting at 1.0
addss xmm2,xmm10 ; compute 1+testbit
ucomiss xmm2,xmm10 ; compare test pattern against 1.0
jne loopstart ; if they're not equal, try again
ret
section .data
one: dd 1.0 ; constants
one_half: dd 0.5
; Count number of bits in floatingpoint mantisdaThis returns 53: clearly, a 64bit "double" uses most of its bits to represent the mantissa!
movsd xmm10,[one]; load constants
movsd xmm5,[one_half]
movsd xmm0,xmm10; testbitdrops by half every iteration
mov eax,0 ; bit count
loopstart:
add eax,1 ; increment bit count
mulsd xmm0,xmm5 ; multiply by one half: drops down to next test bit
movsd xmm2,xmm0 ; build test pattern, starting at 1.0
addsd xmm2,xmm10 ; compute 1+testbit
ucomisd xmm2,xmm10 ; compare test pattern against 1.0
jne loopstart ; if they're not equal, try again
ret
section .data
one: dq 1.0 ; constants
one_half: dq 0.5