# High-Performance Floating Point with SSE "ps"

 single/ serial packed/ parallel single-precision "float" ss ps (4 floats) double-precision "double" sd pd (2 doubles)

Here's some silly floating-point code.  It takes 2.7ns/float.
`enum {n=1024};float a[n];for (int i=0;i<n;i++) a[i]=3.4;for (int i=0;i<n;i++) a[i]+=1.2;return a[0];`

Staring at the assembly language, there are a number of "cvtss2sd" and back again, due to the double-precision constants and single-precision data.  So we can get a good speedup to 1.4ns/float, just by making the constants floating point.
`enum {n=1024};float a[n];for (int i=0;i<n;i++) a[i]=3.4f;for (int i=0;i<n;i++) a[i]+=1.2f;return a[0];`

We can run a *lot* faster by using SSE parallel instructions.  I'm going to do this the "hard way," making separate functions to do the assembly computation.

Here's the C++ conversion to call assembly language functions on the array.
`extern "C" void init_array(float *arr,int n);extern "C" void add_array(float *arr,int n);int foo(void) {	enum {n=1024};	float a[n];	init_array(a,n);	add_array(a,n);	return a[0]*1000;}`

Here are the two assembly language functions called above.  Together, we're down to under 0.5ns/float!
`; extern "C" void init_array(float *arr,int n);;for (int i=0;i<n;i+=4) {;	a[i]=3.4f;;	a[i+1]=3.4f;;	a[i+2]=3.4f;;	a[i+3]=3.4f;;}global init_arrayinit_array:	; rdi points to arr	; rsi is n, the array length	mov rcx,0 ; i	movaps xmm1,[constant3_4]	jmp loopcompareloopstart:	movaps [rdi+4*rcx],xmm1 ; init array with xmm1		add rcx,4loopcompare:	cmp rcx,rsi	jl loopstart	retsection .dataalign 16constant3_4:	dd 3.4,3.4,3.4,3.4 ; movaps!section .text; extern "C" void add_array(float *arr,int n);;for (int i=0;i<n;i++) a[i]+=1.2f;global add_arrayadd_array:	; rdi points to arr	; rsi is n, the array length	mov rcx,0 ; i	movaps xmm1,[constant1_2]	jmp loopcompare2loopstart2:	movaps xmm0,[rdi+4*rcx] ; loads arr[i] through arr[i+3]	addps xmm0,xmm1	movaps [rdi+4*rcx],xmm0		add rcx,4loopcompare2:	cmp rcx,rsi	jl loopstart2	retsection .dataalign 16constant1_2:	dd 1.2,1.2,1.2,1.2 ; movaps!`

0.5ns/float is pretty impressive performance for this code, since:
• We store each float once in init_array, then load and store it again in add_array.  That's 3 trips through memory per float.
• Each float is 4 bytes.
• 0.5ns/float means we're doing over 2 billion floats per second, or 24 gigabytes per second to and from memory!  (In this case, the processor's cache.)