Intel's Advanced Vector Extensions (AVX)

CS 641 Lecture, Dr. Lawlor

 AVX has upgraded SSE to use 256-bit registers.  These can contain up to 8 floats, or 4 doubles!  Clearly, with registers that wide, you want to use struct-of-arrays, not array-of-structs.  A single 3D XYZ point is pretty lonely inside an 8-wide vector, but three 8-wide vectors of XXXXXXXX YYYYYYYY ZZZZZZZZ works fine.
#include <immintrin.h> /* AVX + SSE4 intrinsics header */

// Internal class: do not use...
class not_vec8 {
__m256 v; // bitwise inverse of our value (!!)
public:
not_vec8(__m256 val) {v=val;}
__m256 get(void) const {return v;} // returns INVERSE of our value (!!)
};

// This is the class to use!
class vec8 {
__m256 v;
public:
vec8(__m256 val) {v=val;}
vec8(const float *src) {v=_mm256_loadu_ps(src);}
vec8(float x) {v=_mm256_broadcast_ss(&x);}

vec8 operator+(const vec8 &rhs) const {return _mm256_add_ps(v,rhs.v);}
vec8 operator-(const vec8 &rhs) const {return _mm256_sub_ps(v,rhs.v);}
vec8 operator*(const vec8 &rhs) const {return _mm256_mul_ps(v,rhs.v);}
vec8 operator/(const vec8 &rhs) const {return _mm256_div_ps(v,rhs.v);}
vec8 operator&(const vec8 &rhs) const {return _mm256_and_ps(v,rhs.v);}
vec8 operator|(const vec8 &rhs) const {return _mm256_or_ps(v,rhs.v);}
vec8 operator^(const vec8 &rhs) const {return _mm256_xor_ps(v,rhs.v);}
vec8 operator==(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_EQ_OQ);}
vec8 operator!=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_NEQ_OQ);}
vec8 operator<(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LT_OQ);}
vec8 operator<=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LE_OQ);}
vec8 operator>(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}
vec8 operator>=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}

not_vec8 operator~(void) const {return not_vec8(v);}

__m256 get(void) const {return v;}

float *store(float *ptr) {
_mm256_store_ps(ptr,v);
return ptr;
}

float &operator[](int index) { return ((float *)&v)[index]; }
float operator[](int index) const { return ((const float *)&v)[index]; }

friend ostream &operator<<(ostream &o,const vec8 &y) {
o<<y[0]<<" "<<y[1]<<" "<<y[2]<<" "<<y[3];
return o;
}
friend vec8 operator&(const vec8 &lhs,const not_vec8 &rhs)
{return _mm256_andnot_ps(rhs.get(),lhs.get());}
friend vec8 operator&(const not_vec8 &lhs, const vec8 &rhs)
{return _mm256_andnot_ps(lhs.get(),rhs.get());}
vec8 if_then_else(const vec8 &then,const vec8 &else_part) const {
return _mm256_or_ps( _mm256_and_ps(v,then.v),
_mm256_andnot_ps(v, else_part.v)
);
}
};

vec8 sqrt(const vec8 &v) {
return _mm256_sqrt_ps(v.get());
}
vec8 rsqrt(const vec8 &v) {
return _mm256_rsqrt_ps(v.get());
}
/* Return value = dot product of a & b, replicated 4 times */
inline vec8 dot(const vec8 &a,const vec8 &b) {
vec8 t=a*b;
__m256 vt=_mm256_hadd_ps(t.get(),t.get());
vt=_mm256_hadd_ps(vt,vt);
return vt;
}

float data[8]={1.2,2.3,3.4,1.5,10.0,100.0,1000.0,10000.0};
vec8 a(data);
vec8 b(10.0);
vec8 c(0.0);

int foo(void) {
c=(a<b).if_then_else(3.7,c);
return 0;
}

(Try this in NetRun now!)

Performance, as you might expect, is single clock cycle despite the longer vectors--Intel just built wider floating point hardware!

The whole list of instructions is in "avxintrin.h" (/usr/lib/gcc/x86_64-linux-gnu/4.4/include/avxintrin.h on my machine).  Note that the compare functions still work in basically the same way as SSE, returning a mask that you then AND and OR to keep the values you want.

Encoding 64-bit values in x86 Machine Code

Back in 2003, when AMD wanted to add 64 bit capability to their processors, they had two options:
  1. Throw away the old 32-bit machine code, and start from scratch.  This destroys backward compatibility--Intel has tried rebooting the instruction set with Itanium, which has not yet been a huge success.
  2. Try to patch the old 32-bit machine code into 64-bit mode.  This is what AMD chose to do, and it's what all 64-bit x86 machines are based on today.
Here's how 8-bit, 16-bit, 32-bit, and 64-bit registers are written in modern x86_64:
mov al,1234
mov ax,1234
mov eax,1234
mov rax,1234

(Try this in NetRun now!)

And the machine code:
   0:	b0 d2                	mov    al,0xd2
2: 66 b8 d2 04 mov ax,0x4d2
6: b8 d2 04 00 00 mov eax,0x4d2
b: 48 c7 c0 d2 04 00 00 mov rax,0x4d2

prefix
opcode
data
assembly
meaning

b0 
d2 
mov    al,0xd2
8-bit load
66
b8
d2 04
mov    ax,0x4d2
load with a 16-bit prefix (0x66)

b8
d2 04 00 00
mov    eax,0x4d2
load with default size of 32 bits
48
c7 c0
d2 04 00 00
mov    rax,0x4d2
Sign-extended load using REX 64-bit prefix (0x48)

Everything but that last line is exactly 100% 32-bit machine code.  This incremental approach is beneficial for everybody:
There are downsides, however:

Encoding New Registers in 64-bit x86 code

In addition to allowing 64-bit operations, AMD extended the x86 register set to have a total of 16 registers.

Regarding bit allocations, the REX 64-bit prefix byte always starts with "0x4".  It then has four additional bits: the high bit says we're in 64-bit mode, and the low three bits are the high bits for each of the register numbers!
   0:	48 01 c0             	add    rax,rax
3: 48 01 c8 add rax,rcx
6: 48 01 d0 add rax,rdx
9: 48 01 d8 add rax,rbx
c: 48 01 e0 add rax,rsp
f: 48 01 e8 add rax,rbp
12: 48 01 f0 add rax,rsi
15: 48 01 f8 add rax,rdi
18: 4c 01 c0 add rax,r8
1b: 4c 01 c8 add rax,r9
1e: 4c 01 d0 add rax,r10
21: 4c 01 d8 add rax,r11
24: 4c 01 e0 add rax,r12
27: 4c 01 e8 add rax,r13
2a: 4c 01 f0 add rax,r14
2d: 4c 01 f8 add rax,r15

(Try this in NetRun now!)

Note how a "0x01 0xc0" reads from rax with a REX prefix of 0x48; but it reads from r8 with a REX prefix of 0x4c.

The fact that the low three bits of the source register are in the last byte, and the high bit of the register number are in the first byte, is a consequence of this strange patching.  At the circuit level, this sort of thing is fairly easy to handle--just pull out the bits wherever they show up, and run all the wires to the destination--but it looks weird and is a bit more work for the compiler.

It's still worth it!

Encoding AVX Instructions

In addition to the wider 256-bit ymm registers, AVX allows three-operand arithmetic.

Here's how SSE instructions are encoded:
   7:	0f 58 c4             	addps  xmm0,xmm4
a: 0f 58 cc addps xmm1,xmm4
d: 0f 58 d4 addps xmm2,xmm4
10: 0f 58 dc addps xmm3,xmm4

14: 0f 58 c4 addps xmm0,xmm4
17: 0f 58 c5 addps xmm0,xmm5
1a: 0f 58 c6 addps xmm0,xmm6
1d: 0f 58 c7 addps xmm0,xmm7
(Try this in NetRun now!)
The 0x0f byte is a prefix indicating "ps" mode.  0x58 is the opcode.  The source and destination are in the ModR/M byte at the end.

Here's how AVX instructions are encoded.
   7:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8
c: c4 c1 5c 58 c8 vaddps ymm1,ymm4,ymm8
11: c4 c1 5c 58 d0 vaddps ymm2,ymm4,ymm8
16: c4 c1 5c 58 d8 vaddps ymm3,ymm4,ymm8
^ ^

1c: c4 c1 5c 58 c0 vaddps ymm0,ymm4,ymm8
21: c4 c1 54 58 c0 vaddps ymm0,ymm5,ymm8
26: c4 c1 4c 58 c0 vaddps ymm0,ymm6,ymm8
2b: c4 c1 44 58 c0 vaddps ymm0,ymm7,ymm8
^ ^

31: c4 c1 5c 58 c0 vaddps ymm0,ymm4,ymm8
36: c4 c1 5c 58 c1 vaddps ymm0,ymm4,ymm9
3b: c4 c1 5c 58 c2 vaddps ymm0,ymm4,ymm10
40: c4 c1 5c 58 c3 vaddps ymm0,ymm4,ymm11
^ ^

(Try this in NetRun now!)

The 0xc4 is a VEX prefix.  This is one of several dozen opcodes that AMD bulldozed back in 2003, now rehabilitated for an entirely new use by Intel.  The next two bytes are VEX info, saying 256-bit operation, giving the middle register and all the high bits of the register numbers.  0x58 is still the opcode.  The last source and destination are in the ModR/M byte at the end. 

Weird, but it works!