# Intel's Advanced Vector Extensions (AVX)

CS 641 Lecture, Dr. Lawlor

AVX has upgraded SSE to use 256-bit registers.  These can contain up to 8 floats, or 4 doubles!  Clearly, with registers that wide, you want to use struct-of-arrays, not array-of-structs.  A single 3D XYZ point is pretty lonely inside an 8-wide vector, but three 8-wide vectors of XXXXXXXX YYYYYYYY ZZZZZZZZ works fine.
`#include <immintrin.h> /* AVX + SSE4 intrinsics header */// Internal class: do not use...class not_vec8 {	__m256 v; // bitwise inverse of our value (!!)public:	not_vec8(__m256 val) {v=val;}	__m256 get(void) const {return v;} // returns INVERSE of our value (!!)};// This is the class to use!class vec8 {	__m256 v;public:	vec8(__m256 val) {v=val;}	vec8(const float *src) {v=_mm256_loadu_ps(src);}	vec8(float x) {v=_mm256_broadcast_ss(&x);}		vec8 operator+(const vec8 &rhs) const {return _mm256_add_ps(v,rhs.v);}	vec8 operator-(const vec8 &rhs) const {return _mm256_sub_ps(v,rhs.v);}	vec8 operator*(const vec8 &rhs) const {return _mm256_mul_ps(v,rhs.v);}	vec8 operator/(const vec8 &rhs) const {return _mm256_div_ps(v,rhs.v);}	vec8 operator&(const vec8 &rhs) const {return _mm256_and_ps(v,rhs.v);}	vec8 operator|(const vec8 &rhs) const {return _mm256_or_ps(v,rhs.v);}	vec8 operator^(const vec8 &rhs) const {return _mm256_xor_ps(v,rhs.v);}	vec8 operator==(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_EQ_OQ);}	vec8 operator!=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_NEQ_OQ);}	vec8 operator<(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LT_OQ);}	vec8 operator<=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LE_OQ);}	vec8 operator>(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}	vec8 operator>=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}	not_vec8 operator~(void) const {return not_vec8(v);}	__m256 get(void) const {return v;}	float *store(float *ptr) {		_mm256_store_ps(ptr,v);		return ptr;	}	float &operator[](int index) { return ((float *)&v)[index]; }	float operator[](int index) const { return ((const float *)&v)[index]; }	friend ostream &operator<<(ostream &o,const vec8 &y) {		o<<y[0]<<" "<<y[1]<<" "<<y[2]<<" "<<y[3];		return o;	}	friend vec8 operator&(const vec8 &lhs,const not_vec8 &rhs)		{return _mm256_andnot_ps(rhs.get(),lhs.get());}	friend vec8 operator&(const not_vec8 &lhs, const vec8 &rhs)		{return _mm256_andnot_ps(lhs.get(),rhs.get());}	vec8 if_then_else(const vec8 &then,const vec8 &else_part) const {		return _mm256_or_ps( _mm256_and_ps(v,then.v),			 _mm256_andnot_ps(v, else_part.v)		);	}};vec8 sqrt(const vec8 &v) {	return _mm256_sqrt_ps(v.get());}vec8 rsqrt(const vec8 &v) {	return _mm256_rsqrt_ps(v.get());}/* Return value = dot product of a & b, replicated 4 times */inline vec8 dot(const vec8 &a,const vec8 &b) {	vec8 t=a*b;	__m256 vt=_mm256_hadd_ps(t.get(),t.get());	vt=_mm256_hadd_ps(vt,vt);	return vt;}float data[8]={1.2,2.3,3.4,1.5,10.0,100.0,1000.0,10000.0};vec8 a(data);vec8 b(10.0);vec8 c(0.0);int foo(void) {	c=(a<b).if_then_else(3.7,c);	return 0;}`

(Try this in NetRun now!)

Performance, as you might expect, is single clock cycle despite the longer vectors--Intel just built wider floating point hardware!

The whole list of instructions is in "avxintrin.h" (/usr/lib/gcc/x86_64-linux-gnu/4.4/include/avxintrin.h on my machine).  Note that the compare functions still work in basically the same way as SSE, returning a mask that you then AND and OR to keep the values you want.

## Encoding 64-bit values in x86 Machine Code

Back in 2003, when AMD wanted to add 64 bit capability to their processors, they had two options:
1. Throw away the old 32-bit machine code, and start from scratch.  This destroys backward compatibility--Intel has tried rebooting the instruction set with Itanium, which has not yet been a huge success.
2. Try to patch the old 32-bit machine code into 64-bit mode.  This is what AMD chose to do, and it's what all 64-bit x86 machines are based on today.
Here's how 8-bit, 16-bit, 32-bit, and 64-bit registers are written in modern x86_64:
`mov al,1234mov ax,1234mov eax,1234mov rax,1234`

(Try this in NetRun now!)

And the machine code:
`   0:	b0 d2                	mov    al,0xd2   2:	66 b8 d2 04          	mov    ax,0x4d2   6:	b8 d2 04 00 00       	mov    eax,0x4d2   b:	48 c7 c0 d2 04 00 00 	mov    rax,0x4d2`

 prefix opcode data assembly meaning `b0 ` `d2 ` `mov al,0xd2` 8-bit load 66 `b8` `d2 04` `mov ax,0x4d2` load with a 16-bit prefix (0x66) `b8` `d2 04 00 00` `mov eax,0x4d2` load with default size of 32 bits `48` `c7 c0` `d2 04 00 00` `mov rax,0x4d2` Sign-extended load using REX 64-bit prefix (0x48)

Everything but that last line is exactly 100% 32-bit machine code.  This incremental approach is beneficial for everybody:
• Compiler writers only need to make minor changes for 64-bit machines.
• Ancient 32-bit machine code still runs fine, and can be incrementally rewritten to use 64-bit as needed.
• The 64-bit instruction decode hardware can be almost entirely shared with the old 32-bit decode path.
There are downsides, however:
• You're stuck with a lot of the old junk from the 1970's.  Note how 8-bit loads use far fewer bytes than 64-bit loads, despite the fact that 64-bit loads are probably more common in modern code.
• Some of the bit allocations are... strange.

## Encoding New Registers in 64-bit x86 code

In addition to allowing 64-bit operations, AMD extended the x86 register set to have a total of 16 registers.

Regarding bit allocations, the REX 64-bit prefix byte always starts with "0x4".  It then has four additional bits: the high bit says we're in 64-bit mode, and the low three bits are the high bits for each of the register numbers!
`   0:	48 01 c0             	add    rax,rax   3:	48 01 c8             	add    rax,rcx   6:	48 01 d0             	add    rax,rdx   9:	48 01 d8             	add    rax,rbx   c:	48 01 e0             	add    rax,rsp   f:	48 01 e8             	add    rax,rbp  12:	48 01 f0             	add    rax,rsi  15:	48 01 f8             	add    rax,rdi  18:	4c 01 c0             	add    rax,r8  1b:	4c 01 c8             	add    rax,r9  1e:	4c 01 d0             	add    rax,r10  21:	4c 01 d8             	add    rax,r11  24:	4c 01 e0             	add    rax,r12  27:	4c 01 e8             	add    rax,r13  2a:	4c 01 f0             	add    rax,r14  2d:	4c 01 f8             	add    rax,r15`

(Try this in NetRun now!)

Note how a "0x01 0xc0" reads from rax with a REX prefix of 0x48; but it reads from r8 with a REX prefix of 0x4c.

The fact that the low three bits of the source register are in the last byte, and the high bit of the register number are in the first byte, is a consequence of this strange patching.  At the circuit level, this sort of thing is fairly easy to handle--just pull out the bits wherever they show up, and run all the wires to the destination--but it looks weird and is a bit more work for the compiler.

It's still worth it!

## Encoding AVX Instructions

In addition to the wider 256-bit ymm registers, AVX allows three-operand arithmetic.

Here's how SSE instructions are encoded:
`   7:	0f 58 c4             	addps  xmm0,xmm4   a:	0f 58 cc             	addps  xmm1,xmm4   d:	0f 58 d4             	addps  xmm2,xmm4  10:	0f 58 dc             	addps  xmm3,xmm4  14:	0f 58 c4             	addps  xmm0,xmm4  17:	0f 58 c5             	addps  xmm0,xmm5  1a:	0f 58 c6             	addps  xmm0,xmm6  1d:	0f 58 c7             	addps  xmm0,xmm7(Try this in NetRun now!)`
The 0x0f byte is a prefix indicating "ps" mode.  0x58 is the opcode.  The source and destination are in the ModR/M byte at the end.

Here's how AVX instructions are encoded.
`   7:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8   c:	c4 c1 5c 58 c8       	vaddps ymm1,ymm4,ymm8  11:	c4 c1 5c 58 d0       	vaddps ymm2,ymm4,ymm8  16:	c4 c1 5c 58 d8       	vaddps ymm3,ymm4,ymm8                    ^                     ^  1c:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8  21:	c4 c1 54 58 c0       	vaddps ymm0,ymm5,ymm8  26:	c4 c1 4c 58 c0       	vaddps ymm0,ymm6,ymm8  2b:	c4 c1 44 58 c0       	vaddps ymm0,ymm7,ymm8               ^                               ^  31:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8  36:	c4 c1 5c 58 c1       	vaddps ymm0,ymm4,ymm9  3b:	c4 c1 5c 58 c2       	vaddps ymm0,ymm4,ymm10  40:	c4 c1 5c 58 c3       	vaddps ymm0,ymm4,ymm11                     ^                              ^`

(Try this in NetRun now!)

The 0xc4 is a VEX prefix.  This is one of several dozen opcodes that AMD bulldozed back in 2003, now rehabilitated for an entirely new use by Intel.  The next two bytes are VEX info, saying 256-bit operation, giving the middle register and all the high bits of the register numbers.  0x58 is still the opcode.  The last source and destination are in the ModR/M byte at the end.

Weird, but it works!