Intel's Advanced Vector Extensions (AVX)

CS 641 Lecture, Dr. Lawlor

AVX has upgraded SSE to use 256-bit registers. These can contain up to 8 floats, or 4 doubles! Clearly, with registers that wide, you want to use struct-of-arrays, not array-of-structs. A single 3D XYZ point is pretty lonely inside an 8-wide vector, but three 8-wide vectors of XXXXXXXX YYYYYYYY ZZZZZZZZ works fine.

#include <immintrin.h> /* AVX + SSE4 intrinsics header */

// Internal class: do not use...
class not_vec8 {
	__m256 v; // bitwise inverse of our value (!!)
public:
	not_vec8(__m256 val) {v=val;}
	__m256 get(void) const {return v;} // returns INVERSE of our value (!!)
};

// This is the class to use!
class vec8 {
	__m256 v;
public:
	vec8(__m256 val) {v=val;}
	vec8(const float *src) {v=_mm256_loadu_ps(src);}
	vec8(float x) {v=_mm256_broadcast_ss(&x);}
	
	vec8 operator+(const vec8 &rhs) const {return _mm256_add_ps(v,rhs.v);}
	vec8 operator-(const vec8 &rhs) const {return _mm256_sub_ps(v,rhs.v);}
	vec8 operator*(const vec8 &rhs) const {return _mm256_mul_ps(v,rhs.v);}
	vec8 operator/(const vec8 &rhs) const {return _mm256_div_ps(v,rhs.v);}
	vec8 operator&(const vec8 &rhs) const {return _mm256_and_ps(v,rhs.v);}
	vec8 operator|(const vec8 &rhs) const {return _mm256_or_ps(v,rhs.v);}
	vec8 operator^(const vec8 &rhs) const {return _mm256_xor_ps(v,rhs.v);}
	vec8 operator==(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_EQ_OQ);}
	vec8 operator!=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_NEQ_OQ);}
	vec8 operator<(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LT_OQ);}
	vec8 operator<=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_LE_OQ);}
	vec8 operator>(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}
	vec8 operator>=(const vec8 &rhs) const {return _mm256_cmp_ps(v,rhs.v,_CMP_GT_OQ);}

	not_vec8 operator~(void) const {return not_vec8(v);}

	__m256 get(void) const {return v;}

	float *store(float *ptr) {
		_mm256_store_ps(ptr,v);
		return ptr;
	}

	float &operator[](int index) { return ((float *)&v)[index]; }
	float operator[](int index) const { return ((const float *)&v)[index]; }

	friend ostream &operator<<(ostream &o,const vec8 &y) {
		o<<y[0]<<" "<<y[1]<<" "<<y[2]<<" "<<y[3];
		return o;
	}
	friend vec8 operator&(const vec8 &lhs,const not_vec8 &rhs)
		{return _mm256_andnot_ps(rhs.get(),lhs.get());}
	friend vec8 operator&(const not_vec8 &lhs, const vec8 &rhs)
		{return _mm256_andnot_ps(lhs.get(),rhs.get());}
	vec8 if_then_else(const vec8 &then,const vec8 &else_part) const {
		return _mm256_or_ps( _mm256_and_ps(v,then.v),
			 _mm256_andnot_ps(v, else_part.v)
		);
	}
};

vec8 sqrt(const vec8 &v) {
	return _mm256_sqrt_ps(v.get());
}
vec8 rsqrt(const vec8 &v) {
	return _mm256_rsqrt_ps(v.get());
}
/* Return value = dot product of a & b, replicated 4 times */
inline vec8 dot(const vec8 &a,const vec8 &b) {
	vec8 t=a*b;
	__m256 vt=_mm256_hadd_ps(t.get(),t.get());
	vt=_mm256_hadd_ps(vt,vt);
	return vt;
}

float data[8]={1.2,2.3,3.4,1.5,10.0,100.0,1000.0,10000.0};
vec8 a(data);
vec8 b(10.0);
vec8 c(0.0);

int foo(void) {
	c=(a<b).if_then_else(3.7,c);
	return 0;
}

(Try this in NetRun now!)

Performance, as you might expect, is single clock cycle despite the longer vectors--Intel just built wider floating point hardware!

The whole list of instructions is in "avxintrin.h" (/usr/lib/gcc/x86_64-linux-gnu/4.4/include/avxintrin.h on my machine). Note that the compare functions still work in basically the same way as SSE, returning a mask that you then AND and OR to keep the values you want.

Encoding 64-bit values in x86 Machine Code

Back in 2003, when AMD wanted to add 64 bit capability to their processors, they had two options:

Throw away the old 32-bit machine code, and start from scratch. This destroys backward compatibility--Intel has tried rebooting the instruction set with Itanium, which has not yet been a huge success.
Try to patch the old 32-bit machine code into 64-bit mode. This is what AMD chose to do, and it's what all 64-bit x86 machines are based on today.

Here's how 8-bit, 16-bit, 32-bit, and 64-bit registers are written in modern x86_64:

mov al,1234
mov ax,1234
mov eax,1234
mov rax,1234

(Try this in NetRun now!)

And the machine code:

   0:	b0 d2                	mov    al,0xd2
   2:	66 b8 d2 04          	mov    ax,0x4d2
   6:	b8 d2 04 00 00       	mov    eax,0x4d2
   b:	48 c7 c0 d2 04 00 00 	mov    rax,0x4d2

prefix	opcode	data	assembly	meaning
	b0	d2	mov al,0xd2	8-bit load
66	b8	d2 04	mov ax,0x4d2	load with a 16-bit prefix (0x66)
	b8	d2 04 00 00	mov eax,0x4d2	load with default size of 32 bits
48	c7 c0	d2 04 00 00	mov rax,0x4d2	Sign-extended load using REX 64-bit prefix (0x48)

Everything but that last line is exactly 100% 32-bit machine code. This incremental approach is beneficial for everybody:

Compiler writers only need to make minor changes for 64-bit machines.
Ancient 32-bit machine code still runs fine, and can be incrementally rewritten to use 64-bit as needed.
The 64-bit instruction decode hardware can be almost entirely shared with the old 32-bit decode path.

There are downsides, however:

You're stuck with a lot of the old junk from the 1970's. Note how 8-bit loads use far fewer bytes than 64-bit loads, despite the fact that 64-bit loads are probably more common in modern code.
Some of the bit allocations are... strange.

Encoding New Registers in 64-bit x86 code

In addition to allowing 64-bit operations, AMD extended the x86 register set to have a total of 16 registers.

Regarding bit allocations, the REX 64-bit prefix byte always starts with "0x4". It then has four additional bits: the high bit says we're in 64-bit mode, and the low three bits are the high bits for each of the register numbers!

   0:	48 01 c0             	add    rax,rax
   3:	48 01 c8             	add    rax,rcx
   6:	48 01 d0             	add    rax,rdx
   9:	48 01 d8             	add    rax,rbx
   c:	48 01 e0             	add    rax,rsp
   f:	48 01 e8             	add    rax,rbp
  12:	48 01 f0             	add    rax,rsi
  15:	48 01 f8             	add    rax,rdi
  18:	4c 01 c0             	add    rax,r8
  1b:	4c 01 c8             	add    rax,r9
  1e:	4c 01 d0             	add    rax,r10
  21:	4c 01 d8             	add    rax,r11
  24:	4c 01 e0             	add    rax,r12
  27:	4c 01 e8             	add    rax,r13
  2a:	4c 01 f0             	add    rax,r14
  2d:	4c 01 f8             	add    rax,r15

(Try this in NetRun now!)

Note how a "0x01 0xc0" reads from rax with a REX prefix of 0x48; but it reads from r8 with a REX prefix of 0x4c.

The fact that the low three bits of the source register are in the last byte, and the high bit of the register number are in the first byte, is a consequence of this strange patching. At the circuit level, this sort of thing is fairly easy to handle--just pull out the bits wherever they show up, and run all the wires to the destination--but it looks weird and is a bit more work for the compiler.

It's still worth it!

Encoding AVX Instructions

In addition to the wider 256-bit ymm registers, AVX allows three-operand arithmetic.

Here's how SSE instructions are encoded:

   7:	0f 58 c4             	addps  xmm0,xmm4
   a:	0f 58 cc             	addps  xmm1,xmm4
   d:	0f 58 d4             	addps  xmm2,xmm4
  10:	0f 58 dc             	addps  xmm3,xmm4

  14:	0f 58 c4             	addps  xmm0,xmm4
  17:	0f 58 c5             	addps  xmm0,xmm5
  1a:	0f 58 c6             	addps  xmm0,xmm6
  1d:	0f 58 c7             	addps  xmm0,xmm7
(Try this in NetRun now!)

The 0x0f byte is a prefix indicating "ps" mode. 0x58 is the opcode. The source and destination are in the ModR/M byte at the end.

Here's how AVX instructions are encoded.

   7:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8
   c:	c4 c1 5c 58 c8       	vaddps ymm1,ymm4,ymm8
  11:	c4 c1 5c 58 d0       	vaddps ymm2,ymm4,ymm8
  16:	c4 c1 5c 58 d8       	vaddps ymm3,ymm4,ymm8
                    ^                     ^

  1c:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8
  21:	c4 c1 54 58 c0       	vaddps ymm0,ymm5,ymm8
  26:	c4 c1 4c 58 c0       	vaddps ymm0,ymm6,ymm8
  2b:	c4 c1 44 58 c0       	vaddps ymm0,ymm7,ymm8
               ^                               ^

  31:	c4 c1 5c 58 c0       	vaddps ymm0,ymm4,ymm8
  36:	c4 c1 5c 58 c1       	vaddps ymm0,ymm4,ymm9
  3b:	c4 c1 5c 58 c2       	vaddps ymm0,ymm4,ymm10
  40:	c4 c1 5c 58 c3       	vaddps ymm0,ymm4,ymm11
                     ^                              ^

(Try this in NetRun now!)

The 0xc4 is a VEX prefix. This is one of several dozen opcodes that AMD bulldozed back in 2003, now rehabilitated for an entirely new use by Intel. The next two bytes are VEX info, saying 256-bit operation, giving the middle register and all the high bits of the register numbers. 0x58 is still the opcode. The last source and destination are in the ModR/M byte at the end.

Weird, but it works!