AVX: Advanced Vector Extensions

Good reference info:

The C/C++ AVX intrinsic functions are in the header "immintrin.h".

AVX uses dedicated 256-bit registers, with these C/C++ types:

The 256 bit (32 byte) registers have enough space to store:

For example, here's how you operate on 8 floats at a time, using dedicated AVX _mm256 intrinsic functions.  Note how we're using _ps instructions both for load and add.

#include "immintrin.h"

void foo(void)
	float f[8]={1.0,2.0,1.2,2.1, 5.2,5.3,10.1,11.0};
	__m256 v=_mm256_load_ps(&f[0]);

(Try this in NetRun now!)

And here's how you operate on 8 ints at a time.  For ints, the arithmetic uses _epi32, and the load and store use the weird _si256 type, which just means one giant 256-bit block of integer data.

#include "immintrin.h"

void foo(void)
	int f[8]={1,2,0,3, 5,5,10,11};
	__m256i v=_mm256_load_si256((const __m256i *)&f[0]);
	_mm256_store_si256((__m256i *)&f[0],v);

(Try this in NetRun now!)


Using AVX to Speed Up Array Code

Usually it's as easy as just operating on 8 iterations of the loop at a time!  But there are often minor complications, like scalar values that need to be broadcast to fill all 8 slots:

#include "immintrin.h"

const int n=1024;
float a[n], b[n];
float c=3.0;

long foo(void) {
	bool use_AVX=true;
	if (use_AVX) 
	{ // fancy AVX loop:
		__m256 C=_mm256_broadcast_ss(&c); // splat c across all SIMD lanes
		for (int i=0;i<n;i+=8) {
			// b[i]=a[i]*c;
			__m256 A=_mm256_load_ps(&a[i]);
			__m256 B=_mm256_mul_ps(A,C);
	{ // simple float loop: 
		for (int i=0;i<n;i++) {
	return b[0];

(Try this in NetRun now!)

On my Skylake machine, using AVX takes only 44ns to compute 1000 floats; the simple float loop takes 294ns (a 6.68x speedup!).

Where things get really tricky is when each float wants to do its own separate operations, like per-float branching.  AVX handles branches exactly like SSE .

Alignment in AVX

AVX instructions expect input operands to be aligned on a 32-byte boundary.  On most 32-bit systems, malloc and new only return pointers aligned to an 8-byte boundary, which caused problems for SSE, which expected 16-byte alignment.  So on most 64-bit systems, malloc and new always return data aligned to a 16-byte boundary--half the time you'll also be lucky and your inputs will also be aligned on a 32-byte boundary, but half the time you won't and your AVX code will crash.

If you're doing allocations yourself, C style, the function "_mm_malloc(size_in_bytes,32)" will return you a 32-byte aligned pointer, and it's available in the same Intel headers as the other _mm_ intrinsics.  

For std::vector, you should use an aligning allocator like my osl/alignocator.h as an additional template argument:

   std::vector<float, alignocator<float,32> > myVec;

If you don't do this, your first AVX load has a 50% chance of crashing.

Of course, you still need to advance by 32 bytes / 8 floats from the start of the vector.