High Performance Programming via SIMD: Single Instruction, Multiple Data

"SIMD" stands for Single Instruction Multiple Data:

Single, meaning just one.
Instruction, as in a machine code instruction, executed by hardware.
Multiple, as in more than one--from 2 to a thousand or so.
Data, as in floats or ints.

The basic idea: one instruction operates on multiple data items simultaneously. This is a form of software parallelism.

You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once. This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
PowerPC, where they are called "AltiVec"
Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
Graphics cards, where they are part of the pixel processors

We'll look at the x86 version, SSE, in detail later this week.

Branching in SIMD

One big problem in SIMD is branching. If half the elements in a single SIMD register need one instruction, and half need a different instruction, you can't do them both in a single instruction.

So the AND-OR trick is used to simulate branches. The situation where these are useful is when you're trying to convert a loop like this to SIMD:

	for (int i=0;i<n;i++) { 
	        if (vec[i]<7) 
			vec[i]=vec[i]*a+b;
		else
			vec[i]=c;
	}

(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash. Note that this code is the *exact* software version of a 2-in mux circuit!

	for (int i=0;i<n;i++) { 
	        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
		vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
	}

Written in ordinary sequential code, this is actually a slowdown, not a speedup! There are lots of tricks to speed this up, like building the 0xffFFffFF 'mask' from the sign bit of "vec[i]-7" (negative number indicates vec[i]<7, positive the other way). Also many SIMD instruction sets, including SSE, have a multi-compare instruction that returns a mask of all zeros or all ones--the implied intent being for you to use the bitwise trick above to avoid any branching!

SIMD Within a Register (SWAR)

SIMD execution can speed things up even without fancy instructions. For example, consider the fact that this single bitwise AND instruction & computes 64 separate bit operations:

long a=0x00FF00FF00FFFFFFL;
long b=0x0F0FFFFF00000FFFL;
long c=a&b;
printf("a=%016lX\nb=%016lX\nc=%016lX\n",a,b,c);

(Try this in NetRun now!)

Results:

a=00FF 00FF 00FF FFFF
b=0F0F FFFF 0000 0FFF
& 000F 00FF 0000 0FFF
| 0FFF FFFF 00FF FFFF
^ 0FF0 FF00 00FF F000

This is useful anytime you need a bunch of bit operations.

For example, the key operation in Conway's Game of Life is to collect up neighbor counts. We can use SIMD by keeping the three-bit neighbor counts for 64 neighbors stored "vertically" inside corresponding bits of three 64-bit long integers. The machine's native addition operation doesn't do the right thing, but we can manually build a half adder in software, like this:

/* Conway's Game of Life using SIMD bitwise operations 
Dr. Orion Lawlor, lawlor@alaska.edu, 2011-10-18 (Public Domain)
*/

/* This treats an integer as an array of bits: SIMD Within A Register */
typedef unsigned long swar; 

/* This stores an array of 3-bit counters *vertically* */
class set_of_counters {
public:
	swar L, M, H; // 3-bit saturating counters, one per *bit*
	set_of_counters() {L=M=H=0;}

	/* Add one bit from N to each of our counters */
	void add(long N) {
		// low bit half adder
		swar Lcarry=L&N;
		L=L^N;
		
		// middle bit half adder
		swar Mcarry=M&Lcarry;
		M=M^Lcarry;
		
		// last bit saturates
		H=H|Mcarry; 
	}
};
/*
 Run the rules for the game of life on these three rows.
 prev is the row above you, next is the row below you,
 cur is your current row.

 A 1 bit indicates a living cell.
*/
swar run_game(swar prev,swar cur,swar next) {
// Each counter stores the number of neighbors around this cell
	set_of_counters c;

// Add all eight neighbors to our counts
	c.add(prev>>1);	c.add(prev); c.add(prev<<1);
	c.add(cur>>1);               c.add(cur<<1);
	c.add(next>>1); c.add(next); c.add(next<<1);

// Run the rules on the resulting counts
	long are2=(~c.H)&c.M&(~c.L); // 2==010
	long are3=(~c.H)&c.M&c.L; // 3==011
	return (are2&cur)|are3; // if 2, unchanged.  If 3, alive.
}

// A 2D grid containing game of life cells.
class grid {
public:
	enum {ht=30};
	swar data[ht];
	
	// Create random initial conditions
	void randomize(int seed=1) {
		srand(seed);
		swar mask=1; mask=~mask; 
		mask=mask<<1; mask=mask>>1;// knock off high bit
		for (int y=0;y<ht;y++) data[y]=rand()&mask;
		data[0]=data[ht-1]=0; // clear top and bottom rows
	}
	
	// Run one game of life update, writing new cells into dest.
	void update(grid &dest) const {
		for (int y=1;y<ht-1;y++) {
			dest.data[y]=run_game(data[y-1],data[y],data[y+1]);
		}
	}

	// Print the grid onscreen
	void print(void) {
		for (int y=1;y<ht-1;y++) {
			for (unsigned int x=0;x<8*sizeof(swar);x++) {
				int bit=(data[y]>>x)&1;
				if (bit) std::cout<<"X"; else std::cout<<" ";
			}
			std::cout<<"\n";
		}
	}
};	

int foo(void) {
	grid a,b;
	a.randomize(1);
	
// Run time loop 
	for (int step=0;step<50;step++) {
		std::cout<<"iteration "<<step<<"\n";
		a.print();

		a.update(b);
		std::swap(a,b);
	}
	return 0;
}
(Try this in NetRun now!)

Compared to a straightforward loop across x, this is much, much faster:

update_swar: 381.17 ns/call
update_bits: 15880.91 ns/call
Program complete.  Return 0 (0x0)
(Try this in NetRun now!)

This is a factor of over 40x speedup; not quite the 64x you'd get with perfect SIMD execution on a 64-bit machine, but pretty close!

The downside, of course, is that the software is deeply bizarre; it feels much closer to a circuit design than code! Unresolved question: how can we design a language/compiler that will allow a computer to make the sequential-to-SIMD code transformation?

CS 441 Lecture Note, 2015, Dr. Orion Lawlor, UAF Computer Science Department.