Single Instruction, Multiple Data (SIMD)

CS 641 Lecture, Dr. Lawlor

"SIMD" stands for Single Instruction Multiple Data:

Single, meaning just one.
Instruction, as in a machine code instruction, executed by hardware.
Multiple, as in more than one--from 2 to a thousand or so.
Data, as in floats or ints.

The basic idea: one instruction operates on multiple data items simultaniously. This is a form of software parallelism.

You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once. This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers. For example, the 1988 Cray Y-MP was a typical vector machine. When I was an undergraduate, ARSC still had a Y-MP vector machine. The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!). A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second. Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed. Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
PowerPC, where they are called "AltiVec"
Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
Graphics cards, where they are part of the pixel processors

We'll look at the x86 version, SSE, in detail below.

SIMD Within a Register (SWAR)

SIMD execution can speed things up even without fancy instructions. For example, the key operation in Conway's Game of Life is to collect up neighbor counts. We can use SIMD by keeping the three-bit neighbor counts for 64 neighbors stored "vertically" inside corresponding bits of three 64-bit long integers. The machine's native addition operation doesn't do the right thing, but we can manually build a half adder in software, like this:

/* Conway's Game of Life using SIMD bitwise operations 
Dr. Orion Lawlor, lawlor@alaska.edu, 2011-10-18 (Public Domain)
*/

/* This treats an integer as an array of bits: SIMD Within A Register */
typedef unsigned long swar; 

/* This stores an array of 3-bit counters *vertically* */
class set_of_counters {
public:
	swar L, M, H; // 3-bit saturating counters, one per *bit*
	set_of_counters() {L=M=H=0;}

	/* Add one bit from N to each of our counters */
	void add(long N) {
		// low bit half adder
		swar Lcarry=L&N;
		L=L^N;
		
		// middle bit half adder
		swar Mcarry=M&Lcarry;
		M=M^Lcarry;
		
		// last bit saturates
		H=H|Mcarry; 
	}
};
/*
 Run the rules for the game of life on these three rows.
 prev is the row above you, next is the row below you,
 cur is your current row.

 A 1 bit indicates a living cell.
*/
swar run_game(swar prev,swar cur,swar next) {
// Each counter stores the number of neighbors around this cell
	set_of_counters c;

// Add all eight neighbors to our counts
	c.add(prev>>1);	c.add(prev); c.add(prev<<1);
	c.add(cur>>1);               c.add(cur<<1);
	c.add(next>>1); c.add(next); c.add(next<<1);

// Run the rules on the resulting counts
	long are2=(~c.H)&c.M&(~c.L); // 2==010
	long are3=(~c.H)&c.M&c.L; // 3==011
	return (are2&cur)|are3; // if 2, unchanged.  If 3, alive.
}

// A 2D grid containing game of life cells.
class grid {
public:
	enum {ht=30};
	swar data[ht];
	
	// Create random initial conditions
	void randomize(int seed=1) {
		srand(seed);
		swar mask=1; mask=~mask; 
		mask=mask<<1; mask=mask>>1;// knock off high bit
		for (int y=0;y<ht;y++) data[y]=rand()&mask;
		data[0]=data[ht-1]=0; // clear top and bottom rows
	}
	
	// Run one game of life update, writing new cells into dest.
	void update(grid &dest) const {
		for (int y=1;y<ht-1;y++) {
			dest.data[y]=run_game(data[y-1],data[y],data[y+1]);
		}
	}

	// Print the grid onscreen
	void print(void) {
		for (int y=1;y<ht-1;y++) {
			for (unsigned int x=0;x<8*sizeof(swar);x++) {
				int bit=(data[y]>>x)&1;
				if (bit) std::cout<<"X"; else std::cout<<" ";
			}
			std::cout<<"\n";
		}
	}
};	

int foo(void) {
	grid a,b;
	a.randomize(1);
	
// Run time loop 
	for (int step=0;step<50;step++) {
		std::cout<<"iteration "<<step<<"\n";
		a.print();

		a.update(b);
		std::swap(a,b);
	}
	return 0;
}
(Try this in NetRun now!)

Compared to a straightforward loop across x, this is much, much faster:

update_swar: 381.17 ns/call
update_bits: 15880.91 ns/call
Program complete.  Return 0 (0x0)
(Try this in NetRun now!)

This is a factor of over 40x speedup; not quite the 64x you'd get with perfect SIMD execution on a 64-bit machine, but pretty close!

The downside, of course, is that the software is deeply bizarre; it feels closer to a circuit design than code.