SIMD Within a Register: SWAR

CS 441 Lecture, Dr. Lawlor

SIMD execution can speed things up even without fancy instructions.  For example, the key operation in Conway's Game of Life is to collect up neighbor counts.  We can use SIMD by keeping the three-bit neighbor counts for 64 neighbors stored "vertically" inside corresponding bits of three 64-bit long integers.  The machine's native addition operation doesn't do the right thing, but we can manually build a half adder in software, like this:
/* Conway's Game of Life using SIMD bitwise operations 
Dr. Orion Lawlor, lawlor@alaska.edu, 2011-10-18 (Public Domain)
*/

/* This treats an integer as an array of bits: SIMD Within A Register */
typedef unsigned long swar;

/* This stores an array of 3-bit counters *vertically* */
class set_of_counters {
public:
swar L, M, H; // 3-bit saturating counters, one per *bit*
set_of_counters() {L=M=H=0;}

/* Add one bit from N to each of our counters */
void add(long N) {
// low bit half adder
swar Lcarry=L&N;
L=L^N;

// middle bit half adder
swar Mcarry=M&Lcarry;
M=M^Lcarry;

// last bit saturates
H=H|Mcarry;
}
};
/*
Run the rules for the game of life on these three rows.
prev is the row above you, next is the row below you,
cur is your current row.

A 1 bit indicates a living cell.
*/
swar run_game(swar prev,swar cur,swar next) {
// Each counter stores the number of neighbors around this cell
set_of_counters c;

// Add all eight neighbors to our counts
c.add(prev>>1); c.add(prev); c.add(prev<<1);
c.add(cur>>1); c.add(cur<<1);
c.add(next>>1); c.add(next); c.add(next<<1);

// Run the rules on the resulting counts
long are2=(~c.H)&c.M&(~c.L); // 2==010
long are3=(~c.H)&c.M&c.L; // 3==011
return (are2&cur)|are3; // if 2, unchanged. If 3, alive.
}

// A 2D grid containing game of life cells.
class grid {
public:
enum {ht=30};
swar data[ht];

// Create random initial conditions
void randomize(int seed=1) {
srand(seed);
swar mask=1; mask=~mask;
mask=mask<<1; mask=mask>>1;// knock off high bit
for (int y=0;y<ht;y++) data[y]=rand()&mask;
data[0]=data[ht-1]=0; // clear top and bottom rows
}

// Run one game of life update, writing new cells into dest.
void update(grid &dest) const {
for (int y=1;y<ht-1;y++) {
dest.data[y]=run_game(data[y-1],data[y],data[y+1]);
}
}

// Print the grid onscreen
void print(void) {
for (int y=1;y<ht-1;y++) {
for (unsigned int x=0;x<8*sizeof(swar);x++) {
int bit=(data[y]>>x)&1;
if (bit) std::cout<<"X"; else std::cout<<" ";
}
std::cout<<"\n";
}
}
};

int foo(void) {
grid a,b;
a.randomize(1);

// Run time loop
for (int step=0;step<50;step++) {
std::cout<<"iteration "<<step<<"\n";
a.print();

a.update(b);
std::swap(a,b);
}
return 0;
}
(Try this in NetRun now!)
Compared to a straightforward loop across x, this is much, much faster:
update_swar: 381.17 ns/call
update_bits: 15880.91 ns/call
Program complete. Return 0 (0x0)
(Try this in NetRun now!)
This is a factor of over 40x speedup; not quite the 64x you'd get with perfect SIMD execution on a 64-bit machine, but pretty close!

The downside, of course, is that the software is deeply bizarre; it feels closer to a circuit design than code.