- Single, meaning just one.

- Instruction, as in a machine code instruction, executed by hardware.

- Multiple, as in more than one--from 2 to a thousand or so.

- Data, as in floats or ints.

- & AND, bitwise AND. Output bits are 1 only if both corresponding input bits are 1. This is useful to "mask out" bits you don't want, by ANDing them with zero.

As Ints

As Bits

3&5 == 1

0011&0101 == 0001

3&6 == 2

0011&0110 == 0010

3&4 == 0

0011&0100 == 0000

- 0=A&0 (AND by 0's creates 0's--used for masking)
- A=A&~0 (AND by 1's has no effect)
- A=A&A (AND by yourself has no effect)
- | OR, bitwise OR. Output bits are 1 if either input bit is 1. E.g., 3|5 == 7; or 011 | 101 == 111.

As Ints

As Bits

3|0 == 3

0011|0000 == 0011

3|3 == 3

0011|0011 == 0011

1|4 == 5

0001|0100 == 0101

- A=A|0 (OR by 0's has no effect)
- ~0=A|~0 (OR by 1's creates 1's)
- A=A|A (OR by yourself has no effect)
- ^ XOR, bitwise XOR. Output bits are 1 if either input bit is 1, but not both. E.g., 3^5 == 6; or 011 ^ 101 == 110. Note how the low bit is 0, because both input bits are 1.

As Ints

As Bits

3^5 == 6

0011&0101 == 0110

3^6 == 5

0011&0110 == 0101

3^4 == 7

0011&0100 == 0111

- A=A^0 (XOR by zeros has no effect)
- ~A = A ^ ~0 (XOR by 1's inverts all the bits)

- 0=A^A (XOR by yourself creates 0's--used in cryptography)
- ~ NOT, bitwise invert. Output bits are 1 if the
corresponding input bit is zero. E.g., ~011 ==
111....111100. (The number of leading ones depends on the size of
the machine's "int".)

As Ints

As Bits

~0 == -1

~...0000 == ...1111

- << Left shift (points left). Makes values bigger, by shifting them
into higher-valued places. E.g., 3<<1 == 6. You can
shift by as many bits as you want; 1<<n pushes a 1 up into bit
n. You can't shift by a negative number of bits.

As Ints

As Bits

3<<0 == 3

0011<<0 == 0011

3<<1 == 6

0011<<1 == 0110

3<<2 == 12

0011<<2 == 1100

- >> Right shift (points right). Makes values smaller, by shifting
them into lower-valued places. E.g., 6>>1 == 3. Note
the bits in the lowest places just "fall off the end" and vanish.

As Ints

As Bits

3>>0 == 3

0011>>0 == 0011

3>>1 == 1

0011>>1 == 0001

3>>2 == 0

0011>>2 == 0000

A |
B |
A&B (AND) |
A|B (OR) |
A^B (XOR) |

0 |
0 |
0 |
0 |
0 |

0 |
1 |
0 |
1 |
1 |

1 |
0 |
0 |
1 |
1 |

1 |
1 |
1 |
1 |
0 |

Interpretations: |
Both-1 | Either-1 | Different | |

Spreads-0 |
Spreads-1 |
Flip A if B is set |

One warning: bitwise operations have strange precedence. Specifically, "A&B==C" is interpreted by default like "A&(B==C)", not "(A&B)==C"; and "A>>B+C" is interpreted as "A>>(B+C)". This can cause horrible, hard-to-understand errors, so I just stick in parenthesis around everything when I'm using bitwise operators.

For example, say you're Google. For each possible word, you store a big list of documents containing one 32-bit integer per document. Each bit of that integer corresponds to a section of the document where the word might occur (e.g., bit 0 is set if the word occurs in the first 3% of the document, bit 1 represents the next 3%, and so on). If you've got two search terms, for a given document you can look up two integers, A and B, that represent where those terms appear in the document. If both terms appear near the same place in the document, that's good--and we can find a list of all those places (where both bits are set to 1) in just one clock cycle using just "A&B"! The same idea shows up in the "region codes" of Cohen-Sutherland clipping, used in computer graphics.

But what if I want to compare two documents for similarity? Note that C-like languages don't provide a "bitwise-compare" operator, that sets equal bits to 1 and unequal bits to 0 (because == compares whole integers, not individual bits). So how can you compute this?

Well, all the bitwise operations do interesting and useful things when inverted:

A |
B |
~(A&B) (NAND) |
~(A|B) (NOR) |
~(A^B) (XNOR) |

0 |
0 |
1 |
1 |
1 |

0 |
1 |
1 |
0 |
0 |

1 |
0 |
1 |
0 |
0 |

1 |
1 |
0 |
0 |
1 |

Interpretations: |
Either-0 |
Both-0 | Equality |

Note that "~(A^B)" is 1 if both input bits are identical, and 0 if they're different--so it's perfect for computing document similarity!

The final major bitwise operation is bit shifting, represented in C-like languages using ">>" and "<<". Left shift ("<<", pointing left), shifts bits into higher positions, filling the now-empty lower positions with zeros. For unsigned numbers, right shifting (">>", pointing right) shifts bits into lower positions and fills the empties with zeros. If you wants to search for both-term matches in B that are one bit off of those in A, you can use "A&(B<<1)" or "A&(B>>1)", depending on the direction you want to check. To check both directions, use "(A&(B<<1))|(A&(B>>1))".

The genome uses just 4 digits (ACGT, for the acids Adenine, Cytosine, Guanine, and Thymine), and so can be represented in as few as 2 bits per nucleotide (genetic digit). Former UAF CS undergraduate James Long, now up the hill doing bioinformatics research, developed a very fast sofware implementation of the Smith-Waterman string (or gene) matching algorithm called CBL that uses this SIMD-within-a-register approach.

Floats store numbers in an odd way--they're really storing the number in scientific notation, like

x = + 3.785746 * 10

Note that:

- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer.
- The 3.785746 part, called the "mantissa", can be stored as the integer 3785746 (at least as long as you can figure out where the decimal point goes!)

In binary, you can represent a non-integer like "two and three-eighths" as "10.011". That is, there's:

- a 1 in the 2's place (2=2
^{1}) - a 0 in the 1's place (1=2
^{0}) - a 0 in the (beyond the "binary point") 1/2's place (1/2=2
^{-1}), - a 1 in the 1/4's place (1/4=2
^{-2}), and - a 1 in the 1/8's place (1/8=2
^{-3})

x = + 3.785746 * 10

It's common to "normalize" a number in scientific notation so that:

- There's exactly one digit to the left of the decimal point.
- And that digit ain't zero.

In binary, a "normalized" number *always* has a 1 at the left of the decimal point (if it ain't zero, it's gotta be one). So sometimes there's no reason to even store the 1; you just know it's there!

(Note that there are also "denormalized" numbers, like 0.0, that don't have a leading 1. This is how zero is represented--there's an implicit leading 1 only if the exponent field is nonzero, an implicit leading 0 if the exponent field is zero...)

A "float" (as defined by IEEE Standard 754) consists of three bitfields:

Sign |
Exponent |
Fraction (or
"Mantissa") |

1 bit-- 0 for positive 1 for negative |
8 unsigned bits-- 127 means 2 ^{0
} 137 means 2^{10
} |
23 bits-- a binary fraction. Don't forget the implicit leading 1! |

The hardware interprets a float as having the value:

value = (-1)

Note that the mantissa has an implicit leading binary 1 applied (unless the exponent field is zero, when it's an implicit leading 0; a "denormalized" number).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

8 = (-1)

You can actually dissect the parts of a float using a "union" and a bitfield like so:

/* IEEE floating-point number's bits: sign exponent mantissa */(Executable NetRun link)

struct float_bits {

unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */

unsigned int exp:8; /**< Value is 2^(exp-127) */

unsigned int sign:1; /**< 0 for positive, 1 for negative */

};

/* A union is a struct where all the fields *overlap* each other */

union float_dissector {

float f;

float_bits b;

};

float_dissector s;

s.f=8.0;

std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";

return 0;

There are several different sizes of floating-point types:

C Datatype |
Size |
Approx. Precision |
Approx. Range |
Exponent Bits |
Fraction Bits |
+-1 range |

float |
4 bytes (everywhere) |
1.0x10^{-7} |
10^{38} |
8 |
23 |
2^{24} |

double |
8 bytes (everywhere) |
2.0x10^{-15} |
10^{308} |
11 |
52 |
2^{53} |

long double |
12-16 bytes (if it even exists) |
2.0x10^{-20} |
10^{4932} |
15 |
64 |
2^{65} |

Nowadays floats have roughly the same performance as integers: addition takes about two nanoseconds (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions.

- x86, where they are called "SSE", or "Streaming SIMD Extensions" ("MMX" is actually an older integer-only set of extensions)

- PowerPC, where they are called "AltiVec"
- Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
- Graphics cards, where they are part of the pixel processors

For example, "add" adds two integer registers, like eax and ebx. "addps" adds two SSE registers, like xmm3 and xmm6. There are SSE versions of most other arithmetic operations: subps, mulps, divps, etc.

There's one curious operation called "shufps", which does a permutation of the incoming floats. The permutation is controlled by 4 2-bit "select" fields, one for each destination float, that determines which incoming float goes to that slot. For example, a swizzle can copy the first float into all four destination slots, with a select field of 0x00. To copy the last float into all four destination slots, the select field is 0xff. A select field of 00010111, or 0x17, would set the first float from the last (float 3), copy float 1 into the middle two slots, and fill the last slot with the first float (float 0). This is the *only* operation which rearranges floats between different slots in the registers. (In graphics language, this operation is called a "swizzle".) For example, if xmm1 contains the four floats "(0.0, 1.1, 2.2, 3.3)", then:

shufps xmm1,xmm1, 0xff

will set all 4 of xmm1's floats to the last value (index 11).

There are actually two flavors of the SSE move instruction: "movups" moves a value between *unaligned* addresses (not a multiple of 16); "movaps" moves a value between *aligned* addresses. The aligned move is substantially faster, but it will segfault if the address you give isn't a multiple of 16! Luckily, most "malloc" implementations return you 16-byte aligned data, and the stack is aligned to 16 bytes with most compilers. But for maximum portability, for the aligned move to work you sometimes have to manually add padding (using pointer arithmetic), which is really painful.

Here's a simple SSE assembly example where we load up a constant, add it to itself with SSE, and then write it out:

movups xmm1,[thing1]; <- copy the four floats into xmm1

addps xmm1,xmm1; <- add floats to themselves

movups [retval],xmm1; <- move that constant into the global "retval"

; Print out retval

extern farray_print

push 4 ;<- number of floats to print

push retval ;<- points to array of floats

call farray_print

add esp,8 ; <- pop off arguments

ret

section .data

thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant

retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value

Here's a version that loads up two separate float constants, and adds them:

movups xmm1,[thing1]; <- copy the four floats into xmm1The "dd" lines declare a 32-bit constant. NASM is smart enough to automatically use float format if you type a constant with a decimal point!

movups xmm6,[thing2]; <- copy the four floats into xmm1

addps xmm1,xmm6; <- add floats

movups [retval],xmm1; <- move that constant into the global "retval"

; Print out retval

extern farray_print

push 4 ;<- number of floats to print

push retval ;<- points to array of floats

call farray_print

add esp,8 ; <- pop off arguments

ret

section .data

thing1: dd 10.2, 100.2, 1000.2, 10000.2;<- source constant

thing2: dd 1.2, 2.2, 3.2, 4.2;<- source constant

retval: dd 0.0, 0.0, 0.0, 0.0 ;<- our return value

- GCC 3.x under linux. Be sure to compile with "-msse", or the header will whine.

- Visual C++ 7.0 under Windows.

- Intel C/C++ compiler under Linux or Windows. Makes bogus
complaint about "no EMMS instruction before return", although the
annoying emms() call is thankfully only needed with MMX, not SSE.

The C version of an SSE register is the user-friendly and self-explanatory type "__m128". All the instructions start with "_mm_" (i.e., MultiMedia). The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:

__m128 _mm_load_ps(float *src) |
Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16! |

__m128 _mm_loadu_ps(float *src) | Load 4 floats from an unaligned address (4x slower!) |

__m128 _mm_load1_ps(float *src) | Load 1 individual float into all 4 fields of an __m128 |

__m128 _mm_setr_ps(float a,float b,float c,float d) |
Load 4 separate floats from parameters into an __m128 |

void _mm_store_ps(float *dest,__m128 src) |
Store 4 floats to an aligned address. |

void _mm_storeu_ps(float *dest,__m128 src) | Store 4 floats to unaligned address |

__m128 _mm_add_ps(__m128 a,__m128 b) |
Add corresponding floats (also "sub") |

__m128 _mm_mul_ps(__m128 a,__m128 b) | Multiply corresponding floats (also "div", but it's slow) |

__m128 _mm_min_ps(__m128 a,__m128 b) | Take corresponding minimum (also "max") |

__m128 _mm_sqrt_ps(__m128 a) | Take square roots of 4 floats (12ns, slow like divide) |

__m128 _mm_rcp_ps(__m128 a) | Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!) |

__m128 _mm_rsqrt_ps(__m128 a) | Rough (12-bit) reciprocal-square-root of all 4 floats (fast) |

__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats. |

So take your classic "loop over floats":

for (int i=0;i<n_vals;i++) {(executable NetRun link)

vals[i]=vals[i]*a+b;

}

This takes about 4.5 ns per float.

Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:

for (int i=0;i<n_vals;i+=4) {(Try this in NetRun now!)

vals[i+0]=vals[i+0]*a+b;

vals[i+1]=vals[i+1]*a+b;

vals[i+2]=vals[i+2]*a+b;

vals[i+3]=vals[i+3]*a+b;

}

This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.

We can then replace the guts of the loop with SSE instructions:

__m128 SSEa=_mm_load1_ps(&a);This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!

__m128 SSEb=_mm_load1_ps(&b);

__m128 v=_mm_load_ps(&vec[i]);

v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);

_mm_store_ps(&vec[i],v);

(Note: I'd previously called "_mm_load_ps1" in the above example, which works, but is a bit slower.)

Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place!