Modern machines do tons of crazy stuff to get things happening in parallel:

- "Pipelining" is just a way to break up a long series of tasks
into smaller pieces that can be executed simultaneously. Machines
have been pipelined since the 386.

- "Superscalar" machines use fancy lookahead and scheduling algorithms to try to execute several instructions in parallel. All modern machines since about the Pentium era are superscalar.
- "Out-of-order" machines will actually analyze dependencies and execute instructions that have their data available before previous instructions have even finished. The Pentium II and subsequent machine execute out-of-order.
- "Load buffers" and "store buffers" allow memory reads and writes to proceed simultaneously with other processing. The Pentium 4 has a 24-entry store buffer, which means the processor can have 24 store instructions "in flight" at once!

But consider a program like this:

for (int i=0;i<n_vals;i++) {There's actually n_vals-way parallelism in this program, but to extract that parallelism from the C program is really tricky, since n_vals, vals, a, or b could all be changing (because of another thread, because the array points to it, etc.).

vals[i]=vals[i]*a+b;

}

What we really mean above is just:

for all i simultaneouslybut that's not what the code above means in C; the C version means "set i to 0; do the loop; i++; if (i<n_vals) repeat", which is totally different!

vals[i]=vals[i]*a+b;

Consider how different the loop becomes if we write:

for (int i=1;i<n_vals;i++) {Now it's not a parallel loop at all.

vals[i]=vals[i-1]*a+b;

}

Fundamentally, there's a bottleneck in the way we write code:

- The problems we want to solve often (but not always) have an obvious, trivially parallel structure.
- The code we write to solve the problem is written using the inherently sequential, one-thing-after-another model of all assembly language, C, C++, Java, etc.
- The machine we run the code on then goes to great lengths to
rediscover the original parallelism, and hence execute our sequential
code in parallel.

- People that write "parallelizing compilers" (e.g., OpenMP)
either automatically recognize "parallel loops" or allow you to provide
hints in your source code as to which loops can be parallelized.
These attempts are somewhat hamstrung by the fact that they're still
based on inherently sequential languages, which prevents many of the
kind of transformations you want to make for performance (e.g., load
balancing).

- People that write "parallel languages" (e.g., Parallel Prolog) attempt to describe the entire problem in an inherently parallel way. This is really tough for people to do, and usually isn't a good match to modern sequential-centric machines.
- People have added "parallel arithmetic" to virtually all modern hardware. This is a low-benefit but low-cost way to capture more parallelism.

The advantages of SIMD are:

- Up to a 4x speedup over normal sequential code. Note that 4
32-bit floats is 128 bits, so you can impress people by saying you've
got a "128-bit machine", which is true in some sense.

- 4 elements is perfect for doing calculations on colors (Red-Green-Blue-Alpha), 3D locations (XYZW), or 4-channel stereo.

- Easy to implement in hardware, because the hardware doesn't have to *recognize* all 4 operations are independent, it's just *told* they are.

- The "4" is hardcoded into the hardware and source code. For
higher levels of parallelism, you're back to pipelining and chasing
dependencies.

- Sometimes it's tricky to get exactly 4 copies of what you need--sometimes you need to add extra padding.

- x86: SSE, SSE2, SSE3. "MMX" is the integer version.
- PowerPC: AltiVec
- Cell Processor
- All graphics cards (see next lecture)

- GCC 3.x under linux. Be sure to compile with "-msse", or the header will whine.

- Visual C++ 7.0 under Windows.

- Intel C/C++ compiler under Linux or Windows. Makes bogus
complaint about "no EMMS instruction before return", although the
annoying emms() call is thankfully only needed with MMX, not SSE.

The C version of an SSE register is the friendly and self-explanatory type "__m128". All the instructions start with "_mm_" (i.e., MultiMedia). The suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:

__m128 _mm_load_ps(float *src) |
Load 4 floats from a 16-byte aligned address. |

__m128 _mm_loadu_ps(float *src) | Load from an unaligned address (4x slower!) |

__m128 _mm_load1_ps(float *src) | Load 1 float into all 4 fields of an __m128 |

__m128 _mm_setr_ps(float a,float b,float c,float d) |
Load 4 floats from parameters into an __m128 |

void _mm_store_ps(float *dest,__m128 src) |
Store 4 floats to an aligned address. |

void _mm_storeu_ps(float *dest,__m128 src) | Store 4 floats to unaligned address |

__m128 _mm_add_ps(__m128 a,__m128 b) |
Add corresponding floats (also "sub") |

__m128 _mm_mul_ps(__m128 a,__m128 b) | Multiply corresponding floats (also "div") |

__m128 _mm_min_ps(__m128 a,__m128 b) | Take corresponding minimum (also "max") |

__m128 _mm_sqrt_ps(__m128 a) | Take square roots of 4 floats (12ns, slow like divide) |

__m128 _mm_rcp_ps(__m128 a) | Compute rough (12-bit accuracy) reciprocal of all 4 floats (fast as an add!) |

__m128 _mm_rsqrt_ps(__m128 a) | Rough (12-bit) reciprocal-square-root of all 4 floats (fast) |

__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats. |

Be careful! The non "u" versions of load and store are much faster (and if you want performance, you really do need to use them), but they will segfault if the address you pass in isn't a multiple of 16 bytes! Luckily, most "malloc" implementations return you 16-byte aligned data, and the stack is aligned to 16 bytes with most compilers, especially if you've got __m128 objects in your stack frame. But for portable code, you may have to manually add padding (using pointer arithmetic), which is really painful.

When things work, like in the "times A plus B" array loop above, we can unroll the loop 4 times, and replace the guts of the loop with SSE instructions:

__m128 va=_mm_load_ps1(&a); /* va contains 4 copies of a */This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version!

__m128 vb=_mm_load_ps1(&b); /* vb contains 4 copies of b */

for (int i=0;i<n_vals;i+=4) { /* careful! n_vals must be multiple of 4! */

__m128 v=_mm_load_ps(&vals[i]); /* careful about alignment! */

v=_mm_mul_ps(v,va);

v=_mm_add_ps(v,vb);

_mm_store_ps(&vals[i],v);

}