Comparing Levels of Parallel Execution

CS 441 Lecture, Dr. Lawlor

We've talked about a bunch of different levels of parallelism in this course so far:

Pipelining, running different phases of different instructions at the same time (e.g., fetching the next instruction while decoding the current one). The limiting factor here is the number of pipeline stages--more stages means more parallelism, but also more overhead and higher latency. To hide the pipeline latency during a branch, modern machines work very hard to predict branches. Pipeline lengths vary from five stages for the classic fetch-decode-read-execute-write pipeline to the Pentium 4's giant 30-stage pipeline. Pipelines for more recent multicore machines are actually using fewer stages.
Superscalar execution, running the same phases of different instructions at the same time (e.g., running two adds simultaniously). The limiting factor here is dependencies between instructions. Modern CPUs use tricks like register renaming to eliminate WAR and WAW dependencies, but true dependencies (RAW) can't be renamed away. Modern CPUs also look out way ahead in the program in a large "window" to find non-dependent instructions. Typical modern CPUs can execute a maximum of four instructions simultaniously, but usually true dependencies keep that from occurring. Here's a good presentation on superscalar design; typically even an enormous 1000-instruction window only reveals a few dozen instructions that are ready to execute.
Single-Instruction-Multiple-Data (SIMD) instructions contain multiple independent operations in a single instruction. For example, x86 SSE instructions operate on 4 separate floats simultaniously. This is a much cheaper way to get hardware parallelism: instead of making the hardware carefully analyze sequential code to determine independent operations, the programmer just writes in blocks of independent operations directly. Future CPUs will have evenwider SIMD instructions: Intel's Larrabee x86 GPU will operate on 16 floats at a time. NVIDIA's GPU "thread warps" are really 32-float SIMD batches. ATI's GPU "wavefronts" operate on 64 floats at a time.
Symmetric Multi-Processing (SMP) is where you replicate the CPU cores to run separate threads simultaniously. Aside from ensuring the threads see the same memory ("cache coherent multiprocessing"), this is almost trivial from the hardware side--you just cookie-cutter out a set of identical CPUs. It's a lot harder to take advantage of multicore from the software side, although OpenMP helps a lot. Currently, dual core is standard equipment on laptops, quad core is common on desktops, and six or eight cores are becoming popular. Terminology: multiple processors in a single piece of silicon == multicore, multiple peices of silicon in a single pluggable device == multichip module, multiple peices of silicon plugged into separate sockets == multi socket SMP (and typically a lot more money too!).
Simultanious Multi-Threading (SMT) is a cheaper variant of SMP, where you only replicate the registers, but share the arithmetic hardware. The programming model is identical to SMP, typically OpenMP. Intel calls SMT "HyperThreading Technology", and usually has only two sets of registers sharing each core's ALUs. IBM's upcoming POWER7 and Sun's Niagra both have eight cores per die, with four threads per core, for a total of 32 threads per chip. GPUs typically have a ridiculously high fixed pool of registers (e.g., 16,384 registers!) that they divide up into a variable number of threads depending on how many registers the code currently needs; this can result in thousands of threads sharing a single arithmetic unit!
Clustering is where you replicate the entire box--CPU, RAM, network, and often even disk. The advantage of this is that everything is separate and replicated, so there are fewer bottlenecks and virtually everything scales up (including compute, memory bandwidth, network bandwidth, etc). The downside is that nothing is shared, so you need to explicitly communicate, typically via MPI.

Each one of these levels of parallelism cascade into one another, so (hypothetically):

Pipelining lets a Power7 operate on different stages of 4 instructions at once.
Superscalar execution lets a Power7 execute 2 instructions simultaniously.
SIMD for Power is called AltiVec, and each instruction can operate on 4 floats.
SMP: the Power7 chips each have 8 cores.
SMT: each Power7 core runs 4 threads.
Clustering: NCSA is building a giant machine with 25,000 chips.

That's 200,000 cores; 800,000 threads; 3.2 million floats; or 25.6 million stages. That's a lot of parallelism. They're hoping to hit a dozen petaflops (12,000 trillion floating point operations per second), which they can get by finishing 3 million floating point operations per clock at a 4 GHz clock rate.