Comparing Levels of Parallel Execution
CS 441 Lecture, Dr. Lawlor
We've talked about a bunch of different levels of parallelism in this course so far:
- Pipelining, running different phases of different instructions at
the same time (e.g., fetching the next instruction while decoding the
current one). The limiting factor here is the number of pipeline
stages--more stages means more parallelism, but also more overhead and
higher latency. To hide the pipeline latency during a branch,
modern machines work very hard to predict branches. Pipeline
lengths vary from five stages for the classic
fetch-decode-read-execute-write pipeline to the Pentium 4's giant
30-stage pipeline. Pipelines for more recent multicore machines
are actually using fewer stages.
- Superscalar execution, running the same phases of different
instructions at the same time (e.g., running two adds
simultaniously). The limiting factor here is dependencies between
instructions. Modern CPUs use tricks like register renaming to
eliminate WAR and WAW dependencies, but true dependencies (RAW) can't
be renamed away. Modern CPUs also look out way ahead in the
program in a large "window" to find non-dependent instructions.
Typical modern CPUs can execute a maximum of four instructions
simultaniously, but usually true dependencies keep that from
occurring. Here's a good presentation on superscalar design; typically even an enormous 1000-instruction window only reveals a few dozen instructions that are ready to execute.
- Single-Instruction-Multiple-Data (SIMD) instructions contain multiple independent operations in a single instruction. For example, x86 SSE instructions operate on 4 separate floats simultaniously. This
is a much cheaper way to get hardware parallelism: instead of making
the hardware carefully analyze sequential code to determine independent
operations, the programmer just writes in blocks of independent
operations directly. Future CPUs will have evenwider SIMD instructions: Intel's Larrabee
x86 GPU will operate on 16 floats at a time. NVIDIA's GPU "thread
warps" are really 32-float SIMD batches. ATI's GPU "wavefronts"
operate on 64 floats at a time.
- Symmetric Multi-Processing (SMP) is where you replicate the CPU
cores to run separate threads simultaniously. Aside from ensuring
the threads see the same memory ("cache coherent multiprocessing"),
this is almost trivial from the hardware side--you just cookie-cutter
out a set of identical CPUs. It's a lot harder to take advantage
of multicore from the software side, although OpenMP helps a lot.
Currently, dual core is standard equipment on laptops, quad core is
common on desktops, and six or eight cores are becoming popular.
Terminology: multiple processors in a single piece of silicon ==
multicore, multiple peices of silicon in a single pluggable device ==
multichip module, multiple peices of silicon plugged into separate
sockets == multi socket SMP (and typically a lot more money too!).
- Simultanious Multi-Threading (SMT) is a cheaper variant of SMP,
where you only replicate the registers, but share the arithmetic
hardware. The programming model is identical to SMP, typically
OpenMP. Intel calls SMT "HyperThreading Technology", and usually
has only two sets of registers sharing each core's ALUs. IBM's
upcoming POWER7 and Sun's Niagra
both have eight cores per die, with four threads per core, for a total
of 32 threads per chip. GPUs typically have a ridiculously high
fixed pool of registers (e.g., 16,384 registers!) that they divide up
into a variable number of threads depending on how many registers the
code currently needs; this can result in thousands of threads sharing a
single arithmetic unit!
- Clustering is where you replicate the entire box--CPU, RAM,
network, and often even disk. The advantage of this is that
everything is separate and replicated, so there are fewer bottlenecks
and virtually everything scales up (including compute, memory
bandwidth, network bandwidth, etc). The downside is that nothing
is shared, so you need to explicitly communicate, typically via MPI.
Each one of these levels of parallelism cascade into one another, so (hypothetically):
- Pipelining lets a Power7 operate on different stages of 4 instructions at once.
- Superscalar execution lets a Power7 execute 2 instructions simultaniously.
- SIMD for Power is called AltiVec, and each instruction can operate on 4 floats.
- SMP: the Power7 chips each have 8 cores.
- SMT: each Power7 core runs 4 threads.
- Clustering: NCSA is building a giant machine with 25,000 chips.
That's 200,000 cores; 800,000 threads; 3.2 million floats; or 25.6
million stages. That's a lot of parallelism. They're hoping
to hit a dozen petaflops (12,000 trillion floating point operations per
second), which they can get by finishing 3 million floating point
operations per clock at a 4 GHz clock rate.