Course Review for Final Exam

CS 441 Lecture, Dr. Lawlor

Pre-midterm content on the final:

Pipelining
Superscalar execution and dependencies
Register renaming
Instruction reordering (out-of-order execution)
Bits and execution of floating-point numbers

Post-midterm content on the final includes various explicit parallel programming models:

SIMD/vector instructions, including SSE
Multithreaded shared memory programming, including OpenMP
Distributed-memory programming, including MPI
GPU programming, including GLSL

You should also understand the following performance pitfalls, and how to analyze an application for high performance:

Load imbalance: not all units busy, because work is unevenly assigned.

Only fix is to reassign work. Static work reassignment, such as round robin task assignment, is usually pretty easy and effective. Dynamic reassignment is pretty easy in OpenMP ("schedule(dynamic,1)"), at least theoretically possible on SSE, but much tougher on the GPU or in MPI.

Alpha (startup) cost. Negligible for SSE, but in the microseconds for OpenMP (thread startup: 5us/thread), MPI (message startup: up to 50us!), or GPUs (kernel startup: 5us).

General fix is to "use bigger stuff", amortizing out the overhead. This is a limiting factor for some applications.
Sometimes, alpha cost can be overlapped with some other useful task.

Dependencies. Some applications have surprisingly little parallelism--for example, sorting is tough to parallelize, while "find the maximum value" is pretty easy.

Sometimes a new algorithm is needed. Don't be afraid to replicate work!