Course Review for Final Exam
CS 441 Lecture, Dr. Lawlor
Pre-midterm content on the final:
- Pipelining
- Superscalar execution and dependencies
- Register renaming
- Instruction reordering (out-of-order execution)
- Bits and execution of floating-point numbers
Post-midterm content on the final includes various explicit parallel programming models:
- SIMD/vector instructions, including SSE
- Multithreaded shared memory programming, including OpenMP
- Distributed-memory programming, including MPI
- GPU programming, including GLSL
You should also understand the following performance pitfalls, and how to analyze an application for high performance:
- Load imbalance: not all units busy, because work is unevenly assigned.
- Only fix is to reassign work. Static work reassignment,
such as round robin task assignment, is usually pretty easy and
effective. Dynamic reassignment is pretty easy in OpenMP
("schedule(dynamic,1)"), at least theoretically possible on SSE, but
much tougher on the GPU or in MPI.
- Alpha (startup) cost. Negligible for SSE, but in the
microseconds for OpenMP (thread startup: 5us/thread), MPI (message
startup: up to 50us!), or GPUs (kernel startup: 5us).
- General fix is to "use bigger stuff", amortizing out the overhead. This is a limiting factor for some applications.
- Sometimes, alpha cost can be overlapped with some other useful task.
- Dependencies. Some applications have surprisingly little
parallelism--for example, sorting is tough to parallelize, while "find
the maximum value" is pretty easy.
- Sometimes a new algorithm is needed. Don't be afraid to replicate work!