Designing High-Performance Parallel Code
A modern program can exploit the following hardware parallelism:
- Network parallelism, by using network sockets to split up the work into very big pieces.
- Multicore
, by making separate threads to handle big pieces of work, ideally with a simple directive like OpenMP.
- SIMD
, by using ugly intrinsic functions, assembly language, or a good parallelizing compiler. Often this means operating on blocks of eight (or more) elements at a time.
- Superscalar
and pipelined execution, by (re)structuring the code to minimize dependencies between instructions after renaming. Sometimes this means unrolling even SIMD loops a few times, to expose more parallelism.
Parallel code runs fast if:
- Your software has enough parallelism to keep the hardware busy. Since a desktop machine might have over ten cores today, it's usually not enough to just split off the naturally separate tasks in the application (like one thread for the game network, a second thread for the game AI, and a third thread for the game physics), you need to use threads within an individual task.
- You've divided the work into equal parts, so all the hardware stays busy instead of everybody waiting for one server, one thread, or one SIMD slice. On the network, a dedicated "load balancer" splits up work across servers. In OpenMP, you can use the "schedule" options to move work around between threads. To improve SIMD load balance, you may need to group data by related branches, or use a movemask
instruction to figure out which branches are still needed. Superscalar performance is all about balancing out dependencies.
- Each thread has its own data, so you're not wasting time synchronizing access to the same variables, or thrashing in the cache. On the network, access latency to remote data is huge. For multicore, sharing cache lines can destroy performance. In SIMD, data alignment and cache management
is key for performance. It may be cheaper to recompute answers from scratch than to communicate them, due to the dependencies and overheads inherent in communication.
CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.