Computer Architecture in Review
CS 301: Assembly
Language Programming Lecture, Dr. Lawlor
To study for the final exam, read the lecture notes! You
should be able to:
- Explain the concepts
- Show how to apply the concepts to concrete examples
- Compare different implementations (e.g., should we use SIMD,
OpenMP, or CUDA for this problem? Why?)
Performance:
nanoseconds and timers
Key idea: you can scientifically measure performance, by running
experiments.
double start_time=time_in_seconds();
run n copies of stuff
double end_time=time_in_seconds();
double elapsed = ( end_time - start_time ) / n
copies;
Superscalar
CPU performance
Key idea: the CPU circuitry tries to run your (sequential) assembly
code in parallel.
Pitfall: the CPU must respecting the code's data dependencies.
Circuits are naturally parallel, this is called "instruction level
parallelism"
Single
Instruction, Multiple Data: parallel floats on x86 with SSE
Key idea: write special instructions (like addps) that operate on
several things in parallel.
addps takes exactly the same amount of time as addss, because the
hardware has multiple add circuits.
SSE: parallel floats, but can get 64 parallel *bits* with ordinary
long integers.
Pitfall: for branches, code needs to compute both sides of the
branch, and use bizarre bitwise selection operation:
answer = (mask&then_answer) |
((~mask)&else_answer);
Multicore
with threads
Key idea: run several functions at the same time using different
cores.
std::thread t1(do_stuff,0,n/2); // first half of
the stuff
std::thread t2(do_stuff,n/2,n); // second half of
the stuff
t1.join(); // wait for stuff
t2.join();
Shared between threads: memory, files, network
Per-thread: registers, stack
Pitfalls:
- Avoiding memory race conditions (e.g., malfunction due to
shared global variables)
- Having enough work to outweigh the thread start time (thousands
of nanosecons)
- Dividing up work evenly, so all the threads stay busy
You can also combine threads with SIMD: four cores each operating on
four floats per clock cycle makes for a lot of parallelism!
Multicore
with OpenMP
Key idea: compiler makes threads by splitting up the loops you
indicate.
#pragma omp parallel for reduction(+:total)
for (int i=0;i<n;i++) total+=do_stuff(i);
Much easier than manually making threads, but same pitfalls apply.
Graphics
Card programming with CUDA
Key idea: graphics card is a separate computer that supports a whole
lot of threads.
Example: one sequential C++ for loop:
for (int i=0;i<n;i++)
answer_array[i]=do_stuff(i);
becomes a CUDA kernel call to run n threads (in blocks of 256
threads each):
do_stuff<<<n/256,
256>>>(gpu_answer_array);
Each thread looks up its i=threadIdx.x + blockIdx.x * blockDim.x;
You need about a million threads (!) to really get a modern GPU
cranking.
Rendering
Graphics with CUDA, and Branch Divergence
Key idea: parallelism across pixels.
CUDA is doing SIMD-style branches internally (branch divergence ==
GPU's SIMD vector width).
Macros
in NASM and C
Key idea: macro = string replacement. Lets you arbitrarily
change language syntax.
#define SQUARE(n) ((n)*(n)) /* parenthesis
make it so SQUARE(5+5) works right */
#define DO_STUFF() do { stuff(); } while(0) /* the
do / while is so DO_STUFF(); works inside an "if" statement */
#define MAKE_STUDENT(firstname,lastname) void
run_##firstname##lastname(void) { std::cout<<"Hi! I'm
"<<#firstname<<"\n"; }
Systems
programming: registration
Key idea: store a table of function pointers, add and look up
functions from the table at runtime.
Used to implement registration in the OS, make fast switch
statements, build fast interpreters, etc.
Systems
programming: system calls
Key idea: directly talk to operating system using what is
essentially a special kind of crash. ("syscall" or "int 0x80"
instruction)
Syscalls are important for understanding the machine's security
boundaries: they're the dividing line between user space (no rights)
and kernel space (all the rights).
Memory
Map manipulation with mmap
Key idea: your program's virtual memory shows the real physical
memory indirectly, via the OS-built page table.
This lets your program put memory at any address you choose, and
lets the OS lazily load parts of your program.
ARM
Assembly Intro & Machine Code
Similarities to x86: instructions exist, registers exist
Differences with x86: variable-length instructions (RISC vs CISC),
names and uses of registers (r0 plays role of rdi and rax),
mechanics of function call (lr vs call)
ARM: focused on efficient use of silicon and energy
x86: focused on maximum single-threaded performance
GPU: focused on maximum many-thread performance
Embedded
Systems programming & Raspberry Pi GPIO Pins
Key idea: control physical hardware from software by setting GPIO
register values in memory.