Computer Architecture in Review

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

To study for the final exam, read the lecture notes! You should be able to:

Explain the concepts
Show how to apply the concepts to concrete examples
Compare different implementations (e.g., should we use SIMD, OpenMP, or CUDA for this problem? Why?)

Performance: nanoseconds and timers

Key idea: you can scientifically measure performance, by running experiments.
    double start_time=time_in_seconds();
    run n copies of stuff
    double end_time=time_in_seconds();
    double elapsed = ( end_time - start_time ) / n copies;

Superscalar CPU performance

Key idea: the CPU circuitry tries to run your (sequential) assembly code in parallel.
Pitfall: the CPU must respecting the code's data dependencies.

Circuits are naturally parallel, this is called "instruction level parallelism"

Single Instruction, Multiple Data: parallel floats on x86 with SSE

Key idea: write special instructions (like addps) that operate on several things in parallel.
addps takes exactly the same amount of time as addss, because the hardware has multiple add circuits.

SSE: parallel floats, but can get 64 parallel *bits* with ordinary long integers.

Pitfall: for branches, code needs to compute both sides of the branch, and use bizarre bitwise selection operation:
    answer = (mask&then_answer) | ((~mask)&else_answer);

Multicore with threads

Key idea: run several functions at the same time using different cores.

    std::thread t1(do_stuff,0,n/2); // first half of the stuff
    std::thread t2(do_stuff,n/2,n); // second half of the stuff

    t1.join(); // wait for stuff
    t2.join();

Shared between threads: memory, files, network
Per-thread: registers, stack

Pitfalls:

Avoiding memory race conditions (e.g., malfunction due to shared global variables)
Having enough work to outweigh the thread start time (thousands of nanosecons)
Dividing up work evenly, so all the threads stay busy

You can also combine threads with SIMD: four cores each operating on four floats per clock cycle makes for a lot of parallelism!

Multicore with OpenMP

Key idea: compiler makes threads by splitting up the loops you indicate.

#pragma omp parallel for reduction(+:total)
    for (int i=0;i<n;i++) total+=do_stuff(i);

Much easier than manually making threads, but same pitfalls apply.

Graphics Card programming with CUDA

Key idea: graphics card is a separate computer that supports a whole lot of threads.

Example: one sequential C++ for loop:
    for (int i=0;i<n;i++) answer_array[i]=do_stuff(i);
becomes a CUDA kernel call to run n threads (in blocks of 256 threads each):
    do_stuff<<<n/256, 256>>>(gpu_answer_array);
Each thread looks up its i=threadIdx.x + blockIdx.x * blockDim.x;

You need about a million threads (!) to really get a modern GPU cranking.

Rendering Graphics with CUDA, and Branch Divergence

Key idea: parallelism across pixels.
CUDA is doing SIMD-style branches internally (branch divergence == GPU's SIMD vector width).

Macros in NASM and C

Key idea: macro = string replacement. Lets you arbitrarily change language syntax.

#define SQUARE(n)   ((n)*(n))   /* parenthesis make it so SQUARE(5+5) works right */
#define DO_STUFF() do { stuff(); } while(0)    /* the do / while is so DO_STUFF(); works inside an "if" statement */
#define MAKE_STUDENT(firstname,lastname) void run_##firstname##lastname(void) { std::cout<<"Hi! I'm "<<#firstname<<"\n"; }

Systems programming: registration

Key idea: store a table of function pointers, add and look up functions from the table at runtime.

Used to implement registration in the OS, make fast switch statements, build fast interpreters, etc.

Systems programming: system calls

Key idea: directly talk to operating system using what is essentially a special kind of crash. ("syscall" or "int 0x80" instruction)

Syscalls are important for understanding the machine's security boundaries: they're the dividing line between user space (no rights) and kernel space (all the rights).

Memory Map manipulation with mmap

Key idea: your program's virtual memory shows the real physical memory indirectly, via the OS-built page table.

This lets your program put memory at any address you choose, and lets the OS lazily load parts of your program.

ARM Assembly Intro & Machine Code

Similarities to x86: instructions exist, registers exist
Differences with x86: variable-length instructions (RISC vs CISC), names and uses of registers (r0 plays role of rdi and rax), mechanics of function call (lr vs call)

ARM: focused on efficient use of silicon and energy
x86: focused on maximum single-threaded performance
GPU: focused on maximum many-thread performance

Embedded Systems programming & Raspberry Pi GPIO Pins

Key idea: control physical hardware from software by setting GPIO register values in memory.