Threads: Multicore Parallel Programming in Shared Memory

The typical way C++ code runs is sequential: when we call a function, we stop running, the function starts running, it continues running until the function returns, and only then do we start running again. Even if you have multiple programs, on each core the OS typically runs only one program at a time, switching between programs when interrupts happen, but never running two programs on the same core at the same time.

The basic operation of threads is fundamentally different: if we start a thread, the thread starts running while we keep running. If you have enough cores, both threads will run at the same time, until the program exits or you "join" the threads together again. Unlike with isolated programs, threads share the same memory space, so two threads can read and write anything in each other's memory at the same time. Threads each have their own registers and stack areas, but the stacks are each in their own area of the same memory space.

Registers: separate copy of registers for each thread, so each thread can "add rax,3" without interference from any other thread.
Arithmetic units: these may actually be physically shared between nearby HyperThread / SMT cores, but they're logically kept separate, so each thread only sees the results of its own arithmetic operations.
Stack: separate memory area for each thread, so each thread can call its own functions and store its own local variables without interference.
Memory: all of memory is shared between all threads, which is how threads communicate and exchange data. If a thread wants to not share, it needs to allocate its own unique area of memory, for example using new or malloc. For global variables, the C++11 "thread_local" keyword makes a separate copy of the variable for each thread.
Cache: physically, the L1 cache is normally separate for each core; the last level cache is normally shared between cores. But logically, all of memory is shared.

C++11 finally added a thread library, in <thread>. This makes it easy to create threads in a portable way, without needing to call the operating system directly to make Windows kernel threads or UNIX pthreads.

#include <thread>

volatile int delay=0;
void do_work(char where) {
	for (int i=0;i<10;i++) {
		for (int j=0;j<1000;j++) delay++;
		std::cout<<where<<i<<"\n";
	}
}

void worker_thread(void) {
	do_work('B');
}

void foo(void) {
	std::thread t(worker_thread);
	do_work('A');
	t.join();
	std::cout<<"Done!\n";
}

(Try this in NetRun now!)

If the function takes arguments, you create the thread by first passing the name of the function to run, then the function arguments. For example, here we're making a thread to call puts (which is probably a waste of a thread!)

	std::thread t(puts,"I/O from a thread!");

(Try this in NetRun now!)

Threads are a very common way to handle:

Slow operations like network or disk I/O. One thread can handle the I/O, while the main thread keeps the user interface responsive. Many servers spawn one thread for each network client.
Compute-intensive operations like rendering. Multiple threads can compute parts of the answer simultaneously.

However, notice in the above program, sometimes the prints from A and B overlap, and run in an arbitrary order:

AB0
B1
0
A1
B2
A2
B3
B4
A3
B5
A4
BA5
6A6

A7B7

A8
B8A9

B9

Note the weird blank lines that result when A's printouts happen between B's number and newline, or vice versa.

PROBLEM: Parallel access to any single resource, like cout, results in resource contention. Contention can lead to wrong answers, bad performance, or even both at once.

SOLUTION(I): Ignore this, and get the wrong answer, slowly. You will also occasionally crash onstage.
SOLUTION(F): Forget parallelism. You get the right answer, but only using one core. This has been the default solution from 1950 until 2005, and it's the one we still teach!
SOLUTION(A): Use an atomic operation to access the resource. This is typically only possible for single-int or single-float operations, and has some overhead on CPUs, but it's possible for the hardware to make this efficient, like on most GPUs.
SOLUTION(C): Add a mutex, a mutual exclusion device (AKA lock) to control access to the single resource. This gives you the right answer, but costs performance--the whole point of the critical section is to reduce parallelism, and locking and unlocking costs at least a few function calls. In C++, mutex supports manual .lock() and .unlock() calls, but for exception safety you shouldn't use them, you should use a RAII lock_guard:

#include <mutex>
std::mutex big_lock;
... inside your function ...
    std::lock_guard<std::mutex> guard(big_lock); // lock on ctor, unlock on dtor
    ... code here is protected by the big_lock ...

SOLUTION(P): Parallelize (or "privatize") all resources--then there aren't any sharing problems because nothing is shared. This is the best solution, but making several copies of hardware or software can be expensive, and usually requires changing a lot of code.
SOLUTION(H): Hybrid: use any of the above where it's appropriate.

For example, we can eliminate contention on the single "cout" variable by writing the results into an array of strings. Generally, arrays or vectors are nice, because multicore machines do a decent job at simultaneous access to different places in memory, but do a bad job at simultaneous access to a single shared data structure like a file or network device: accessing a single resource is likely to be slow if it even works at all!

Multicore via OpenMP

Because threaded programming is so ugly and tricky, there's a simple loop-oriented language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.

The basic idea is you take what looks like an ordinary sequential loop, like:

    for (int i=0;i<n;i++) do_fn(i);

And you add a little note to the compiler saying it's a parallel forloop, so if you've got six CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement, with the "omp" prefix:

#pragma omp parallel for
    for (int i=0;i<n;i++) do_fn(i);

Granted, this line has like a 5,000ns/thread overhead, so it won't help tiny loops, but it can really help long loops. After the for loop completes, all the threads go back to waiting, except the main thread, which continues with the (otherwise serial) program. Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!

This means if you're totalling up values, the total will get overwritten by multiple threads. But OpenMP has a feature to give each thread its own sub total, and use a lock or atomic to total up the totals across the threads.

int total=0;
#pragma omp parallel for  reduction(+:total)
   for (int i=0;i<n;i++) total+=fn(i);

Unlike bare threads, with OpenMP:

The compiler decides how many threads to create (typically the number of CPUs)
The compiler moves the guts of your loop off into a separate function.
The compiler creates and deallocates the threads.
The compiler can help you find and fix memory race conditions.

Here's how you enable OpenMP in various compilers. All recent compilers support OpenMP: Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and gcc version 4.2 all support OpenMP, although many of them only support it via an option like "/openmp" (Microsoft) or "-fopenmp" (gcc).

Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop. You're done!

Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!

CS 301 Lecture Note, Dr. Orion Lawlor, UAF Computer Science Department.