Threads: Multicore Parallel Programming in Shared Memory

The typical way C++ code runs is sequential: when we call a function, we stop running, the function starts running, and keeps running until the function returns, and only then do we start running again.  Even if you have multiple programs, on each core the OS typically runs only one program at a time, switching between programs when interrupts happen, but never running two programs on the same core at the same time.

The basic operation of threads is fundamentally different: if we start a thread, the thread starts running while we keep running. If you have enough cores, both threads will run at the same time, until the program exits or you "join" the threads together again.  Unlike with isolated programs, threads share the same memory space, so two threads can read and write anything in each other's memory at the same time.  Threads each have their own registers and stack areas, but the stacks are all in the same memory space.

C++11 finally added a thread library, in <thread>.  This makes it easy to create threads in a portable way, without needing to call the operating system directly to make Windows kernel threads or UNIX pthreads.

#include <thread>

volatile int delay=0;
void do_work(char where) {
	for (int i=0;i<10;i++) {
		for (int j=0;j<1000;j++) delay++;

void worker_thread(void) {

void foo(void) {
	std::thread t(worker_thread);

(Try this in NetRun now!)

If the function takes arguments, you create the thread by first passing the name of the function to run, then the function arguments.  For example, here we're making a thread to call puts (which is probably a waste of a thread!)

	std::thread t(puts,"I/O from a thread!");

(Try this in NetRun now!)

Threads are a very common way to handle:

However, notice in the above program, sometimes the prints from A and B overlap, and run in an arbitrary order:





Note the weird blank lines that result when A's printouts happen between B's number and newline, or vice versa.

PROBLEM: Parallel access to any single resource, like cout, results in resource contention.  Contention can lead to wrong answers, bad performance, or even both at once.

SOLUTION(I): Ignore this, and get the wrong answer, slowly. You will also occasionally crash onstage.
SOLUTION(F): Forget parallelism.  You get the right answer, but only using one core.  This has been the default solution from 1950 until 2005, and it's the one we still teach!
SOLUTION(A): Use an atomic operation to access the resource.  This is typically only possible for single-int or single-float operations, and has some overhead on CPUs, but it's possible for the hardware to make this efficient, like on most GPUs.
SOLUTION(C): Add a mutex, a mutual exclusion device (AKA lock) to control access to the single resource.  This gives you the right answer, but costs performance--the whole point of the critical section is to reduce parallelism, and locking and unlocking costs at least a few function calls.  In C++, mutex supports manual .lock() and .unlock() calls, but for exception safety you shouldn't use them, you should use a RAII lock_guard:

#include <mutex>
std::mutex big_lock;
... inside your function ...
    std::lock_guard<std::mutex> guard(big_lock); // lock on ctor, unlock on dtor
    ... code here is protected by the big_lock ...

SOLUTION(P): Parallelize (or "privatize") all resources--then there aren't any sharing problems because nothing is shared.  This is the best solution, but making several copies of hardware or software can be expensive, and usually requires changing a lot of code.  
SOLUTION(H): Hybrid: use any of the above where it's appropriate.

For example, we can eliminate contention on the single "cout" variable by writing the results into an array of strings.  Generally, arrays or vectors are nice, because multicore machines do a decent job at simultaneous access to different places in memory, but do a bad job at simultaneous access to a single shared data structure like a file or network device: accessing a single resource is likely to be slow if it even works at all!

Multicore via OpenMP

Because threaded programming is so ugly and tricky, there's a simple loop-oriented language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.

The basic idea is you take what looks like an ordinary sequential loop, like:

    for (int i=0;i<n;i++) do_fn(i);

And you add a little note to the compiler saying it's a parallel forloop, so if you've got six CPUs, the iterations should be spread across the CPUs.  The particular syntax they chose is a "#pragma" statement, with the "omp" prefix:

#pragma omp parallel for
    for (int i=0;i<n;i++) do_fn(i);

Granted, this line has like a 5,000ns/thread overhead, so it won't help tiny loops, but it can really help long loops.  After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program.  Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!

This means if you're totalling up values, the total will get overwritten by multiple threads.  But OpenMP has a feature to give each thread its own sub total, and use a lock or atomic to total up the totals across the threads.

int total=0;
#pragma omp parallel for  reduction(+:total)
   for (int i=0;i<n;i++) total+=fn(i);

Unlike bare threads, with OpenMP:

Here's how you enable OpenMP in various compilers.  Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and gcc version 4.2 all support OpenMP, although earlier versions do not!

Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop.  You're done!

Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs.  Note the compiler "#pragma" statements!

CS 301 Lecture Note, Dr. Orion LawlorUAF Computer Science Department.