Multicore Programming with Threads

CS 641 Lecture, Dr. Lawlor

There is always a well-known solution to every human problem — neat, plausible, and wrong."
    - H.L. Mecken, 1917 (not actually discussing threads.  Probably.)

To get into the proper revolutionary mindset, read:

The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
   written by Herb Sutter, smart Microsoft guy on the C++ standards committe

Notable quotes:

Creating Threads

Just like creating a new process, creating a thread is an OS call.


Windows
UNIX: Linux, Mac OS X
User-Level Threads
fibers & others makecontext/swapcontext (example) & others
Kernel Threads
CreateThread (example)
pthread_create (example)
    Seems to cost about 10us nowadays.
Processes
CreateProcess (example)
fork (example)
    Seems to cost about 300us nowadays.
Virtual Machines
varies (QEMU, VMWare, etc.)

See all the example code as a (Directory, Zip, Tar-gzip).


Here's an example of creating threads on UNIX:
#include <pthread.h>

volatile int delay=0;
void doWork(const char *caller) {
std::cout<<caller<<"\n";
for (int i=0;i<1000*1000*1;i++) delay++;
}
void *fnA(void *arg) {
for (int i=0;i<10;i++) doWork("A");
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<10;i++) doWork("B");
return 0;
}
int foo(void) {
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return delay;
}

(Try this in NetRun now!)

Annoying things about this include:

Getting the Right Answer with Threads

So basically a kernel thread is just a fancy way to run one of your subroutines simultaniously with all the other subroutines.  A kernel thread can still call other functions, access your global variables, or use anything it can find that belongs to other threads. 

This shared access to common variables immediately introduces the many problems of "thread safety".   For example, consider a piece of code like this:
int shared=0;
void inc(void) {
int i=shared;
i++;
shared=i;
}
If two threads try to call "inc" repeatedly, the two executions might interleave like this:
Thread A
Thread B
int i=shared; // i==0
i++; // i==1
   //  hit interrupt.  switch to B












shared=i; // i is still 1, so shared==1!



int i=shared; // i==0 (again)
i++; //i==1
shared=i; // shared==1

int i=shared; // i==1
i++; //i==2
shared=i; // shared==2

int i=shared; // i==2
i++; //i==3
shared=i; // shared==3
// hit interrupt, switch back to A


Uh oh!  When we switch back to thread A, the value stored in "i" isn't right anymore. It's an older copy of a shared global variable, stored in thread A's stack or registers.  So thread A will happily overwrite the shared variable with its old version, causing all of B's work to be lost!

Here's an executable example of this problem.  Both threads are trying to increment "sum".  They do this 6,000 times, so "sum" should be 6000 at the end.  But with optimization on, they both store a copy of "sum" in a register, so one guy overwrites the other guy's work when they write the modified value back to "sum", and you (usually!) end up with the totally-wrong value 3000:
#include <pthread.h>

int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000;i++)
sum++;
}
void *fnA(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
int foo(void) {
sum=0;
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return sum;
}
(executable NetRun link)

Generally speaking, parallel race conditions can only happen when:
More than one thread writes to the same data at the same time.

This statement tells you all the ways to eliminate data race conditions:
More than one thread...
Only one thread?  No problem.  This is the solution taken by most modern software.
... writes to ...
No writes?  No problem.  "RAR" isn't a dependency, so read-only data structures are fine.  As soon as anybody does any writes, you're in dependency land.
... the same data ...
Separate data?  No problem.  One common fix is to "privatize" all your data--make a private copy for each thread to use.
... at the same time.
There are several cool ways to separate different threads' accesses in time.  Many architectures support "atomic" instructions that the hardware guarantees will get the right answer, typically by excluding other cores' accesses.   All thread libraries support a mutual exclusion primitive or "lock", typically built in software from atomic instructions.

Unfortunately, there's a false syllogism in people's minds, where:
   multicore => threads => locks
That is, people assume the only way to write multicore code is using threads, and the only way to avoid problems with threads is to add locks.  Often, locks are a poor solution--they have fairly high overhead (30ns, about the cost of ten function calls!), they reduce concurrency by design (lock contention), and they themselves suffer from strange odd corner cases (deadlock).  Personally, I try to make private data first, then read-only data structures, and finally add locks if no other solution is possible.