Multicore Programming with Threads

CS 641 Lecture, Dr. Lawlor

There is always a well-known solution to every human problem — neat, plausible, and wrong."
- H.L. Mecken, 1917 (not actually discussing threads. Probably.)

To get into the proper revolutionary mindset, read:

The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe

Notable quotes:

"Andy giveth, and Bill taketh away." That is, Andrew Grove, Intel, keeps building faster and faster CPUs. Bill Gates, Microsoft, keeps building slower and slower (yet better and better!) software.
"Concurrency is the next major revolution in how we write software." That is, by 2020 purely sequential code will have the same retro feeling that COBOL has today.
"Probably the greatest cost of concurrency is that concurrency really is hard: The programming model, meaning the model in the programmer’s head that he needs to reason reliably about his program, is much harder than it is for sequential control flow." That is, since Aristotle human reasoning is sequential. Execution is parallel. Horrible things can crawl into our universe through this gap!
"For example, Intel is talking about someday producing 100-core chips; a single-threaded application can exploit at most 1/100 of such a chip’s potential throughput." 90% of commercial applications today are single-threaded. They will either adapt or die out.
"just because it takes one woman nine months to produce a baby doesn’t imply that nine women could produce one baby in one month." So if the problem is "make one baby", parallelism is useless. But if you change the problem to "make 1 million babies", suddenly you have million-way parallelism.
"A few rare classes of applications are naturally parallelizable, but most aren’t." In other words, occasionally it's obvious how to run in parallel, and actually running in parallel is easy. Usually, though, you have to rethink and reformulate the basic algorithm used to solve the problem. I'd add that "A few rare applications are impossible to parallelize, but given enough effort, most aren't."

Creating Threads

Just like creating a new process, creating a thread is an OS call.

	Windows	UNIX: Linux, Mac OS X
User-Level Threads	fibers & others	makecontext/swapcontext (example) & others
Kernel Threads	CreateThread (example)	pthread_create (example) Seems to cost about 10us nowadays.
Processes	CreateProcess (example)	fork (example) Seems to cost about 300us nowadays.
Virtual Machines	varies (QEMU, VMWare, etc.)

See all the example code as a (Directory, Zip, Tar-gzip).

Here's an example of creating threads on UNIX:

#include <pthread.h>

volatile int delay=0;
void doWork(const char *caller) {
	std::cout<<caller<<"\n";
	for (int i=0;i<1000*1000*1;i++) delay++;
}
void *fnA(void *arg) {
	for (int i=0;i<10;i++) doWork("A");
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<10;i++) doWork("B");
	return 0;
}
int foo(void) {
	pthread_t B; 
	pthread_create(&B,0,fnB,0); 
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret); 
	return delay;
}

(Try this in NetRun now!)

Annoying things about this include:

Moving the per-thread work into a separate function is a pain.
You need to remember to both create *and* join threads.
The A and B printouts come in some unpredictable order, depending on the speed of the machine, OS, etc.
"delay" ends up with a weird random answer.

Getting the Right Answer with Threads

So basically a kernel thread is just a fancy way to run one of your subroutines simultaniously with all the other subroutines. A kernel thread can still call other functions, access your global variables, or use anything it can find that belongs to other threads.

This shared access to common variables immediately introduces the many problems of "thread safety". For example, consider a piece of code like this:

int shared=0;
void inc(void) {
	int i=shared;
	i++;
	shared=i;
}

If two threads try to call "inc" repeatedly, the two executions might interleave like this:

Thread A	Thread B
int i=shared; // i==0 i++; // i==1 // hit interrupt. switch to B shared=i; // i is still 1, so shared==1!	int i=shared; // i==0 (again) i++; //i==1 shared=i; // shared==1 int i=shared; // i==1 i++; //i==2 shared=i; // shared==2 int i=shared; // i==2 i++; //i==3 shared=i; // shared==3 // hit interrupt, switch back to A

Uh oh! When we switch back to thread A, the value stored in "i" isn't right anymore. It's an older copy of a shared global variable, stored in thread A's stack or registers. So thread A will happily overwrite the shared variable with its old version, causing all of B's work to be lost!

Here's an executable example of this problem. Both threads are trying to increment "sum". They do this 6,000 times, so "sum" should be 6000 at the end. But with optimization on, they both store a copy of "sum" in a register, so one guy overwrites the other guy's work when they write the modified value back to "sum", and you (usually!) end up with the totally-wrong value 3000:

#include <pthread.h>

int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		sum++;
}
void *fnA(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
int foo(void) {
	sum=0;
	pthread_t B; 
	pthread_create(&B,0,fnB,0); 
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret); 
	return sum;
}

(executable NetRun link)

Generally speaking, parallel race conditions can only happen when:

More than one thread writes to the same data at the same time.

This statement tells you all the ways to eliminate data race conditions:

More than one thread...	Only one thread? No problem. This is the solution taken by most modern software.
... writes to ...	No writes? No problem. "RAR" isn't a dependency, so read-only data structures are fine. As soon as anybody does any writes, you're in dependency land.
... the same data ...	Separate data? No problem. One common fix is to "privatize" all your data--make a private copy for each thread to use.
... at the same time.	There are several cool ways to separate different threads' accesses in time. Many architectures support "atomic" instructions that the hardware guarantees will get the right answer, typically by excluding other cores' accesses. All thread libraries support a mutual exclusion primitive or "lock", typically built in software from atomic instructions.

Unfortunately, there's a false syllogism in people's minds, where:
multicore => threads => locks
That is, people assume the only way to write multicore code is using threads, and the only way to avoid problems with threads is to add locks. Often, locks are a poor solution--they have fairly high overhead (30ns, about the cost of ten function calls!), they reduce concurrency by design (lock contention), and they themselves suffer from strange odd corner cases (deadlock). Personally, I try to make private data first, then read-only data structures, and finally add locks if no other solution is possible.