Parallel Processing with Threads & Processes

Kinds of Parallel Computers

One ALU, one set of registers: serial computer (single processor).
One ALU, several sets of registers: Symmetric Multi-Threading (SMT) or HyperThreading. This is only a tiny amount of additional chip area, but increases ALU utilization way more than wider superscalar execution. ALU contention limits the parallelism to perhaps 4-way on modern CPUs, but can reach 64-fold on GPUs.
Several different CPUs, one shared RAM area: Symmetric Multi-Processing (SMP) or Multi-Core. 2 cores is entirely standard today, 4 cores are on the way, and commercial SMP nodes with 16 or 32 way access exist. SMP can also combine with SMT for absurd numbers of threads: e.g., a typical GPU is 128-way SMP, with a total of 8192-way SMT.
Several CPUs, each with local RAM, but with a nonlocal RAM interconnect: Non-Uniform Memory Architecture (NUMA). Unlike SMP, where all RAM is equal, in a NUMA, it's faster to go to your own RAM (60ns) compared to some other CPU's RAM (1us).
Several complete nodes, connected with network: Cluster or Massively Parallel Processor (MPP). This is the only architecture known to scale to hundreds of thousands of CPUs--nothing is shared except the network. If all the computers are running the same program, as is typical to solve tightly coupled problems, it's called Single Program Multiple Data (SPMD). The Internet is a really loosely coupled MPP, because all the computers on the internet are running different programs at different times for different reasons.

Kinds of Program Control Flows

There are quite a few different levels of saving and restoring that you can design. Existing levels are nesting, and from least-to-most context include:

Functions, where things execute one at a time. Hopefully you're familiar with normal functions.

Registers: by convention, some are saved (must restore value for caller), some are scratch (can freely overwrite)
Stack: organized by convention into "stack frames"
Memory: shared between all functions
Files: shared between all functions

User-level threads, where you save and restore processor registers yourself, usually via an explicit "swap" subroutine call. Also called "coroutines" or "user/userspace threads", or (by Sun Java folks) "green threads". Check out my little "minithread" example code for an example (Directory, Zip, Tar-gzip).   The key routine is just the "mth_swap" subroutine, which saves an old set of registers and loads up a new one, and the core of this is just:
    mov [eax],esp ; Save the old stack pointer
    mov esp, [ecx] ; Load up the new stack pointer
Typical examples include Netscape Threads, or any of a thousand tiny projects mostly relying on setjmp/longjmp or "makecontext/swapcontext". Each thread has its own stack, usually just allocated with "malloc", but shares the rest of memory with other user-level threads.

Registers: explicitly saved and restored by a "swap" routine.
Stack: each user-level thread has a separate stack region in memory.
Memory & Files: shared between all threads.

Kernel-level threads, or just "threads", are where the OS saves and restores processor registers to switch threads at interrupt time. Also called "native threads". The most common example is pthreads. There's actually a hybrid of kernel and user-level threads where you save and restore processor registers yourself (not in the OS), but you do it inside a signal handler during an interrupt.

Registers: implicitly saved and restored by the OS when it decides to switch threads.
Stack: separate regions of memory for each thread.
Memory & Files: usually shared between all threads.

Processes, where the OS swaps out all of memory and all I/O connections in addition to the registers. This is what you usually think of as a "program"--we covered it last class.

Registers: implicitly saved and restored by the OS when it decides to switch processes.
Stack & memory: completely separate memory area for each process.
Files: open files are local to a process (I open a file, it's not open for you), but data in files is shared between all processes.

Virtual Machines, where the OS itself and everything in it is swapped out. QEMU is a pretty typical example. A virtual machine is usually treated as an ordinary process inside the outer ("host") OS.

Registers, stack, memory, files, disks, and even CPUs are all private to a virtual machine.

All these levels actually nest. So a virtual machine might contain several processes, with the emulated OS switching between them. Each process might contain several kernel threads, with the OS again switching between them. Each kernel thread might contain several user-level threads, which user code explicitly switches between. Each user-level thread contains several functions, which call one another.

Creating Different Program Control Flows

	Windows	UNIX: Linux, Mac OS X
User-Level Threads	fibers & others	makecontext/swapcontext (example) & others
Kernel Threads	CreateThread (example)	pthread_create (example) Seems to cost about 10us nowadays.
Processes	CreateProcess (example)	fork (example) Seems to cost about 300us nowadays.
Virtual Machines	varies (QEMU, VMWare, etc.)

See all the example code as a (Directory, Zip, Tar-gzip).

You should become familiar with both Windows and UNIX flavors of thread and process creation.

Simple Kernel Threads

You probably want to make the OS kernel to do the hard work of creating threads, which are then called "kernel threads". On UNIX systems, you create a new kernel thread by calling "pthread_create", which is in "#include <pthread.h>" and the "-lpthread" library. You just pass pthread_create the function to run in the thread, and the kernel will then run that function, switching back and forth between threads at basically random times (100 to 1000 times per second, usually).

#include <pthread.h>

void doWork(const char *caller) {
	std::cout<<caller<<"\n";
}
void *fnA(void *arg) {
	for (int i=0;i<3;i++) doWork("A");
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<3;i++) doWork("B");
	return 0;
}
int foo(void) {
	pthread_t B; 
	pthread_create(&B,0,fnB,0); /* make a thread B, to run fnB */
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret);  /* wait until B finishes */
	return 0;
}

(executable NetRun link)

To illustrate the way the kernel randomly switches between these functions, run this thing several times--here's what I got on several different runs:

A
A
A
B
B
B

AB

AB

AB

B
B
B
A
A
A

AB
A
A

B
B

AB

B
B
A
A

That is, the kernel runs A for a while, then B for a while, then back to A. You can't tell when a switch is going to happen. Note that the kernel sometimes switches to B between when A prints the letter 'A' and when it prints the newline immediately after it in the string "A\n"!

The danger of threads

So basically a kernel thread is just a fancy way to run one of your subroutines simultaniously with all the other subroutines. A kernel thread can still call other functions, access your global variables, or use anything it can find that belongs to other threads.

This shared access to common variables immediately introduces the many problems of "thread safety". For example, consider a piece of code like this:

int shared=0;
void inc(void) {
	int i=shared;
	i++;
	shared=i;
}

If two threads try to call "inc" repeatedly, the two executions might interleave like this:

Thread A	Thread B
int i=shared; // i==0 i++; // i==1 // hit interrupt. switch to B shared=i; // i is still 1, so shared==1!	int i=shared; // i==0 (again) i++; //i==1 shared=i; // shared==1 int i=shared; // i==1 i++; //i==2 shared=i; // shared==2 int i=shared; // i==2 i++; //i==3 shared=i; // shared==3 // hit interrupt, switch back to A

Uh oh! When we switch back to thread A, the value stored in "i" isn't right anymore. It's an older copy of a shared global variable, stored in thread A's stack or registers. So thread A will happily overwrite the shared variable with its old version, causing all of B's work to be lost!

Here's an executable example of this problem. Both threads are trying to increment "sum". They do this 6,000 times, so "sum" should be 6000 at the end. But with optimization on, they both store a copy of "sum" in a register, so one guy overwrites the other guy's work when they write the modified value back to "sum", and you (usually!) end up with the totally-wrong value 3000:

#include <pthread.h>

int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		sum++;
}
void *fnA(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
int foo(void) {
	sum=0;
	pthread_t B; 
	pthread_create(&B,0,fnB,0); 
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret); 
	return sum;
}

(executable NetRun link)

Thread Safety with "volatile"

Sometimes, the only problem is the compiler's over-optimization of access to global variables. You can scare the compiler away from a variable by putting the keyword "volatile" in front of it. This makes the compiler fear the variable, and do exactly what you ask with it:

volatile int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		sum++;
}

(executable NetRun link)

Adding "volatile" does indeed improve the thread safety of this program--in fact, on a uniprocessor machine, it now gets the right answer! This is because "sum++" becomes a single instruction ("inc dword [sum]"), and an instruction executes as one piece--if the kernel switches to another thread, it will be either before or after this instruction, and never "in the middle of an instruction".

However, on a multiprocessor machine (dual core, dual CPU, or hyperthreaded like the NetRun machine), this program STILL GIVES THE WRONG ANSWER.

The reason is curious. On a real multiprocessor, two processors might simultaneously execute the crucial "inc" instruction. Since the instruction involves reading "sum" from memory, incrementing it, and writing it back out, it can still happen that one processor overwrites the result of another processor.

Thread Safety with "lock" Instruction Prefix

One way to *really* fix this code on a multiprocessor would be to use the assembly language "lock" prefix, which causes an instruction to execute "atomically". This changes the memory access mode so that no other processor can modify that instruction's data until the instruction completes. Here's what our loop looks like using the lock prefix. "incl (sum)" is the g++ inline assembly version of "inc dword [sum]".

volatile int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		__asm__ ( "lock incl (sum)" );
}

(executable NetRun link)

Now, finally, we get the right answer! However, our program is now 10x slower! The trouble is the "lock" prefix must talk to the memory bus to actually guarantee the right answer, and the memory bus is way slower than the CPU.

Thread Safety With Mutexes

There's another way to make this function threadsafe that works even if you've got more than one assembly-instruction worth of work to do. It's called a "mutex", which is just a special object with "lock" and "unlock" operations--while you've got the mutex locked, no other thread can lock the mutex (think of the lock in a bathroom stall!). In pthreads, this looks like:

int sum=0;  /*<- Careful!  This variable is shared between threads! */
pthread_mutex_t sum_lock=PTHREAD_MUTEX_INITIALIZER;
void doWork(void) {
	pthread_mutex_lock(&sum_lock);
	for (int i=0;i<1000;i++)
		sum++;
	pthread_mutex_unlock(&sum_lock);
}

This supports long stretches of protected code, is guaranteed to work on all machines, and requires no ugly unportable assembly. But there are a few problems:

Mutex aquisition is kinda slow, like 60ns per lock. This means you want to lock and unlock as rarely as possible.
Nobody else can do the stuff protected by the lock while you've got it. This means you want to leave the lock locked as short a time as possible.
It's possible to try to re-lock a lock you've locked. This means you're waiting for yourself to unlock it!

The Problem With Threads--Example: THERAC-25 Race Condition

The THERAC-25was a piece of medical equipment capable of generating either a high-strength electron beam through a target to generate X-Rays, or scanning a much weaker raw electron beam over the surface of a cancer patient. When you chose "Electron" mode, the software would begin this process:

Enter an 8-second magnet calibration cycle, and then
Remove the metal target used to generate X-Rays

When you chose "X-Ray" mode, the software would crank up the electron beam strength and insert the piece of metal used to generate X-Rays. If you chose "Electron" mode (starting the calibration cycle) and then within 8 seconds went back and chose "X-Ray" mode, the end of the Electron mode setup would remove the crucial piece of metal needed for X-Ray mode, but then send a high-strength electron beam directly into the patient. At least two people died as a direct result of this problem, and several more were permanently maimed.

The problem is just "simple" multithreaded locking--the two modes should have been exclusive, but because they were both running at the same time (directly from the keyboard interrupt) on the same machine they could interleave in a strange way. Allow me to be more clear:

MULTITHREAD BUGS HAVE KILLED PEOPLE.

For goodness sake, don't let your software kill people!

Parallelism Using Threads (UNIX)

So the whole point of making threads is to get some work done in each thread. This is pretty easy in theory:

	// ... prepare program to be run in pieces ...
	// Run the pieces as separate threads:
	pthread_t *thr=new pthread_t[n];
	for (int i=0;i<n;i++) {
		worker *w=new ... i'th piece of work ...
		pthread_create(&thr[i],0,work,w);
	}
	// Wait until all pieces are finished (until work routine returns)
	for (int i=0;i<n;i++) {
		void *ret;
		pthread_join(thr[i],&ret);
	}

You just have to be very careful with threads that the workers don't accidentally overwrite each other's work, since all memory is shared between threads. Sometimes this requires locking, but usually it requires making separate copies of variables. Keeping track of the separate copies can be a pain. Separating the copies used across a big complicated program can be a real pain.

Parallelism Using Processes

Unlike threads, processes don't share memory by default. This substantially reduces the number of places you have to worry about race conditions and other parallel problems. However, there are times when you want to share some memory between processes--for example, to collect the pieces of the final result. You must allocate these shared regions in a special way--on UNIX, you can make a shared memory region really easily using mmap or with more code using shmget/shmat (see Beej's shm example). On Windows, you can create a shared-memory region with CreateFileMapping and then MapViewOfFile (see codeproject example).

Here's my UNIX example (works on Linux and Mac OS X, but not Windows):

#include <unistd.h> /* for fork */
#include <sys/mman.h> /* for mmap */
#include <sys/wait.h> /* for wait */

// Return len bytes of RAM that will automagically be shared after a fork().
void *shared_malloc(int len) {
	void *p=mmap(0,len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1,0);
	if (p==(void *)-1) printf("mmap failed!\n");
	return p;
}
void shared_free(void *p,int len) {
	munmap(p,len);
}

...
	// ... prepare program to be run in pieces ...
	// ... call shared_malloc for memory shared between pieces ...
	// Run the pieces as separate processes:
	int *pid=new int[n];
	for (int i=0;i<n;i++) {
		worker *w=new ... i'th piece of work ...
		pid[i]=fork();
		if (pid[i]==0) { /* I'm the child laborer */
			work(w);
			exit(0);
		}
	}
	// Wait until all pieces are finished (until children exit)
	for (int i=0;i<n;i++) {
		int ret;
		waitpid(pid[i],&ret,0);
	}

Unlike the threaded version above, the separate pieces of this program can't possibly mess up each other's work--in fact, they have no way to communicate except by the shared_malloc regions allocated by the parent! I claim that for many real programs, this "private by default, shared only when explicitly asked" is the correct approach. It means all your old (buggy, global variable-laden) C++ code still works. It means you don't get annoying race conditions (as often).