Parallel Programming Mindset, Multithreading, OpenMP
CS 641 Lecture, Dr. Lawlor
Pop quiz--so parallel computing is finally going mainstream. What will this event be most like?
- The new 2006 Democratic majority in the US Congress. (New names, lots of hype, but zero measureable effect.)
- The worst 1950's rumors about communist spies infiltrating honest
American institutions. (Reliable components slowly going bad
without anybody knowing... until it's too late!)
- The
worst rumors about global warming coming true. (Plagues of locusts,
drought, heat, and major lifestyle changes for everybody.)
- A meteorite similar to that which drove the dinosaurs extinct. (Slow suffocating starvation for 90%+ of earth's life.)
- The second coming. (Dead shall rise, seas shall boil, skies
shall burn, seven seals/seven bowls, battle of Armageddon, new
Jerusalem on a new Earth.)
Answer: all of the above!
- Whatever happens, there's still going to be C++-lookin' code, if
for no other reason than the huge quantity of the C++ that's out there
already! The names may change, and the way the code is called may change a lot, but deep down tons of sequential "variable=value;" code will not change. It's the code that calls your deep-down application code that's likely to change a lot.
- Multiple things happening at once means you have to worry about
"race conditions", where two (or more) pieces of code "race" to see
which one finishes first. You have to worry about three
(nondeterministic!) possibilities: A finishes first, B finishes first,
or a tie. Maybe two of these cases are really rare. But
either one might cause your code to blow up! (Crash, fail,
silently give the wrong answer to a crucial query...)
- Parallel programming is a new lifestyle. You've got to
worry about things you didn't have to worry about before.
Concurrency makes some formerly-moral stuff (like static variables)
immoral, and can make some formerly immoral stuff (like public members)
moral, because the situation has changed.
- Many
perfectly innocent serial programs will slowly wither and
die, because they don't take advantage of parallelism. They will
have the embarrasingly limited, dated feeling of command-line DOS
applications (8 character limit),FAT32 partitions (4GB file limit), or
software-rendered 3D games (ugly and slow). Few users will mourn their
passing.
- Parallel programming really has the potential to change
everything. Just the hardware ranges from staid stuff like SSE to
multicore to distributed to GPU to FPGA. FPGA's don't even run
C++ code, dangit. It's the end of the world! (As we know it... and I feel fine.)
To get into the proper revolutionary mindset, read:
The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe
Notable quotes:
- "Andy giveth, and Bill
taketh away." That is, Andrew Grove, Intel, keeps building faster
and faster CPUs. Bill Gates, Microsoft, keeps building slower and
slower (yet better and better!) software.
- "Concurrency is the next
major revolution in how we write software." That is, by 2015
purely sequential code will have the same retro feeling that COBOL has
today.
- "Probably
the greatest cost of
concurrency is that concurrency really is hard: The programming model,
meaning the model in the programmer’s head that he needs to reason
reliably about his program, is much harder than it is for sequential
control flow." That is, human reasoning is sequential. Execution is
parallel. Horrible things can crawl into our universe through this gap!
- "For example, Intel is
talking
about someday producing 100-core chips; a single-threaded application
can exploit at most 1/100 of such a chip’s potential throughput."
90% of commercial applications today are single-threaded. They
will either adapt or die out.
- "just because it takes one
woman nine months to produce a baby doesn’t imply that nine women could
produce one baby in one month." So if the problem is "make one
baby", parallelism is useless. But if you change the problem to
"make 1 million babies", suddenly you have million-way parallelism.
- "A few rare classes of
applications are naturally parallelizable, but most aren’t." In
other words, occasionally it's obvious how to run in parallel, and
actually running in parallel is easy. Usually, though, you have
to rethink and reformulate the basic algorithm used to solve the problem!
The Argument for Thread-driven Parallelism
So you're running along in one function. You'd like to stop
running, pack up, and start running some other function for a
while. Easy, right? Just CALL the other function--your
return address will get pushed onto the stack, and the other function
will run until it's done and RET back to you.
But sometimes, CALL and RET are not good enough. Sometimes, it's nice to be able to suspend function A, run function B for a while, then switch back to A before
B is finished, then switch back to where we left off inside B, and so
on. In other words, sometimes you want to be able to split up a
few functions into several pieces, and switch off between the execution
of each of those those pieces.
This is actually a form of parallelism--it lets you get several things done at the same time.
For example, if function A is reading stuff from the network, it'll
spend a lot of time just waiting for the next piece of data to
arrive. That's when it would make sense to run some other
function B. If B is writing stuff to disk, it'll spend time
waiting for the disk to finish writing. That's when you want to
switch back to A!
Just to be clear, here's what we want to have happen:
Function
|
A
|
B
|
Code
|
for (int i=0;i<3;i++) {
std::cout<<"A:"<<i<<"\n";
swap from A to B
}
|
for (int i=0;i<3;i++) {
std::cout<<" B:"<<i<<"\n";
swap from B to A
}
|
So what we want is to be able to run one iteration of A, then run one
iteration of B, and so on. In a real program, of course, A and B
would be doing something useful instead of just printing stuff, and
they'd only switch off when they need to wait for something slow (a
disk, the network, the keyboard, etc.) instead of swapping every
iteration.
Implementing Threads in User Space
So you can actually write code that switches between functions A and B like above. It's actually pretty easy to do so.
You've just got to save and restore everything that's different in A and B. This includes:
- The instruction pointer. This will automatically get saved to the stack if we just make "swap" a normal function.
- The saved registers (ebp, ebx, esi, and edi). It's easy enough to save these on the stack in assembly.
- The stack pointer. "i" is probably sitting on the stack for both functions.
So the easy part is writing the swap routine:
; Usage: swap32(old struct pointer, new struct pointer)
global swap32
swap32:
mov eax,[esp+4] ; first argument: old thread struct pointer
mov ecx,[esp+8] ; second argument: new thread struct pointer
; Save registers
push ebp
push ebx
push esi
push edi
; Save old stack
mov [eax],esp
; Load new stack
mov esp,[ecx]
; Restore registers
pop edi
pop esi
pop ebx
pop ebp
ret
Now you've just got to call the swap routine to switch from running one
function to running another--and then you can switch back!
struct thread {
void *stack_pointer; /* <- accessed from assembly */
};
/* Save this old thread's stack; and load up from this new thread's stack. */
extern "C" void swap32(thread *old,thread *new_thread);
thread A, B; /* threads to run */
void fnA(void) {
for (int i=0;i<3;i++) {
std::cout<<"A:"<<i<<"\n";
swap32(&A,&B);
}
}
void fnB(void) {
for (int i=0;i<3;i++) {
std::cout<<" B:"<<i<<"\n";
swap32(&B,&A);
}
}
int foo(void) {
/* Set up a separate stack for B (nasty, but only needed once!) */
int Blen=8192; /* small 8KB stack */
char *Bstack=new char[Blen];
void **Btop=(void **)(&Bstack[Blen]); /* end of B's stack */
*(--Btop)=(void *)fnB; /* first swap will return to start of fnB */
for (int crap=0;crap<4;crap++)
*(--Btop)=0; /* set new ebp, ebx, esi, and edi all to zero */
B.stack_pointer=(void *)Btop; /* B's stack starts here */
/* Just call A. It will swap to B, which will swap to A, and so on! */
std::cout<<"Calling fnA\n";
fnA();
std::cout<<"Back to foo\n";
return 0;
}
(executable NetRun link part 1,
NetRun link part 2)
Note the trickiness of setting up B's stack so that A's first call to swap32 will actually fire up function B!
This code produces:
Calling fnA
A:0
B:0
A:1
B:1
A:2
B:2
Back to foo
Program complete. Return 0 (0x0)
Flavors of Threading
Writing your own assembly code to switch between threads is called
"user-level threading", or "coroutines". There are several other
ways to implement the same thing, like using the builtin OS routines setjmp/longjmp, makecontext/swapcontext, or Windows Fibers.
You can also get the OS itself to do the swapping whenever it likes;
threads created and managed by the OS are called "kernel threads" or
just plain "threads". You can create OS threads with pthread_create (on UNIX-like systems) or CreateThread (on Windows).
The big advantage of kernel threads is that the kernel knows when an
I/O will block (for example, when you're reading a file), and can thus
automatically swap out that thread with another thread. The big
advantage of user-level threads is that you know exactly when every
thead switch will occur (because you're explicitly calling the "swap"
function!), so it's pretty rare for a thread switch to happen at
"exactly the wrong instant" and hence screw up your program.
My friend Gengbin and I wrote a decent paper on the tradeoffs between user-level and kernel threads for parallel programming.
Generally, user-level threads have less overhead (25ns per context
switch for the implementation above); while kernel threads can run
better (the kernel swaps its threads out if one blocks for I/O, and
they naturally take advantage of multiple CPUs).
Simple Kernel Threads
So far, we've built threads and switched between them on our own. But you can actually get the OS
kernel to do the hard work of creating threads, which are then called
"kernel threads". On UNIX systems, you create a new kernel thread
by calling "pthread_create", which is in "#include <pthread.h>"
and the "-lpthread" library. You just pass pthread_create the
function to run in the thread, and the kernel will then run that
function, switching back and forth between threads at
basically random times (100 or 1000 times per second, usually).
#include <pthread.h>
void doWork(const char *caller) {
std::cout<<caller<<"\n";
for (int delay=0;delay<1*1000*1000;delay++) {}
}
void *fnA(void *arg) {
for (int i=0;i<10;i++) doWork("A");
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<10;i++) doWork(" B");
return 0;
}
int foo(void) {
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return 0;
}
(executable NetRun link)
To illustrate the way the kernel randomly switches between these functions, run
this thing several times--here's what I got on several different runs on my (uniprocessor) laptop:
A A A B B B
|
AB
AB
AB
|
B B B A A A
|
AB A A
B B
|
AB
B B A A
|
That is, the kernel runs A for a while, then B for a while, then back
to A. You can't tell when a switch is going to happen. Note that the kernel sometimes switches to B between when A
prints the letter 'A' and when it prints the newline immediately after
it in the string "A\n"!
The danger of threads
So basically a kernel thread is just a fancy way to run one of your subroutines
simultaniously with all the other subroutines. A kernel thread can
still call other functions, access your global variables, or use anything it can find that belongs
to other threads.
This shared access to common variables
immediately introduces the many problems of "thread
safety". For example, consider a piece of code like this:
int shared=0;
void inc(void) {
int i=shared;
i++;
shared=i;
}
If two threads try to call "inc" repeatedly, the two executions might interleave like this:
Thread A
|
Thread B
|
int i=shared; // i==0
i++; // i==1
// hit interrupt. switch to B
shared=i; // i is still 1, so shared==1!
|
int i=shared; // i==0 (again)
i++; //i==1
shared=i; // shared==1
int i=shared; // i==1
i++; //i==2
shared=i; // shared==2
int i=shared; // i==2
i++; //i==3
shared=i; // shared==3
// hit interrupt, switch back to A
|
Uh oh! When we switch back to thread A, the value stored in "i"
isn't right anymore. It's an older copy of a shared global variable,
stored in thread A's stack or registers. So thread A will happily
overwrite the shared variable with its old version, causing all of B's
work to be lost!
Here's an executable example of this problem. Both threads are
trying to increment "sum". They each do this 10 million times, so "sum"
should be 20 million at the end. But with optimization on, they both
store a copy of "sum" in a register, so one guy overwrites the other
guy's work when they write the modified value back to "sum", and you
(usually!) end up with the totally-wrong value 10 million (or so):
#include <pthread.h>
int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000*1000;i++)
sum++;
}
void *fnA(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
int foo(void) {
sum=0;
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return sum;
}
(executable NetRun link)
Thread Safety with "volatile"
Sometimes, the only problem is the compiler's over-optimization of
access to global variables. You can scare the compiler away from
a variable by putting the keyword "volatile" in front of it. This
makes the compiler fear the variable, and do exactly what you ask with
it:
volatile int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000*1000;i++)
sum++;
}
(executable NetRun link)
Adding "volatile" does indeed improve the thread safety of this
program--in fact, on a uniprocessor machine, it now gets the right
answer! This is because "sum++" becomes a single instruction
("inc dword [sum]"), and an instruction executes as one piece--if the
kernel switches to another thread, it will be either before or after
this instruction, and never "in the middle of an instruction".
However, on a multiprocessor machine (dual core, dual CPU, or
hyperthreaded like the NetRun machine), this program STILL GIVES THE
WRONG ANSWER.
The reason is curious. On a real multiprocessor, two processors might simultaneously
execute the crucial "inc" instruction. Since the instruction
involves reading "sum" from memory, incrementing it, and writing it
back out, it can still happen that one processor overwrites the result
of another processor.
One way to *really* fix this code on a multiprocessor would be to use
the assembly language "lock" prefix, which causes an instruction to
execute "atomically". This changes the memory access mode so that
no other processor can modify that instruction's data until the
instruction completes. Here's what our loop looks like using the
lock prefix.
volatile int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000*1000;i++)
__asm__ ( "lock incl (sum)" );
}
(executable NetRun link)
Now, finally, we get the right answer! However, our program is
now 10x slower--"lock" is a memory operation, so now we're running at
memory speed, not at cache speed.
Thread Safety With Mutexes
There's another way to make this function threadsafe that works even if
you're writing pure C++ and/or have more than one assembly-instruction worth of work to
do. It's called a "mutex", which is just a special object with
"lock" and "unlock" operations--while you've got the mutex locked, no
other thread can lock the mutex (think of the lock in a bathroom
stall!). In pthreads, this looks like:
int sum=0; /*<- Careful! This variable is shared between threads! */
pthread_mutex_t sum_lock=PTHREAD_MUTEX_INITIALIZER;
void doWork(void) {
pthread_mutex_lock(&sum_lock);
for (int i=0;i<1000;i++)
sum++;
pthread_mutex_unlock(&sum_lock);
}
(executable NetRun link)
This supports long stretches of protected code, is guaranteed to work
on all machines, and requires no ugly unportable assembly.
But there are a few problems:
- Mutex aquisition is kinda slow, like 60ns per lock. This means you want to lock and unlock as rarely as possible.
- Nobody else can do the stuff protected by the lock while you've
got it. This means you want to leave the lock locked as short a
time as possible.
- Everybody
who uses "sum" MUST remember to use the lock, as well. If anybody
ever forgets, you're right back in a race condition.
- It's possible to try to re-lock a lock you've locked. This means you're waiting (forever!) for yourself to unlock it!
Thread Safety With Privatization
The ONLY way to get good performance and the right answer is to have each thread working on its own, separate copy of each variable:
int sumA=0, sumB=0;
void doWork(int &sum) {
for (int i=0;i<1000*1000;i++)
sum++;
}
(executable NetRun link)
This is called "privatization". Generally, privatization is the right
answer for multithreaded code--find all the shared variables, and make
separate copies of them.
OpenMP: Threaded Programming For Mortals
Because threaded programming is so ugly and tricky, there's a newish (mainstream in 2007) language extension out there called OpenMP, designed to make it easier to write multithreaded code.
The basic idea is you take what looks like an ordinary sequential loop, like:
for (int i=0;i<n;i++) do_fn(i);
And you add a little note to the compiler saying it's a parallel
forloop, so if you've got six CPUs, the iterations should be spread
across the CPUs. The particular syntax they chose is a "#pragma"
statement, with the "omp" prefix:
#pragma omp parallel for
for (int i=0;i<n;i++) do_fn(i);
You can also add a variety of interesting options to the "parallel for" line:
- "private(n)" would give each CPU a separate read-write copy of the variable "n" (or any other variable).
- "reduction(+:k)" would total up each CPU's (private) copy of the
variable "k" (or any other variable) using the "+" operation (or any
other operation).
In addition to "pragma omp parallel for", there are other pragma lines:
- "barrier" tells each thread to not to continue until all threads have reached this point.
- "critical" only allows one thread at a time to execute the following block (wraps a lock around the block).
- "single"
Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!
Here's how you enable OpenMP
in various compilers. Visual C++ 2005, Intel C++ version 9.0, and
gcc version 4.2 all support OpenMP, although earlier versions do not!
Here's the idiomatic OpenMP program: slap "#pragma parallel for" in front of your main loop. You're done!
Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!
On the powerwall, you can compile and run OpenMP code as follows:
g++-4.2 get_info.c -o get_info -fopenmp
./get_info
Chris Granade has a nice page on OpenMP on
the Cell Broadband Engine.