Multicore Programming with Threads
CS 641 Lecture, Dr. Lawlor
There is always a well-known solution to every human problem — neat, plausible, and wrong."
- H.L. Mecken, 1917 (not actually discussing threads. Probably.)
To get into the proper revolutionary mindset, read:
The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe
Notable quotes:
- "Andy giveth, and Bill
taketh away." That is, Andrew Grove, Intel, keeps building faster
and faster CPUs. Bill Gates, Microsoft, keeps building slower and
slower (yet better and better!) software.
- "Concurrency is the next
major revolution in how we write software." That is, by 2020
purely sequential code will have the same retro feeling that COBOL has
today.
- "Probably
the greatest cost of
concurrency is that concurrency really is hard: The programming model,
meaning the model in the programmer’s head that he needs to reason
reliably about his program, is much harder than it is for sequential
control flow." That is, since Aristotle human reasoning is sequential. Execution is
parallel. Horrible things can crawl into our universe through this gap!
- "For example, Intel is
talking
about someday producing 100-core chips; a single-threaded application
can exploit at most 1/100 of such a chip’s potential throughput."
90% of commercial applications today are single-threaded. They
will either adapt or die out.
- "just because it takes one
woman nine months to produce a baby doesn’t imply that nine women could
produce one baby in one month." So if the problem is "make one
baby", parallelism is useless. But if you change the problem to
"make 1 million babies", suddenly you have million-way parallelism.
- "A few rare classes of
applications are naturally parallelizable, but most aren’t." In
other words, occasionally it's obvious how to run in parallel, and
actually running in parallel is easy. Usually, though, you have
to rethink and reformulate the
basic algorithm used to solve the problem. I'd add that "A few
rare applications are impossible to parallelize, but given enough
effort, most aren't."
Creating Threads
Just like creating a new process, creating a thread is an OS call.
See all the example code as a
(Directory, Zip, Tar-gzip).
Here's an example of creating threads on UNIX:
#include <pthread.h>
volatile int delay=0;
void doWork(const char *caller) {
std::cout<<caller<<"\n";
for (int i=0;i<1000*1000*1;i++) delay++;
}
void *fnA(void *arg) {
for (int i=0;i<10;i++) doWork("A");
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<10;i++) doWork("B");
return 0;
}
int foo(void) {
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return delay;
}
(Try this in NetRun now!)
Annoying things about this include:
- Moving the per-thread work into a separate function is a pain.
- You need to remember to both create *and* join threads.
- The A and B printouts come in some unpredictable order, depending on the speed of the machine, OS, etc.
- "delay" ends up with a weird random answer.
Getting the Right Answer with Threads
So basically a kernel thread is just a fancy way to run one of your subroutines
simultaniously with all the other subroutines. A kernel thread can
still call other functions, access your global variables, or use anything it can find that belongs
to other threads.
This shared access to common variables
immediately introduces the many problems of "thread
safety". For example, consider a piece of code like this:
int shared=0;
void inc(void) {
int i=shared;
i++;
shared=i;
}
If two threads try to call "inc" repeatedly, the two executions might interleave like this:
Thread A
|
Thread B
|
int i=shared; // i==0
i++; // i==1
// hit interrupt. switch to B
shared=i; // i is still 1, so shared==1!
|
int i=shared; // i==0 (again)
i++; //i==1
shared=i; // shared==1
int i=shared; // i==1
i++; //i==2
shared=i; // shared==2
int i=shared; // i==2
i++; //i==3
shared=i; // shared==3
// hit interrupt, switch back to A
|
Uh oh! When we switch back to thread A, the value stored in "i"
isn't right anymore. It's an older copy of a shared global variable,
stored in thread A's stack or registers. So thread A will happily
overwrite the shared variable with its old version, causing all of B's
work to be lost!
Here's an executable example of this problem. Both threads are
trying to increment "sum". They do this 6,000 times, so "sum"
should be 6000 at the end. But with optimization on, they both
store a copy of "sum" in a register, so one guy overwrites the other
guy's work when they write the modified value back to "sum", and you
(usually!) end up with the totally-wrong value 3000:
#include <pthread.h>
int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000;i++)
sum++;
}
void *fnA(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
int foo(void) {
sum=0;
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return sum;
}
(executable NetRun link)
Generally speaking, parallel race conditions can only happen when:
More than one thread writes to the same data at the same time.
This statement tells you all the ways to eliminate data race conditions:
More than one thread...
|
Only one thread? No problem. This is the solution taken by most modern software.
|
... writes to ...
|
No writes? No problem.
"RAR" isn't a dependency, so read-only data structures are fine.
As soon as anybody does any writes, you're in dependency land.
|
... the same data ...
|
Separate data? No
problem. One common fix is to "privatize" all your data--make a
private copy for each thread to use.
|
... at the same time.
|
There are several cool ways to separate different threads' accesses in time. Many architectures support "atomic" instructions that
the hardware guarantees will get the right answer, typically by
excluding other cores' accesses. All thread libraries
support a mutual exclusion primitive or "lock", typically built in software from atomic instructions.
|
Unfortunately, there's a false syllogism in people's minds, where:
multicore => threads => locks
That is, people assume the only way to write multicore code is using
threads, and the only way to avoid problems with threads is to add
locks. Often, locks are a poor solution--they have fairly high
overhead (30ns, about the cost of ten function calls!), they reduce
concurrency by design (lock contention), and they themselves suffer
from strange odd corner cases (deadlock). Personally, I try to
make private data first, then read-only data structures, and finally
add locks if no other solution is possible.