| More than one thread... |
Only
one thread? No problem. This is the solution taken by most
modern software, and the one you've been using so far. |
| ... writes to ... |
No writes? No problem. So read-only data structures are fine.
As soon as anybody does any writes, you're in trouble. |
| ... the same data ... |
Separate data? No
problem. One common fix is to "privatize" all your data--make a
private copy for each thread to use. This is probably the highest performance fix too. |
| ... at the same time. |
There are several cool ways to separate different threads' accesses in time. Many architectures support "atomic" instructions that
the hardware guarantees will get the right answer, typically by
excluding other cores' accesses. All thread libraries
support a mutual exclusion primitive or "lock", typically built in software from atomic instructions. |
for (int i=0;i<n;i++) do_fn(i);And you add a little note to the compiler saying it's a parallel forloop, so if you've got six CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement like so:
#pragma omp parallel forThis is a deceptively powerful statement. Each iteration of the loop can now potentially run on a separate core, in its own thread. After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program. But there are downsides:
for (int i=0;i<n;i++) do_fn(i);
int foo(void) {
for (int i=0;i<10;i++) {
std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
}
return 0;
}
Here's the same loop running in parallel with OpenMP. Notice that
(1) there are 4 separate threads on this quad-core machine, (2) each
thread gets its *own* copy of i (thanks to the compiler), and (3) every
time you run the program, the printouts arrive in a slightly different
order.int foo(void) {
#pragma omp parallel for
for (int i=0;i<10;i++) {
std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
}
return 0;
}
Here's an example where OpenMP really shines. Each thread handles
its own part of the array, working alone and getting things done.
The net result is about 3x faster than a single core.enum {n=100};
float array[n];
int foo(void) {
#pragma omp parallel for
for (int i=0;i<n;i++) {
array[i]=exp(-array[i]);
}
return array[0];
}
volatile int sum=0;This takes 2.5 ns/iteration on my quad core machine.
int foo(void) {
int i,j;
sum=0;
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
volatile int sum=0;There are two problems here:
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
volatile int sum=0;To get the right answer, we just ask OpenMP to automatically total up a private copy of "sum":
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for private(j)
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
volatile int sum=0;
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for private(j) reduction(+:sum)
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
This is now 0.44ns/iteration, a solid 4x speedup over the original code, and it gets the right answer!