Multicore Parallel Programming & Pitfalls

CS 441/641 Lecture, Dr. Lawlor

Let's say I want to print out 20 integers, and for some insane reason I care how fast this happens.
const int n=20;
double start=time_in_seconds();

#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
#pragma omp critical /* do this stuff single-core */
{
std::cout<<"i="<<i<<"\n";
}
}

double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";

(Try this in NetRun now!)

This takes 10us per print, which is pretty slow.  But it's only using one core.

Right, then: let's put more processors on the job, using the very simple OpenMP directives.  These split up the loop iterations across the cores, so core 0 gets iterations 0-4, core 1 iterations 5-9, core 2 iterations 10-14, and core 3 iterations 15-19.
const int n=20;
double start=time_in_seconds();

#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
std::cout<<"i="<<i<<"\n";
}

double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";

(Try this in NetRun now!)

Sadly, this now takes about 28us per print--it got 180% slower!

Here's the other problem: it never quite runs the same way twice.  Hmmm.
i=i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14
i=i=0
i=1
i=2
i=3
i=4
i=10
i=11
i=12
i=13
i=14
15
i=16
i=17
i=18
i=19
i=5
i=6
i=7
i=8
i=9
i=i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14
i=i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=i=i=0
i=1
i=2
i=3
i=4
5
i=6
i=7
i=8
i=9
15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14

We can improve matters a little bit by making each entire output line a "critical section", so it's done as a unit.  This makes the lines a little cleaner, and actually improves the speed a bit, to 24us per print, only 140% slower than single core.
const int n=20;
double start=time_in_seconds();

#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
#pragma omp critical /* do this stuff single-core */
{
std::cout<<"i="<<i<<"\n";
}
}

double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";

(Try this in NetRun now!)


i=5
i=6
i=7
i=8
i=9
i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14
i=5
i=6
i=7
i=8
i=9
i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14
i=10
i=11
i=12
i=13
i=14
i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
i=5
i=6
i=7
i=8
i=9
i=5
i=6
i=7
i=8
i=9
i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14
i=5
i=6
i=7
i=8
i=9
i=0
i=1
i=2
i=3
i=4
i=15
i=16
i=17
i=18
i=19
i=10
i=11
i=12
i=13
i=14

PROBLEM: Parallel access to any single resource results in contention.  Contention leads to wrong answers, bad performance, or both at once.

SOLUTION(I): Ignore this, and get the wrong answer, slowly.
SOLUTION(F): Forget parallelism.  You get the right answer, but only using one core.
SOLUTION(C): Add a critical section (or a lock, or a mutex, etc) to control access to the single resource.  This gives you the right answer, but costs performance--the whole point of the critical section is to reduce parallelism.
SOLUTION(P): Parallelize (or "privatize") all resources.  This is the best solution, but making several copies of hardware or software is expensive.  This is the model that highly scalable software like MPI recommends: even the main function is parallel!
SOLUTION(H): Hybrid: use any of the above where it's appropriate.  This is the model OpenMP recommends: you start serial, add parallelism where it makes sense, and privatize or restrict access to shared things to get the right answer.

For example, we can eliminate contention on the single "cout" variable by writing the results into an array of strings.  Generally, arrays or vectors are nice, because multicore machines do a decent job at simultanious access to different places in memory, but do a bad job at simultanious access to a single shared data structure like a tree.
const int n=10000;
double start=time_in_seconds();

char results[n][100]; /* array of resulting strings (so each core can write its own)*/
#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
sprintf(results[i],"i=%d\n",i);
}
/* concat & print all strings? */
double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";

(Try this in NetRun now!)

This version gets the right answer (assuming you concatenate all the strings properly), and by running a big enough problem, here with 10K strings, you can soundly beat the single-core performance, here by about 200%.