Multicore: Who ordered this?

The combination of pipelining and superscalar execution is designed to extract maximum runtime parallelism from sequential machine code. Though a combination of branch prediction, register renaming, and operand forwarding, this is remarkably effective, able to finish up to a half-dozen instructions per clock cycle for real code. The problem is there just isn't much more parallelism available in sequential code.

One solution is to change the code.

For example, we could keep each CPU the same, and just stick several of them together onto the same chip. This is called multicore.

Kinds of Parallel Hardware

One ALU, one set of registers: serial computer (single processor).
Multiple ALUs, one decoder & one set of registers: Single Instruction Multiple Data (SIMD), like SSE.
One set of ALUs, shared by several sets of decoders & registers: Symmetric Multi-Threading (SMT) or HyperThreading. This is only a tiny amount of additional chip area, but increases ALU utilization way more than wider superscalar execution. ALU contention limits the parallelism to perhaps 4-way on modern CPUs, but can reach 64-fold on GPUs.
Several full CPUs, one shared RAM area: multicore or Symmetric Multi-Processing (SMP). Typically we call it multicore when the CPUs are on a single silicon die, and SMP when they're separate dies. 4 cores is entirely standard today, 8 cores are available, and commercial SMP nodes with 64 or 128 way access exist. SMP can also combine with SMT for absurd numbers of threads: e.g., a typical GPU is 128-way SMP, with a total of 8192-way SMT.
Several CPUs, each with local RAM, but with a nonlocal RAM interconnect: Non-Uniform Memory Architecture (NUMA). Unlike SMP, where all RAM is equal, in a NUMA, it's faster to go to your own RAM (20ns) compared to some other CPU's RAM (80ns).
Several complete nodes, connected with network: Cluster or Massively Parallel Processor (MPP). Memory is not shared, so this a distributed system. This is the only architecture known to scale to hundreds of thousands of CPUs--nothing is shared except the network. If all the computers are running the same program, as is typical to solve tightly coupled problems, it's called Single Program Multiple Data (SPMD). It still takes microsecond (1000ns) to get to another node's RAM across even very fast networks. The Cloud is more loosely coupled, typically using virtual machines physically connected with some variant of ethernet within each data center, so it takes a substantial fraction of a millisecond to communicate between nodes. The whole Internet is a huge and very very loosely coupled distributed system, because all the computers on the internet are running different programs at different times for different reasons.

How do we write parallel code?

First, to get into the proper revolutionary mindset, read this now decade-old but prescient article:

The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe

Notable quotes:

"Andy giveth, and Bill taketh away." That is, Andrew Grove, Intel, keeps building faster and faster CPUs. Bill Gates, Microsoft, keeps building slower and slower (yet more features!) software.
"Concurrency is the next major revolution in how we write software." That is, by 2020 purely sequential code will have the same retro feeling that COBOL has today.
"Probably the greatest cost of concurrency is that concurrency really is hard: The programming model, meaning the model in the programmer’s head that he needs to reason reliably about his program, is much harder than it is for sequential control flow." That is, since Aristotle we've trained humans to reason sequentially. Yet physics runs in parallel. Horrible things can crawl into our universe through this gap!
"For example, Intel is talking about someday producing 100-core chips; a single-threaded application can exploit at most 1/100 of such a chip’s potential throughput." This has already happened, with Intel Xeon Phi, a GPU-looking coprocessor with 50+ cores and hundreds of threads. Many commercial applications today are single-threaded. They will either adapt or die out.
"just because it takes one woman nine months to produce a baby doesn’t imply that nine women could produce one baby in one month." So if the problem is "make one object", parallelism is useless. But if you change the problem to "make 1 million objects", like most performance-intensive problems, suddenly you have million-way parallelism.
"A few rare classes of applications are naturally parallel, but most aren’t." In other words, occasionally it's obvious how to run in parallel, and actually running in parallel is easy. Usually, though, you have to rethink and reformulate the basic algorithm used to solve the problem. I'd add that "A few rare applications are impossible to parallelize, but given enough effort, most aren't."

Parallel Threads in C++11

C++11 added a thread library, in <thread>. This makes it easy to create threads in a portable way, without needing to call Windows kernel threads or UNIX pthreads.

#include <thread>

void do_work(char where) {
	for (int i=0;i<10;i++) {
		std::cout<<where<<i<<"\n";
	}
}

void worker_thread(void) {
	do_work('B');
}

void foo(void) {
	std::thread t(worker_thread);
	do_work('A');
	t.join();
	std::cout<<"Done!\n";
}

(Try this in NetRun now!)

Notice in the above program, sometimes the prints from A and B overlap, and run in an arbitrary order.

PROBLEM: Parallel access to any single resource, like cout, results in resource contention. Contention leads to wrong answers, bad performance, or both at once.

SOLUTION(I): Ignore this, and get the wrong answer, slowly. You can also occasionally crash onstage.
SOLUTION(F): Forget parallelism. You get the right answer, but only using one core.
SOLUTION(A): Use an atomic operation to access the resource. This is typically only possible for single-int or single-float operations, and has some overhead on CPUs, but it's possible for the hardware to make this efficient, like on most GPUs.
SOLUTION(C): Add a mutex, a mutual exclusion device (AKA lock) to control access to the single resource. This gives you the right answer, but costs performance--the whole point of the critical section is to reduce parallelism.
SOLUTION(P): Parallelize (or "privatize") all resources--then there aren't any sharing problems because nothing is shared. This is the best solution, but making several copies of hardware or software can be expensive. This is the model that highly scalable software like MPI recommends: even the main function is parallel!
SOLUTION(H): Hybrid: use any of the above where it's appropriate. This is the model OpenMP recommends: you start serial, add parallelism where it makes sense, and privatize or restrict access to shared things to get the right answer.

For example, we can eliminate contention on the single "cout" variable by writing the results into an array of strings. Generally, arrays or vectors are nice, because multicore machines do a decent job at simultaneous access to different places in memory, but do a bad job at simultaneous access to a single shared data structure like a file or network device.

Multicore via OpenMP

Because threaded programming is so ugly and tricky, there's a simple loop-oriented language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.

The basic idea is you take what looks like an ordinary sequential loop, like:

    for (int i=0;i<n;i++) do_fn(i);

And you add a little note to the compiler saying it's a parallel forloop, so if you've got four CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement, with the "omp" prefix:

#pragma omp parallel for num_threads(4)
    for (int i=0;i<n;i++) do_fn(i);

Granted, this line has like a 5,000ns/thread overhead, so it won't help tiny loops, but it can really help long loops. After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program. Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!

If your compiler supports OpenMP 4.0 (e.g., gcc 4.9 or later), you can ask the compiler to use SIMD instructions in your loops like this, although at the moment it doesn't seem to combine with threads.

#pragma omp simd
    for (int i=0;i<n;i++) arr[i]=3*src[i];

Unlike bare threads, with OpenMP:

The compiler decides how many threads to create (typically the number of CPUs)
The compiler moves the guts of your loop off into a separate function.
The compiler creates and deallocates the threads.
The compiler can help you find and fix memory race conditions.

Here's how you enable OpenMP in various compilers. Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and gcc version 4.2 all support OpenMP, although earlier versions do not!

Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop. You're done!

Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!

CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.