Mutli-Core Computing Systems


Introduction

Traditionally in order to increase the performance of a computing system increasing the clock frequency of the processor was the primary method. Manufacturers were able to keep pace with Moore's Law, however in a 2006 whitepaper even Intel admitted continuing at such a pace along the same line of development was impractical due to the increased power required by such systems and the resulting heat generation. In order to meet the growing demands for high speed processing, microprocessor developers came up with “Multi-Core” as their solution.


Origins

Symmetric Multi-Processing has been available in computing systems for a long time. In fact in 1975 Honeywell introduced the Multics system which boasted as many as four “central processors”.


The idea behind SMP is that two or more independent processors are connected to RAM and other system resources in parallel with a bus or crossbar.


Unfortunately seeing performance benefits from multiple processors is not as simple as plugging in new hardware.


Operating systems must support multiple processors by scheduling execution of processes on the available CPU's. Most modern operating systems support SMP. From the perspective of the operating system having two seperate “sockets” with a single core CPU or a dual core single “socket” CPU are the same.


Applications must also support multiple execution paths (threads) to gain the benefit of parallel processing.


Theory


Consider a very basic counting algorithm which given a set of integers and a “target” integer returns the number of times that the target integer is found in the set. Every time that a “target” is found it will be set to zero inside the original set.


int count( int *set, int target)

{

int count;


for (i=0;i<sizeof(set);i++)

{

if (set[i]==target)

{

count++;

set[i]=0;

}

}

return count;

}


This is a simple serial algorithm. So how could we make it take use of multiple cores?


The key to developing a parallel algorithm is being able to break down the problem into multiple independent sub-problems. In this case we could have three separate threads run the same counting algorithm on one third of the “set” and then add the counts returned by each thread.


This is great if you are using an array to store your data, but what if you are using a linked list? Now each thread has to start at the first node and count to the end of the linked list, so you have lost the ability to process the problem in parallel. So even given a great a parallel algorithm your data storage mechanism can impact your applications performance!


Overhead is another consideration which must be addressed as in order to coordinate the threads within an application and reassemble the data there is a certain amount of processing which must take place.


More Problems


Cache coherency is a big concern for hardware and application designers. Given the slow response time inherent in most DRAM architectures in contrast with CPU clock speeds, proper use of memory cache is important to performance.


Generally it is the case that each core has a separate L1 cache and all cores share a single L2 cache (or at least that is how Intel does it).


Consider the following example:

  1. Core 1 reads and caches variable X

  2. Core 2 reads and caches variable X

  3. Core 1 writes and caches a new value for variable X

  4. Core 2 reads variable X from cache


Since the L1 cache for each core is separate, there must be a method to keep all caches synchronized. Otherwise, in the example, the thread running on Core 2 would continue running with incorrect data.


Two common methods for keeping data uniform across multiple caches are “update” and “invalidate”.


In an update based system each cache informs all other caches when a write to memory is performed so that each cache is updated with the new value.


In an invalidate based system when a single core writes to memory an invalidate message is sent to all other caches informing them that they must fetch the new value from RAM on the next read attempt.


Virtual Machines


Virtual machine platforms make use of multi-core systems by allocating CPU resources to individual virtual machines as needed. Very advanced hypervisors can be very efficient about resource scheduling. In fact, it has been suggested that future mobile embedded systems will benefit in terms of power utilization by using both virtual machine and multi-core technologies.


In theory it seems beneficial to use a hypervisor to schedule usage of CPU's since individual virtual machines do not share common memory resources. This seems to fit the ideal situation of having multiple independent processes running on each CPU core. Obviously though there are some performance overheads due to the hypervisor.


However, in order to execute a virtual machine across multiple cores, the virtual machine would have to have at least two virtual CPU's. As such, hypervisors are not a silver bullet.




Resources