OOOE: How does it work?

Basically, the processor is able to perform instructions out of the order they are given in order to save time. Wikipedia gives a good outline of the process (http://en.wikipedia.org/wiki/Out-of-order_execution):

1. Instruction fetch.

2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).

3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.

4. The instruction is issued to the appropriate functional unit and executed by that unit.

5. The results are queued.

6. Only after all older instructions have their results written back to the register file, and then this result is written back to the register file. This is called the graduation or retire stage.

So with this type of execution, we save lots of time. But there are some drawbacks: “The dependencies of all instructions need to be determined, and instructions cannot execute before or at the same time as instructions on which they are dependent.” (http://en.wikibooks.org/wiki/Microprocessor_Design/Out_Of_Order_Execution).

Side note- Register renaming is used to avoid false dependencies that would make using OOOE impossible. As we’ve talked about in class, register renaming is where there are more physical registers then defined by the architecture.

Hyperthreading Example

Wikibooks gave a pretty understandable example:

“In a hyperthreaded system, it appears to the operating system that there are two separate processors, when there is only one processor. The OOOE engine feeds instructions from the two separate execution threads to the execution cores, to try and keep all of the cores simultaneously busy. In general hyperthreading increases performance, although in some instances this additional complexity actually decreased performance.”

History

The first machine to use OOOE was the CDC 6600, which actually used a scoreboard. That was in 1964 and almost 3 years later, the IBM 360/91 supported full OOOE. Now we jump to 1990, when IBM made the first OOOE microprocessor called the POWER1; although, it only worked on floating point instructions. The 90’s were when OOOE really took hold. “Throughout the 1990s out-of-order execution became more common, and was featured in the IBM/Motorola PowerPC 601 (1993), Fujitsu/HAL SPARC64 (1995), Intel Pentium Pro(1995), MIPS R10000 (1996), HP PA-8000 (1996), AMD K5 (1996) and DEC Alpha 21264 (1998). Notable exceptions to this trend include the Sun UltraSPARC, HP/Intel Itanium,Transmeta Crusoe

, Intel Atom, and the IBM POWER6.” –Wikipedia.

Branch Prediction

History

Originally branch prediction wasn't needed, because the earliest computers took multiple cycles to perform any instruction. However, as computer evolved pipelining became important to speeding up a programs performance and this created a need for computer to figure out how a program branches or jumps. Branches are points in programs when the execution could follow two or more paths, such as if-then-else statements or for loops.

The basic idea of branch prediction started when Tom McWilliams and Curt Widdoes introduced two-bit predictors. Two-bit predictors used a simple counter from 0 to 3 to predict whether or not a branch instruction is taken or not. If the counter is at 0 or 1 the predictor was predicting that the branch instruction was not taken. If it was 2 or 3 it predicted that the instruction will be taken and the the next instruction should come from the taken path. If the prediction was wrong the correct instructions are fetched and the counter is changed by one depending on whether or not the branch was taken or not. This method was accurate for predicting branches in loops. (http://everything2.com/title/2-bit+predictor)

IBM was creating the IBM 7030 also known as the STRETCH. The STRETCH included many different organizational techniques including precoding, memory operand prefetch, out-of-order prediction, branch misprediction recovery, and precise interrupts. The STRETCH pre-executed all branches that used the index registers, this caused a significant number of branch mispredictions. This branch mispredictions caused the STRETCH to perform poorly. (http://www.computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/index.html)

Burroughs B4900 used a branch prediction history state, stored in memory instructions during a programs execution. This scheme used a 4-state branch prediction by using the branch opcodes to represent the branch operator types. This particular scheme predicted branches correctly 93% of the time.

Types of Branches

Conditional

Unconditional

Direct

If-then-else

for loops

(benz, bez, ect)

Precedural calls (jal)

gotos (j)

Indirect

Return (jr)

function pointers (jalr)

(http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt)

RISC, MIPS, and SPARC originally only did the not taken branch prediction, which means any time a program could branch it chose not to. This didn't matter because the original processors could only fetch one instruction per clock cycle. As the processors evolved they kept the old branch predictor, since it was a bad branch predictor it caused the processors to slow down because when the branch mispredicted the processor lost four clock cycles to resolve the bad prediction.

In the Intel Pentium III and Pentium 4 had a Branch Prediction Unit. The Pentium III uses local branch prediction, which keeps a history buffer to predict branches, has about a 90% accuracy rate. The Pentium 4 just uses a more efficient branch prediction algorithm, and has about 94% accuracy rate. (http://www.slcentral.com/articles/01/8/p4/)

Today Intel has Intel Core i7 uses an array of branches to store the results of the branches as the program runs and uses an algorithm to attempt to determine the next branch. The new algorithm uses a two-level predictor first-level stores recent branch history and the second-level has slower access, but stores more branch history. This is good for branch prediction in large volumes of code, like databases. (http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-4.html)

AMD processors use a two level predictor like Intel. AMD uses a mixture between a global branch predictor and a saturated counter. The global branch predictor uses the history of all of the jumps in the program. The saturated counter like the two-bit predictor, it keeps a count to decide whether or not the program needs to branch.

http://www.omnipelagos.com/entry?n=branch_predictor

Rationale

The main idea behind branch prediction is stop the program from coming to a halt every time a conditional jump could occur. Since about 15-25% of a program are statements that can potentially branch it is important find a way to speed up the programs, because otherwise programs would waste clock cycles doing very little. (http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt) So to take advantage of the fact that most processors can do multiple instructions per clock cycle the branch predictor tries to figure out whether or not a conditional jump is taken, which significantly increases the performance of the program.

There are two main types of branch prediction, Static and Dynamic. Static branch prediction the basic branch prediction is done before runtime. While dynamic branch predictions the predictions may change throughout the program.

It is important to create an efficient and accurate branch predictor. A good branch predictor allows the program to increase the number of instructions available for the scheduler. It increases the level of parallelism in the program, which allows for more instructions to be spread out over a shorter time. It also allows more instructions to be completed while waiting for a branch to resolve.

http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt

http://www.slcentral.com/articles/01/8/p4/

Advantages

The main advantage to using branch prediction is to accurately pipeline instructions, which allows for the instructions to be completed faster. It does this through allowing more instructions to be completed at a single clock cycle, in turn dramatically increasing the efficiency of the processor.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

Disadvantages

One of the major problems with branch prediction is what if the system branches when it doesn't need to. Well this is called Branch misprediction and it causes affects the processor in two ways. First it causes the processor to waste its time by executing instructions that are on the mispredicted path. Second the mispredicted paths instructions must be flushed, also called bubbles, into the pipeline until the proper instructions overwrite the pipeline. If main penalty for mispredicting a branch is it causes a significant number of cycles to be wasted anywhere from 1 (very simple branch predictors) to about 40 (very complicated branch predictors) cycles to be used.

http://www.cs.nyu.edu/goldberg/pubs/gchp02.pdf

http://http://www.slcentral.com/articles/01/8/p4/