Computer Architecture
The Anatomy of Modern Processors

Pipelines

In order to increase the instruction throughput, high performance processors make extensive use of a technique called pipelining. A pipelined processor doesn't wait until the result from a previous operation has been written back into the register files or main memory - it fetches and starts to execute the next instruction as soon as it has fetched the first one and despatched it to the instruction register. When the simple processor described on the previous pages is performing an add instruction, there is no need for it to wait for the add operation to complete before it starts fetching the next instruction. So a pipelined processor will start fetching the next instruction from memory as soon as it has latched the current instruction in the instruction register.

Thus a pipelined processor has a pipeline containing a number of stages (4 or 5 was a common number in early RISC processors) arranged so that a new instruction is latched into its input register as the results calculated in this stage are latched into the input register of the following stage. This means that there will be a number of instructions (equal to the number of pipeline stages in the best case) "active" in the processor at any one time.

In a typical early RISC processor (eg MIPS R3000),
there would be four pipeline stages

IF Instruction Fetch
Fetch the instruction from memory

DEC Decode and Operand Fetch
Decode it and fetch operands from the register file

EX Execute
Execute the instruction in the ALU

WB WriteBack
Write the result back in to a register

Population of the pipeline at each clock cycle:
i₁, i₂, ... are successive instructions in the instruction stream.
With an n-stage pipeline, after n-1 clock cycles, the pipeline will become full and an instruction completes on every clock cycle, instead of every n clock cycles. This effectively speeds up the processor by a factor of n.

Latency and throughput

There's no free lunch! Each instruction is still taking n cycles to complete. However, the throughput has been increased to an instruction every cycle from one every n cycles. The latency - or lag between the time an instruction is given to the processor and the time that it "emerges" with a result - has not changed. We have increased the throughput by adding a degree of parallelism to the system.

Bubbles

We can achieve the high throughput only if every instruction takes exactly one cycle to execute. Although most arithmetic instructions in a RISC machine complete in a single cycle, some complex instructions (eg divide) take more than a cycle. Such an instruction remains in one of the stages (eg EX) for more than a cycle and creates a pipeline bubble ahead of it in the pipeline.

i₂ is a long-latency instruction, eg divide, taking 3 cycles in the EX stage. A "bubble" develops ahead of it in the pipeline as instructions ahead of it move on the the next stage.

Pipeline bubbles are caused by

Long latency instructions, such as division, floating point operations and math functions, which take more than one cycle in the execution stage,
Memory reads, which take many cycles in the Decode stage,
Memory writes, which may take extra cycles in the WriteBack stage, and
Branches.

Branching in pipelined processors

Efficient handling of branch instructions presents a significant challenge to a computer architect. The simplest branch - an unconditional one, such as the return from a procedure or function, requires a number of instructions behind it in the pipeline to be squashed.

When i₂ the execution stage and the new PC is being calculated, instructions i₃ and i₄ have already entered the pipeline and must be squashed. This creates a series of bubbles in the pipeline until the target of the branch instruction, i_a and its successors can be fetched from memory.

The diagram shows the cost of branches fairly dramatically - a large number of bubbles are present for quite a few cycles. This diagram also makes an optimistic assumption - that the new instruction stream could be accessed within 3 cycles (not realistic if the new instructions have to come from memory rather than cache!). Branches occur very frequently in typical programs, of the order of one every 10 instructions! This shows the importance of branch prediction strategies which we will examine later.

Branch delay slots

Some early processors were not able to squash the instruction following a branch in the hardware and required the compiler to insert a NOOP - an instruction that does nothing into the program following every branch.

Thus instead of emitting this:

mult $4, $2, $1
add $3, $4, $5
ret
sub $4, $6, $7

the compiler would emit:

mult $4, $2, $1
add $3, $4, $5
ret
or $1, $1, $1
sub $4, $6, $7

Note that
or $1, $1, $1
is an effective NOOP - it changes nothing!

The instruction following the branch is said to be in the branch delay slot. It was soon realised that, since the instruction in this slot has progressed well down the pipeline anyway, if it was guaranteed that it would be executed, some improvement in performance would result.

The compiler is asked to move an instruction that must be executed which precedes the branch into the branch delay slot where it will be executed while the branch target is being fetched. Current RISC machines will have a one-instruction branch delay slot - occasionally two.
This is an example of the modern trend in computer architecture - to expose more details of the underlying machine to the compiler and let it generate the most efficient code. In this case, it is almost trivial for the compiler to find an instruction before the branch which must be executed. Compilers are rapidly becoming more capable and much more complex instruction reordering operations are routinely performed by optimising compilers.
We will see more examples of this movement of responsibility for execution efficiency from the hardware to the software later.

Conditional Branches

Branches which depend on some state of the machine (eg the value stored in a register) introduce another problem into pipelined machines. Consider this sequence of instructions:

start: addi $3, $3, 4  ; increment address
       lw $2, 0($3)    ; load next array element
       add $1, $1, $2  ; add to total
       subu $5, $5, 1  ; decrement count
       bne start       ; if not zero, branch back

which sums the values in an array. The element count is stored in register 5 and is decremented on each iteration of the loop. The loop terminates when it reaches 0. The branch tests the condition code generated by the dec operation. If it's not zero, it will branch back to the beginning of the loop. However this value is not available until the dec instruction has reached the WB stage. The branch instruction must therefore stall in the execution stage because one of its operands (the condition code) was not available when it entered this stage. This is the first example of a pipeline hazard - we shall see more in the next section.