Pipeline Hazards

We have seen that ideally, a new instruction should enter the pipeline on each clock cycle, so that each pipeline stage is busy on each clock cycle.  Unfortunately, there are some sequences of instructions where that is impossible (assuming we want the pipelined program to run the same as it would on a non-pipelined machine!)  These are called pipeline hazards.  Pipeline hazards are instruction dependencies which prevent an instruction from being executed in the next available clock cycle.

There are three types of pipeline hazards:

1.  Structural hazards.  A structural hazard occurs when a single hardware component is needed by more than one instruction in the pipeline.  For example, if there is a single ALU, which is used by two different pipeline stages, it will not be possible to have two instructions perform those two stages at the same time.  In general, the problem here is insufficient parallelism in the data path.  We need to add hardware components so that components are not shared between stages.

2.  Data hazards.  A data hazard occurs when an instruction depends on the result of a prior instruction, but that result is not available when needed because the prior instruction has not yet completed.

3.  Control hazards.  A control hazard occurs as a result of an instruction which changes the program counter.  The result is that some instructions already in the pipeline may need to be invalidated, thus causing lost cycles.  Control hazards can be caused by branches or interrupts.
 

We will look at how to handle hazards.  Some can be handled effectively without losing any cycles.  Others will cause pipeline stalls.

When stall cycles are included, the speedup from pipelining becomes

speedup =  (non-pipelined CPI) / (pipelined CPI)

              =  (pipeline depth) / (1 + pipeline stall cycles per instruction)
 

Structural hazards

Structural hazards can occur as a result of:

1.  Lack of resource duplication where needed.  The same component is needed by two pipeline stages.  For example, if we have a single memory for data and code, and can only read from one location at a time, there will be a conflict between the IF and MEM stages of the DLX pipeline.

2.  Functional unit not fully pipelined.  If a functional unit (such as the floating point unit) is not pipelined, its operations may require more than one cycle to complete.

These hazards can be handled by introducing stall cycles.  In the case of the IF, MEM conflict, we can introduce a stall cycle as follows:

If EX/MEM.IR contains a load or store instruction, then

Stalls in intermediate stages are a little more complicated.  We will deal with them later.
 

Data hazards

The basic problem here is that the expected order of operations is altered by the pipeline.

For example, consider the code sequence

i1:  add  r1,r2,r3
i2:  add  r4,r1,r5
i3:  sub  r6,r7,r1

The result of i1 is used by i2.  i2 fetches the contents of r1 and r5 from the register file during stage ID, while i1 is in stage EX.  But i1 will not write its result back to the register file until stage WB.  So when i2 fetches r1, it will get the old value of r1.

Consider the timeline:

        1      2       3        4          5           6          7            8          9
i1    IF     ID    EX    MEM    WB
i2            IF     ID     EX        MEM    WB
i3                     IF     ID         EX         MEM   WB
i4                              IF          ID         EX        MEM   WB
i5                                           IF          ID         EX        MEM   WB

There is no conflict between i1 and i5.  i1's result is written to the register file during cycle 5, and i5 does not fetch it until cycle 6.  There appears to be a conflict between i1 and i4, because i4 needs to fetch its register operand during cycle 6.  This conflict may be avoided by requiring that the register writeback be completed early during the WB stage, and allowing the register fetch in the ID stage to be performed late in the cycle, possibly by using a two-phase clock.

It is clear that there is a conflict between i1 and i2, and between i1 and i3.  This type of hazard can be handled by a technique called forwarding or bypassing.  It is based on the observation that, although the value computed by i1 is not written back to the register file until cycle 5, it is actually computed during cycle 3.  During cycle 4, the value is available in EX/MEM.ALUOutput (where it could be used by the EX stage of i2).  During cycle 5, it is available in MEM/WB.ALUOutput (where it could be used by the EX stage of i3)

Another example:

i1:   add  r1,r2,r3
i2:   lw    r4,0(r1)
i3:   sw   12(r1),r4

The value of r1 computed by i1 is needed by i2 and i3 for address calculation.  It can be forwarded from EX/MEM.ALUOutput for use by i2 and from MEM/WB.ALUOutput for use by i3.

We'll cover implementation details a little later.
 

These are examples of RAW hazards (read after write).  (r1 is read by an instruction which comes after an instruction which writes r1.)  The hazard occurs if the reader needs the data before the writer has written it; the reader may get an out-of-date version of the data.

There are two other types of data hazard:  WAW (write after write) and WAR (write after read).

Here is an example of a WAW hazard (on a hypothetical machine that has an add instruction with one register operand and one memory operand):

i1:   add   r3,4(r4)
i2:   mov  r3,r5

The hazard could occur if we allow faster instructions to proceed through the pipeline ahead of slower ones.  The problem in the code above could occur if i2 completes before i1 does.  Then the value written to r3 by i2 could be overwritten by the value computed by i1; in effect, it will appear that the instructions were executed in reverse order.

This sort of hazard cannot occur in DLX.

WAR hazards are rare.  Here is an example:

i1:   mov    20(r3),30(r4)  //  this is a memory-to-memory move, requiring computation of two effective addresses.
i2:   add     r3,1                //  this instruction increments r3

It is possible that in some implementations, the effective address of the first operand might not be completed until r3 has been incremented by i2, giving an incorrect value for the effective address.
 

Some hazards cannot be avoided by forwarding.  Consider the following example:

i1:   lw    r1,0(r2)
i2:   sub  r4,r1,r5
i3:   and  r6,r1,r7
i4:   or    r8,r1,r9

The value loaded from memory by i1 is fetched during the MEM stage (cycle 4) and stored in MEM/WB.LMD.  It is available starting in cycle 5.  i4 fetches the value of r1 during its ID stage, which is cycle 5, so there is no hazard.  i3 needs the value in its EX stage (cycle 5), so the value must be forwarded from MEM/WB.LMD to one of the ALU inputs.  But i2 needs to use the value during its EX stage, which is cycle 4.  The value has not been fetched from memory yet, so there is no way to use forwarding to obtain the value.

So, we need a stall cycle inserted at the EX phase during cycle 4.  i2, i3, and i4 will all be delayed by one clock cycle.  This is known as a pipeline interlock.

One way to avoid this type of hazard is to have the compiler try to avoid using the result of a load immediately after the load, by inserting some other useful code there.  This is called a "load delay."  That is, the compiler tries to avoid the hazard by rearranging instructions.  In order to do this, the compiler must identify dependencies between instructions within "basic blocks."

Another approach is to expose the delay at the ISA level.  With a "delayed load", it is part of the ISA specification that the instruction immediately following a load cannot use the loaded value.  It is then the responsibility of the compiler writer or assembly programmer to either follow the load by an independent instruction (possibly by rearranging instructions), or else simply to insert a nop after the load.  The advantage here is that no stall cycles are required.
 

Implementation of hazard detection, forwarding, and interlocks

The following example illustrates detection of a RAW hazard which requires a stall cycle.

lw     r1,0(r3)
add   r4,r1,r5

After two cycles, the lw instruction is in ID/EX.IR and the add instruction is in IF/ID.IR.  The hazard can be detected during the third cycle.  It occurs if

ID/EX.IR is a load, and
IF/ID.IR is a reg-reg ALU operation, and
(ID/EX.IR11-15 (the load destination) = IF/ID.IR6-10 (one operand of the ALU operation), or
ID/EX.IR11-15 = IF/ID.IR11-15 (the other operand of the ALU operation))

or

ID/EX.IR is a load, and
IF/ID.IR is a reg-imm ALU, branch, load, or store operation, and
ID/EX.IR11-15 = IF/ID.IR11-15

This is a boolean function which can be implemented with combinational logic.

What should be done if this hazard is detected?  Insert a stall cycle by doing:

IF/ID.IR <-- IF/ID.IR    (keep same instruction)
PC <-- PC   (don't increment program counter)
ID/EX.IR <-- no-op instruction

Note:  even with the stall cycle, this sequence requires a forwarding operation.  The value fetched by the lw must be forwarded from MEM/WB.LMD to one of the ALU inputs during the add instruction's EX cycle (which is also the lw instruction's WB cycle).
 

As an example of the detection and implementation of a hazard which can be avoided by forwarding, suppose we have:

add      r1,r2,r3

followed by one of:

add      r4,r1,r5
add      r4,r5,r1
addi     r4,r1,immediate
lw/sw   r4,offset(r1)

In each of these cases, the ALUOutput from the first add must be forwarded to one of the ALU inputs for the following instruction.  The detection condition is in part

EX/MEM.IR is a reg-reg or reg-imm ALU operation,
ID/EX.IR is a reg-reg ALU instruction, and
(EX/MEM.IR16-20 = ID/EX.IR6-10, or
EX/MEM.IR16-20 = ID/EX.IR11-15)

If this condition holds, then either ALUtop or ALUbottom must receive input from EX/MEM.ALUOutput.  If the data dependency occurs between two instructions with another instruction in between, the value produced by the first must be forwarded from MEM/WB.ALUOutput to either ALUtop or ALUbottom.  As a result, the multiplexers for ALUtop and ALUbottom must be extended with additional inputs:

ALUtop:  A, NPC, EX/MEM.ALUOutput, MEM/WB.ALUOutput, MEM/WB.LMD
ALUbottom:  B, Imm, EX/MEM.ALUOutput, MEM/WB.ALUOutput, MEM/WB.LMD
 

e-REdING. Biblioteca de la Escuela Superior de Ingenieros de Sevilla.


:
: Ingeniería