There are three types of pipeline hazards:
1. Structural hazards. A structural hazard occurs when a single hardware component is needed by more than one instruction in the pipeline. For example, if there is a single ALU, which is used by two different pipeline stages, it will not be possible to have two instructions perform those two stages at the same time. In general, the problem here is insufficient parallelism in the data path. We need to add hardware components so that components are not shared between stages.
2. Data hazards. A data hazard occurs when an instruction depends on the result of a prior instruction, but that result is not available when needed because the prior instruction has not yet completed.
3. Control hazards. A control hazard occurs as a result of an
instruction which changes the program counter. The result is that some
instructions already in the pipeline may need to be invalidated, thus causing
lost cycles. Control hazards can be caused by branches or interrupts.
We will look at how to handle hazards. Some can be handled effectively without losing any cycles. Others will cause pipeline stalls.
When stall cycles are included, the speedup from pipelining becomes
speedup = (non-pipelined CPI) / (pipelined CPI)
= (pipeline depth) / (1 + pipeline stall cycles per instruction)
Structural hazards
Structural hazards can occur as a result of:
1. Lack of resource duplication where needed. The same component is needed by two pipeline stages. For example, if we have a single memory for data and code, and can only read from one location at a time, there will be a conflict between the IF and MEM stages of the DLX pipeline.
2. Functional unit not fully pipelined. If a functional unit (such as the floating point unit) is not pipelined, its operations may require more than one cycle to complete.
These hazards can be handled by introducing stall cycles. In the case of the IF, MEM conflict, we can introduce a stall cycle as follows:
If EX/MEM.IR contains a load or store instruction, then
Data hazards
The basic problem here is that the expected order of operations is altered by the pipeline.
For example, consider the code sequence
i1: add r1,r2,r3
i2: add r4,r1,r5
i3:
sub r6,r7,r1
The result of i1 is used by i2. i2 fetches the contents of r1 and r5 from the register file during stage ID, while i1 is in stage EX. But i1 will not write its result back to the register file until stage WB. So when i2 fetches r1, it will get the old value of r1.
Consider the timeline:
1
2
3
4
5
6
7
8 9
i1 IF ID
EX MEM WB
i2
IF ID
EX MEM WB
i3
IF ID
EX MEM WB
i4
IF
ID
EX MEM WB
i5
IF
ID
EX MEM WB
There is no conflict between i1 and i5. i1's result is written to the register file during cycle 5, and i5 does not fetch it until cycle 6. There appears to be a conflict between i1 and i4, because i4 needs to fetch its register operand during cycle 6. This conflict may be avoided by requiring that the register writeback be completed early during the WB stage, and allowing the register fetch in the ID stage to be performed late in the cycle, possibly by using a two-phase clock.
It is clear that there is a conflict between i1 and i2, and between i1 and i3. This type of hazard can be handled by a technique called forwarding or bypassing. It is based on the observation that, although the value computed by i1 is not written back to the register file until cycle 5, it is actually computed during cycle 3. During cycle 4, the value is available in EX/MEM.ALUOutput (where it could be used by the EX stage of i2). During cycle 5, it is available in MEM/WB.ALUOutput (where it could be used by the EX stage of i3)
Another example:
i1: add r1,r2,r3
i2: lw
r4,0(r1)
i3: sw 12(r1),r4
The value of r1 computed by i1 is needed by i2 and i3 for address calculation. It can be forwarded from EX/MEM.ALUOutput for use by i2 and from MEM/WB.ALUOutput for use by i3.
We'll cover implementation details a little later.
These are examples of RAW hazards (read after write). (r1 is read by an instruction which comes after an instruction which writes r1.) The hazard occurs if the reader needs the data before the writer has written it; the reader may get an out-of-date version of the data.
There are two other types of data hazard: WAW (write after write) and WAR (write after read).
Here is an example of a WAW hazard (on a hypothetical machine that has an add instruction with one register operand and one memory operand):
i1: add r3,4(r4)
i2: mov r3,r5
The hazard could occur if we allow faster instructions to proceed through the pipeline ahead of slower ones. The problem in the code above could occur if i2 completes before i1 does. Then the value written to r3 by i2 could be overwritten by the value computed by i1; in effect, it will appear that the instructions were executed in reverse order.
This sort of hazard cannot occur in DLX.
WAR hazards are rare. Here is an example:
i1: mov 20(r3),30(r4) // this is a
memory-to-memory move, requiring computation of two effective addresses.
i2: add
r3,1
// this instruction increments r3
It is possible that in some implementations, the effective address of the
first operand might not be completed until r3 has been incremented by i2, giving
an incorrect value for the effective address.
Some hazards cannot be avoided by forwarding. Consider the following example:
i1: lw r1,0(r2)
i2: sub
r4,r1,r5
i3: and r6,r1,r7
i4:
or r8,r1,r9
The value loaded from memory by i1 is fetched during the MEM stage (cycle 4) and stored in MEM/WB.LMD. It is available starting in cycle 5. i4 fetches the value of r1 during its ID stage, which is cycle 5, so there is no hazard. i3 needs the value in its EX stage (cycle 5), so the value must be forwarded from MEM/WB.LMD to one of the ALU inputs. But i2 needs to use the value during its EX stage, which is cycle 4. The value has not been fetched from memory yet, so there is no way to use forwarding to obtain the value.
So, we need a stall cycle inserted at the EX phase during cycle 4. i2, i3, and i4 will all be delayed by one clock cycle. This is known as a pipeline interlock.
One way to avoid this type of hazard is to have the compiler try to avoid using the result of a load immediately after the load, by inserting some other useful code there. This is called a "load delay." That is, the compiler tries to avoid the hazard by rearranging instructions. In order to do this, the compiler must identify dependencies between instructions within "basic blocks."
Another approach is to expose the delay at the ISA level. With a
"delayed load", it is part of the ISA specification that the instruction
immediately following a load cannot use the loaded value. It is then the
responsibility of the compiler writer or assembly programmer to either follow
the load by an independent instruction (possibly by rearranging instructions),
or else simply to insert a nop after the load. The advantage here is that
no stall cycles are required.
Implementation of hazard detection, forwarding, and interlocks
The following example illustrates detection of a RAW hazard which requires a stall cycle.
lw r1,0(r3)
add r4,r1,r5
After two cycles, the lw instruction is in ID/EX.IR and the add instruction is in IF/ID.IR. The hazard can be detected during the third cycle. It occurs if
ID/EX.IR is a load, and
IF/ID.IR is a reg-reg ALU operation, and
(ID/EX.IR11-15 (the load destination) = IF/ID.IR6-10
(one operand of the ALU operation), or
ID/EX.IR11-15 =
IF/ID.IR11-15 (the other operand of the ALU operation))
or
ID/EX.IR is a load, and
IF/ID.IR is a reg-imm ALU, branch, load, or store
operation, and
ID/EX.IR11-15 = IF/ID.IR11-15
This is a boolean function which can be implemented with combinational logic.
What should be done if this hazard is detected? Insert a stall cycle by doing:
IF/ID.IR <-- IF/ID.IR (keep same instruction)
PC
<-- PC (don't increment program counter)
ID/EX.IR <--
no-op instruction
Note: even with the stall cycle, this sequence requires a forwarding
operation. The value fetched by the lw must be forwarded from MEM/WB.LMD
to one of the ALU inputs during the add instruction's EX cycle (which is also
the lw instruction's WB cycle).
As an example of the detection and implementation of a hazard which can be avoided by forwarding, suppose we have:
add r1,r2,r3
followed by one of:
add r4,r1,r5
add r4,r5,r1
addi
r4,r1,immediate
lw/sw r4,offset(r1)
In each of these cases, the ALUOutput from the first add must be forwarded to one of the ALU inputs for the following instruction. The detection condition is in part
EX/MEM.IR is a reg-reg or reg-imm ALU operation,
ID/EX.IR is a reg-reg
ALU instruction, and
(EX/MEM.IR16-20 = ID/EX.IR6-10,
or
EX/MEM.IR16-20 = ID/EX.IR11-15)
If this condition holds, then either ALUtop or ALUbottom must receive input from EX/MEM.ALUOutput. If the data dependency occurs between two instructions with another instruction in between, the value produced by the first must be forwarded from MEM/WB.ALUOutput to either ALUtop or ALUbottom. As a result, the multiplexers for ALUtop and ALUbottom must be extended with additional inputs:
ALUtop: A, NPC, EX/MEM.ALUOutput, MEM/WB.ALUOutput,
MEM/WB.LMD
ALUbottom: B, Imm, EX/MEM.ALUOutput,
MEM/WB.ALUOutput, MEM/WB.LMD