

### **Simple Instruction Pipelining**

Krste Asanovic
Laboratory for Computer Science
M.I.T.

http://www.csg.lcs.mit.edu/6.823



Krste Asanovic February 21, 2001 6.823. L5--2

### **Processor Performance Equation**

<u>Time</u> = <u>Instructions</u> \* <u>Cycles</u> \* <u>Time</u> Program Program Instruction Cycle

- Instructions per program depends on source code, compiler technology, and ISA
- Microcoded DLX from last lecture had cycles per instruction (CPI) of around 7 *minimum*
- Time per cycle for microcoded DLX fixed by microcode cycle time
  - $\rightarrow$  mostly ROM access + next  $\mu$ PC select logic





## **Pipelined DLX**

### To pipeline DLX:

- First build unpipelined DLX with CPI=1
- Next, add pipeline registers to reduce cycle time while maintaining CPI=1



### **A Simple Memory Model**

Krste Asanovic February 21, 2001 6.823, L5--4



Reads and writes are always completed in one cycle

- a Read can be done any time (i.e. combinational)
- a Write is performed at the rising clock edge if it is enabled
  - ⇒ the write address and data must be stable at the clock edge





### **Datapath for Memory Instructions**

Should program and data memory be separate?

Harvard style: separate

(Aiken and Mark 1 influence)

- read-only program memory
- read/write data memory at some level the two memories have to be the same

Princeton style: the same

(von Neumann's influence)

 A Load or Store instruction requires accessing the memory more than once during its execution















## **Single-Cycle Hardwired Control**

We will assume

- clock period is sufficiently long for all of the following steps to be "completed":
  - 1. instruction fetch
  - 2. decode and register fetch
  - 3. ALU operation
  - 4. data fetch if required
  - 5. register write-back setup time

$$\Rightarrow$$
  $t_{\text{C}} > t_{\text{IFetch}} + t_{\text{RFetch}} + t_{\text{ALU}} + t_{\text{DMem}} + t_{\text{RWB}}$ 

 At the rising edge of the following clock, the PC, the register file and the memory are updated







| Ť                                               | Har        | Krste Asanovi<br>February 21, 200<br>6.823, L51 |           |              |              |           |            |                |
|-------------------------------------------------|------------|-------------------------------------------------|-----------|--------------|--------------|-----------|------------|----------------|
|                                                 | Ext<br>Sel | B<br>Src                                        | Op<br>Sel | Mem<br>Write | Reg<br>Write | WB<br>Src | Reg<br>Dst | PC<br>Src      |
| ALU<br>ALUu                                     |            |                                                 |           |              |              |           |            |                |
| ALUi<br>ALUui                                   |            |                                                 |           |              |              |           |            |                |
| LW<br>SW                                        |            |                                                 |           |              |              |           |            |                |
| BEQZ <sub>taken</sub><br>BEQZ <sub>-taken</sub> |            |                                                 |           |              |              |           |            |                |
| J<br>JAL                                        |            |                                                 |           |              |              |           |            |                |
| JR<br>JALR                                      |            |                                                 |           |              |              |           |            |                |
| Src = Reg / Ir<br>CSrc = PCR/f                  |            | WBS                                             | rc = AL   | U / Mem      | / PC         | R         | egDst =    | rf2 / rf3 / R3 |

| Hardwired Control Table: Harvard DLX 6.823, L5- |                    |      |                                               |         |       |     |                         |      |  |  |  |  |
|-------------------------------------------------|--------------------|------|-----------------------------------------------|---------|-------|-----|-------------------------|------|--|--|--|--|
|                                                 | Ext                | В    | Op                                            | Mem     | Reg   | WB  | Reg                     | PC   |  |  |  |  |
|                                                 | Sel                | Src  | Sel                                           | Write   | Write | Src | Dst                     | Src  |  |  |  |  |
| ALU                                             | *                  | Reg  | Func                                          | no      | yes   | ALU | rf3                     | ~j   |  |  |  |  |
| ALUu                                            | *                  | Reg  | Func                                          | no      | yes   | ALU | rf3                     | ~j   |  |  |  |  |
| ALUi                                            | sExt <sub>16</sub> | lmm  | Op                                            | no      | yes   | ALU | rf2                     | ~j   |  |  |  |  |
| ALUui                                           | uExt <sub>16</sub> | lmm  | Ор                                            | no      | yes   | ALU | rf2                     | ~j   |  |  |  |  |
| LW                                              | sExt <sub>16</sub> | lmm  | +                                             | no      | yes   | Mem | rf2                     | ~j   |  |  |  |  |
| SW                                              | sExt <sub>16</sub> | lmm  | +                                             | yes     | no    | *   | *                       | ~j   |  |  |  |  |
| BEQZ <sub>zero?=1</sub>                         | sExt <sub>16</sub> | *    | 0?                                            | no      | no    | *   | *                       | PCR  |  |  |  |  |
| BEQZ <sub>zero?=0</sub>                         | sExt <sub>16</sub> | *    | 0?                                            | no      | no    | *   | *                       | ~j   |  |  |  |  |
| J                                               | sExt <sub>26</sub> | *    | *                                             | no      | no    | *   | *                       | PCR  |  |  |  |  |
| JAL                                             | sExt <sub>26</sub> | *    | *                                             | no      | yes   | PC  | R31                     | PCR  |  |  |  |  |
| JR                                              | *                  | *    | *                                             | no      | no    | *   | *                       | RInd |  |  |  |  |
| JALR                                            | *                  | *    | *                                             | no      | yes   | PC  | R31                     | RInd |  |  |  |  |
| SSrc = Reg / Imm                                |                    | _    | WBSrc = ALU / Mem / PC<br>PCSrc2 = PCR / RInd |         |       |     | RegDst = rf2 / rf3 / R3 |      |  |  |  |  |
| CSrc1 = j / ~j                                  |                    | PCSr | C2 = PCF                                      | (/ Kind |       |     |                         |      |  |  |  |  |





# How to divide the datapath into stages

Suppose memory is significantly slower than other stages. In particular, suppose

$$t_{IM} = t_{DM} = 10 \text{ units}$$
  
 $t_{ALU} = 5 \text{ units}$   
 $t_{RF} = t_{RW} = 1 \text{ unit}$ 

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance



Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase

⇒ increase the critical path by 10%



### **Maximum Speedup by Pipelining**

For the 4-stage pipeline, given

 $t_{IM} = t_{DM} = 10$  units,  $t_{ALU} = 5$  units,  $t_{RF} = t_{RW} = 1$  unit t<sub>c</sub> could be reduced from 27 units to 10 units

 $\Rightarrow$  speedup = 2.7

However, if  $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$  units The same 4-stage pipeline can reduce t<sub>C</sub> from 25 units to 10 units

 $\Rightarrow$  speedup = 2.5

But, since  $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW}$ , it is possible to achieve higher speedup with more stages in the pipeline.

> A 5-stage pipeline can reduce t<sub>c</sub> from 25 units to 5 units

 $\Rightarrow$  speedup = 5



### **Technology Assumptions**

#### We will assume

- A small amount of very fast memory (caches) backed up by a large, slower memory
- Fast ALU (at least for integers)
- Multiported Register files (slower!).

It makes the following timing assumption valid

$$t_{\text{IM}} \approx t_{\text{RF}} \approx t_{\text{ALU}} \approx t_{\text{DM}} \approx t_{\text{RW}}$$

A 5-stage pipelined Harvard-style architecture will be the focus of our detailed design

















# How Instructions can Interact with each other in a pipeline

- An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard
- An instruction may produce data that is needed by a later instruction data hazard
- In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...)



### Feedback to Resolve Hazards

February 21, 2001 6.823, L5--32



Controlling pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)

Feedback to previous stages is used to stall or kill instructions