







- Must separate execution from allowing instruction to finish or "commit"
- This additional step called instruction commit
- •When an instruction is no longer speculative, allow it to update the register file or memory
- Requires additional set of buffers to hold results of instructions that have finished execution but have not committed
- This Reorder Buffer (ROB) is also used to pass results among instructions that may be speculated



| <b>ROB Details: Fields of Reorder Buffer Entry</b>                                                                                                                                             |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Each entry in the ROB contains four fields:                                                                                                                                                    |
| 1. Instruction type                                                                                                                                                                            |
| <ul> <li>a branch (has no destination result), a store (has a memory address<br/>destination), or a register operation (ALU operation or load, which has register<br/>destinations)</li> </ul> |
| 2. Destination                                                                                                                                                                                 |
| <ul> <li>Register number (for loads and ALU operations) or<br/>memory address (for stores) where the instruction result should be written</li> </ul>                                           |
| 3. Value                                                                                                                                                                                       |
| <ul> <li>Value of instruction result until the instruction commits</li> </ul>                                                                                                                  |
| 4. Ready                                                                                                                                                                                       |
| <ul> <li>Indicates that instruction has completed execution, and the value is ready</li> </ul>                                                                                                 |
|                                                                                                                                                                                                |
| 6                                                                                                                                                                                              |



### **4 Steps of Speculative Tomasulo Algorithm**

#### 1. Issue—get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called "dispatch")

#### Execution—operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (this stage sometimes called "issue")

#### 3. Write result—finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

#### 4. Commit—update register with reorder result

When instr. at head of reorder buffer & result present, **update register with result** (or store to memory) and remove instr from reorder buffer. Mispredicted branch at head of ROB flushes ROB. (this state sometimes called "**graduation**")









































| Superscala                                                                                 | ar e   | exe            | ecu                  | tion                   | illus                  | stra   | ted   |
|--------------------------------------------------------------------------------------------|--------|----------------|----------------------|------------------------|------------------------|--------|-------|
| <ul> <li>Higher instructio</li> </ul>                                                      | n fet  | ch b           | andw                 | idth                   |                        |        |       |
| <ul> <li>INT instructions,</li> </ul>                                                      | load   | ls an          | id sto               | res occ                | upy on                 | e slot |       |
| ► FP instructions of                                                                       | occu   | py se          | econc                | l slot                 |                        |        |       |
| Need pipeli                                                                                |        |                |                      | ath, oth               | erwise                 | FP dat | apath |
| becomes be                                                                                 | ottiei | песк           |                      |                        |                        |        |       |
| becomes be                                                                                 | ottiei | neck           |                      | Pipe                   | stages                 |        |       |
| Instruction type<br>INT instruction                                                        | IF     | ID             | EX                   | MEM                    | WB                     |        |       |
| Instruction type<br>INT instruction<br>FP instruction                                      |        | ID<br>ID       | EX<br>EX             | MEM<br>MEM             | WB<br>WB               |        |       |
| Instruction type<br>INT instruction<br>FP instruction<br>INT instruction                   | IF     | ID<br>ID<br>IF | EX<br>EX<br>ID       | MEM<br>MEM<br>EX       | WB<br>WB<br>MEM        | WB     |       |
| Instruction type<br>INT instruction<br>FP instruction<br>INT instruction<br>FP instruction | IF     | ID<br>ID       | EX<br>EX<br>ID<br>ID | MEM<br>MEM<br>EX<br>EX | WB<br>WB<br>MEM<br>MEM | WB     |       |
| Instruction type<br>INT instruction<br>FP instruction<br>INT instruction                   | IF     | ID<br>ID<br>IF | EX<br>EX<br>ID       | MEM<br>MEM<br>EX       | WB<br>WB<br>MEM        |        | WB    |

Example: Loop unrolling and instruction schedulingfor (i=1; i<=1000; i=i+1;)</td>x[i] = x[i] + s;Assume latenciesProducerConsumerLatency

| Producer    | Consumer          | Latency |
|-------------|-------------------|---------|
| FP ALU op   | Another FP ALU op | 3       |
| FP ALU op   | store double      | 2       |
| Load double | FP ALU op         | 1       |
| Load double | store double      | 0       |

# Loop unrolling and instruction scheduling

#### Example in MIPS assembly

Clock cycle issued Loop: LD F0, 0(R1) 1 ADDD F4, F0, F2 3 SD F4, 0(R1) 6 SUBI R1, R1, 8 7 BNEZ R1, Loop 8

9 cycles per loop iteration due to stalls

















# Limits to ILP

- How much ILP is available using existing mechanisms with increasing HW budgets?
- •Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

| Limits to ILP: Quantitative Analysis                                                                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Assumptions for ideal/perfect machine (all constraints on ILP are removed):                                                                                                                                          |
| <ol> <li>Register renaming – infinite virtual registers</li> <li>all register WAW &amp; WAR hazards are avoided</li> </ol>                                                                                           |
| 2. Branch prediction – perfect; no mispredictions                                                                                                                                                                    |
| <ol> <li>Jump prediction – all jumps perfectly predicted (returns, case statements)</li> <li>2 &amp; 3 ⇒ no control dependencies; perfect speculation &amp; an unbounded buffer of instructions available</li> </ol> |
| <ol> <li>Memory-address alias analysis – addresses known &amp; a load can be<br/>moved before a store provided that the addresses are not equal; 1&amp;4<br/>eliminates all but RAW</li> </ol>                       |
| Also: perfect caches; 1 cycle latency for all instructions (FP *,/);<br>unlimited instructions issued/clock cycle;                                                                                                   |

ľ

|                               | Model    | Power 5                            |
|-------------------------------|----------|------------------------------------|
| Instructions Issued per clock | Infinite | 4                                  |
| Instruction Window<br>Size    | Infinite | 200                                |
| Renaming<br>Registers         | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch Prediction             | Perfect  | 2% to 6%<br>misprediction          |
|                               |          | (Tournament<br>Branch Predictor)   |
| Cache                         | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory Alias<br>Analysis      | Perfect  | ??                                 |



| Limits  | to | IIΡ | ΗМ | Model   | comparison |
|---------|----|-----|----|---------|------------|
| LIIIIII | U  |     |    | INIUGEI | companison |

|                                     | New Model                     | Model    | Power 5                            |
|-------------------------------------|-------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | Infinite                      | Infinite | 4                                  |
| Instruction<br>Window Size          | Infinite, 2K, 512,<br>128, 32 | Infinite | 200                                |
| Renaming<br>Registers               | Infinite                      | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | Perfect                       | Perfect  | 2% to 6%<br>misprediction          |
|                                     |                               |          | (Tournament Branch<br>Predictor)   |
| Cache                               | Perfect                       | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | Perfect                       | Perfect  | ??                                 |



|                                     | New Model                                                             | Model    | Power 5                                                       |
|-------------------------------------|-----------------------------------------------------------------------|----------|---------------------------------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                                                    | Infinite | 4                                                             |
| Instruction<br>Window Size          | 2048                                                                  | Infinite | 200                                                           |
| Renaming<br>Registers               | Infinite                                                              | Infinite | 48 integer +<br>40 Fl. Pt.                                    |
| Branch<br>Prediction                | Perfect vs. 8K<br>Tournament vs.<br>512 2-bit vs.<br>profile vs. none | Perfect  | 2% to 6%<br>misprediction<br>(Tournament Branch<br>Predictor) |
| Cache                               | Perfect                                                               | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3                            |
| Memory<br>Alias                     | Perfect                                                               | Perfect  | ??                                                            |





|                                     | New Model                             | Model    | Power 5                            |
|-------------------------------------|---------------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                    | Infinite | 4                                  |
| Instruction<br>Window Size          | 2048                                  | Infinite | 200                                |
| Renaming<br>Registers               | Infinite v. 256,<br>128, 64, 32, none | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | 8K 2-level                            | Perfect  | Tournament Branch<br>Predictor     |
| Cache                               | Perfect                               | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | Perfect                               | Perfect  | ??                                 |



|                                     | New Model                                 | Model    | Power 5                            |
|-------------------------------------|-------------------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | 64                                        | Infinite | 4                                  |
| Instruction<br>Window Size          | 2048                                      | Infinite | 200                                |
| Renaming<br>Registers               | 256 Int + 256 FP                          | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | 8K 2-level                                | Perfect  | Tournament                         |
| Cache                               | Perfect                                   | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | Perfect v. Stack<br>v. Inspect v.<br>none | Perfect  | ??                                 |



|                                     | New Model                        | Model    | Power 5                            |
|-------------------------------------|----------------------------------|----------|------------------------------------|
| Instructions<br>Issued per<br>clock | 64 (no<br>restrictions)          | Infinite | 4                                  |
| Instruction<br>Window Size          | Infinite vs. 256,<br>128, 64, 32 | Infinite | 200                                |
| Renaming<br>Registers               | 64 Int + 64 FP                   | Infinite | 48 integer +<br>40 Fl. Pt.         |
| Branch<br>Prediction                | 1K 2-level                       | Perfect  | Tournament                         |
| Cache                               | Perfect                          | Perfect  | 64KI, 32KD, 1.92MB<br>L2, 36 MB L3 |
| Memory<br>Alias                     | HW<br>disambiguation             | Perfect  | Perfect                            |







•Hardware Based Speculation

- Multiple Issue Processors
- •Limits of ILP

•TLP - Multithreading









# **1. Fine-Grained Multithreading**

- Switches between threads on each cycle, causing the execution of multiples threads to be interleaved
- Usually done in a round-robin fashion, skipping any stalled threads
- CPU must be able to switch threads every clock
- Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls
- Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads
- •Used on Sun's T1 (2005) and T2 (Niagara, 2007)

### **Fine-Grained Multithreading on the Sun T1**

• Circa 2005; first major processor to focus on TLP rather than ILP

| Sun T1                                                                                                                                                              |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Eight cores per chip; four threads per core. Fine-grained thread scheduling. One shared floating-point unit for eight cores. Supports only on-chip multiprocessing. |
| Simple, in-order, six-deep pipeline with three-cycle delays for loads and branches.                                                                                 |
| 16 KB instructions; 8 KB data. 64-byte block size. Miss to L2 is 23 cycles, assuming no contention.                                                                 |
| Four separate L2 caches, each 750 KB and associated with a memory bank. 64-byte block size. Miss to main memory is 110 clock cycles assuming no contention.         |
| 90 nm process; maximum clock rate of 1.2 GHz; power 79 W;<br>300 M transistors; 379 mm <sup>2</sup> die.                                                            |
|                                                                                                                                                                     |

| CPI on Sun T1 |                |              |  |  |
|---------------|----------------|--------------|--|--|
| Benchmark     | Per-thread CPI | Per-core CPI |  |  |
| TPC-C         | 7.2            | 1.80         |  |  |
| SPECJBB       | 5.6            | 1.40         |  |  |
| SPECWeb99     | 6.6            | 1.65         |  |  |

- T1 has 8 cores; with 4 threads/core
- Ideal effective CPI per thread?
- Ideal per-core CPI?
- In 2005 when it was introduced, the Sun T1 processor had the best performance on integer applications with extensive TLP and demanding memory performance, such as SPECJBB and transaction processing workloads



#### **3. Simultaneous MultiThreading (SMT) for OoO superscalars**

- •Techniques presented so far have all been "vertical" multithreading where each pipeline stage works on one thread at a time
- •SMT uses fine-grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle. Gives better utilization of machine resources.

65

### **Simultaneous Multithreading (SMT)**

- Simultaneous multithreading (SMT): a variation of fine-grained multithreading implemented on top of a multiple-issue, dynamically scheduled processor
- Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously
- Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads
- OoO instruction window already has most of the circuitry required to schedule from multiple threads
- Any single thread can utilize whole machine
- Used in Core i7 (2008) and IBM Power 7 (2010)









# Example 2: ARM Cortex A8

- •13-stage pipeline, dual-issue, statically scheduled superscalar with dynamic issue detection, which allows the processor to issue one or two instructions per clock
- Dynamic branch predictor with a 512-entry two-way set associative branch target buffer and a 4K-entry global history buffer, indexed by branch history and current PC
  - ° incorrect prediction results in a 13-cycle penalty as pipeline is flushed
  - $^{\circ}\,$  eight-entry return stack is kept to track return addresses
- Has an ideal CPI of 0.5 due to its dual-issue structure
- Hazards:
  - $^\circ\,$  Functional and data (compiler must handle; otherwise issue 1 instr/cycle)
  - ° Control (arise only when branches are mispredicted)







### **The Future?**

#### Moving away from ILP enhancements

|                          | Power4 | Power5 | Power6 | Power7 |
|--------------------------|--------|--------|--------|--------|
| Introduced               | 2001   | 2004   | 2007   | 2010   |
| Initial clock rate (GHz) | 1.3    | 1.9    | 4.7    | 3.6    |
| Transistor count (M)     | 174    | 276    | 790    | 1200   |
| Issues per clock         | 5      | 5      | 7      | 6      |
| Functional units         | 8      | 8      | 9      | 12     |
| Cores/chip               | 2      | 2      | 2      | 8      |
| SMT threads              | 0      | 2      | 2      | 4      |
| Total on-chip cache (MB) | 1.5    | 2      | 4.1    | 32.3   |
|                          |        |        |        |        |

Figure 3.47 Characteristics of four IBM Power processors. All except the Power6 were dynamically scheduled, which is static, and in-order, and all the processors support two load/store pipelines. The Power6 has the same functional units as the Power5 except for a decimal unit. Power7 uses DRAM for the L3 cache.