







# **Superscalar Execution**

- Higher instruction fetch bandwidth
- INT instructions, loads and stores occupy one slot
- FP instructions occupy second slot
  - Need pipelined FP datapath, otherwise FP datapath becomes bottleneck

| Instruction type |    |    |    | Pipe | stages |     |    |
|------------------|----|----|----|------|--------|-----|----|
| INT instruction  | IF | ID | EX | MEM  | WB     |     |    |
| FP instruction   | IF | ID | EX | MEM  | WB     |     |    |
| INT instruction  |    | IF | ID | EX   | MEM    | WB  |    |
| FP instruction   |    | IF | ID | EX   | MEM    | WB  |    |
| INT instruction  |    |    | IF | ID   | EX     | MEM | WB |
| FP instruction   |    |    | IF | ID   | EX     | MEM | WB |

5

3





#### **The Future?**

#### Moving away from ILP enhancements

|                          | Power4 | Power5 | Power6 | Power7 |
|--------------------------|--------|--------|--------|--------|
| Introduced               | 2001   | 2004   | 2007   | 2010   |
| Initial clock rate (GHz) | 1.3    | 1.9    | 4.7    | 3.6    |
| Transistor count (M)     | 174    | 276    | 790    | 1200   |
| Issues per clock         | 5      | 5      | 7      | 6      |
| Functional units         | 8      | 8      | 9      | 12     |
| Cores/chip               | 2      | 2      | 2      | 8      |
| SMT threads              | 0      | 2      | 2      | 4      |
| Total on-chip cache (MB) | 1.5    | 2      | 4.1    | 32.3   |

Figure 3.47 Characteristics of four IBM Power processors. All except the Power6 were dynamically scheduled, which is static, and in-order, and all the processors support two load/store pipelines. The Power6 has the same functional units as the Power5 except for a decimal unit. Power7 uses DRAM for the L3 cache.

















## **Fine-Grained Multithreading on the Sun T1**

 Circa 2005; first major processor to focus on TLP rather than ILP

| Characteristic                                  | Sun T1                                                                                                                                                              |
|-------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Multiprocessor and<br>multithreading<br>support | Eight cores per chip; four threads per core. Fine-grained thread scheduling. One shared floating-point unit for eight cores. Supports only on-chip multiprocessing. |
| Pipeline structure                              | Simple, in-order, six-deep pipeline with three-cycle delays for loads and branches.                                                                                 |
| L1 caches                                       | 16 KB instructions; 8 KB data. 64-byte block size. Miss to L2 is 23 cycles, assuming no contention.                                                                 |
| L2 caches                                       | Four separate L2 caches, each 750 KB and associated with a memory bank. 64-byte block size. Miss to main memory is 110 clock cycles assuming no contention.         |
| Initial implementation                          | 90 nm process; maximum clock rate of 1.2 GHz; power 79 W; 300 M transistors; 379 mm <sup>2</sup> die.                                                               |

| CPI on Sun T1 |                |              |  |  |
|---------------|----------------|--------------|--|--|
| Benchmark     | Per-thread CPI | Per-core CPI |  |  |
| TPC-C         | 7.2            | 1.80         |  |  |
| SPECJBB       | 5.6            | 1.40         |  |  |
| SPECWeb99     | 6.6            | 1.65         |  |  |

- T1 has 8 cores; with 4 threads/core
- Ideal effective CPI per thread? (4)
- Ideal per-core CPI? (1)
- In 2005 when it was introduced, the Sun T1 processor had the best performance on integer applications with extensive TLP and demanding memory performance, such as SPECJBB and transaction processing workloads















## Outline

- Limits of Instruction Level Parallelism (ILP)
- TLP Hardware Multithreading
- Multiprocessors and Multicores (+ Networkson-Chip)
- Graphics Processor Units (GPUs)
- Domain Specific Architectures

23

## **Back to Basics**

• "A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast."

#### Parallel Architecture =

**Computer Architecture + Communication Architecture** 

- Two classes of multiprocessors WRT memory:
  - 1. Centralized Memory Multiprocessor
    - < few dozen processor chips (and < 100 cores)</li>
    - Small enough to share single, centralized memory
  - 2. Physically Distributed-Memory multiprocessor
    - Larger number chips and cores than 1
    - BW demands ⇒ Memory distributed among processors



















































| Туре                 | More descriptive<br>name               | Closest old term<br>outside of GPUs           | Official CUDA/<br>NVIDIA GPU term | Book definition                                                                                                                                                             |
|----------------------|----------------------------------------|-----------------------------------------------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Program Abstractions | Vectorizable<br>Loop                   | Vectorizable Loop                             | Grid                              | A vectorizable loop, executed on the GPU, made<br>up of one or more Thread Blocks (bodies of<br>vectorized loop) that can execute in parallel.                              |
|                      | Body of<br>Vectorized Loop             | Body of a<br>(Strip-Mined)<br>Vectorized Loop | Thread Block                      | A vectorized loop executed on a multithreaded<br>SIMD Processor, made up of one or more threads<br>of SIMD instructions. They can communicate via<br>Local Memory.          |
| Prog                 | Sequence of<br>SIMD Lane<br>Operations | One iteration of<br>a Scalar Loop             | CUDA Thread                       | A vertical cut of a thread of SIMD instructions<br>corresponding to one element executed by one<br>SIMD Lane. Result is stored depending on mask<br>and predicate register. |
| Machine Object       | A Thread of<br>SIMD<br>Instructions    | Thread of Vector<br>Instructions              | Warp                              | A traditional thread, but it contains just SIMD<br>instructions that are executed on a multithreaded<br>SIMD Processor. Results stored depending on a<br>perelement mask.   |
|                      | SIMD<br>Instruction                    | Vector Instruction                            | PTX Instruction                   | A single SIMD instruction executed across SIMD<br>Lanes.                                                                                                                    |
|                      | Multithreaded<br>SIMD<br>Processor     | (Multithreaded)<br>Vector Processor           | Streaming<br>Multiprocessor       | A multithreaded SIMD Processor executes<br>threads of SIMD instructions, independent of<br>other SIMD Processors.                                                           |
| rdware               | Thread Block<br>Scheduler              | Scalar Processor                              | Giga Thread<br>Engine             | Assigns multiple Thread Blocks (bodies of<br>vectorized loop) to multithreaded SIMD<br>Processors.                                                                          |
| Processing Hardware  | SIMD Thread<br>Scheduler               | Thread scheduler<br>in a Multithreaded<br>CPU | Warp Scheduler                    | Hardware unit that schedules and issues threads<br>of SIMD instructions when they are ready to<br>execute; includes a scoreboard to track SIMD<br>Thread execution.         |
|                      | SIMD Lane                              | Vector lane                                   | Thread Processor                  | A SIMD Lane executes the operations in a thread<br>of SIMD instructions on a single element. Results<br>stored depending on mask.                                           |
| Memory Hardware      | GPU Memory                             | Main Memory                                   | Global Memory                     | DRAM memory accessible by all multithreaded<br>SIMD Processors in a GPU.                                                                                                    |
|                      | Local Memory                           | Local Memory                                  | Shared Memory                     | Fast local SRAM for one multithreaded SIMD<br>Processor, unavailable to other SIMD Processors.                                                                              |
| Men                  | SIMD Lane<br>Registers                 | Vector Lane<br>Registers                      | Thread Processor<br>Registers     | Registers in a single SIMD Lane allocated across<br>a full thread block (body of vectorized loop).                                                                          |







