Overview of Computer Architecture Guest Lecture ECE 153

Скачать презентацию Overview of Computer Architecture Guest Lecture ECE 153

0bd4210e47b6c6cacbd2a5e35d6649f1.ppt

Количество слайдов: 44

Overview of Computer Architecture Guest Lecture ECE 153 a, Fall, 2001 Modified From Prof. Patterson's Slides at UCB, 1

Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i. e. , the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE Modified From Prof. Patterson's Slides at UCB, 2

What are “Machine Structures”? Application (Netscape) Software Hardware Operating Compiler System Assembler (Windows 98) Processor Memory I/O system Instruction Set Architecture Datapath & Control Digital Design Circuit Design transistors • Coordination of many levels of abstraction Modified From Prof. Patterson's Slides at UCB, 3

Levels of Representation temp = v[k]; High Level Language Program (e. g. , C) v[k] = v[k+1]; Compiler v[k+1] = temp; Assembly Language Program (e. g. , MIPS) lw $to, lw $t 1, sw$t 0, Assembler Machine Language Program (MIPS) Machine Interpretation 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0($2) 4($2) 0110 1000 1111 1001 1010 0000 0101 1100 1111 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Control Signal Specification ° ° Modified From Prof. Patterson's Slides at UCB, 4

Anatomy: 5 components of any Computer Personal Computer Devices Keyboard, Mouse Input Disk Computer Processor (active) Control (“brain”) Datapath (“brawn”) Memory (passive) (where programs, data live when running) Output (where programs, data live when not running) Display, Printer Modified From Prof. Patterson's Slides at UCB, 5

A "Typical" RISC • • 32 bit fixed format instruction (3 formats) 32 32 bit GPR (R 0 contains zero, DP take pair) 3 address, reg arithmetic instruction Single address mode for load/store: base + displacement – no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA Risc, DEC Alpha, IBM Power. PC, CDC 6600, CDC 7600, Cray 1, Cray 2, Cray 3 Modified From Prof. Patterson's Slides at UCB, 6

Example: MIPS ( DLX) Register 31 26 25 Op 21 20 Rs 1 16 15 Rs 2 11 10 6 5 Rd 0 Opx Register Immediate 31 26 25 Op 21 20 Rs 1 16 15 0 immediate Rd Branch 31 26 25 Op Rs 1 21 20 16 15 Rs 2/Opx 0 immediate Jump / Call 31 26 25 Op 0 target Modified From Prof. Patterson's Slides at UCB, 7

Three Key Subjects • Pipeline – How data computation can be done faster? » Pipeline, hazards, scheduling, prediction, super scalar, etc. • Cache – How data access can be done faster? » L 1, L 2, L 3 caches, pipelined requests, parallel processing, etc. • Virtual Memory – How data communication can be done faster? » Virtual memory, paging, bus control and protocol, etc. Modified From Prof. Patterson's Slides at UCB, 8

Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes A B C D • Dryer takes 40 minutes • “Folder” takes 20 minutes Modified From Prof. Patterson's Slides at UCB, 9

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e r A B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? Modified From Prof. Patterson's Slides at UCB, 10

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k O r d e r 40 40 40 20 A B C D • Pipelined laundry takes 3. 5 hours for 4 loads Modified From Prof. Patterson's Slides at UCB, 11

Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r 30 40 A B C D 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup Modified From Prof. Patterson's Slides at UCB, 12

5 Steps of DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch IR Execute Addr. Calc Memory Access Write Back L M D Modified From Prof. Patterson's Slides at UCB, 13

Pipelined DLX Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access • Data stationary control – local decode for each instruction phase / pipeline stage Modified From Prof. Patterson's Slides at UCB, 14

Why Pipeline? Because the resources are there! Time (clock cycles) Inst 3 Reg Im Reg Dm Im Reg Reg Dm ALU Inst 4 Im Dm ALU Inst 2 Reg ALU Inst 1 Im ALU O r d e r Inst 0 ALU I n s t r. Reg Dm Reg Modified From Prof. Patterson's Slides at UCB, 15

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Pipelining of branches & other instructionsstall the pipeline until the hazardbubbles” in the pipeline Modified From Prof. Patterson's Slides at UCB, 16

Control Hazard • Stall: wait until decision is clear Load Mem Reg ALU Beq Reg ALU O r d e r Add Mem ALU I n s t r. – Its possible to move up decision to 2 nd stage by adding hardware to check registers as being read Time (clock cycles) Mem Reg • Impact: 2 clock cycles per branch instruction => slow Modified From Prof. Patterson's Slides at UCB, 17

Data Hazard on r 1: • Dependencies backwards in time are hazards xor r 10, r 11 Reg Dm Im Reg ALU or r 8, r 1, r 9 WB ALU and r 6, r 1, r 7 MEM ALU O r d e r sub r 4, r 1, r 3 Im EX ALU I n s t r. add r 1, r 2, r 3 ID/RF ALU Time (clock cycles) IF Reg Reg Dm Reg Modified From Prof. Patterson's Slides at UCB, 18

Pipeline Hazards Again I Fet ch DCD Mem. Op. Fetch IFetch Structural Hazard I Fet ch DCD Op. Fetch Jump IFetch IF DCD EX IF Mem WB DCD EX IF DCD Exec °°° Control Hazard °°° RAW (read after write) Data Hazard Mem WB DCD EX Mem WB IF WAW Data Hazard (write after write) DCD IF Store DCD OF OF Ex RS Ex Mem WAR Data Hazard (write after read) Modified From Prof. Patterson's Slides at UCB, 19

General Solutions • Forwarding – Forward data to the requested unit(s) as soon as they are available – Don’t wait until they are written into the register file • Stalling – Wait until things become clear – Sacrifice performance for correctness • Guessing – While waiting, why not make a guess and just do it – Help to improve performance, but no guaranty Modified From Prof. Patterson's Slides at UCB, 20

Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64 bits/clock cycle; Int on left, FP on right – Can only issue 2 nd instruction if 1 st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Int. instruction FP instruction Int. instruction WB FP instruction WB Int. instruction MEM WB FP instruction MEM WB Stages IF ID EX MEM WB IF ID EX MEM IF ID EX Modified From Prof. Patterson's Slides at UCB, 21

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb, b Rc, c Ra, Rb, Rc a, Ra Re, e Rf, f Rd, Re, Rf d, Rd Fast code: LW LW LW ADD LW SW SUB SW Rb, b Rc, c Re, e Ra, Rb, Rc Rf, f a, Ra Rd, Re, Rf d, Rd Modified From Prof. Patterson's Slides at UCB, 22

Recap: Who Cares About the Memory Hierarchy? Processor DRAM Memory Gap (latency) 100 10 1 µProc 60%/yr. “Moore’s Law” (2 X/1. 5 yr) Processor Memory Performance Gap: (grows 50% / year) DRAM 9%/yr. (2 X/10 yrs) CPU 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 Time Modified From Prof. Patterson's Slides at UCB, 23

Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost Staging Xfer Unit CPU Registers 100 s Bytes <10 s ns Registers Cache K Bytes 10 100 ns 1 0. 1 cents/bit faster Cache Instr. Operands Blocks Main Memory M Bytes 200 ns 500 ns $. 0001. 00001 cents /bit Disk G Bytes, 10 ms (10, 000 ns) 5 6 10 cents/bit Tape infinite sec min 10 8 prog. /compiler 1 8 bytes cache cntl 8 128 bytes Memory Pages OS 512 4 K bytes Files user/operator Mbytes Disk Tape Larger Lower Level Modified From Prof. Patterson's Slides at UCB, 24

The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e. g. , loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e. g. , straightline code, array access) • Last 15 years, HW relied on localilty for speed Modified From Prof. Patterson's Slides at UCB, 25

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty (500 instructions on 21264!) To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y Modified From Prof. Patterson's Slides at UCB, 26

$Cache Measures • Hit rate: fraction found in that level – So high that$ Cache Measures • Hit rate: fraction found in that level – So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Average memory access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Miss penalty: time to replace a block from lower level, including time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(BW between upper & lower levels) Modified From Prof. Patterson's Slides at UCB, 27

1 KB Direct Mapped Cache, 32 B blocks • For a 2 ** N byte cache: – The uppermost (32 N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M) Example: 0 x 50 Stored as part of the cache “state” Valid Bit Cache Tag 0 x 50 : Cache Data Byte 31 Byte 63 4 0 Byte Select Ex: 0 x 00 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : Cache Tag 9 Cache Index Ex: 0 x 01 : : 31 Byte 992 31 Modified From Prof. Patterson's Slides at UCB, 28

Two way Set Associative Cache • N way set associative: N entries for each Cache Index – N direct mapped caches operates in parallel (N typically 2 to 4) • Example: Two way set associative cache – Cache Index selects a “set” from the cache – The two tags in the set are compared in parallel – Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Block 0 : : Sel 1 1 Mux 0 Sel 0 Cache Tag Valid : : Compare OR Hit Cache Block Modified From Prof. Patterson's Slides at UCB, 29

4 Questions for Memory Hierarchy • Q 1: Where can a block be placed in the upper level? (Block placement) • Q 2: How is a block found if it is in the upper level? (Block identification) • Q 3: Which block should be replaced on a miss? (Block replacement) • Q 4: What happens on a write? (Write strategy) Modified From Prof. Patterson's Slides at UCB, 30

Q 1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache: – Fully associative, direct mapped, 2 way set associative – S. A. Mapping = Block Number Modulo Number Sets Memory Modified From Prof. Patterson's Slides at UCB, 31

Q 2: How is a block found if it is in the upper level? • Tag on each block – No need to check index or block offset • Increasing associativity shrinks index, expands tag Modified From Prof. Patterson's Slides at UCB, 32

Q 3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – Random – LRU (Least Recently Used) Associativity: 2 way 4 way 8 way Size LRU Random 16 KB 5. 2% 5. 7% 4. 7% 5. 3% 4. 4% 5. 0% 64 KB 1. 9% 2. 0% 1. 5% 1. 7% 1. 4% 1. 5% 256 KB 1. 15% 1. 17% 1. 13% 1. 12% Modified From Prof. Patterson's Slides at UCB, 33

Q 4: What happens on a write? • Write through—The information is written to both the block in the cache and to the block in the lower level memory. • Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty? • Pros and Cons of each? – WT: read misses cannot result in writes – WB: no repeated writes to same location • WT always combined with write buffers so that don’t wait for lower level memory Modified From Prof. Patterson's Slides at UCB, 34

Quicksort vs. Radix as vary number keys: Instrs & Time Radix sort Time Quick sort Instructions Set size in keys Modified From Prof. Patterson's Slides at UCB, 35

Quicksort vs. Radix as vary number keys: Cache misses Radix sort Cache misses Quick sort Set size in keys What is proper approach to fast algorithms? Modified From Prof. Patterson's Slides at UCB, 36

A Modern Memory Hierarchy • By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Processor Control Speed (ns): 1 s Size (bytes): 100 s On-Chip Cache Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10 s 100 s Ks Ms Tertiary Secondary Storage (Disk/Tape) (Disk) 10, 000 s 10, 000, 000 s (10 s ms) (10 s sec) Gs Ts Modified From Prof. Patterson's Slides at UCB, 37

Basic Issues in VM System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block > replacement policy which region of M is to hold the new block > placement policy missing item fetched from secondary memory only on the occurrence of a fault > demand load policy disk mem cache reg frame pages Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages Modified From Prof. Patterson's Slides at UCB, 38

Address Map V = {0, 1, . . . , n 1} virtual address space M = {0, 1, . . . , m 1} physical address space n>m MAP: V > M U {0} address mapping function MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor a Addr Trans Mechanism 0 Main Memory Secondary Memory a' physical address OS performs this transfer Modified From Prof. Patterson's Slides at UCB, 39

Paging Organization P. A. 0 1024 frame 0 1 1 K 1 K 7 7168 Addr Trans MAP 1 K Physical Memory 0 1024 31744 V. A. page 0 1 31 1 K 1 K unit of mapping also unit of transfer from virtual to physical 1 K memory Virtual Memory Address Mapping VA 10 disp page no. Page Table Base Reg index into page table Page Table V Access Rights PA table located in physical memory + actually, concatenation is more likely physical memory address Modified From Prof. Patterson's Slides at UCB, 40

Virtual Address and a Cache VA PA Trans Cache CPU lation miss Main Memory hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size Modified From Prof. Patterson's Slides at UCB, 41

TLBs A way to speed up translation is to use a special cache of recently used page table entries this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) Modified From Prof. Patterson's Slides at UCB, 42

Translation Look Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid range machines use small n way set associative organizations. hit PA VA CPU Translation with a TLB Lookup miss Cache Main Memory hit Trans lation data 1/2 t t 20 t Modified From Prof. Patterson's Slides at UCB, 43

Overlapped Cache & TLB Access 32 TLB index assoc lookup Cache 1 K 4 bytes 10 2 00 PA Hit/ Miss 20 page # PA 12 disp Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation Modified From Prof. Patterson's Slides at UCB, 44