Скачать презентацию Adapted from http wwwinst EECS Berkeley EDU 80 cs Скачать презентацию Adapted from http wwwinst EECS Berkeley EDU 80 cs

82f14b58a5850a070ea9754a7b60c6c0.ppt

  • Количество слайдов: 105

Adapted from “http: //wwwinst. EECS. Berkeley. EDU: 80/~cs 152/fa 97/index_lectures. html”, “http: //www. cs. Adapted from “http: //wwwinst. EECS. Berkeley. EDU: 80/~cs 152/fa 97/index_lectures. html”, “http: //www. cs. berkeley. edu/~pattrsn/252 S 98/index. html” Copyright 1998 UCB Chapter 5: Memory-Hierarchy Design 순천향대학교 컴퓨터학부 이상정 S. J. Lee 1

Review • Speculation: Out-of-order execution, In-order commit (reorder buffer+rename registers)=>precise exceptions • Branch Prediction Review • Speculation: Out-of-order execution, In-order commit (reorder buffer+rename registers)=>precise exceptions • Branch Prediction – – Branch History Table: 2 bits for loop accuracy Recently executed branches correlated with next branch? Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches • Software Pipelining – Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code expansion, little overhead • Superscalar and VLIW(“EPIC”): CPI < 1 (IPC > 1) – Dynamic issue vs. Static issue – More instructions issue at same time => larger hazard penalty – # independent instructions = # functional units X latency S. J. Lee 2

Review: Theoretical Limits to ILP? (Figure 4. 48, Page 332) IPC Perfect disambiguation (HW), Review: Theoretical Limits to ILP? (Figure 4. 48, Page 332) IPC Perfect disambiguation (HW), 1 K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: 8 - 45 Integer: 6 - 12 Infinite 256 128 64 32 16 8 4 S. J. Lee 3

Review: Instructon Level Parallelism • High speed execution based on instruction level parallelism (ilp): Review: Instructon Level Parallelism • High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel • High-speed microprocessors exploit ILP by: 1) pipelined execution: overlap instructions 2) superscalar execution: issue and execute multiple instructions per clock cycle 3) Out-of-order execution (commit in-order) • Memory accesses for high-speed microprocessor? – Data Cache, possibly multiported, multiple levels S. J. Lee 4

Introduction Processor-DRAM Memory Gap (latency) roc 60%/yr. “Moore’s Law” (2 X/1. 5 yr) Processor-Memory Introduction Processor-DRAM Memory Gap (latency) roc 60%/yr. “Moore’s Law” (2 X/1. 5 yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM 9%/yr. (2 X/10 yrs) CPU 100 10 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 1 Time 2000 Performance 1000 S. J. Lee 5

Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost Staging Xfer Unit Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost Staging Xfer Unit CPU Registers 100 s Bytes <10 s ns Registers Cache K Bytes 10 -100 ns 1 -0. 1 cents/bit faster Cache Instr. Operands Blocks Main Memory M Bytes 200 ns- 500 ns $. 0001 -. 00001 cents /bit Disk G Bytes, 10 ms (10, 000 ns) -5 -6 10 - 10 cents/bit Tape infinite sec-min 10 -8 prog. /compiler 1 -8 bytes cache cntl 8 -128 bytes Memory Pages OS 512 -4 K bytes Files user/operator Mbytes Disk Tape Larger Lower Level S. J. Lee 6

The Principle of Locality • The Principle of Locality: – Program access a relatively The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e. g. , loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e. g. , straightline code, array access) • Last 15 years, HW relied on localilty for speed S. J. Lee 7

Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level To Processor Upper Level Memory Blk X From Processor Blk Y S. J. Lee 8

ABCs of Caches • Hit rate: fraction found in that level – So high ABCs of Caches • Hit rate: fraction found in that level – So high that usually talk about Miss rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) • Miss penalty: time to replace a block from lower level, including time to replace in CPU – access time: time to lower level = f(latency to lower level) – transfer time: time to transfer block =f(BW between upper & lower levels) S. J. Lee 9

Simplest Cache: Direct Mapped Memory Address 0 1 2 3 4 5 6 7 Simplest Cache: Direct Mapped Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory 4 Byte Direct Mapped Cache Index 0 1 2 3 • Location 0 can be occupied by data from: – Memory location 0, 4, 8, . . . etc. – In general: any memory location whose 2 LSBs of the address are 0 s – Address<1: 0> => cache index • Which one should we place in the cache? • How can we tell which one S. J. Lee 10 is in the cache?

1 KB Direct Mapped Cache, 32 B blocks • For a 2 ** N 1 KB Direct Mapped Cache, 32 B blocks • For a 2 ** N byte cache: – The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2 ** M) Example: 0 x 50 Stored as part of the cache 뱒tate Valid Bit Cache Tag 0 x 50 : Cache Data Byte 31 Byte 63 4 0 Byte Select Ex: 0 x 00 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : Cache Tag 9 Cache Index Ex: 0 x 01 : : 31 Byte 992 31 S. J. Lee 11

Two-way Set Associative Cache • N-way set associative: N entries for each Cache Index Two-way Set Associative Cache • N-way set associative: N entries for each Cache Index – N direct mapped caches operates in parallel (N typically 2 to 4) • Example: Two-way set associative cache – Cache Index selects a “set” from the cache – The two tags in the set are compared in parallel – Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Block 0 : : Sel 1 1 Mux 0 Sel 0 Cache Tag Valid : : Compare OR Hit Cache Block S. J. Lee 12

Disadvantage of Set Associative Cache • N-way Set Associative Cache v. Direct Mapped Cache: Disadvantage of Set Associative Cache • N-way Set Associative Cache v. Direct Mapped Cache: – N comparators vs. 1 – Extra MUX delay for the data – Data comes AFTER Hit/Miss • In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: – Possible to assume a hit and continue. Recover later if miss. Valid Cache Tag : : Adr Tag Compare Cache Index Cache Data Cache Block 0 : : Sel 1 1 Mux 0 Sel 0 Cache Tag Valid : : Compare OR Hit Cache Block S. J. Lee 13

4 Questions for Memory Hierarchy • Q 1: Where can a block be placed 4 Questions for Memory Hierarchy • Q 1: Where can a block be placed in the upper level? (Block placement) • Q 2: How is a block found if it is in the upper level? (Block identification) • Q 3: Which block should be replaced on a miss? (Block replacement) • Q 4: What happens on a write? (Write strategy) S. J. Lee 14

Q 1: Where can a block be placed in the upper level? • Block Q 1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache: – Fully associative, direct mapped, 2 -way set associative – S. A. Mapping = Block Number Modulo Number Sets Memory S. J. Lee 15

Q 2: How is a block found if it is in the upper level? Q 2: How is a block found if it is in the upper level? • Tag on each block – No need to check index or block offset • Increasing associativity shrinks index, expands tag S. J. Lee 16

Q 3: Which block should be replaced on a miss? • Easy for Direct Q 3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – Random – LRU (Least Recently Used) Associativity: 2 -way 4 -way 8 -way Size LRU Random 16 KB 5. 2% 5. 7% 4. 7% 5. 3% 4. 4% 5. 0% 64 KB 1. 9% 2. 0% 1. 5% 1. 7% 1. 4% 1. 5% 256 KB 1. 15% 1. 17% 1. 13% 1. 12% S. J. Lee 17

Q 4: What happens on a write? • Write through뾗he information is written to Q 4: What happens on a write? • Write through뾗he information is written to both the block in the cache and to the block in the lower-level memory. • Write back뾗he information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty? • Pros and Cons of each? – WT: read misses cannot result in writes – WB: no repeated writes to same location • WT always combined with write buffers so that don뭪 wait for lower level memory S. J. Lee 18

Write Buffer for Write Through Processor Cache DRAM Write Buffer • A Write Buffer Write Buffer for Write Through Processor Cache DRAM Write Buffer • A Write Buffer is needed between the Cache and Memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: – Typical number of entries: 4 – Works fine if: Store frequency (w. r. t. time) << 1 / DRAM write cycle • Memory system designer뭩 nightmare: – Store frequency (w. r. t. time) -> 1 / DRAM write cycle – Write buffer saturation S. J. Lee 19

Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty S. J. Lee 20

Cache Performance CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Cache Performance CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time S. J. Lee 21

Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. S. J. Lee 22

Reducing Misses • Classifying Misses: 3 Cs – Compulsory - The first access to Reducing Misses • Classifying Misses: 3 Cs – Compulsory - The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) – Capacity - If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) – Conflict - If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) S. J. Lee 23

3 Cs Absolute Miss Rate (SPEC 92) Conflict Compulsory vanishingly small S. J. Lee 3 Cs Absolute Miss Rate (SPEC 92) Conflict Compulsory vanishingly small S. J. Lee 24

3 Cs Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => 3 Cs Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => invention S. J. Lee 25

How Can Reduce Misses? • 3 Cs: Compulsory, Capacity, Conflict • In all cases, How Can Reduce Misses? • 3 Cs: Compulsory, Capacity, Conflict • In all cases, assume total cache size not changed: • What happens if: 1) Change Block Size: Which of 3 Cs is obviously affected? 2) Change Associativity: Which of 3 Cs is obviously affected? 3) Change Compiler: Which of 3 Cs is obviously affected? S. J. Lee 26

1. Reduce Misses via Larger Block Size S. J. Lee 27 1. Reduce Misses via Larger Block Size S. J. Lee 27

2. Reduce Misses via Higher Associativity • 2: 1 Cache Rule: – Miss Rate 2. Reduce Misses via Higher Associativity • 2: 1 Cache Rule: – Miss Rate DM cache size N Miss Rate 2 -way cache size N/2 • Beware: Execution time is only final measure! – Will Clock Cycle time increase? – Hill [1988] suggested hit time for 2 -way vs. 1 -way external cache +10%, internal + 2% S. J. Lee 28

Example: Avg. Memory Access Time vs. Miss Rate • Example: assume CCT = 1. Example: Avg. Memory Access Time vs. Miss Rate • Example: assume CCT = 1. 10 for 2 -way, 1. 12 for 4 -way, 1. 14 for 8 -way vs. CCT direct mapped Cache Size (KB) 1 -way 1 2. 33 2 1. 98 4 1. 72 8 1. 46 16 1. 29 32 1. 20 64 1. 14 128 1. 10 Associativity 2 -way 4 -way 2. 15 2. 07 1. 86 1. 76 1. 67 1. 61 1. 48 1. 47 1. 32 1. 24 1. 25 1. 20 1. 21 1. 17 1. 18 8 -way 2. 01 1. 68 1. 53 1. 43 1. 32 1. 27 1. 23 1. 20 (Red means A. M. A. T. not improved by more associativity) S. J. Lee 29

3. Reducing Misses via a “Victim Cache” • How to combine fast hit time 3. Reducing Misses via a “Victim Cache” • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4 -entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines S. J. Lee 30

4. Reducing Misses via “Pseudo-Associativity” • How to combine fast hit time of Direct 4. Reducing Misses via “Pseudo-Associativity” • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2 -way SA cache? • Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty Time • Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles – Better for caches not tied directly to processor (L 2) – Used in MIPS R 1000 L 2 cache, similar in Ultra. SPARC S. J. Lee 31

5. Reducing Misses by Hardware Prefetching of Instructions & Data • E. g. , 5. Reducing Misses by Hardware Prefetching of Instructions & Data • E. g. , Instruction Prefetching – Alpha 21064 fetches 2 blocks on a miss – Extra block placed in stream buffer – On miss check stream buffer • Works with data blocks too: – Jouppi [1990] 1 data stream buffer got 25% misses from 4 KB cache; 4 streams got 43% – Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64 KB, 4 -way set associative caches • Prefetching relies on having extra memory bandwidth that can be used without penalty S. J. Lee 32

6. Reducing Misses by Software Prefetching Data • Data Prefetch – Load data into 6. Reducing Misses by Software Prefetching Data • Data Prefetch – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into cache (MIPS IV, Power. PC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative execution • Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth S. J. Lee 33

7. Reducing Misses by Compiler Optimizations • Mc. Farling [1989] reduced caches misses by 7. Reducing Misses by Compiler Optimizations • Mc. Farling [1989] reduced caches misses by 75% on 8 KB direct mapped cache, 4 byte blocks in software • Instructions – Reorder procedures in memory so as to reduce conflict misses – Profiling to look at conflicts(using tools they developed) • Data – Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays – Loop Interchange: change nesting of loops to access data in order stored in memory – Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap – Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows S. J. Lee 34

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality S. J. Lee 35

Loop Interchange Example /* Before */ for (k = 0; k < 100; k Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial S. J. Lee 36 locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } 2 misses per access to a & c vs. one miss per access; improve spatial locality S. J. Lee 37

Blocking Example /* Before */ for (i = 0; i < N; i = Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j]; }; x[i][j] = r; }; • Two Inner Loops: – Read all Nx. N elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: – 3 Nx. Nx 4 => no capacity misses; otherwise. . . • Idea: compute on Bx. B submatrix that fits S. J. Lee 38

Blocking Example /* After */ for (jj = 0; jj < N; jj = Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1, N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1, N); k = k+1) { r = r + y[i][k]*z[k][j]; }; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2 N 3 + N 2 to 2 N 3/B +N 2 • Conflict Misses Too? S. J. Lee 39

Reducing Conflict Misses by Blocking • Conflict misses in caches not FA vs. Blocking Reducing Conflict Misses by Blocking • Conflict misses in caches not FA vs. Blocking size – Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. S. J. Lee 40 48 despite both fit in cache

Summary of Compiler Optimizations to Reduce Cache Misses (by hand) S. J. Lee 41 Summary of Compiler Optimizations to Reduce Cache Misses (by hand) S. J. Lee 41

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. S. J. Lee 42

1. Reducing Miss Penalty: Read Priority over Write on Miss • Write through with 1. Reducing Miss Penalty: Read Priority over Write on Miss • Write through with write buffers offer RAW conflicts with main memory reads on cache misses • If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) • Check write buffer contents before read; if no conflicts, let the memory access continue • Write Back? – Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead copy the dirty block to a write buffer, then do the read, and then do the write – CPU stall less since restarts as soon as do read S. J. Lee 43

2. Reduce Miss Penalty: Subblock Placement • Don’t have to load full block on 2. Reduce Miss Penalty: Subblock Placement • Don’t have to load full block on a miss • Have valid bits per subblock to indicate valid • (Originally invented to reduce tag storage) Valid Bits Subblocks S. J. Lee 44

3. Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for 3. Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for full block to be loaded before restarting CPU – Early restart - As soon as the requested word of the block ar rives, send it to the CPU and let the CPU continue execution – Critical Word First - Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first • Generally useful only in large blocks, • Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block S. J. Lee 45

4. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses • Non-blocking cache 4. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses • Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss – requires out-of-order executuion CPU • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses – Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses – Requires muliple memory banks (otherwise cannot support) – Penium Pro allows 4 outstanding memory misses S. J. Lee 46

Value of Hit Under Miss for SPEC 0 ->1 1 ->2 2 ->64 Base Value of Hit Under Miss for SPEC 0 ->1 1 ->2 2 ->64 Base 밐it under n Misses Integer Floating Point • FP programs on average: AMAT= 0. 68 -> 0. 52 -> 0. 34 -> 0. 26 • Int programs on average: AMAT= 0. 24 -> 0. 20 -> 0. 19 • 8 KB Data Cache, Direct Mapped, 32 B block, 16 cycle miss S. J. Lee 47

5 th Miss Penalty • L 2 Equations AMAT = Hit Time. L 1 5 th Miss Penalty • L 2 Equations AMAT = Hit Time. L 1 + Miss Rate. L 1 x Miss Penalty. L 1 = Hit Time. L 2 + Miss Rate. L 2 x Miss Penalty. L 2 AMAT = Hit Time. L 1 + Miss Rate. L 1 x (Hit Time. L 2 + Miss Rate. L 2 + Miss Penalty. L 2) • Definitions: – Local miss rate - misses in this cache divided by the total number of memory accesses to this cache (Miss rate. L 2) – Global miss rate - misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate. L 1 x Miss Rate. L 2) – Global Miss Rate is what matters S. J. Lee 48

Comparing Local and Global Miss Rates • 32 KByte 1 st level cache; Increasing Comparing Local and Global Miss Rates • 32 KByte 1 st level cache; Increasing 2 nd level cache • Global miss rate close to single level cache rate provided L 2 >> L 1 • Don뭪 use local miss rate • L 2 not tied to CPU clock cycle! • Cost & A. M. A. T. • Generally Fast Hit Times and fewer misses • Since hits are few, target miss reduction Linear Cache Size Log Cache Size S. J. Lee 49

Reducing Misses: Which apply to L 2 Cache? • Reducing Miss Rate 1. Reduce Reducing Misses: Which apply to L 2 Cache? • Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations S. J. Lee 50

L 2 cache block size & A. M. A. T. • 32 KB L L 2 cache block size & A. M. A. T. • 32 KB L 1, 8 byte path to memory S. J. Lee 51

Reducing Miss Penalty Summary • Five techniques – – – Read priority over write Reducing Miss Penalty Summary • Five techniques – – – Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache • Can be applied recursively to Multilevel Caches – Danger is that time to DRAM will grow with multiple levels in between – First attempts at L 2 caches can make things worse, since increased worst case is worse S. J. Lee 52

What is the Impact of What You’ve Learned About Caches? • 1960 -1985: Speed What is the Impact of What You’ve Learned About Caches? • 1960 -1985: Speed = f(no. operations) • 1990 – Pipelined Execution & Fast Clock Rate – Out-of-Order execution – Superscalar Instruction Issue • 1998: Speed = ? non-cached memory accesses) • Superscalar, Out-of-Order machines hide L 1 data cache miss ( clocks) but not L 2 cache miss ( 0 clocks)? S. J. Lee 53

miss penalty miss rate Cache Optimization Summary Technique MR MP Larger Block Size + miss penalty miss rate Cache Optimization Summary Technique MR MP Larger Block Size + Higher Associativity + Victim Caches + Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + Priority to Read Misses Subblock Placement Early Restart & Critical Word 1 st Non-Blocking Caches Second Level Caches HT + + + + Complexity 0 2 + + 0 1 1 3 2 2 2 3 2 S. J. Lee 54

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. S. J. Lee 55

1. Fast Hit times via Small and Simple Caches • Why Alpha 21164 has 1. Fast Hit times via Small and Simple Caches • Why Alpha 21164 has 8 KB Instruction and 8 KB data cache + 96 KB second level cache? – Small data cache and clock rate • Direct Mapped, on chip S. J. Lee 56

2. Fast hits by Avoiding Address Translation • Send virtual address to cache? Called 2. Fast hits by Avoiding Address Translation • Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache – Every time process is switched logically must flush the cache; otherwise get false hits » Cost is time to flush + “compulsory” misses from empty cache – Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address – I/O must interact with cache, so need virtual address • Solution to aliases – HW guarantees – SW guarantee page coloring • Solution to cache flush – Add process identifier tag that identifies process as well as address within process: cann’t get a hit if wrong process S. J. Lee 57

Virtually Addressed Caches CPU VA Tags PA Tags $ TB PA L 2 $ Virtually Addressed Caches CPU VA Tags PA Tags $ TB PA L 2 $ TB PA $ VA VA VA TB CPU PA MEM Conventional Organization Virtually Addressed Cache Translate only on miss Synonym Problem MEM Overlap $ access with VA translation: requires $ index to remain invariant S. J. Lee 58 across translation

2. Fast Cache Hits by Avoiding Translation: Process ID impact • Black is uniprocess 2. Fast Cache Hits by Avoiding Translation: Process ID impact • Black is uniprocess • Light Gray is multiprocess when flush cache • Dark Gray is multiprocess when use Process ID tag • Y axis: Miss Rates up to 20% • X axis: Cache size from 2 KB to 1024 KB S. J. Lee 59

2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address • 2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Page Address Tag Page Offset Index Block Offset • Limits cache to page size: what if want bigger caches and uses same trick? – Higher associativity moves barrier to right – Page coloring S. J. Lee 60

3. Fast Hit Times Via Pipelined Writes • Pipeline Tag Check and Update Cache 3. Fast Hit Times Via Pipelined Writes • Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update • Only STORES in the pipeline; empty during a miss Store r 2, (r 1) Add -Sub -Store r 4, (r 3) Check r 1 M[r 1]<-r 2& check r 3 • In shade is “Delayed Write Buffer”must be checked on reads; either complete write or read from buffer S. J. Lee 61

4. Fast Writes on Misses Via Small Subblocks • If most writes are 1 4. Fast Writes on Misses Via Small Subblocks • If most writes are 1 word, subblock size is 1 word, & write through then always write subblock & tag immediately – Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again. – Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the subblock makes it appropriate to turn the valid bit on. – Tag mismatch: This is a miss and will modify the data portion of the block. Since write-through cache, no harm was done; memory still has an up-todate copy of the old value. Only the tag to the address of the write and the valid bits of the other subblock need be changed because the valid bit for this subblock has already been set • Doesn뭪 work with write back due to last case S. J. Lee 62

hit time miss penalty miss rate Cache Optimization Summary Technique MR MP Larger Block hit time miss penalty miss rate Cache Optimization Summary Technique MR MP Larger Block Size + Higher Associativity + Victim Caches + Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + Priority to Read Misses Subblock Placement Early Restart & Critical Word 1 st Non-Blocking Caches Second Level Caches Small & Simple Caches Avoiding Address Translation Pipelining Writes HT + + + + Complexity 0 2 + + 0 1 1 3 2 0 + 1 2 2 3 2 2 S. J. Lee 63

Main Memory Background • Performance of Main Memory: – Latency: Cache Miss Penalty » Main Memory Background • Performance of Main Memory: – Latency: Cache Miss Penalty » Access Time: time between request and word arrives » Cycle Time: time between requests – Bandwidth: I/O & Large Block Miss Penalty (L 2) • Main Memory is DRAM: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms, 1% time) – Addresses divided into 2 halves (Memory as a 2 D matrix): » RAS or Row Access Strobe » CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory – No refresh (6 transistors/bit vs. 1 transistor – Address not divided: Full addreess • Size: DRAM/SRAM 4 -8, Cost/Cycle time: SRAM/DRAM 8 -16 S. J. Lee 64

Main Memory Deep Background • • • “Out-of-Core”, “In-Core, ” “Core Dump”? “Core memory? Main Memory Deep Background • • • “Out-of-Core”, “In-Core, ” “Core Dump”? “Core memory? Non-volatile, magnetic Lost to 4 Kbit DRAM (today using 64 Kbit DRAM) Access time 750 ns, cycle time 1500 -3000 ns S. J. Lee 65

DRAM logical organization (4 Mbit) Column Decoder 11 A 0…A 10 Sense Amps & DRAM logical organization (4 Mbit) Column Decoder 11 A 0…A 10 Sense Amps & I/O D Memory Array (2, 048 x 2, 048) Q Storage Word Line Cell • Square root of bits per RAS/CAS S. J. Lee 66

DRAM physical organization (4 Mbit) Column Address Row Address I/O I/O 8 I/Os D DRAM physical organization (4 Mbit) Column Address Row Address I/O I/O 8 I/Os D Block Row Dec. 9 : 512 Q 2 I/O Block 0 I/O Block 3 8 I/Os S. J. Lee 67

4 Key DRAM Timing Parameters • t. RAC: minimum time from RAS line falling 4 Key DRAM Timing Parameters • t. RAC: minimum time from RAS line falling to the valid data output. – Quoted as the speed of a DRAM when buy – A typical 4 Mb DRAM t. RAC = 60 ns – Speed of DRAM since on purchase sheet? • t. RC: minimum time from the start of one row access to the start of the next. – t. RC = 110 ns for a 4 Mbit DRAM with a t. RAC of 60 ns • t. CAC: minimum time from CAS line falling to valid data output. – 15 ns for a 4 Mbit DRAM with a t. RAC of 60 ns • t. PC: minimum time from the start of one column access to the start of the next. – 35 ns for a 4 Mbit DRAM with a t. RAC of 60 ns S. J. Lee 68

DRAM Performance • A 60 ns (t. RAC) DRAM can – perform a row DRAM Performance • A 60 ns (t. RAC) DRAM can – perform a row access only every 110 ns (t. RC) – perform column access (t. CAC) in 15 ns, but time between column accesses is at least 35 ns (t. PC). » In practice, external address delays and turning around buses make it 40 to 50 ns • These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead! S. J. Lee 69

DRAM History • DRAMs: capacity +60%/yr, cost ? 0%/yr – 2. 5 X cells/area, DRAM History • DRAMs: capacity +60%/yr, cost ? 0%/yr – 2. 5 X cells/area, 1. 5 X die size in ? years • ‘ 98 DRAM fab line costs $2 B – DRAM only: density, leakage v. speed • Rely on increasing no. of computers & memory per computer (60% market) – SIMM or DIMM is replaceable unit => computers use any generation DRAM • Commodity, second source industry => high volume, low profit, conservative – Little organization innovation in 20 years • Order of importance: 1) Cost/bit 2) Capacity – First RAMBUS: 10 X BW, +30% cost => little impact S. J. Lee 70

DRAM Future: 1 Gbit DRAM (ISSCC ‘ 96; production ‘ 02? ) • • DRAM Future: 1 Gbit DRAM (ISSCC ‘ 96; production ‘ 02? ) • • Mitsubishi Samsung Blocks 512 x 2 Mbit 1024 x 1 Mbit Clock 200 MHz 250 MHz Data Pins 64 16 Die Size 24 x 24 mm 31 x 21 mm – Sizes will be much smaller in production • Metal Layers • Technology 3 4 0. 15 micron 0. 16 micron • Wish could do this for Microprocessors! S. J. Lee 71

Main Memory Performance • Simple: – CPU, Cache, Bus, Memory same width (32 or Main Memory Performance • Simple: – CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: – CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; Utra. SPARC 512) • Interleaved: – CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved S. J. Lee 72

Main Memory Performance • Timing model (word size is 32 bits) – 1 to Main Memory Performance • Timing model (word size is 32 bits) – 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words • Simple M. P. = 4 x (1+6+1) = 32 • Wide M. P. =1+6+1 =8 • Interleaved M. P. = 1 + 6 + 4 x 1 = 11 S. J. Lee 73

Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses – Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses – Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache • Superbank: all memory active on one block transfer (or Bank) • Bank: portion within a superbank that is word interleaved (or Subbank) Superbank Bank S. J. Lee 74

Independent Memory Banks • How many banks? number banks ? number clocks to access Independent Memory Banks • How many banks? number banks ? number clocks to access word in bank – For sequential accesses, otherwise will return to original bank before it has next word ready – (like in vector case) • Increasing DRAM => fewer chips => harder to have banks S. J. Lee 75

DRAMs per PC over Time Minimum Memory Size ‘ 86 1 Mb 4 MB DRAMs per PC over Time Minimum Memory Size ‘ 86 1 Mb 4 MB 32 8 MB 16 MB DRAM Generation ‘ 89 ‘ 92 ‘ 96 ‘ 99 ‘ 02 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 8 16 4 8 2 32 MB 4 1 64 MB 8 2 128 MB 4 1 256 MB 8 2 S. J. Lee 76

Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: Prime number of banks – – – bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank number? easy if 2 N words per bank S. J. Lee 77

Fast Bank Number • Chinese Remainder Theorem As long as two sets of integers Fast Bank Number • Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules and that ai and aj are co-prime if i ? j, then the integer x has only one solution (unambiguous mapping): – bank number = b 0, number of banks = a 0 (= 3 in example) – address within bank = b 1, number of words in bank = a 1 (= 8 in example) – N word address 0 to N-1, prime no. banks, words power of 2 Bank Number: Address within Bank: 0 3 2 9 12 15 18 21 Seq. Interleaved 0 1 2 0 4 6 10 13 16 19 22 1 5 7 11 14 17 20 23 2 9 8 3 12 21 6 15 Modulo Interleaved 0 1 2 0 1 18 19 4 13 22 7 16 17 10 11 20 5 14 23 81 23 4 5 6 7 S. J. Lee 78

Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) – Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) – Extended Data Out (EDO): 30% faster in page mode • New DRAMs to address gap; what will they cost, will they survive? – RAMBUS: startup company; reinvent DRAM interface » Each Chip a module vs. slice of memory » Short bus between CPU and chips » Does own refresh » Variable amount of data returned » 1 byte / 2 ns (500 MB/s per chip) – Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) – Intel claims RAMBUS Direct (16 b wide) is future PC memory • Niche memory or main memory? – e. g. , Video RAM for frame buffers, DRAM + fast serial output S. J. Lee 79

DRAM Latency >> BW • More App Bandwidth => Cache misses => DRAM RAS/CAS DRAM Latency >> BW • More App Bandwidth => Cache misses => DRAM RAS/CAS • Application BW => Lower DRAM Latency • RAMBUS, Synch DRAM increase BW but higher latency • EDO DRAM < 5% in PC Proc I$ D$ L 2$ Bus D R A M S. J. Lee 80

Potential DRAM Crossroads? • After 20 years of 4 X every 3 years, running Potential DRAM Crossroads? • After 20 years of 4 X every 3 years, running into wall? (64 Mb - 1 Gb) • How can keep $1 B fab lines full if buy fewer DRAMs per computer? • Cost/bit -30%/yr if stop 4 X/3 yr? • What will happen to $40 B/yr DRAM industry? S. J. Lee 81

Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy? S. J. Lee 82

A Modern Memory Hierarchy • By taking advantage of the principle of locality: – A Modern Memory Hierarchy • By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Processor Control Speed (ns): 1 s Size (bytes): 100 s On-Chip Cache Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10 s 100 s Ks Ms Tertiary Secondary Storage (Disk/Tape) (Disk) 10, 000 s 10, 000, 000 s (10 s ms) (10 s sec) Gs Ts S. J. Lee 83

Basic Issues in VM System Design size of information blocks that are transferred from Basic Issues in VM System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy disk mem cache reg frame pages Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages S. J. Lee 84

Address Map V = {0, 1, . . . , n - 1} virtual Address Map V = {0, 1, . . . , n - 1} virtual address space M = {0, 1, . . . , m - 1} physical address space n>m MAP: V --> M U {0} address mapping function MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor a Addr Trans Mechanism 0 Main Memory Secondary Memory a' physical address OS performs this transfer S. J. Lee 85

Paging Organization P. A. 0 1024 frame 0 1 1 K 1 K 7 Paging Organization P. A. 0 1024 frame 0 1 1 K 1 K 7 7168 Addr Trans MAP 1 K Physical Memory V. A. 0 1024 31744 page 0 1 31 1 K 1 K unit of mapping also unit of transfer from virtual to physical 1 K memory Virtual Memory Address Mapping VA 10 disp page no. Page Table Base Reg index into page table Page Table V Access Rights PA table located in physical memory + physical memory address actually, concatenation is more likely S. J. Lee 86

Virtual Address and a Cache VA CPU miss PA Translation Cache Main Memory hit Virtual Address and a Cache VA CPU miss PA Translation Cache Main Memory hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size S. J. Lee 87

TLBs A way to speed up translation is to use a special cache of TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) S. J. Lee 88

Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit PA VA CPU Translation with a TLB Lookup miss Cache Main Memory hit Translation data 1/2 t t S. J. Lee 20 t 89

Reducing Translation Time Machines with TLBs go one step further to reduce # cycles/cache Reducing Translation Time Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access: high order bits of the VA are used to look in the TLB while low order bits are used as index into cache S. J. Lee 90

Overlapped Cache & TLB Access 32 TLB index assoc lookup Cache 1 K 4 Overlapped Cache & TLB Access 32 TLB index assoc lookup Cache 1 K 4 bytes 10 2 00 PA Hit/ Miss 20 page # PA 12 disp Data Hit/ Miss = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation S. J. Lee 91

Problems With Overlapped TLB Access Overlapped access only works as long as the address Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 cache index 2 00 This bit is changed by VA translation, but is needed for cache lookup 12 disp 20 virt page # Solutions: go to 8 K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 10 1 K 4 4 2 way set assoc cache S. J. Lee 92

Alpha 21064 • Separate Instr & Data TLB & Caches • TLBs fully associative Alpha 21064 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW (밣riv Arch Libr? • Caches 8 KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L 2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64 -bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L 2$ Instr Data Write Buffer Stream Buffer Victim Buffer S. J. Lee 93

Alpha Memory Performance: Miss Rates of SPEC 92 I$ miss = 6% D$ miss Alpha Memory Performance: Miss Rates of SPEC 92 I$ miss = 6% D$ miss = 32% L 2 miss = 10% 8 K 8 K 2 M I$ miss = 2% D$ miss = 13% L 2 miss = 0. 6% I$ miss = 1% D$ miss = 21% L 2 miss = 0. 3% S. J. Lee 94

Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction cache (yellow); L 2$ (pink) Other: compute + reg conflicts, structural conflicts S. J. Lee 95

Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, . . . ) D$, Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, . . . ) D$, Tom • 4 KB Data cache miss rate 8%, 12%, or 28%? • 1 KB Instr cache miss rate 0%, 3%, or 10%? • Alpha vs. MIPS for 8 KB Data $: 17% vs. 10% • Why 2 X Alpha v. MIPS? D$, gcc D$, esp I$, gcc I$, esp I$, Tom S. J. Lee 96

Pitfall: Simulating Too Small an Address Trace I$ = 4 KB, B=16 B D$ Pitfall: Simulating Too Small an Address Trace I$ = 4 KB, B=16 B D$ = 4 KB, B=16 B L 2 = 512 KB, B=128 B MP = 12, 200 S. J. Lee 97

Summary #1 • The Principle of Locality: – Program access a relatively small portion Summary #1 • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. » Temporal Locality: Locality in Time » Spatial Locality: Locality in Space • Three Major Categories of Cache Misses: – Compulsory Misses: sad facts of life. Example: cold start misses. – Capacity Misses: increase cache size – Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Write Policy: – Write Through: needs a write buffer. Nightmare: WB saturation – Write Back: control can be complex S. J. Lee 98

Summary #2 The Cache Design Space • Several interacting dimensions – – – Cache Summary #2 The Cache Design Space • Several interacting dimensions – – – Cache Size cache size block size associativity replacement policy write-through vs write-back write allocation Associativity • The optimal choice is a compromise – depends on access characteristics » workload » use (I-cache, D-cache, TLB) – depends on technology / cost • Simplicity often wins Block Size Bad Good Factor A Less Factor B More S. J. Lee 99

Summary #3 • 3 Cs: Compulsory, Capacity, Conflict • Reducing Miss Rate 1. Reduce Summary #3 • 3 Cs: Compulsory, Capacity, Conflict • Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations • Remember danger of concentrating on just one parameter when evaluating performance S. J. Lee 100

Summary #4, Reducing Miss Penalty Summary • Five techniques – – – Read priority Summary #4, Reducing Miss Penalty Summary • Five techniques – – – Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache • Can be applied recursively to Multilevel Caches – Danger is that time to DRAM will grow with multiple levels in between – First attempts at L 2 caches can make things worse, since increased worst case is worse • Out-of-order CPU can hide L 1 data cache miss (? ? clocks), but stall on L 2 miss (? 0? 00 clocks)? S. J. Lee 101

hit time miss penalty miss rate Summary #5 Cache Optimization Summary Technique MR MP hit time miss penalty miss rate Summary #5 Cache Optimization Summary Technique MR MP Larger Block Size + Higher Associativity + Victim Caches + Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + Priority to Read Misses Subblock Placement Early Restart & Critical Word 1 st Non-Blocking Caches Second Level Caches Small & Simple Caches Avoiding Address Translation Pipelining Writes HT + + + + Complexity 0 1 2 + + 0 1 1 3 2 0 + 1 2 2 3 2 2 S. J. Lee 102

Summary #6, Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or Summary #6, Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy? S. J. Lee 103

Summary #7: TLB, Virtual Memory • Caches, TLBs, Virtual Memory all understood by examining Summary #7: TLB, Virtual Memory • Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled? • Page tables map virtual address to physical address • TLBs are important for fast translation • TLB misses are significant in processor performance – funny times, as most systems can뭪 access all of 2 nd level cache without TLB misses! S. J. Lee 104

Summary #8: Memory Hierachy • VIrtual memory was controversial at the time: can SW Summary #8: Memory Hierachy • VIrtual memory was controversial at the time: can SW automatically manage 64 KB across many programs? – 1000 X DRAM growth removed the controversy • Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy • Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? S. J. Lee 105