Скачать презентацию LOGO P r i n c e s Скачать презентацию LOGO P r i n c e s

f64dfd4f97d09d759ff8732b20afb637.ppt

  • Количество слайдов: 112

LOGO P r i n c e s s S u m a y LOGO P r i n c e s s S u m a y a U n i v e r s i t y f o r Computer Architecture Dr. Esam Al_Qaralleh T e c h n o l o g y

Memory Hierarchy Design 2 Memory Hierarchy Design 2

LOGO 5. 1 Introduction LOGO 5. 1 Introduction

Memory Hierarchy Design v Motivated by the principle of locality - A 90/10 type Memory Hierarchy Design v Motivated by the principle of locality - A 90/10 type of rule § Take advantage of 2 forms of locality • Spatial - nearby references are likely • Temporal - same reference is likely soon v Also motivated by cost/performance structures § Smaller hardware is faster: SRAM, Disk, Tape § Access vs. bandwidth variations § Fast memory is more expensive v Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level 4

Memory relevance in Computer Design ? v A computer’s performance is given by the Memory relevance in Computer Design ? v A computer’s performance is given by the number of instructions executed per time unit v The time for executing an instruction depends on: § The ALU speed (I. e. the data-path cycle duration) § The time it takes for each instruction to load/store its operands/result from/into the memory (in brief, the time to access memory) v The processing speed (CPU speed) grows faster than the memory speed. As a result the CPU speed cannot be fully exploited. This speed gap leads to an Unbalanced System ! 5

DRAM/CPU Gap v. CPU performance improves at 55%/year § In 1996 it was a DRAM/CPU Gap v. CPU performance improves at 55%/year § In 1996 it was a phenomenal 18% per month v. DRAM - has improved at 7% per year 6

Levels in A Typical Memory Hierarchy 7 Levels in A Typical Memory Hierarchy 7

Sample Memory Hierarchy 8 Sample Memory Hierarchy 8

Unit of Transfer / Addressable Unit v Unit of Transfer: Number of bits read Unit of Transfer / Addressable Unit v Unit of Transfer: Number of bits read from, or written into memory at a time § Internal : usually governed by data bus width § External : usually a block of words e. g 512 or more. v Addressable unit: smallest location which can be uniquely addressed § Internal : word or byte § External : device dependent e. g. a disk “cluster” 9

Access Method v Sequential § Data is stored in records, access is in linear Access Method v Sequential § Data is stored in records, access is in linear sequence (tape) v Direct § Data blocks have a unique and direct access, data within block is sequential (disk) v Random § Data has unique and direct access (ram) v Associative § Data retrieved based on (partial) match rather than address (cache) 10

Sequential Access Method v. Start at the beginning – read through in order v. Sequential Access Method v. Start at the beginning – read through in order v. Access time depends on location of data and previous location first location start ve. g. tape. . . read to here location of interest 11

Direct Access Method . . . v Individual blocks have unique address v Access Direct Access Method . . . v Individual blocks have unique address v Access is by jumping to vicinity plus sequential search (or waiting! e. g. waiting for disk to rotate) v Access time depends on target location and previous location v e. g. disk jump to here read to here block i 12

Random Access Method v. Individual addresses identify specific locations v. Access time independent of Random Access Method v. Individual addresses identify specific locations v. Access time independent of location or previous access ve. g. RAM. . . read here 13

Associative Access Method v. Data is located by a comparison with contents of a Associative Access Method v. Data is located by a comparison with contents of a portion of the store. ( Content Addressable Memory). v. Access time is independent of location or previous access ve. g. Cache 14

Registers v Used for storing data 32 bits 16 words v Register files § Registers v Used for storing data 32 bits 16 words v Register files § Multiple read/write ports possible § Example: 32 -bit wide by 16 bit deep, dual-port parallel read, single port parallel write register file . . . 32 15

Implementing Registers Using Logic Gates v. Flip-flops § Simple SR latch: S Q Q Implementing Registers Using Logic Gates v. Flip-flops § Simple SR latch: S Q Q R S Q R Q S 1 1 0 0 R 1 0 Q Q 0 1 x Q’ Q’ 1 0 x § Flip-flops • JK, D, T • Clocked • Master-slave (edge-triggered) 16

Implementing Registers in CMOS v Direct gate implementation too costly § A master-slave JK Implementing Registers in CMOS v Direct gate implementation too costly § A master-slave JK flip-flop uses 38 CMOS transistors v Directly implement in transistors § Example: clocked SR FF Q Q Q f f S R 17

Internal Organization of Memory Chips b 7 b¢ 7 b 1 b¢ 1 b Internal Organization of Memory Chips b 7 b¢ 7 b 1 b¢ 1 b 0 b¢ 0 W 0 • • • FF A 0 A 2 • • • A 1 W 1 FF Address decoder • • • • • Memory cells A 3 • • • W 15 Sense / Write circuit Data inputoutput lines: b 7 / Sense / Write circuit b 1 Sense / Write circuit R/W CS b 0 Figure 5. 2. Organization of bit cells in a memory chip. 18

A Memory Chip 5 -bit row address W 0 W 1 5 -bit decoder A Memory Chip 5 -bit row address W 0 W 1 5 -bit decoder 32 memory cell array W 31 10 -bit address Sense. Write / circuitry 32 -to-1 output multiplexer and input demultiplexer R/ W CS 5 -bit column address Data input/output Organization of a 1 K 1 memory chip. 19

Static Memories v The circuits are capable of retaining their state as long as Static Memories v The circuits are capable of retaining their state as long as power is applied. b b¢ T 1 X Y T 2 Word line Bit lines A static RAM cell. 20

Static Memories v CMOS cell: low power consumption 21 Static Memories v CMOS cell: low power consumption 21

Asynchronous DRAMs v Static RAMs are fast, but they cost more area and are Asynchronous DRAMs v Static RAMs are fast, but they cost more area and are more expensive. v Dynamic RAMs (DRAMs) are cheap and area efficient, but they can not retain their state indefinitely – need to be periodically refreshed. Bit line Word line T C A single-transistor dynamic memory cell 22

A Dynamic Memory Chip RA S Row Addr. Strobe Row address latch A 20 A Dynamic Memory Chip RA S Row Addr. Strobe Row address latch A 20 - 9 ¤ A 8 - Row decoder 4096 (512 8) cell array Sense / Write circuits 0 Column address latch CA S CS R/ W Column decoder D 7 D 0 Column Addr. Strobe Internal organization of a 2 M 8 dynamic memory chip. 23

LOGO 5. 2 Review of the ABCs of Caches LOGO 5. 2 Review of the ABCs of Caches

36 Basic Terms on Caches Cache Full associative Write allocate Virtual memory dirty bit 36 Basic Terms on Caches Cache Full associative Write allocate Virtual memory dirty bit unified cache memory stall cycles block offset misses per instruction directed mapped write back block valid bit data cache locality block address hit time address trace write through cache miss set instruction cache page fault random placement average memory access time miss rate index field cache hit n-way set associative no-write allocate page least-recently used write buffer miss penalty tag field write stall 25

Cache v The first level of the memory hierarchy encountered once the address leaves Cache v The first level of the memory hierarchy encountered once the address leaves the CPU § Persistent mismatch between CPU and main-memory speeds § Exploit the principle of locality by providing a small, fast memory between CPU and main memory -- the cache memory v Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches) v Caching – copying information into faster storage system § Main memory can be viewed as a cache for secondary storage 26

General Hierarchy Concepts v At each level - block concept is present (block is General Hierarchy Concepts v At each level - block concept is present (block is the caching unit) § Block size may vary depending on level • Amortize longer access by bringing in larger chunk • Works if locality principle is true § Hit - access where block is present - hit rate is the probability § Miss - access where block is absent (in lower levels) - miss rate v Mirroring and consistency § Data residing in higher level is subset of data in lower level § Changes at higher level must be reflected down - sometime • Policy of sometime is the consistency mechanism v Addressing § Whatever the organization you have to know how to get at it! § Address checking and protection 27

Physical Address Structure v. Key is that you want different block sizes at different Physical Address Structure v. Key is that you want different block sizes at different levels 28

Latency and Bandwidth v The time required for the cache miss depends on both Latency and Bandwidth v The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level) v Latency determines the time to retrieve the first word of the block v Bandwidth determines the time to retrieve the rest of this block v A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available 29

Predicting Memory Access Times v On a hit: simple access time to the cache Predicting Memory Access Times v On a hit: simple access time to the cache v On a miss: access time + miss penalty § Miss penalty = access time of lower + block transfer time § Block transfer time depends on • Block size - bigger blocks mean longer transfers • Bandwidth between the two levels of memory – Bandwidth usually dominated by the slower memory and the bus protocol v Performance § Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty § Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty 30

Block Sizes, Miss Rates & Penalties, Accesses 31 Block Sizes, Miss Rates & Penalties, Accesses 31

Headaches of Memory Hierarchies v CPU never knows for sure if an access will Headaches of Memory Hierarchies v CPU never knows for sure if an access will hit v How deep will a miss be - i. e. miss penalty § If short then the CPU just waits § If long then probably best to work on something else – task switch • Implies that the amount can be predicted with reasonable accuracy • Task switch better be fast or productivity/efficiency will suffer v Implies some new needs § More hardware accounting § Software readable accounting information (address trace) 32

Four Standard Questions v. Block Placement § Where can a block be placed in Four Standard Questions v. Block Placement § Where can a block be placed in the upper level? v. Block Identification § How is a block found if it is in the upper level? v. Block Replacement § Which block should be replaced on a miss? v. Write Strategy § What happens on a write? Answer the four questions for the first level of the memory hierarchy 33

Block Placement Options v Direct Mapped § (Block address) MOD (# of cache blocks) Block Placement Options v Direct Mapped § (Block address) MOD (# of cache blocks) v Fully Associative § Can be placed anywhere v Set Associative § Set is a group of n blocks -- each block is called a way § Block first mapped into a set (Block address) MOD (# of cache sets) § Placed anywhere in the set v Most caches are direct mapped, 2 - or 4 -way set associative 34

Block Placement Options (Cont. ) 35 Block Placement Options (Cont. ) 35

Block Identification Many memory blocks may v Each cache block carries tags map to Block Identification Many memory blocks may v Each cache block carries tags map to the same cache v Address Tags: which block am I? block § Physical address now: address tag## set index## block offset § Note relationship of block size, cache size, and tag size § The smaller the set tag the cheaper it is to find § Status Tags: what state is the block in? § valid, dirty, etc. Physical address = r (address tag) 2 m addressable sets in the cache m (set index) n (block offset) r + m + n bits 2 n bytes per block 36

Block Identification (Cont. ) Physical address = r + m + n bits r Block Identification (Cont. ) Physical address = r + m + n bits r (address tag) 2 m addressable sets in the cache m n 2 n bytes per block • Caches have an address tag on each block frame that gives the block address. • A valid bit to say whether or not this entry contains a valid address. • The block frame address can be divided into the tag field and the index field. 37

Block Replacement v Random: just pick one and chuck it § Simple hash game Block Replacement v Random: just pick one and chuck it § Simple hash game played on target block frame address § Some use truly random • But lack of reproducibility is a problem at debug time v LRU - least recently used § Need to keep time since each block was last accessed • Expensive if number of blocks is large due to global compare • Hence approximation is often. Only one. Use bitfor direct-mapped used = choice tag and LFU v FIFO placement 38

Data Cache Misses Per 1000 Instructions 64 byte blocks on a Alpha using 10 Data Cache Misses Per 1000 Instructions 64 byte blocks on a Alpha using 10 SPEC 2000 39

Short Summaries from the Previous Figure v More-way associative is better for small cache Short Summaries from the Previous Figure v More-way associative is better for small cache v 2 - or 4 -way associative perform similar to 8 -way associative for larger caches v Larger cache size is better v LRU is the best for small block sizes v Random works fine for large caches v FIFO outperforms random in smaller caches v Little difference between LRU and random for larger caches 40

Improving Cache Performance v MIPS mix is 10% stores and 37% loads § Writes Improving Cache Performance v MIPS mix is 10% stores and 37% loads § Writes are about 10%/(100%+10%+37%) = 7% of overall memory traffic, and 10%/(10%+37%)=21% of data cache traffic v Make the common case fast § Implies optimizing caches for reads v Read optimizations § Block can be read concurrent with tag comparison § On a hit the read information is passed on § On a miss the - nuke the block and start the miss access v Write optimizations § Can’t modify until after tag check - hence take longer 41

Write Options v Write through: write posted to cache line and through to next Write Options v Write through: write posted to cache line and through to next lower level § Incurs write stall (use an intermediate write buffer to reduce the stall) v Write back § Only write to cache not to lower level § Implies that cache and main memory are now inconsistent • Mark the line with a dirty bit • If this block is replaced and dirty then write it back v Pro’s and Con’s both are useful § Write through • No write on read miss, simpler to implement, no inconsistency with main memory § Write back • Uses less main memory bandwidth, write times independent of main memory speeds • Multiple writes within a block require only one write to the main memory 42

LOGO 5. 3 Cache Performance LOGO 5. 3 Cache Performance

Cache Performance 44 Cache Performance 44

Cache Performance Example v Each instruction takes 2 clock cycle (ignore memory stalls) v Cache Performance Example v Each instruction takes 2 clock cycle (ignore memory stalls) v Cache miss penalty – 50 clock cycles v Miss rate = 2% v Average 1. 33 memory reference per instructions • • Ideal – IC * 2 * cycle-time With cache – IC*(2+1. 33*2%*50)*cycle-time = IC * 3. 33 * cycle-time No cache – IC * (2+1. 33*100%*50)*cycle-time The importance of cache for CPUs with lower CPI and higher clock rates is greater – Amdahl’s Law 45

Average Memory Access Time VS CPU Time v Compare two different cache organizations § Average Memory Access Time VS CPU Time v Compare two different cache organizations § Miss rate – direct-mapped (1. 4%), 2 -way associative (1. 0%) § Clock-cycle-time – direct-mapped (2. 0 ns), 2 -way associative (2. 2 ns) v CPI with a perfect cache – 2. 0, average memory reference per instruction – 1. 3; miss-penalty – 70 ns; hittime – 1 CC • Average Memory Access Time (Hit time + Miss_rate * Miss_penalty) • AMAT(Direct) = 1 * 2 + (1. 4% * 70) = 2. 98 ns • AMAT(2 -way) = 1 * 2. 2 + (1. 0% * 70) = 2. 90 ns • CPU Time • CPU(Direct) = IC * (2 * 2 + 1. 3 * 1. 4% * 70) = 5. 27 * IC • CPU(2 -way) = IC * (2 * 2. 2 + 1. 3 * 1. 0% * 70) = 5. 31 * IC Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is direct mapped in this example 46

Unified and Split Cache v v Unified – 32 KB cache, Split – 16 Unified and Split Cache v v Unified – 32 KB cache, Split – 16 KB IC and 16 KB DC Hit time – 1 clock cycle, miss penalty – 100 clock cycles Load/Store hit takes 1 extra clock cycle for unified cache 36% load/store – reference to cache: 74% instruction, 26% data • Miss rate(16 KB instruction) = 3. 82/1000/1. 0 = 0. 004 Miss rate (16 KB data) = 40. 9/1000/0. 36 = 0. 114 • Miss rate for split cache – (74%*0. 004) + (26%*0. 114) = 0. 0324 Miss rate for unified cache – 43. 3/1000/(1+0. 36) = 0. 0318 • Average-memory-access-time = % inst * (hit-time + inst-miss-rate * miss -penalty) + % data * (hit-time + data-miss-rate * miss-penalty) • AMAT(Split) = 74% * (1 + 0. 004 * 100) + 26% * (1 + 0. 114 * 100) = 4. 24 • AMAT(Unified) = 74% * (1 + 0. 0318 * 100) + 26% * (1 + 0. 0318* 100) = 4. 44 47

Improving Cache Performance v. Average-memory-access-time = Hittime + Miss-rate * Miss-penalty v. Strategies for Improving Cache Performance v. Average-memory-access-time = Hittime + Miss-rate * Miss-penalty v. Strategies for improving cache performance § Reducing the miss penalty § Reducing the miss rate § Reducing the miss penalty or miss rate via parallelism § Reducing the time to hit in the cache 48

LOGO 5. 4 Reducing Cache Miss Penalty LOGO 5. 4 Reducing Cache Miss Penalty

Techniques for Reducing Miss Penalty v. Multilevel Caches (the most important) v. Critical Word Techniques for Reducing Miss Penalty v. Multilevel Caches (the most important) v. Critical Word First and Early Restart v. Giving Priority to Read Misses over Writes v. Merging Write Buffer v. Victim Caches 50

Multi-Level Caches v. Probably the best miss-penalty reduction v. Performance measurement for 2 -level Multi-Level Caches v. Probably the best miss-penalty reduction v. Performance measurement for 2 -level caches § AMAT = Hit-time-L 1 + Miss-rate-L 1* Misspenalty-L 1 § Miss-penalty-L 1 = Hit-time-L 2 + Miss-rate-L 2 * Miss-penalty-L 2 § AMAT = Hit-time-L 1 + Miss-rate-L 1 * (Hit-time. L 2 + Miss-rate-L 2 * Miss-penalty-L 2) 51

Multi-Level Caches (Cont. ) v Definitions: § Local miss rate: misses in this cache Multi-Level Caches (Cont. ) v Definitions: § Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Missrate-L 2) § Global miss rate: misses in this cache divided by the total number of memory accesses generated by CPU (Miss-rate-L 1 x Miss-rate-L 2) § Global Miss Rate is what matters v Advantages: § Capacity misses in L 1 end up with a significant penalty reduction since they likely will get supplied from L 2 • No need to go to main memory § Conflict misses in L 1 similarly will get supplied by L 2 52

Miss Rate Example v. Suppose that in 1000 memory references there are 40 misses Miss Rate Example v. Suppose that in 1000 memory references there are 40 misses in the first-level cache and 20 misses in the second-level cache § Miss rate for the first-level cache = 40/1000 (4%) § Local miss rate for the second-level cache = 20/40 (50%) § Global miss rate for the second-level cache = 20/1000 (2%) 53

Miss Rate Example (Cont. ) v Assume miss-penalty-L 2 is 100 CC, hit-time-L 2 Miss Rate Example (Cont. ) v Assume miss-penalty-L 2 is 100 CC, hit-time-L 2 is 10 CC, hit-time-L 1 is 1 CC, and 1. 5 memory reference per instruction. What is average memory access time and average stall cycles per instructions? Ignore writes impact. § AMAT = Hit-time-L 1 + Miss-rate-L 1 * (Hit-time-L 2 + Miss-rate-L 2 * Miss-penalty-L 2) = 1 + 4% * (10 + 50% * 100) = 3. 4 CC § Average memory stalls per instruction = Misses-perinstruction-L 1 * Hit-time-L 2 + Misses-per-instructions. L 2*Miss-penalty-L 2 = (40*1. 5/1000) * 10 + (20*1. 5/1000) * 100 = 3. 6 CC • Or (3. 4 – 1. 0) * 1. 5 = 3. 6 CC 54

Critical Word First and Early Restart v Do not wait for full block to Critical Word First and Early Restart v Do not wait for full block to be loaded before restarting CPU § Critical Word First – request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first § Early restart -- as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution v Benefits of critical word first and early restart depend on § Block size: generally useful only in large blocks § Likelihood of another access to the portion of the block that has not yet been fetched • Spatial locality problem: tend to want next sequential word, so not clear if benefit block 55

Giving Priority to Read Misses Over Writes v In write through, write buffers complicate Giving Priority to Read Misses Over Writes v In write through, write buffers complicate memory access in that they might hold the updated value of location needed on a read miss § RAW conflicts with main memory reads on cache misses v Read miss waits until the write buffer empty increase read miss penalty (old MIPS 1000 with 4 -word buffer by 50% ) v Check write buffer contents before read, and if no conflicts, let the memory access continue v Write Back? § Read miss replacing dirty block § Normal: Write dirty block to memory, and then do the read § Instead copy the dirty block to a write buffer, then do the read, and then do the write § CPU stall less since restarts as soon as do read 56

Merging Write Buffer v An entry of write buffer often contain multi-words. However, a Merging Write Buffer v An entry of write buffer often contain multi-words. However, a write often involves single word § A single-word write occupies the whole entry if no write-merging v Write merging: check to see if the address of a new data matches the address of a valid write buffer entry. If so, the new data are combined with that entry v Advantage § Multi-word writes are usually faster than single-word writes § Reduce the stalls due to the write buffer being full 57

Write-Merging Illustration 58 Write-Merging Illustration 58

Victim Caches v Remember what was just discarded in case it is need again Victim Caches v Remember what was just discarded in case it is need again v Add small fully associative cache (called victim cache) between the cache and the refill path § Contain only blocks discarded from a cache because of a miss § Are checked on a miss to see if they have the desired data before going to the next lower-level of memory • If yes, swap the victim block and cache block § Addressing both victim and regular cache at the same time • The penalty will not increase v Jouppi (DEC SRC) shows miss reduction of 20 - 95% § For a 4 KB direct mapped cache with 1 -5 victim blocks 59

Victim Cache Organization 60 Victim Cache Organization 60

LOGO 5. 5 Reducing Miss Rate LOGO 5. 5 Reducing Miss Rate

Classify Cache Misses - 3 C’s v Compulsory independent of cache size § First Classify Cache Misses - 3 C’s v Compulsory independent of cache size § First access to a block no choice but to load it § Also called cold-start or first-reference misses v Capacity decrease as cache size increases § Cache cannot contain all the blocks needed during execution, then blocks being discarded will be later retrieved v Conflict (Collision) decrease as associativity increases § Side effect of set associative or direct mapping § A block may be discarded and later retrieved if too many blocks map to the same cache block 62

Techniques for Reducing Miss Rate v. Larger Block Size v. Larger Caches v. Higher Techniques for Reducing Miss Rate v. Larger Block Size v. Larger Caches v. Higher Associativity v. Way Prediction Caches v. Compiler optimizations 63

Larger Block Sizes v. Obvious advantages: reduce compulsory misses § Reason is due to Larger Block Sizes v. Obvious advantages: reduce compulsory misses § Reason is due to spatial locality v. Obvious disadvantage § Higher miss penalty: larger block takes longer to move § May increase conflict misses and capacity miss if cache is small Don’t let increase in miss penalty outweigh the decrease in miss rate 64

Large Caches v. Help with both conflict and capacity misses v. May need longer Large Caches v. Help with both conflict and capacity misses v. May need longer hit time AND/OR higher HW cost v. Popular in off-chip caches 65

Higher Associativity v 8 -way set associative is for practical purposes as effective in Higher Associativity v 8 -way set associative is for practical purposes as effective in reducing misses as fully associative v 2: 1 Rule of thumb § 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB) v Greater associativity comes at the cost of increased hit time § Lengthen the clock cycle § Hill [1988] suggested hit time for 2 -way vs. 1 -way: external cache +10%, internal + 2% 66

Effect of Higher Associativity for AMAT Clock-cycle-time (2 -way) = 1. 10 * Clock-cycle-time Effect of Higher Associativity for AMAT Clock-cycle-time (2 -way) = 1. 10 * Clock-cycle-time (1 -way) Clock-cycle-time (4 -way) = 1. 12 * Clock-cycle-time (1 -way) Clock-cycle-time (8 -way) = 1. 14 * Clock-cycle-time (1 -way) 67

Way Prediction v Extra bits are kept in cache to predict the way, or Way Prediction v Extra bits are kept in cache to predict the way, or block within the set of the next cache access v Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle v A miss results in checking the other blocks for matches in subsequent clock cycles v Alpha 21264 uses way prediction in its 2 -way set -associative instruction cache. Simulation using SPEC 95 suggested way prediction accuracy is in excess of 85% 68

Compiler Optimization for Code v Code can easily be arranged without affecting correctness v Compiler Optimization for Code v Code can easily be arranged without affecting correctness v Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses v Mc. Farling's observation using profiling information [1988] § Reduce miss by 50% for a 2 KB direct-mapped instruction cache with 4 -byte blocks, and by 75% in an 8 KB cache § Optimized programs on a direct-mapped cache missed less than unoptimized ones on an 8 -way setassociative cache of same size 69

Compiler Optimization for Data v Idea – improve the spatial and temporal locality of Compiler Optimization for Data v Idea – improve the spatial and temporal locality of the data v Lots of options § Array merging – Allocate arrays so that paired operands show up in same cache block § Loop interchange – Exchange inner and outer loop order to improve cache performance § Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop § Blocking – Do as much as possible on a sub- block before moving on 70

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; val Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; val key /* After: 1 array of stuctures */ struct merge { int val; val key int key; }; struct merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality 71

Loop Interchange Example /* Before */ for (j = 0; j < 100; j Loop Interchange Example /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improve spatial locality 72

Loop Fusion Example /* Before */ for (i = 0; i < N; i Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } Perform different computations on the common data in two loops fuse the two loops 2 misses per access to a & c vs. one miss per access; Improve temporal locality 73

LOGO 5. 7 Reducing Hit Time LOGO 5. 7 Reducing Hit Time

Reducing Hit Time v. Hit time is critical because it affects the clock cycle Reducing Hit Time v. Hit time is critical because it affects the clock cycle time § On many machines, cache access time limits the clock cycle rate v. A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything § Average-Memory-Access-Time = Hit. Access-Time + Miss-Rate * Miss-Penalty • Miss-penalty is clock-cycle dependent 75

Techniques for Reducing Hit Time v. Small and Simple Caches v. Avoid Address Translation Techniques for Reducing Hit Time v. Small and Simple Caches v. Avoid Address Translation during Indexing of the Cache v. Pipelined Cache Access v. Trace Caches 76

Small and Simple Caches v A time-consuming portion of a cache hit: use the Small and Simple Caches v A time-consuming portion of a cache hit: use the index portion to read the tag and then compare it to the address v Small caches – smaller hardware is faster § Keep the L 1 cache small enough to fit on the same chip as CPU § Keep the tags on-chip, and the data off-chip for L 2 caches v Simple caches – direct-Mapped cache § Trading hit time for increased miss-rate • Small direct mapped misses more often than small associative caches • But simpler structure makes the hit go faster 77

Pipelining Writes for Fast Write Hits – Pipelined Cache v Write hits usually take Pipelining Writes for Fast Write Hits – Pipelined Cache v Write hits usually take longer than read hits § Tag must be checked before writing the data v Pipelines the write § 2 stages – Tag Check and Update Cache (can be more in practice) § Current write tag check & previous write cache update v Result § Looks like a write happens on every cycle § Cycle-time can stay short since real write is spread over § Mostly works if CPU is not dependent on data from a write • Spot any problems if read and write ordering is not preserved by the memory system? v Reads play no part in this pipeline since they already operate in parallel with the tag check 78

Cache Optimization Summary 79 Cache Optimization Summary 79

LOGO 5. 9 Main Memory LOGO 5. 9 Main Memory

Main Memory -- 3 important issues v Capacity v Latency § Access time: time Main Memory -- 3 important issues v Capacity v Latency § Access time: time between a read is requested and the word arrives § Cycle time: min time between requests to memory (> access time) • Memory needs the address lines to be stable between accesses § By addressing big chunks - like an entire cache block (amortize the latency) § Critical to cache performance when the miss is to main v Bandwidth -- # of bytes read or written per unit time § Affects the time it takes to transfer the block 81

Example of Memory Latency and Bandwidth v Consider § 4 cycle to send the Example of Memory Latency and Bandwidth v Consider § 4 cycle to send the address § 56 cycles per word of access § 4 cycle to transmit the data v Hence if main memory is organized by word § 64 cycles has to be spent for every word we want to access v Given a cache line of 4 words (8 bytes per word) § 256 cycles is the miss penalty § Memory bandwidth = 1/8 byte per clock cycle (4 * 8 /256) 82

Improving Main Memory Performance v. Simple: § CPU, Cache, Bus, Memory same width (32 Improving Main Memory Performance v. Simple: § CPU, Cache, Bus, Memory same width (32 or 64 bits) v. Wide: § CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; Utra. SPARC 512) v. Interleaved: § CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved 83

3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory Bandwidth 3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory Bandwidth 84

Wider Main Memory v Doubling or quadrupling the width of the cache or memory Wider Main Memory v Doubling or quadrupling the width of the cache or memory will doubling or quadrupling the memory bandwidth § Miss penalty is reduced correspondingly v Cost and Drawback § More cost on memory bus § Multiplexer between the cache and the CPU may be on the critical path (CPU is still access the cache one word at a time) • Multiplexors can be put between L 1 and L 2 § The design of error correction become more complicated • If only a portion of the block is updated, all other portions must be read for calculating the new error correction code § Since main memory is traditionally expandable by the customer, the minimum increment is doubled or quadrupled 85

Simple Interleaved Memory v Memory chips are organized into banks to read or write Simple Interleaved Memory v Memory chips are organized into banks to read or write multiple words at a time, rather than a single word § Share address lines with a memory controller § Keep the memory bus the same but make it run faster § Take advantage of potential memory bandwidth of all DRAMs banks § The banks are often one word wide § Good for accessing consecutive memory location v Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0. 4 bytes per CC) Bank_# = address MOD #_of_banks Address_within_bank = Floor(Address / #_of_bank Interleaving factor = #_of_banks (usually power of 2) Organization of Four-way Interleaved Memory 86

What Can Interleaving and a Wide Memory Buy? v v v Block size = What Can Interleaving and a Wide Memory Buy? v v v Block size = 1, 2, 4 words. Miss rate = 3%, 2% 1. 2% correspondingly Memory Bus width = 1 word, memory access per instruction = 1. 2 Cache miss penalty = 64 cycles (as above) Average cycles per instruction (ignore cache misses) = 2 CPI = 2 + (1. 2 * 3% *64) = 4. 3 (1 -word block) • • Block size = 2 words 64 -bit bus and memory, no interleaving = 2 + (1. 2 * 2% * 2 * 64) = 5. 07 64 -bit bus and memory, interleaving = 2 + (1. 2 * 2% * (4+56+2*4)) = 3. 63 128 -bit bus and memory, no interleaving = 2 + (1. 2 * 2% * 1* 64) = 3. 54 • • Block size = 4 words 64 -bit bus and memory, no interleaving = 2 + (1. 2 * 1. 2% * 4 *64) = 5. 69 64 -bit bus and memory, interleaving = 2 + (1. 2 * 1. 2% * (4+56+4*4)) = 3. 09 128 -bit bus and memory, no interleaving = 2 + (1. 2 * 1. 2% * 2 *64) = 3. 84 87

Simple Interleaved Memory (Cont. ) v Interleaved memory is logically a wide memory, except Simple Interleaved Memory (Cont. ) v Interleaved memory is logically a wide memory, except that accesses to bank are staged over time to share bus v How many banks should be included? § More than # of CC to access word in bank • To achieve the goal that delivering information from a new bank each clock for sequential accesses avoid waiting v Disadvantages § Making multiple banks are expensive larger chip, few chips • 512 MB RAM – 256 chips of 4 M*4 bits 16 banks of 16 chips – 16 chips of 64 M*4 bit only 1 bank § More difficulty in main memory expansion (like wider memory) 88

LOGO 5. 9 Memory Technology LOGO 5. 9 Memory Technology

DRAM Technology v Semiconductor Dynamic Random Access Memory v Emphasize on cost per bit DRAM Technology v Semiconductor Dynamic Random Access Memory v Emphasize on cost per bit and capacity v Multiplex address lines cutting # of address pins in half § Row access strobe (RAS) first, then column access strobe (CAS) § Memory as a 2 D matrix – rows go to a buffer § Subsequent CAS selects subrow v Use only a single transistor to store a bit § Reading that bit can destroy the information § Refresh each bit periodically (ex. 8 milliseconds) by writing back • Keep refreshing time less than 5% of the total time v DRAM capacity is 4 to 8 times that of SRAM 90

DRAM Technology (Cont. ) v DIMM: Dual inline memory module § DRAM chips are DRAM Technology (Cont. ) v DIMM: Dual inline memory module § DRAM chips are commonly sold on small boards called DIMMs § DIMMs typically contain 4 to 16 DRAMs v Slowing down in DRAM capacity growth § Four times the capacity every three years, for more than 20 years § New chips only double capacity every two year, since 1998 v DRAM performance is growing at a slower rate § RAS (related to latency): 5% per year § CAS (related to bandwidth): 10%+ per year 91

RAS improvement A performance improvement in RAS of about 5% per year 92 RAS improvement A performance improvement in RAS of about 5% per year 92

SRAM Technology v Cache uses SRAM: Static Random Access Memory v SRAM uses six SRAM Technology v Cache uses SRAM: Static Random Access Memory v SRAM uses six transistors per bit to prevent the information from being disturbed when read no need to refresh § SRAM needs only minimal power to retain the charge in the standby mode good for embedded applications § No difference between access time and cycle time for SRAM v Emphasize on speed and capacity § SRAM address lines are not multiplexed v SRAM speed is 8 to 16 x that of DRAM 93

ROM and Flash v Embedded processor memory v Read-only memory (ROM) § § Programmed ROM and Flash v Embedded processor memory v Read-only memory (ROM) § § Programmed at the time of manufacture Only a single transistor per bit to represent 1 or 0 Used for the embedded program and for constant Nonvolatile and indestructible v Flash memory: § Nonvolatile but allow the memory to be modified § Reads at almost DRAM speeds, but writes 10 to 100 times slower § DRAM capacity per chip and MB per dollar is about 4 to 8 times greater than flash 94

Improving Memory Performance in a Standard DRAM Chip v Fast page mode: time signals Improving Memory Performance in a Standard DRAM Chip v Fast page mode: time signals that allow repeated accesses to buffer without another row access time v Synchronous RAM (SDRAM): add a clock signal to DRAM interface, so that the repeated transfer would not bear overhead to synchronize with the controller § Asynchronous DRAM involves overhead to sync with controller § Peak speed per memory module 800— 1200 MB/sec in 2001 v Double data rate (DDR): transfer data on both the rising edge and falling edge of DRAM clock signal § Peak speed per memory module 1600— 2400 MB/sec in 2001 95

LOGO 5. 10 Virtual Memory LOGO 5. 10 Virtual Memory

Virtual Memory v Virtual memory divides physical memory into blocks (called page or segment) Virtual Memory v Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes v With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation v Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks 97

Example of Virtual to Physical Address Mapping by a page table 98 Example of Virtual to Physical Address Mapping by a page table 98

Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 99 Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 99

Page table when some pages are not in main memory… illegal access 100 Page table when some pages are not in main memory… illegal access 100

Virtual Memory (Cont. ) v Permits applications to grow bigger than main memory size Virtual Memory (Cont. ) v Permits applications to grow bigger than main memory size v Helps with multiple process management § § Each process gets its own chunk of memory Permits protection of 1 process’ chunks from another Mapping of multiple chunks onto shared physical memory Mapping also facilitates relocation (a program can run in any memory location, and can be moved during execution) § Application and CPU run in virtual space (logical memory, 0 – max) § Mapping onto physical space is invisible to the application v Cache VS. VM § Block becomes a page or segment § Miss becomes a page or address fault 101

Typical Page Parameters 102 Typical Page Parameters 102

Cache vs. VM Differences v Replacement § Cache miss handled by hardware § Page Cache vs. VM Differences v Replacement § Cache miss handled by hardware § Page fault usually handled by OS v Addresses § VM space is determined by the address size of the CPU § Cache space is independent of the CPU address size v Lower level memory § For caches - the main memory is not shared by something else § For VM - most of the disk contains the file system • File system addressed differently - usually in I/ O space • VM lower level is usually called SWAP space 103

2 VM Styles - Paged or Segmented? v Virtual systems can be categorized into 2 VM Styles - Paged or Segmented? v Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks) Page Segment Words per address One Two (segment and offset) Programmer visible? Invisible to application programmer May be visible to application programmer Replacing a block Trivial (all blocks are the same size) Hard (must find contiguous, variablesize, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentation (unused pieces of main memory) Efficient disk traffic Yes (adjust page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes) 104

Virtual Memory – The Same 4 Questions v Block Placement § Choice: lower miss Virtual Memory – The Same 4 Questions v Block Placement § Choice: lower miss rates and complex placement or vice versa • Miss penalty is huge, so choose low miss rate place anywhere • Similar to fully associative cache model v Block Identification - both use additional data structure § Fixed size pages - use a page table § Variable sized segments - segment table frame number frame offset f (l-n) d (n) 105

Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 106 Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 106

Virtual Memory – The Same 4 Questions (Cont. ) v Block Replacement -- LRU Virtual Memory – The Same 4 Questions (Cont. ) v Block Replacement -- LRU is the best § However true LRU is a bit complex – so use approximation • Page table contains a use tag, and on access the use tag is set • OS checks them every so often - records what it sees in a data structure - then clears them all • On a miss the OS decides who has been used the least and replace that one v Write Strategy -- always write back § Due to the access time to the disk, write through is silly § Use a dirty bit to only write back pages that have been modified 107

Techniques for Fast Address Translation v Page table is kept in main memory (kernel Techniques for Fast Address Translation v Page table is kept in main memory (kernel memory) § Each process has a page table v Every data/instruction access requires two memory accesses § One for the page table and one for the data/instruction § Can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs) v If locality applies then cache the recent translation § TLB = translation look-aside buffer § TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit 108

TLB = Translation Look-aside Buffer v. The TLB must be on chip; otherwise it TLB = Translation Look-aside Buffer v. The TLB must be on chip; otherwise it is worthless § Fully associative – parallel search v. Typical TLB’s § § Hit time - 1 cycle Miss penalty - 10 to 30 cycles Miss rate -. 1% to 2% TLB size - 32 B to 8 KB 109

Paging Hardware with TLB 110 Paging Hardware with TLB 110

TLB of Alpha 21264 Address Space Number: process ID to prevent context switch A TLB of Alpha 21264 Address Space Number: process ID to prevent context switch A total of 128 TLB entries 111

Page Size – An Architectural Choice v. Large pages are good: § Reduces page Page Size – An Architectural Choice v. Large pages are good: § Reduces page table size § Amortizes the long disk access § If spatial locality is good then hit rate will improve § Reduce the number of TLB miss v. Large pages are bad: § More internal fragmentation • If everything is random each structure’s last page is only half full § Process start up time takes longer 112