Скачать презентацию Region Scout Exploiting Coarse Grain Sharing in Snoop Скачать презентацию Region Scout Exploiting Coarse Grain Sharing in Snoop

f7cb018e5fe426807942ba6e4974ee91.ppt

  • Количество слайдов: 24

Region. Scout: Exploiting Coarse Grain Sharing in Snoop Coherence www. eecg. toronto. edu/aenao Andreas Region. Scout: Exploiting Coarse Grain Sharing in Snoop Coherence www. eecg. toronto. edu/aenao Andreas Moshovos moshovos@eecg. toronto. edu Moshovos © 1

Improving Snoop Coherence CPU I$ D$ interconnect Main Memory Conventional Considerations: Complexity and Correctness Improving Snoop Coherence CPU I$ D$ interconnect Main Memory Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth § Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? § Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping § Moshovos © 2

Region. Scout: Avoid Some Snoops CPU I$ D$ interconnect Main Memory Frequent case: non-sharing Region. Scout: Avoid Some Snoops CPU I$ D$ interconnect Main Memory Frequent case: non-sharing even at a coarse level/Region n Region. Scout: Dynamically Identify Non-Shared Regions n First Request to a Region Identifies it as not Shared l Subsequent Requests do not need to be broadcast l n Uses Imprecise Information Small structures l Layer on top of conventional coherence l No additional constraints l Moshovos © 3

Roadmap n Conventional Coherence: l The need for power-aware designs n Potential: Program Behavior Roadmap n Conventional Coherence: l The need for power-aware designs n Potential: Program Behavior n Region. Scout: What and How n Implementation n Evaluation n Summary Moshovos © 4

Coherence Basics CPU CPU X p o no s op o sn t hi Coherence Basics CPU CPU X p o no s op o sn t hi Main Memory Given request for memory block X (address) n Detect where its current value resides n Moshovos © 5

Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU L 2 s is m m s is Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU L 2 s is m m s is Main Memory All L 2 tags see all accesses Perf. & Complexity: Have L 2 tags why not use them Power: All L 2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Moshovos © 6

Region. Scout Motivation: Sharing is Coarse Typical Memory Space Snapshot: colored by owner(s) addresses Region. Scout Motivation: Sharing is Coarse Typical Memory Space Snapshot: colored by owner(s) addresses Region: large continuous memory area, power of 2 size n CPU X asks for data block in region R n 1. 2. No one else has X No one else has any block in R Region. Scout Exploits this Behavior Layered Extension over Snoop Coherence Moshovos © 7

Optimization Opportunities CPU I$ D$ SWITCH Memory n Power and Bandwidth Originating node: avoid Optimization Opportunities CPU I$ D$ SWITCH Memory n Power and Bandwidth Originating node: avoid asking others l Remote node: avoid tag lookup l Moshovos © 8

better % of all requests Global Region Misses Potential: Region Miss Frequency Region Size better % of all requests Global Region Misses Potential: Region Miss Frequency Region Size Even with a 16 K Region ~45% of requests miss in all remote nodes Moshovos © 9

Region. Scout at Work: Non-Shared Region Discovery CPU CPU 2 1 3 2 Region Region. Scout at Work: Non-Shared Region Discovery CPU CPU 2 1 3 2 Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions First request detects a non-shared region Moshovos © 10

Region. Scout at Work: Avoiding Snoops CPU CPU 1 2 Global Region Miss Main Region. Scout at Work: Avoiding Snoops CPU CPU 1 2 Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Subsequent request avoids snoops Moshovos © 11

Region. Scout is Self-Correcting CPU CPU 1 2 2 Main Memory Record: Non-Shared Regions Region. Scout is Self-Correcting CPU CPU 1 2 2 Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Request from another node invalidates non-shared record Moshovos © 12

Implementation: Requirements n Requesting Node provides address: address n offset lg(Region Size) CPU At Implementation: Requirements n Requesting Node provides address: address n offset lg(Region Size) CPU At Originating Node – from CPU: l n Region Tag Have I discovered that this region is not shared? At Remote Nodes – from Interconnect: l Do I have a block in the region? Moshovos © 13

Remembering Non-Shared Regions address Region Tag offset Non-Shared Region Table valid Few entries 16 Remembering Non-Shared Regions address Region Tag offset Non-Shared Region Table valid Few entries 16 x 4 in most experiments Records non-shared regions n Lookup by Region portion prior to issuing a request n Snoop requests and invalidate n Moshovos © 14

What Regions are Locally Cached? Region Tag offset counter n If we had as What Regions are Locally Cached? Region Tag offset counter n If we had as many counters as regions: Block Allocation: counter[region]++ l Block Eviction: counter[region]-l Region cached only if counter[region] non-zero l n Not Practical: l E. g. , 16 K Regions and 4 G Memory 256 K counters Moshovos © 15

What Regions are Locally Cached? Region Tag offset Cached Region Hash hash p bits What Regions are Locally Cached? Region Tag offset Cached Region Hash hash p bits counter “Counter”: + on block allocation - on block eviction Few entries, e. g. , 256 P-bit 1 if counter non-zero used for lookups n Use few Counters Imprecise: Records a superset of locally cached Regions l False positives: lost opportunity, correctness preserved l Moshovos © 16

Roadmap n Conventional Coherence n Program Behavior: Region Miss Frequency n Region. Scout n Roadmap n Conventional Coherence n Program Behavior: Region Miss Frequency n Region. Scout n Evaluation n Summary Moshovos © 17

Evaluation Overview n Methodology n Filter rates l n Practical Filters can capture many Evaluation Overview n Methodology n Filter rates l n Practical Filters can capture many Region Misses Interconnect bandwidth reduction Moshovos © 18

Methodology n In-House simulator based on Simplescalar l l l l l n Execution Methodology n In-House simulator based on Simplescalar l l l l l n Execution driven All instructions simulated – MIPS like ISA System calls faked by passing them to host OS Synchronization using load-linked/store-conditional Simple in-order processors Memory requests complete instantaneously MESI snoop coherence 1 or 2 level memory hierarchy WATTCH power models SPLASH II benchmarks Scientific workloads l Feasibility study l Moshovos © 19

Identified Global Region Misses better Filter Rates CRH Size For small CRH better to Identified Global Region Misses better Filter Rates CRH Size For small CRH better to use large regions Practical Region. Scout filters capture a lot of the potential Moshovos © 20

CMP better Messages Bandwidth Reduction Region Size Moderate Bandwidth Savings for SMP (15%-22%) More CMP better Messages Bandwidth Reduction Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%) Moshovos © 21

Related Work n Region. Scout l n Jetty l n Moshovos, Memik, Falsafi, Choudhary, Related Work n Region. Scout l n Jetty l n Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 PST l n Technical Report, Dec. 2003 Eckman, Dahlgren, and Stenström, ISLPED 2002 Coarse-Grain Coherence l Cantin, Lipasti and Smith, ISCA 2005 Moshovos © 22

Summary n Exploit program behavior/optimize a frequent case l n Many requests result in Summary n Exploit program behavior/optimize a frequent case l n Many requests result in a global region miss Region. Scout l l l l Practical filter mechanism Dynamically detect would-be region misses Avoid broadcasts Save tag lookup power and interconnect bandwidth Small structures Layered extension over existing mechanisms Invisible to programmer and the OS Moshovos © 23

Region. Scout and Directories n Different information Directory block-level sharing l Region. Scout: Region-level Region. Scout and Directories n Different information Directory block-level sharing l Region. Scout: Region-level sharing l u Could build Region-level directory u This work serves as motivation n Directories use precise information l Region. Scout does not have to Directories/Implementation n Region. Scout can approximate a directory n l If remote nodes sent sharing info as opposed to a single bit Moshovos © 24