ec01086f0ae97d8efc299409b31887af.ppt
- Количество слайдов: 22
EE Department Technion, Haifa, Israel The Power of Priority: No. C based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny QNo. C Research Group Technion 1 E. Bolotin – The Power of Priority, No. Cs 2007
Chip Multi-Processor (CMP) Multi-Core Dual-Core Large cache Monolithic shared cache Shared cache Distributed cache No. C-based: How? 2 E. Bolotin – The Power of Priority, No. Cs 2007
Future Cache - Physics Perspective • Large cache Large access time • Global wires delay • Distance reached in single cycle Global Wires Delay 100 Global wire delay 10 § Today: ~25% of chip § In 10 years: ~1% of chip 1 Gate delay 0. 1 250 250 180 130 90 65 45 Source: ITRS 2003 Fraction of chip reachable in 1 clock cycle Source: Keckler et al. ISSCC 2003 Large monolithic cache is not scalable 3 E. Bolotin – The Power of Priority, No. Cs 2007 32
NUCA - Non Uniform Cache Architecture Banked cache over No. C § Smaller bank Smaller Access Time § Multiple banks Multiple Ports § Closer bank Smaller Access Time NUCA= Non uniform access times Cache-line placement policy • Static NUCA (SNUCA) • Dynamic NUCA (DNUCA) 4 Sources: Kim et al. ASPLOS 2002 Beckmann et al. MICRO 2004 E. Bolotin – The Power of Priority, No. Cs 2007
Issues in NUCA-based CMP • No. C performance CMP performance • Cache coherency and transaction order (correctness) • Search (in DNUCA) • Different traffic types (e. g. fetch vs. prefetch) • Synchronization (locks) No. C Services for CMP? 5 E. Bolotin – The Power of Priority, No. Cs 2007
Cache Coherency over No. C How do we maintain coherency over No. C? Distributed • • Snooping directory • Central directory Cache bank with distributed directory 6 E. Bolotin – The Power of Priority, No. Cs 2007
Distributed Cache Coherency Cache access Multiple No. C transactions Example: Simple read transaction Ctrl. packet Data packet 7 E. Bolotin – The Power of Priority, No. Cs 2007
Read Transaction of Modified Block Ctrl. packet Data packet 8 E. Bolotin – The Power of Priority, No. Cs 2007
Read Exclusive of Shared Block Ctrl. packet Data packet 9 E. Bolotin – The Power of Priority, No. Cs 2007
Basic No. C to Support CMP Off-the-shelf (Vanilla) No. C: • Grid of wormhole routers • Unicast only • Ordering in network § Static routing § No virtual channels • Smart interfaces Can We Do Better? 10 E. Bolotin – The Power of Priority, No. Cs 2007 Vanilla No. C
Observations: L 2 Access A) Delay = Queueing + No. C transactions B) All No. C transactions are equally important C) No. C transactions consist of: • Short ctrl. packets • Long data packets Idea: Differentiate between Ctrl. and Data Solution: Preemptive Priority No. C Give priority to short ctrl. packets 11 E. Bolotin – The Power of Priority, No. Cs 2007
Preemptive Priority No. C: QNo. C Service Levels: • Dedicated wormhole buffer • Preemptive priority scheduling Multiple SL link 12 E. Bolotin – The Power of Priority, No. Cs 2007 Multiple SL Router
Example: Vanilla No. C Without contention: X: Delay of long packet δ: Delay of short packet Transaction 1 Long Data Transaction 2 Short Req. Long Resp. 13 A E. Bolotin – The Power of Priority, No. Cs 2007 B Vanilla No. C example Blue delay ~X Red delay ~ 2 X+δ Average delay ~ 1. 5 X
Example: Priority No. C Without contention: X: Delay of long packet δ: Delay of short packet Transaction 1 Long Data Transaction 2 Short Req. Long Resp. A B Vanilla No. C example Blue delay=X Red delay = 2 X+δ Average delay ~ 1. 5 X Priority No. C example Blue delay= X+δ Red delay = X+δ Average delay ~ X Potential delay reduction ~ 0. 5 X 14 E. Bolotin – The Power of Priority, No. Cs 2007
Priority No. C: Different Destinations Very important in wormhole • When ctrl. packet is blocked by other worms Long Data Short Req. 15 E. Bolotin – The Power of Priority, No. Cs 2007
Protocol Correctness Need state-preserving serialization of transactions in the processor interface 16 E. Bolotin – The Power of Priority, No. Cs 2007
Numerical Evaluation • CMP simulator (SIMICS) § Simulate parallel benchmarks § Obtain L 2 -cache access traces • QNo. C simulator (OPNET) § Simulate distributed coherence protocol over No. C § Measure total RD/RX L 2 -access delay § Measure total program throughput 17 E. Bolotin – The Power of Priority, No. Cs 2007
Priority No. C: Results Delay Reduction vs. Network Load RD Delay - Apache RD/RX Delay Reduction - Apache • Short ctrl. packet gets high priority • Long data packet gets low priority 18 E. Bolotin – The Power of Priority, No. Cs 2007
Priority No. C: Several Benchmarks Delay Reduction 19 E. Bolotin – The Power of Priority, No. Cs 2007 Program Speedup
So Far: The Power of Priority • Simplicity - Almost for Free • Significant CMP Speed-up Good For: • Coherency • Traffic differentiation (e. g. Fetch vs. Pre-Fetch) • Search in DNUCA • Synchronization (Locks) 20 E. Bolotin – The Power of Priority, No. Cs 2007
Advanced Support Functions • Special Broadcast for Short Messages § Broadcast service (e. g. search in DNUCA) § Wormhole broadcast slow and expensive S&F broadcast embedded in wormhole • Virtual Ring § No Additional Cost § For Invalidation Multicast § Snooping or synchronization 21 E. Bolotin – The Power of Priority, No. Cs 2007
Summary No. C at CMP Service! • Shared cache over No. C • Priority is powerful • Built-in support functions 22 E. Bolotin – The Power of Priority, No. Cs 2007


