18 -742 Spring 2011 Parallel Computer Architecture Lecture

Скачать презентацию 18 -742 Spring 2011 Parallel Computer Architecture Lecture

fbe0b7fcd78491cb0f1719d0f3e9ba1d.ppt

Количество слайдов: 47

18 -742 Spring 2011 Parallel Computer Architecture Lecture 14: Interconnects III Prof. Onur Mutlu Carnegie Mellon University

Announcements n No class Wednesday (Feb 23) n No class Friday (Feb 25) – Open House n Project Milestone due Thursday (Feb 24) 2

Reviews n Due Today (Feb 21) q q n Smith, “A pipelined, shared resource MIMD computer, ” ICPP 1978. Chappell et al. , “Simultaneous Subordinate Microthreading (SSMT), ” ISCA 1999. Due Sunday (Feb 27) q q Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading, ” ISCA 2000. Mukherjee et al. , “Detailed Design and Evaluation of Redundant Multithreading Alternatives, ” ISCA 2002. 3

Last Lectures n Interconnection Networks q q q q n Introduction & Terminology Topology Buffering and Flow control Routing Router design Network performance metrics On-chip vs. off-chip differences Research on No. Cs and packet scheduling q q q The problem with packet scheduling Application-aware packet scheduling Aergia: Latency slack based packet scheduling 4

Today n n Recap interconnection networks Multithreading 5

Some Questions n n n What are the possible ways of handling contention in a router? What is head-of-line blocking? What is a non-minimal routing algorithm? What is the difference between deterministic, oblivious, and adaptive routing algorithms? What routing algorithms need to worry about deadlock? What routing algorithms need to worry about livelock? How to handle deadlock How to handle livelock? What is zero-load latency? What is saturation throughput? What is an application-aware packet scheduling algorithm? 6

Routing Mechanism n Arithmetic q q n Simple arithmetic to determine route in regular topologies Dimension order routing in meshes/tori Source Based Source specifies output port for each switch in route + Simple switches q n no control state: strip output port off header - Large header n Table Lookup Based Index into table for output port + Small header - More complex switches q 7

Routing Algorithm n Types q q q n Deterministic: always choose the same path Oblivious: do not consider network state (e. g. , random) Adaptive: adapt to state of the network How to adapt q q Local/global feedback Minimal or non-minimal paths 8

Deterministic Routing n n All packets between the same (source, dest) pair take the same path Dimension-order routing q q E. g. , XY routing (used in Cray T 3 D, and many on-chip networks) First traverse dimension X, then traverse dimension Y + Simple + Deadlock freedom (no cycles in resource allocation) - Could lead to high contention - Does not exploit path diversity 9

Deadlock n n n No forward progress Caused by circular dependencies on resources Each packet waits for a buffer occupied by another packet downstream 10

Handling Deadlock n Avoid cycles in routing q Dimension order routing n q Cannot build a circular dependency Restrict the “turns” each packet can take n Avoid deadlock by adding virtual channels n Detect and break deadlock q Preemption of buffers 11

Turn Model to Avoid Deadlock n Idea q q q n Analyze directions in which packets can turn in the network Determine the cycles that such turns can form Prohibit just enough turns to break possible cycles Glass and Ni, “The Turn Model for Adaptive Routing, ” ISCA 1992. 12

Valiant’s Algorithm n n n An example of oblivious algorithm Goal: Balance network load Idea: Randomly choose an intermediate destination, route to it first, then route from there to destination q Between source-intermediate and intermediate-dest, can use dimension order routing + Randomizes/balances network load - Non minimal (packet latency can increase) n Optimizations: q q Do this on high load Restrict the intermediate node to be close (in the same quadrant) 13

Adaptive Routing n Minimal adaptive Router uses network state (e. g. , downstream buffer occupancy) to pick which “productive” output port to send a packet to q Productive output port: port that gets the packet closer to its destination + Aware of local congestion - Minimality restricts achievable link utilization (load balance) q n Non-minimal (fully) adaptive “Misroute” packets to non-productive output ports based on network state + Can achieve better network utilization and load balance - Need to guarantee livelock freedom q 14

More on Adaptive Routing n Can avoid faulty links/routers n Idea: Route around faults + Deterministic routing cannot handle faulty components - Need to change the routing table to disable faulty routes - Assuming the faulty link/router is detected 15

Real On-Chip Network Designs n n n Tilera Tile 64 and Tile 100 Larrabee Cell 16

On-Chip vs. Off-Chip Differences Advantages of on-chip n n Wires are “free” q Can build highly connected networks with wide buses Low latency q n Can cross entire network in few clock cycles High Reliability q Packets are not dropped and links rarely fail Disadvantages of on-chip n n n Sharing resources with rest of components on chip q Area q Power Limited buffering available Not all topologies map well to 2 D plane 17

Tilera Networks n n n 2 D Mesh Five networks Four packet switched q q q n Dimension order routing, wormhole flow control TDN: Cache request packets MDN: Response packets IDN: I/O packets UDN: Core to core messaging One circuit switched q q STN: Low-latency, highbandwidth static network Streaming data 18

Research Topics in Interconnects n n n Plenty of topics in on-chip networks. Examples: Energy/power efficient/proportional design Reducing Complexity: Simplified router and protocol designs Adaptivity: Ability to adapt to different access patterns Qo. S and performance isolation q n Co-design of No. Cs with other shared resources q n n Reducing and controlling interference, admission control End-to-end performance, Qo. S, power/energy optimization Scalable topologies to many cores Fault tolerance Request prioritization, priority inversion, coherence, … New technologies (optical, 3 D) 19

Bufferless Routing (these slides are not covered in class; they are for your benefit) Moscibroda and Mutlu, “A Case for Bufferless Routing in On -Chip Networks, ” ISCA 2009. 20

On-Chip Networks (No. C) • • • Connect cores, caches, memory controllers, etc… Examples: • Intel 80 -core Terascale chip • MIT RAW chip Energy/Power in On-Chip Networks $ • Power is a key constraint in the design Design goals in No. C design: of high-performance processors • High throughput, low latency • No. Cs consume … • Fairness between cores, Qo. S, substantial portion of system power • Low complexity, low cost • ~30% in Intel 80 -core Terascale [IEEE • Power, low energy consumption Micro’ 07] • ~40% in MIT RAW Chip [ISCA’ 04] • No. Cs estimated to consume 100 s of Watts [Borkar, DAC’ 07]

Current No. C Approaches • Existing approaches differ in numerous ways: • Network topology [Kim et al, ISCA’ 07, Kim et al, ISCA’ 08 etc] • Flow control [Michelogiannakis et al, HPCA’ 09, Kumar et al, MICRO’ 08, etc] • Virtual Channels [Nicopoulos et al, MICRO’ 06, etc] • Qo. S & fairness mechanisms [Lee et al, ISCA’ 08, etc] • Routing algorithms [Singh et al, CAL’ 04] • Router architecture [Park et al, ISCA’ 08] • Broadcast, Multicast [Jerger et al, ISCA’ 08, Rodrigo et al, MICRO’ 08] $ Existing work assumes existence of buffers in routers!

A Typical Router Input Channel 1 Credit Flow to upstream router VC 1 Scheduler Routing Computation VC Arbiter VC 2 Switch Arbiter VCv Input Port 1 Output Channel 1 $ Input Channel N VC 1 VC 2 Output Channel N Credit Flow to upstream router VCv Input Port N N x N Crossbar Buffers are integral part of existing No. C Routers

Buffers in No. C Routers Buffers are necessary for high network throughput buffers increase total available bandwidth in network Avg. packet latency • small buffers $ Injection Rate medium buffers large buffers

Buffers in No. C Routers • Buffers are necessary for high network throughput buffers increase total available bandwidth in network • Buffers consume significant energy/power • • • Static energy even when not occupied Buffers add complexity and latency • • $ Dynamic energy when read/write Logic for buffer management Virtual channel allocation Credit-based flow control Buffers require significant chip area • E. g. , in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD’ 06]

Going Bufferless…? How much throughput do we lose? How is latency affected? latency • no buffers Injection Rate • Up to what injection rates can we use bufferless routing? • Can we achieve energy reduction? $ Are there realistic scenarios in which No. C is operated at injection rates below the threshold? If so, how much…? • Can we reduce area, complexity, etc…? buffers

Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions

BLESS: Bufferless Routing • • • Always forward all incoming flits to some output port If no productive direction is available, send to another direction packet is deflected Hot-potato routing [Baran’ 64, etc] $ Deflected! Buffered BLESS

BLESS: Bufferless Routing Flit-Ranking VC Arbiter Port. Switch Arbiter Prioritization arbitration policy $ Flit-Ranking Port. Prioritization 1. Create a ranking over all incoming flits 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking

FLIT-BLESS: Flit-Level Routing • • Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Flit-Ranking Port. Prioritization • 1. Oldest-first ranking 2. Assign flit to productive port, if possible. Otherwise, assign to non-productive port. Network Topology: $ Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router • Flow Control & Injection Policy: Completely local, inject whenever input port is free • • Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking

WORM-BLESS: Wormhole Routing • Potential downsides of FLIT-BLESS • • • Not-energy optimal (each flits needs header information) Increase in latency (different flits take different path) Increase in receive buffer size new worm! • • $ BLESS with wormhole routing…? [Dally, Seitz’ 86] Problems: • Injection Problem (not known when it is safe to inject) • Livelock Problem (packets can be deflected forever)

WORM-BLESS: Wormhole Routing Flit-Ranking Port-Prioritization Deflect worms if necessary! Truncate worms if necessary! At low congestion, packets travel routed as worms 1. 2. Oldest-first ranking If flit is head-flit a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm $ allocated to West Body-flit turns into head-flit This worm is truncated! & deflected! allocated to North Head-flit: West See paper for details…

BLESS with Buffers • • BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers • • FLIT-BLESS with Buffers WORM-BLESS with Buffers Whenever a buffer is full, it’s first flit becomes must-schedule $ must-schedule flits must be deflected if necessary See paper for details…

Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions

BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! • • $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit • Oldest-first arbitration complex • Qo. S becomes difficult Router latency reduction Area savings Impact on energy…?

Reduction of Router Latency • Baseline Router (speculative) BLESS Router (standard) BLESS Router (optimized) BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead BLESS gets rid of input buffers and virtual channels head BW flit RC body flit VA SA BW SA Router 1 RC ST ST LT ST RC ST LT LT Router 2 Router 1 [Dally, Towles’ 04] RC ST LT Router Latency = 2 LT LA LT Router 2 Router Latency = 3 ou re $ val Can be improved to 2. uat ion s RC ST LT Router Latency = 1

BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! • • Router latency reduction Area savings $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit Extensive evaluations in the paper! Impact on energy…?

Evaluation Methodology • 2 D mesh network, router latency is 2 cycles o o 4 x 4, 16 core, 16 L 2 cache banks (each node is a core and an L 2 bank) o 8 x 8, 16 core, 64 L 2 cache banks (each node is L 2 bank and may be a core) o 128 -bit wide links, 4 -flit data packets, 1 -flit address packets o • 4 x 4, 8 core, 8 L 2 cache banks (each node is a core or an L 2 bank) For baseline configuration: 4 VCs per physical input port, 1 packet deep Benchmarks $ o o Heterogeneous and homogenous application mixes o • Multiprogrammed SPEC CPU 2006 and Windows Desktop applications Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement x 86 processor model based on Intel Pentium M o 2 GHz processor, 128 -entry instruction window o 64 Kbyte private L 1 caches o Total 16 Mbyte shared L 2 caches; 16 MSHRs per bank o DRAM model based on Micron DDR 2 -800

Evaluation Methodology • Energy model provided by Orion simulator [MICRO’ 02] o • 70 nm technology, 2 GHz routers at 1. 0 Vdd For BLESS, we model o Additional energy to transmit header information o Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver $ • We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. • Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM)

Evaluation – Synthethic Traces BLESS FLIT-2 WORM-2 FLIT-1 WORM-1 MIN-AD 80 60 40 Injection Rate (flits per cycle per node) 0. 49 0. 46 0. 43 0. 4 0. 37 0. 34 0. 31 0. 28 0. 25 0. 22 0. 19 0. 16 $ 0. 13 0 0. 07 20 0 • BLESS has significantly lower saturation throughput compared to buffered baseline. Average Latency • Uniform random injection 100 0. 1 • First, the bad news Best Baseline

Evaluation – Homogenous Case Study Baseline • milc benchmarks (moderately intensive) • Perfect caches! • Very little performance degradation with BLESS (less than 4% in dense network) • With router latency 1, BLESS can even outperform baseline (by ~10%) • Significant energy improvements (almost 40%) $ BLESS RL=1

Evaluation – Homogenous Case Study Baseline BLESS RL=1 • milc benchmarks (moderately intensive) Observations: • Perfect caches! 1)Injection rates not extremely high • Very little performance degradation with BLESS on average (less than 4% in dense network) self-throttling! $ • With router latency 1, 2)For bursts and temporary BLESS can even outperform baseline network links as buffers! (by ~10%) • Significant energy improvements (almost 40%) hotspots, use

Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2 x See paper for details… overall, energy is still reduced • Impact of memory latency with real caches, very little slowdown! (at most 1. 5%) $ • Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications) little performance degradation significant energy savings in all cases no significant increase in unfairness across different applications • Area savings: ~60% of network area can be saved!

Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network Perfect L 2 Realistic L 2 Average Worst-Case ∆ Network Energy -39. 4% -28. 1% -46. 4% -41. 0% ∆ System Performance -0. 5% $ -3. 2% -0. 15% -0. 55% Link. Energy Router. Energy 1 0. 8 0. 2 0 BASE FLIT WORM Mean BASE FLIT WORM Worst -Case FLIT 0. 4 WORM 0. 6 BASE Energy (normalized) Buffer. Energy

Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network Perfect L 2 Realistic L 2 Average Worst-Case ∆ Network Energy -39. 4% -28. 1% -46. 4% -41. 0% ∆ System Performance -0. 5% $ -3. 2% -0. 15% -0. 55% Dense Network Perfect L 2 Realistic L 2 Average Worst-Case ∆ Network Energy -32. 8% -14. 0% -42. 5% -33. 7% ∆ System Performance -3. 6% -17. 1% -0. 7% -1. 5%

BLESS Conclusions • For a very wide range of applications and network settings, buffers are not needed in No. C • • Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc…) Performance slowdown is$ minimal (can even increase!) Ø A strong case for a rethinking of No. C design! • Future research: • Support for quality of service, different traffic classes, energymanagement, etc…