Скачать презентацию A Heterogeneous Lightweight Multithreaded Architecture Sheng Li Amit Скачать презентацию A Heterogeneous Lightweight Multithreaded Architecture Sheng Li Amit

5d2c71bba826a3330dc8db13518548c8.ppt

  • Количество слайдов: 29

A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007, CA

Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusions Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusions and future work MTAAP 2007, Long Beach, CA

Architecture Highlights l Processing-In-Memory(PIM) Based 0 Effectively attack memory wall problem l Highly multithreaded Architecture Highlights l Processing-In-Memory(PIM) Based 0 Effectively attack memory wall problem l Highly multithreaded 0 Successfully hide large latencies and contentions l Heterogeneous, Supports Extended Memory Semantics (EMS) 0 Extremely low overhead on context switch and synchronization MTAAP 2007, Long Beach, CA

Multithreaded Processors l Multithreading reduces the processor idle time l Thread context is part Multithreaded Processors l Multithreading reduces the processor idle time l Thread context is part of the processor Multithreading Machines 1960 s CDC 6600 1970 s I/O Processor for the Space Shuttle 1980 s Denelcor HEP 1990 s Cray/Tera MTA 2000+ Cray Eldorado 2000+ Intel Xeon Single Threaded Multithreaded 2000+ Sun Niagara MTAAP 2007, Long Beach, CA

Lightweight Threads l Thread context (frame) is 32 double words (256 bytes) 0 Two Lightweight Threads l Thread context (frame) is 32 double words (256 bytes) 0 Two double words are reserved for the thread status; 30 general purpose registers. 0 No other per thread state, easy for multithreading. l Frames are stored in memory (No Register File) 0 Registers are aliases for memory locations MTAAP 2007, Long Beach, CA

Lightweight Multithreading l Thread creation is fast and inexpensive - single instruction 0 Contrast Lightweight Multithreading l Thread creation is fast and inexpensive - single instruction 0 Contrast with pthread creation - kernel intervention and as many as 10, 000’s of instructions l Unbounded Multithreading 0 Threads are part of the memory system rather than the processor state. 0“Unlimited number” of threads per processor. 0 Many opportunities for issuing an instruction. l Ultra-lightweight Processing 0 Unbounded Multithreading requires low overhead thread management and synchronization 0 At the memory bank, Greater data bandwidth, Low overhead MTAAP 2007, Long Beach, CA

Heterogeneous Architecture l Issue instruction from ready threads on each clock cycle l Architectural Heterogeneous Architecture l Issue instruction from ready threads on each clock cycle l Architectural support for low overhead thread management Heterogeneous Architecture Lightweight Processor Chip (LPC) MTAAP 2007, Long Beach, CA

Extended Memory Semantics (EMS) l Memory subsystem is constructed of 65 bit dwords 064 Extended Memory Semantics (EMS) l Memory subsystem is constructed of 65 bit dwords 064 bits of data 01 extension bit; 1: dword is Full, 0: dword is empty l Extends Cray MTA E/F bits 0 Full/Empty: Contains data or not 0 Extra states: Metadata can contain frame pointer l Same semantics apply to thread registers 64 bits of data/metadata Extension bit MTAAP 2007, Long Beach, CA

Single Producer/ Consumer on EMS l LWP behavior for load_fe with A empty. 0 Single Producer/ Consumer on EMS l LWP behavior for load_fe with A empty. 0 Location A changes state to “FVE: forward value, leave empty” 0 Content of A is the target address of the forward operation (all registers also have a memory address). MTAAP 2007, Long Beach, CA

Completing the Load l How does the LWP complete the load_fe? 0 store_ef arrives Completing the Load l How does the LWP complete the load_fe? 0 store_ef arrives at A 0 Data associated with store is returned to T 2: R 2 – this completes the load_fe 0 Location A changes to the empty state. MTAAP 2007, Long Beach, CA

A More Complex Situation l Consider a multiple producer/consumer problem such as locks. 0 A More Complex Situation l Consider a multiple producer/consumer problem such as locks. 0 Multiple threads (more than 3) all attempt to acquire the lock. 0 Memory requests will be queued up at the target location 0 EMS handler thread needed to handle the bookkeeping MTAAP 2007, Long Beach, CA

EMS Handler Overhead l Invoking a EMS handler 0 Synchronized memory operations beyond the EMS Handler Overhead l Invoking a EMS handler 0 Synchronized memory operations beyond the hardware supported single producer/consumer scenario l Overhead 0 Creating the handler threads 0 To queue up memory requests, handlers need to spin on the target memory address to get exclusive access 0 Significant overhead on LWP CPU time, No. C traffic and memory bandwidth l How to alleviate the overhead? MTAAP 2007, Long Beach, CA

Ultra-Lightweight Processor l Alleviate burden from LWP l For thread synchronization and management, Complex Ultra-Lightweight Processor l Alleviate burden from LWP l For thread synchronization and management, Complex atomic memory operations l Simple design, Minimal circuitry l At the memory bank, Greatest data bandwidth (wide-word), no No. C traffic when accessing memory. l Multithreaded MTAAP 2007, Long Beach, CA

Large-scale system MTAAP 2007, Long Beach, CA Large-scale system MTAAP 2007, Long Beach, CA

Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusion Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusion and future work MTAAP 2007, Long Beach, CA

Simulation Environment Dim. C – Diminished C - An extension of the ANSI C Simulation Environment Dim. C – Diminished C - An extension of the ANSI C - Expose low level architectural features - Support lightweight multithreading SALT -Simulator for the Analysis of LWP Timings -Contains LWPs, ULWPs, No. C and memory subsystems. MTAAP 2007, Long Beach, CA

Benchmark Suite l Two categories of irregular problems. l Complicated control structures such as Benchmark Suite l Two categories of irregular problems. l Complicated control structures such as recursion. 0 Such programs can achieve decent performance on conventional architectures but need great effort. 0 Not necessarily Invoking EMS handler or ULWP 0 N-Queens, Fibonacci l Complicated control structures and dynamic data structures 0 Very hard to parallelize effectively on conventional SMPs. 0 EMS handler or ULWP support is necessary 0 Competing agents, SAT solver kernel MTAAP 2007, Long Beach, CA

N-Queens l Find all solutions to the problem of placing N queens on an N-Queens l Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another. l Irregular problems with dynamic parallel recursion , l Thread behavior is hard to predict. MTAAP 2007, Long Beach, CA

Competing Agents l Multiple agents attempt to update a shared memory location simultaneously l Competing Agents l Multiple agents attempt to update a shared memory location simultaneously l Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC l Complicated control structures and dynamic data structures l Using separate synchronized load/stores l To characterize the effectiveness of the ULWP in reducing the cost of synchronization. MTAAP 2007, Long Beach, CA

SAT Solver/z. Chaff l SAT-Boolean satisfiability problem (from propositional logic) 0 fundamental to many SAT Solver/z. Chaff l SAT-Boolean satisfiability problem (from propositional logic) 0 fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design. 0 Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true. 0 For example, the CNF formula, x 1 is true and x 3 is false, then all three clauses are satisfied, regardless of the value of x 2. l z. Chaff , the modern variants of the DPLL algorithm, is used to implement SAT solver. MTAAP 2007, Long Beach, CA

N-Queens l Successfully deploy all the parallelism 0 Completely dynamic, Ideal speedup 0 Saturation N-Queens l Successfully deploy all the parallelism 0 Completely dynamic, Ideal speedup 0 Saturation is only due to small data set l Good performance can be achieved on conventional SMPs but need great extra effort MTAAP 2007, Long Beach, CA

Competing Agents l EMS handler is the bottleneck in high contention situation l Heterogeneous Competing Agents l EMS handler is the bottleneck in high contention situation l Heterogeneous architecture can achieve unbounded scalability l High contention is not a problem any more in the heterogeneous architecture MTAAP 2007, Long Beach, CA

SAT Solver/z. Chaff on Conventional SMPs Data from Parallel Multithreaded Satisfiability Solver: Design and SAT Solver/z. Chaff on Conventional SMPs Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel l Parallel implementation lead to performance degeneration l The more processors, the worse performance l Very hard to achieve good performance on conventional SMPs MTAAP 2007, Long Beach, CA

SAT Solver/z. Chaff on Heterogeneous architecture Speedup Over serial version l Ideal speedup l SAT Solver/z. Chaff on Heterogeneous architecture Speedup Over serial version l Ideal speedup l saturation is only due to small data set l Successfully deployed all the parallelism MTAAP 2007, Long Beach, CA

Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusions Outline l Heterogeneous Lightweight Multithreaded Architecture l Simulation environments, benchmarks and results l Conclusions and future work MTAAP 2007, Long Beach, CA

Conclusions l The Heterogeneous Lightweight Multithreaded Architecture 0 is a good solution for irregular Conclusions l The Heterogeneous Lightweight Multithreaded Architecture 0 is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs 0 Has very low overhead on context switching and synchronization 0 Can successfully hide latencies and contentions 0 Can provide unbounded multithreading and scalability 0 Can deploy all possible parallelism inside an irregular problem MTAAP 2007, Long Beach, CA

Future Work l Provide standard language support l Benchmark suites l Large-scale system performance Future Work l Provide standard language support l Benchmark suites l Large-scale system performance l Comparison with conventional large-scale systems MTAAP 2007, Long Beach, CA

Acknowledgments l DARPA 0 This material is based upon work supported by the Defense Acknowledgments l DARPA 0 This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH 3039003. l University of Notre Dame l Caltech/JPL l Cray MTAAP 2007, Long Beach, CA

Thank you! MTAAP 2007, Long Beach, CA Thank you! MTAAP 2007, Long Beach, CA