Скачать презентацию X 10 An Object-Oriented Approach to Non-uniform Cluster Скачать презентацию X 10 An Object-Oriented Approach to Non-uniform Cluster

9138193a9d979b0396c925ced472b610.ppt

  • Количество слайдов: 24

X 10: An Object-Oriented Approach to Non-uniform Cluster Computing Vijay Saraswat IBM Research July X 10: An Object-Oriented Approach to Non-uniform Cluster Computing Vijay Saraswat IBM Research July 23, 2003 IBM PL Day 2005

Overview n Introduction and context q n Language model and constructs q q n Overview n Introduction and context q n Language model and constructs q q n n Clustered Computing Big picture places, atomic, async, finish, clocks, arrays Example programs and demo Conclusion and Future Work q q Guarantees Challenges July 23, 2003 IBM PL Day 2005 2

Acknowledgements n X 10 core team q q q q n Philippe Charles Chris Acknowledgements n X 10 core team q q q q n Philippe Charles Chris Donawa (IBM Toronto) Kemal Ebcioglu Christian Grothoff (Purdue) Allan Kielstra (IBM Toronto) Maged Michael Christoph von Praun Vivek Sarkar Additional contributors to X 10 ideas: David Bacon, Bob Blainey, Perry Cheng, Julian Dolby, Guang Gao (U Delaware), Robert O'Callahan, Filip Pizlo (Purdue), Lawrence Rauchwerger (Texas A&M), Mandana Vaziri, Jan Vitek (Purdue), V. T. Rajan, Radha Jagadeesan (De. Paul) July 23, 2003 n X 10 Tools Julian Dolby, Steve Fink, Robert Fuhrer, Matthias Hauswirth, Peter Sweeney, Frank Tip, Mandana Vaziri n University partners: MIT (Stream. It), Purdue University (X 10), UC Berkeley (Stream. Bit), U. Delaware (Atomic sections), U. Illinois (Fortran plug-in), Vanderbilt University (Productivity metrics), De. Paul U (Semantics) X 10 PM+Tools Team Lead: Kemal Ebcioglu, Vivek Sarkar PERCS Principal Investigator: Mootaz Elnozahy 3

Performance and Productivity Challenges 1) Memory wall: Architectures exhibit severe non-uniformities in bandwidth & Performance and Productivity Challenges 1) Memory wall: Architectures exhibit severe non-uniformities in bandwidth & latency in memory hierarchy Proc Cluster PEs, L 1 $. . PEs, . L 1 $ 2) Frequency wall: Architectures introduce hierarchical heterogeneous parallelism to compensate for frequency scaling slowdown Clusters (scale-out) Proc Cluster . . . PEs, L 1 $. . PEs, . L 1 $ SMP Multiple cores on a chip L 2 Cache . . . Coprocessors (SPUs) SMTs . . . L 3 Cache Memory July 23, 2003 . . . SIMD ILP 3) Scalability wall: Software will need to deliver ~ 105 -way parallelism to utilize peta -scale parallel systems IBM PL Day 2005 4

Proc Cluster PEs, L 1 $ . . PEs, . L 1 $ . Proc Cluster PEs, L 1 $ . . PEs, . L 1 $ . . . PEs, L 1 $ . . 1995: entire chip can be accessed in 1 cycle 2010: only small fraction of chip can be accessed in 1 cycle L 2 Cache . . . PEs, . L 1 $ \ One billion transistors in a chip High Complexity Limits Development Productivity Major sources of complexity for application developer: 1) Severe non-uniformities in data accesses 2) Applications must exhibit large degrees of parallelism (up to ~ 105 threads) Complexity leads to increases in all phases of HPC Software Lifecycle related to parallel code L 3 Cache Parallel Specification Source Code Written Specification Algorithm Development // Input Data Requirements Memory Development of Parallel Source Code --Design, Code, Test, Port, Scale, Optimize // Production Runs of Parallel Code Maintenance and Porting of Parallel Code HPC Software Lifecycle July 23, 2003 5

PERCS Programming Model/Tools: Overall Architecture X 10 source code Performance Exploration Productivity Metrics Java+Threads+Conc PERCS Programming Model/Tools: Overall Architecture X 10 source code Performance Exploration Productivity Metrics Java+Threads+Conc utils X 10 Development Toolkit Java Development Toolkit C/C++ /MPI /Open. MP Fortran/MPI/Open. MP) C Development Toolkit Fortran Development Toolkit . . . Integrated Programming Environment: Edit, Compile, Debug, Visualize, Refactor Use Eclipse platform (eclipse. org) as foundation for integrating tools Morphogenic Software: separation of concerns, separation of roles X 10 Components X 10 runtime Java components Java runtime Fortran components Fast extern interface Fortran runtime C/C++ components C/C++ runtime Integrated Concurrency Library: messages, synchronization, threads PERCS = Productive Easy-to-use Reliable Computer Systems Continuous Program Optimization (CPO) PERCS System Software (K 42) PERCS System Hardware July 23, 2003 6

X 10 Design Assumptions n Productivity q q Axiom: OO provides proven baseline productivity, X 10 Design Assumptions n Productivity q q Axiom: OO provides proven baseline productivity, maintenance, portability benefits. Axiom: Design must rule out large classes of errors (Type safe, Memory safe, Pointer safe, Lock safe, Clock safe …) n Scalability q q q Axiom: Design must support incremental introduction of explicit place types/remote operations. q Axiom: PM must integrate with static tools (Eclipse) -- flag performance problems, refactor code, detect races. Axiom: PM must support automatic static and dynamic optimization (CPO). q Axiom: Programmer must have explicit language constructs to deal with non-uniformity of access. Axiom: Allow specification of a large collection of activities. Axiom: A program must use scalable synchronization constructs. Axiom: The runtime may implement aggregate operations more efficiently than user-specified iterations with index variables. Axiom: The user may know more than the compiler/RTS. Support High Productivity (&, possibly U ) High Performance Programmer July 23, 2003 7

The X 10 Programming Model Place Partitioned Global heap Granularity of place can range The X 10 Programming Model Place Partitioned Global heap Granularity of place can range from single register file to an entire SMP system Outbound activities Inbound activities Place-local heap Activities & Activity-local storage heap stack control Place-local heap . . . Activities & Activity-local storage heap. . . stack control Partitioned Global heap Inbound activity replies Outbound activity replies stack heap. . . control stack control Immutable Data n n n A program is a collection of places, each containing resident data and a dynamic place collection of activities. distribution Program may distribute aggregate data (arrays) across places during allocation. Program may directly operate only on local atomic, when data, using atomic blocks. n n n Program may spawn multiple (local or remote) activities in parallel. {at/for}each async, Program must use asynchronous operations to access/update remote data. Program may detect termination or (repeatedly) detect quiescence of a datadependent, distributed set of activities. clock finish, Cluster Computing: Common framework for P>=1 Shared Memory (P=1) July 23, 2003 MPI (P > 1) Formalized in Saraswat, Jagadeesan “Concurrent Clustered Programming”. 8

async n async Place. Expression. Single. Listopt Statement async (P) S q q Parent async n async Place. Expression. Single. Listopt Statement async (P) S q q Parent activity creates a new child activity at place P, to execute statement S; returns immediately. S may reference final variables in enclosing blocks. double A[D]=…; // Global dist. array final int k = …; async ( A. distribution[99] ) { // Executed at A[99]’s place atomic A[99] = k; } cf Cilk’s spawn July 23, 2003 IBM PL Day 2005 9

finish n finish S q q q n Statement : : = finish Statement finish n finish S q q q n Statement : : = finish Statement Execute S, but wait until all (transitively) spawned async’s have terminated. Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. Useful for expressing “synchronous” operations on remote data q And potentially, ordering information in a weakly consistent memory model finish ateach(point [i]: A) A[i] = i; finish async(A. distribution[j]) A[j] = 2; // All A[i]=i will complete before A[j]=2; cf Cilk’s sync Rooted Exception Model July 23, 2003 10

atomic n Atomic blocks are q n Statement : : = atomic Statement Method. atomic n Atomic blocks are q n Statement : : = atomic Statement Method. Modifier : : = atomic Conceptually executed in a single step, while other activities are suspended An atomic block may not include q q q July 23, 2003 Blocking operations Accesses to data at remote places Creation of activities at remote places // target defined in lexically enclosing environment. public atomic boolean CAS( Object old, Object new) { if (target. equals(old)) { target = new; return true; } return false; } // push data onto concurrent list-stack Node node=new Node(17); atomic { node. next = head; head = node; } IBM PL Day 2005 11

when n Statement : : = When. Statement : : = when ( Expression when n Statement : : = When. Statement : : = when ( Expression ) Statement Activity suspends until a state in which the guard is true; in that state the body is executed atomically. July 23, 2003 IBM PL Day 2005 class One. Buffer { nullable Object datum = null; boolean filled = false; public void send(Object v) { when ( !filled ) { this. datum = v; this. filled = true; } } public Object receive() { when ( filled ) { Object v = datum; datum = null; filled = false; return v; } } } 12

regions, distributions n n n Region q a (multi-dimensional) set of indices Distribution q regions, distributions n n n Region q a (multi-dimensional) set of indices Distribution q A mapping from indices to places High level algebraic operations are provided on regions and distributions region R = 0: 100; region R 1 = [0: 100, 0: 200]; region RInner = [1: 99, 1: 199]; // a local distribution D 1=R-> here; // a blocked distribution D = block(R); // union of two distributions distribution D = (0: 1) -> P 0 || (2: N) -> P 1; distribution DBoundary = D – RInner; Based on ZPL. July 23, 2003 IBM PL Day 2005 13

arrays n Arrays may be q q n Multidimensional Distributed Value types Initialized in arrays n Arrays may be q q n Multidimensional Distributed Value types Initialized in parallel: n int [D] A= new int[D] (point [i, j]) {return N*i+j; }; July 23, 2003 Array section q A [RInner] High level parallel array, reduction and span operators q Highly parallel library implementation q A-B (array subtraction) q A. reduce(int. Array. add, 0) q A. sum() IBM PL Day 2005 14

ateach, foreach n ateach (point p: A) S q ateach ( Formal. Param: Expression ateach, foreach n ateach (point p: A) S q ateach ( Formal. Param: Expression ) Statement foreach ( Formal. Param: Expression ) Statement public boolean run() { distribution D = distribution. factory. block(TABLE_SIZE); Creates |region(A)| async statements Instance p of statement S is executed at the place where A[p] is located foreach (point p: R) S q Creates |R| async statements in parallel at current place Termination of all activities can be ensured using finish. long[. ] table = new long[D] (point [i]) { return i; } long[. ] Ran. Starts = new long[distribution. factory. unique()] q n n (point [i]) { return starts(i); }; long[. ] Small. Table = new long value[TABLE_SIZE] (point [i]) {return i*S_TABLE_INIT; }; finish ateach (point [i] : Ran. Starts ) { long ran = next. Random(Ran. Starts[i]); for (int count: 1: N_UPDATES_PER_PLACE) { int J = f(ran); long K = Small. Table[g(ran)]; async atomic table[J] ^= K; ran = next. Random(ran); }} return table. sum() == EXPECTED_RESULT; } July 23, 2003 IBM PL Day 2005 15

clocks n async (P) clock (c 1, …, cn)S Operations n clock c = clocks n async (P) clock (c 1, …, cn)S Operations n clock c = new clock(); c. resume(); n Signals completion of work by activity in this clock phase. (c 1, …, cn) n next; n Static Semantics q Blocks until all clocks it is registered on can advance. Implicitly resumes all clocks. q c. drop(); n Unregister activity with c. (Clocked async): activity is registered on the clocks n Dynamic Semantics q No explicit operation to register a clock. An activity may operate only on those clocks it is live on. In finish S, S may not contain any top-level clocked asyncs. A clock c can advance only when all its registered activities have executed c. resume(). Supports over-sampling, hierarchical nesting. July 23, 2003 IBM PL Day 2005 16

Example: Spec. JBB finish async { clock c = new clock(); Company company = Example: Spec. JBB finish async { clock c = new clock(); Company company = create. Company(. . . ); for (int w : 0: wh_num) for (int t: 0: term_num) async clocked(c) { // a client initialize; next; //1. while (company. mode!=STOP) { select a transaction; think; process the transaction; if (company. mode==RECORDING) record data; if (company. mode==RAMP_DOWN) { c. resume(); //2. } } gather global data; } // a client July 23, 2003 IBM PL Day 2005 // master activity next; //1. company. mode = RAMP_UP; sleep rampuptime; company. mode = RECORDING; sleep recordingtime; company. mode = RAMP_DOWN; next; //2. // All clients in RAMP_DOWN company. mode = STOP; } // finish // Simulation completed. print results. 17

Formal semantics (FX 10) n n Based on Middleweight Java (MJ) Configuration is a Formal semantics (FX 10) n n Based on Middleweight Java (MJ) Configuration is a tree of located processes n Basic theorems q q Tree necessary for finish. q Clocks formalized using short circuits (PODC 88). Bisimulation semantics. q q n n July 23, 2003 q q Equational laws Clock quiescence is stable. Monotonicity of places. Deadlock freedom (for language w/out when). … Type Safety … Memory Safety 18

Current Status 09/03 PERCS Kickoff 02/04 X 10 Kickoff n 07/04 X 10 0. Current Status 09/03 PERCS Kickoff 02/04 X 10 Kickoff n 07/04 X 10 0. 32 Spec Draft We have an operational X 10 0. 41 implementation q All. X 10 programs shown here run. Grammar Annotated AST Analysis passes Parser 02/05 Target Java Code emitter Structure X 10 Productivity Study 12/05 X 10 Prototype #2 06/06 Open Source Release? July 23, 2003 PEM Events Code metrics • Translator based on Polyglot (Java compiler framework) • X 10 extensions are modular. • Uses Jikes parser generator. X 10 Multithreaded RTS Native code JVM X 10 source X 10 Prototype #1 07/05 Code Templates Limitations • Parser: ~45/14 K* • Translator: ~112/9 K • RTS: ~190/10 K • Polyglot base: ~517/80 K • Approx 180 test cases. (* classes+interfaces/LOC) IBM PL Day 2005 Program output • Clocked final not yet implemented. • Type-checking incomplete. • No type inference. • Implicit syntax not supported. 19

Future Work: Implementation n Type checking/inference q q n n Lock assignment for atomic Future Work: Implementation n Type checking/inference q q n n Lock assignment for atomic sections Data-race detection n Batch activities into a single thread. July 23, 2003 Batch “small” messages. Efficient implementation of scan/reduce Efficient invocation of components in foreign languages q n Dynamic, adaptive migration of places from one processor to another. Continuous optimization q Message aggregation q Load-balancing q Activity aggregation q n Clocked types Place-aware types Consistency management q n C, Fortran Garbage collection across multiple places Welcome University Partners and other collaborators. IBM PL Day 2005 20

Future work: Other topics n Design/Theory q q Atomic blocks Structural study of concurrency Future work: Other topics n Design/Theory q q Atomic blocks Structural study of concurrency and distribution n q q Clocked types Hierarchical places Weak memory model n Tools q n Refactoring language. Applications q q Persistence/Fault tolerance Database integration Several HPC programs planned currently. Also: web-based applications. Welcome University Partners and other collaborators. July 23, 2003 IBM PL Day 2005 21

Backup material July 23, 2003 IBM PL Day 2005 Backup material July 23, 2003 IBM PL Day 2005

Type system n Value classes q q n n May only have final fields. Type system n Value classes q q n n May only have final fields. May only be subclassed by value classes. Instances of value classes can be copied freely between places. nullable is a type constructor q n nullable T contains the values of T and null. Place types: [email protected], specify the place at which the data object lives. Future work: Include generics and dependent types. July 23, 2003 IBM PL Day 2005 23

Example: Latch public class Latch implements future { protected boolean forced = false; protected Example: Latch public class Latch implements future { protected boolean forced = false; protected nullable boxed result = null; protected nullable exception z = null; public interface future { boolean forced(); Object force(); } public class boxed { nullable Object val; } public atomic boolean set. Value( nullable Object val, nullable exception z ) { if ( forced ) return false; // these assignment happens only once. this. result. val= val; this. z = z; this. forced = true; return true; public atomic boolean forced() { return forced; } public Object force() { when ( forced ) { if (z != null) throw z; return result; } } } July 23, 2003 IBM PL Day 2005 24