Partitioned Global Address Space Languages Kathy Yelick Lawrence

Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome PGAS Languages 1 Kathy Yelick

The 3 P’s of Parallel Computing • Productivity • Global address space supports construction of complex shared data structures • High level constructs (e. g. , multidimensional arrays) simplify programming • Performance • PGAS Languages are Faster than two-sided MPI • Some surprising hints on performance tuning • Portability • These languages are nearly ubiquitous Kathy Yelick, 2

Partitioned Global Address Space Global address space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout x: 1 y: x: 5 y: l: l: g: g: p 0 p 1 x: 7 y: 0 By default: • Object heaps are shared • Program stacks are private pn • 3 Current languages: UPC, CAF, and Titanium • Emphasis in this talk on UPC & Titanium (based on Java) Kathy Yelick, 3

PGAS Language Overview • Many common concepts, although specifics differ • Consistent with base language • Both private and shared data • int x[10]; and shared int y[10]; • Support for distributed data structures • Distributed arrays; local and global pointers/references • One-sided shared-memory communication • Simple assignment statements: x[i] = y[i]; or t = *p; • Bulk operations: memcpy in UPC, array ops in Titanium and CAF • Synchronization • Global barriers, locks, memory fences • Collective Communication, IO libraries, etc. Kathy Yelick, 4

Example: Titanium Arrays • Ti Arrays created using Domains; indexed using Points: double [3 d] grid. A = new double [[0, 0, 0]: [10, 10]]; • Eliminates some loop bound errors using foreach (p in grid. A. domain()) grid. A[p] = grid. A[p]*c + grid. B[p]; • Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies • Array copy operations automatically work on intersection data[neighbor. Pos]. copy(mydata); intersection (copied area) “restrict”-ed (non-ghost) cells ghost cells mydata[neighor. Pos] Kathy Yelick, 5

Productivity: Line Count Comparison • Comparison of NAS Parallel Benchmarks • UPC version has modest programming effort relative to C • Titanium even more compact, especially for MG, which uses multi-d arrays • Caveat: Titanium FT has user-defined Complex type and cross-language support used to call FFTW for serial 1 D FFTs UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al; Titanium joint with Kaushik Datta & Dan Bonachea Kathy Yelick, 6

Case Study 1: Block-Structured AMR • Adaptive Mesh Refinement (AMR) is challenging • Irregular data accesses and control from boundaries • Mixed global/local view is useful Titanium AMR benchmarks available AMR Titanium work by Tong Wen and Philip Colella Kathy Yelick, 7

AMR in Titanium C++/Fortran/MPI AMR Titanium AMR • Chombo package from LBNL • Bulk-synchronous comm: • Entirely in Titanium • Finer-grained communication • Pack boundary data between procs • No explicit pack/unpack code • Automated in runtime system Code Size in Lines C++/Fortran/MPI Titanium AMR data Structures 35000 2000 AMR operations 6500 1200 Elliptic PDE solver 4200* 1500 * Somewhat more functionality in PDE part of Chombo code 10 X reduction in lines of code! Elliptic PDE solver running time (secs) PDE Solver Time (secs) Serial Parallel (28 procs) C++/Fortran/MPI Titanium 57 53 113 Comparable running time 126 Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su Kathy Yelick, 8

Immersed Boundary Simulation in Titanium • Modeling elastic structures in an incompressible fluid. • Blood flow in the heart, blood clotting, inner ear, embryo growth, and many more • Complicated parallelization • Particle/Mesh method • “Particles” connected into materials Code Size in Lines Fortran 8000 Joint work with Ed Givelberg, Armando Solar-Lezama Titanium 4000 Kathy Yelick, 9

The 3 P’s of Parallel Computing • Productivity • Global address space supports complex shared structures • High level constructs simplify programming • Performance • PGAS Languages are Faster than two-sided MPI • Better match to most HPC networks • Some surprising hints on performance tuning • Send early and often is sometimes best • Portability • These languages are nearly ubiquitous Kathy Yelick, 10

PGAS Languages: High Performance Strategy for acceptance of a new language • Make it run faster than anything else Keys to high performance • Parallelism: • Scaling the number of processors • Maximize single node performance • Generate friendly code or use tuned libraries (BLAS, FFTW, etc. ) • Avoid (unnecessary) communication cost • Latency, bandwidth, overhead • Berkeley UPC and Titanium use GASNet communication layer • Avoid unnecessary delays due to dependencies • Load balance; Pipeline algorithmic dependencies Kathy Yelick, 11

One-Sided vs Two-Sided one-sided put message address data payload two-sided message id host CPU network interface data payload memory • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) Kathy Yelick, 12

(up is good) Performance Advantage of One-Sided Communication: GASNet vs MPI • Opteron/Infini. Band (Jacquard at NERSC): • GASNet’s vapi-conduit and OSU MPI 0. 9. 5 MVAPICH • Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea Kathy Yelick, 13

(down is good) GASNet: Portability and High-Performance GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 14

(up is good) GASNet: Portability and High-Performance GASNet at least as high (comparable) for large messages Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 15

(up is good) GASNet: Portability and High-Performance GASNet excels at mid-range sizes: important for overlap Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 16

Case Study 2: NAS FT • Performance of Exchange (Alltoall) is critical • 1 D FFTs in each dimension, 3 phases • Transpose after first 2 for locality • Bisection bandwidth-limited • Problem as #procs grows • Three approaches: • Exchange: • wait for 2 nd dim FFTs to finish, send 1 message per processor pair • Slab: • wait for chunk of rows destined for 1 proc, send when ready • Pencil: • send each row as it completes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Kathy Yelick, 17

Overlapping Communication • Goal: make use of “all the wires all the time” • Schedule communication to avoid network backup • Trade-off: overhead vs. overlap • Exchange has fewest messages, less message overhead • Slabs and pencils have more overlap; pencils the most • Example: Class D problem on 256 Processors Exchange (all data at once) 512 Kbytes Slabs (contiguous rows that go to 1 processor) 64 Kbytes Pencils (single row) 16 Kbytes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Kathy Yelick, 18

NAS FT Variants Performance Summary. 5 Tflops • Slab is always best for MPI; small message cost too high • Pencil is always best for UPC; more overlap Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Kathy Yelick, 19

Case Study 2: LU Factorization • Direct methods have complicated dependencies • Especially with pivoting (unpredictable communication) • Especially for sparse matrices (dependence graph with holes) • LU Factorization in UPC • Use overlap ideas and multithreading to mask latency • Multithreaded: UPC threads + user threads + threaded BLAS • Panel factorization: Including pivoting • Update to a block of U • Trailing submatrix updates • Status: • Dense LU done: HPL-compliant • Sparse version underway Joint work with Parry Husbands Kathy Yelick, 20

UPC HPL Performance • MPI HPL numbers from HPCC database • Large scaling: • 2. 2 TFlops on 512 p, • 4. 4 TFlops on 1024 p (Thunder) • Comparison to Sca. LAPACK on an Altix, a 2 x 4 process grid • Sca. LAPACK (block size 64) 25. 25 GFlop/s (tried several block sizes) • UPC LU (block size 256) - 33. 60 GFlop/s, (block size 64) - 26. 47 GFlop/s • n = 32000 on a 4 x 4 process grid • Sca. LAPACK - 43. 34 GFlop/s (block size = 64) • UPC - 70. 26 Gflop/s (block size = 200) Kathy Yelick, 21 Joint work with Parry Husbands

The 3 P’s of Parallel Computing • Productivity • Global address space supports complex shared structures • High level constructs simplify programming • Performance • PGAS Languages are Faster than two-sided MPI • Some surprising hints on performance tuning • Portability • These languages are nearly ubiquitous • Source-to-source translators are key • Combined with portable communication layer • Specialized compilers are useful in some cases Kathy Yelick, 22

Portability of Titanium and UPC • Titanium and the Berkeley UPC translator use a similar model • Source-to-source translator (generate ISO C) • Runtime layer implements global pointers, etc • Common communication layer (GASNet) Also used by gcc/upc • Both run on most PCs, SMPs, clusters & supercomputers • Support Operating Systems: • Linux, Free. BSD, Tru 64, AIX, IRIX, HPUX, Solaris, Cygwin, Mac. OSX, Unicos, Super. UX • UPC translator somewhat less portable: we provide a http-based compile server • Supported CPUs: • x 86, Itanium, Alpha, Sparc, Power. PC, PA-RISC, Opteron • GASNet communication: • Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X 1, SGI Altix, Cray/SGI SHMEM, and (for portability) MPI and UDP • Specific supercomputer platforms: • HP Alpha. Server, Cray X 1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 • Underway: Cray XT 3, BG/L (both run over MPI) • Can be mixed with MPI, C/C++, Fortran Joint work with Titanium and UPC groups Kathy Yelick, 23

Portability of PGAS Languages Other compilers also exist for PGAS Languages • UPC • • Gcc/UPC by Intrepid: runs on GASNet HP UPC for Alpha. Servers, clusters, … MTU UPC uses HP compiler on MPI (source to source) Cray UPC • Co-Array Fortran: • Cray CAF Compiler: X 1, X 1 E • Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey • • Source to source Processors: Pentium, Itanium 2, Alpha, MIPS Networks: Myrinet, Quadrics, Altix, Origin, Ethernet OS: Linux 32 Red. Hat, IRIS, Tru 64 NB: source-to-source requires cooperation by backend compilers Kathy Yelick, 24

Summary • PGAS languages offer performance advantages • Good match to RDMA support in networks • Smaller messages may be faster: • make better use of network: postpone bisection bandwidth pain • can also prevent cache thrashing for packing • PGAS languages offer productivity advantage • Order of magnitude in line counts for grid-based code in Titanium • Push decisions about packing/not into runtime for portability (advantage of language with translator vs. library approach) • Source-to-source translation • The way to ubiquity • Complement highly tuned machine-specific compilers Kathy Yelick, 25

End of Slides PGAS Languages 26 Kathy Yelick

Productizing BUPC • Recent Berkeley UPC release • Support full 1. 2 language spec • Supports collectives (tuning ongoing); memory model compliance • Supports UPC I/O (naïve reference implementation) • Large effort in quality assurance and robustness • Test suite: 600+ tests run nightly on 20+ platform configs • Tests correct compilation & execution of UPC and GASNet • >30, 000 UPC compilations and >20, 000 UPC test runs per night • Online reporting of results & hookup with bug database • Test suite infrastructure extended to support any UPC compiler • now running nightly with GCC/UPC + UPCR • also support HP-UPC, Cray UPC, … • Online bug reporting database • Over >1100 reports since Jan 03 • > 90% fixed (excl. enhancement requests) Kathy Yelick, 27

Benchmarking • Next few UPC and MPI application benchmarks use the following systems • • Myrinet: Myrinet 2000 PCI 64 B, P 4 -Xeon 2. 2 GHz Infini. Band: IB Mellanox Cougar 4 X HCA, Opteron 2. 2 GHz Elan 3: Quadrics Qs. Net 1, Alpha 1 GHz Elan 4: Quadrics Qs. Net 2, Itanium 2 1. 4 GHz Kathy Yelick, 30

PGAS Languages: Key to High Performance One way to gain acceptance of a new language • Make it run faster than anything else Keys to high performance • Parallelism: • Scaling the number of processors • Maximize single node performance • Generate friendly code or use tuned libraries (BLAS, FFTW, etc. ) • Avoid (unnecessary) communication cost • Latency, bandwidth, overhead • Avoid unnecessary delays due to dependencies • Load balance • Pipeline algorithmic dependencies Kathy Yelick, 31

Hardware Latency • Network latency is not expected to improve significantly • Overlapping communication automatically (Chen) • Overlapping manually in the UPC applications (Husbands, Welcome, Bell, Nishtala) • Language support for overlap (Bonachea) Kathy Yelick, 32

Effective Latency Communication wait time from other factors • Algorithmic dependencies • Use finer-grained parallelism, pipeline tasks (Husbands) • Communication bandwidth bottleneck • Message time is: Latency + 1/Bandwidth * Size • Too much aggregation hurts: wait for bandwidth term • De-aggregation optimization: automatic (Iancu); • Bisection bandwidth bottlenecks • Spread communication throughout the computation (Bell) Kathy Yelick, 33

Fine-grained UPC vs. Bulk-Synch MPI • How to waste money on supercomputers • Pack all communication into single message (spend memory bandwidth) • Save all communication until the last one is ready (add effective latency) • Send all at once (spend bisection bandwidth) • Or, to use what you have efficiently: • Avoid long wait times: send early and often • Use “all the wires, all the time” • This requires having low overhead! Kathy Yelick, 34

What You Won’t Hear Much About • Compiler/runtime/gasnet bug fixes, performance tuning, testing, … • >13, 000 e-mail messages regarding cvs checkins • Nightly regression testing • 25 platforms, 3 compilers (head, opt-branch, gcc-upc), • Bug reporting • 1177 bug reports, 1027 fixed • Release scheduled for later this summer • Beta is available • Process significantly streamlined Kathy Yelick, 35

Take-Home Messages • Titanium offers tremendous gains in productivity • High level domain-specific array abstractions • Titanium is being used for real applications • Not just toy problems • Titanium and UPC are both highly portable • Run on essentially any machine • Rigorously tested and supported • PGAS Languages are Faster than two-sided MPI • Better match to most HPC networks • Berkeley UPC and Titanium benchmarks • Designed from scratch with one-side PGAS model • Focus on 2 scalability challenges: AMR and Sparse LU Kathy Yelick, 36

Titanium Background • Based on Java, a cleaner C++ • Classes, automatic memory management, etc. • Compiled to C and then machine code, no JVM • Same parallelism model at UPC and CAF • SPMD parallelism • Dynamic Java threads are not supported • Optimizing compiler • Analyzes global synchronization • Optimizes pointers, communication, memory Kathy Yelick, 37

Do these Features Yield Productivity? Joint work with Kaushik Datta, Dan Bonachea Kathy Yelick, 38

GASNet/X 1 Performance single word put single word get • GASNet/X 1 improves small message performance over shmem and MPI • Leverages global pointers on X 1 • Highlights advantage of languages vs. library approach Joint work with Christian Bell, Wei Chen and Dan Bonachea Kathy Yelick, 39

High Level Optimizations in Titanium • Irregular communication can be expensive • “Best” strategy differs by data size/distribution and machine parameters • E. g. , packing, sending bounding boxes, fine-grained are • Use of runtime optimizations • Inspector-executor Speedup relative to MPI code (Aztec library) • Performance on Sparse Mat. Vec Mult • Results: best strategy differs within the machine on a single matrix (~ 50% better) Average and maximum speedup of the Titanium version relative to the Aztec version on 1 to 16 processors Joint work with Jimmy Su Kathy Yelick, 40

Source to Source Strategy • Source-to-source translation strategy • Tremendous portability advantage • Still can perform significant optimizations • Relies on high quality back-end compilers and some coaxing in code generation • Use of “restrict” pointers in C • Understand Multi-D array indexing (Intel/Itanium issue) • Support for pragmas like IVDEP • Robust vectorizators (X 1, SSE, NEC, …) 48 x • On machines with integrated shared memory hardware need access to shared memory operations Joint work with Jimmy Su Kathy Yelick, 41