Скачать презентацию Autotuning sparse matrix kernels Richard Vuduc Center for Скачать презентацию Autotuning sparse matrix kernels Richard Vuduc Center for

b7226f6042e271206ab29fa670a9b5b6.ppt

  • Количество слайдов: 119

Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007

Predictions (2003) Ø Need for “autotuning” will increase over time Ø Ø Improve performance Predictions (2003) Ø Need for “autotuning” will increase over time Ø Ø Improve performance for given app & machine using automated experiments Example: Sparse matrix-vector multiply (Sp. MV), 1987 to present Ø Ø Ø Untuned: 10% of peak or less, decreasing Tuned: 2 x speedup, increasing over time Tuning is getting harder (qualitative) Ø Ø More complex machines & workloads Parallelism

Trends in uniprocessor Sp. MV performance (Mflop/s), pre-2004 Trends in uniprocessor Sp. MV performance (Mflop/s), pre-2004

Trends in uniprocessor Sp. MV performance (Mflop/s), pre-2004 Trends in uniprocessor Sp. MV performance (Mflop/s), pre-2004

Trends in uniprocessor Sp. MV performance (fraction of peak) Trends in uniprocessor Sp. MV performance (fraction of peak)

Is tuning getting easier? // y <-- y + A*x for all A(i, j): Is tuning getting easier? // y <-- y + A*x for all A(i, j): y(i) += A(i, j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8 x 8 dense blocks • As r x c , Mflop/s

Speedups on Itanium 2: The need for search Mflop/s (31. 1%) Reference Mflop/s (7. Speedups on Itanium 2: The need for search Mflop/s (31. 1%) Reference Mflop/s (7. 6%)

Speedups on Itanium 2: The need for search Mflop/s (31. 1%) Best: 4 x Speedups on Itanium 2: The need for search Mflop/s (31. 1%) Best: 4 x 2 Reference Mflop/s (7. 6%)

Sp. MV Performance—raefsky 3 Sp. MV Performance—raefsky 3

Sp. MV Performance—raefsky 3 Sp. MV Performance—raefsky 3

Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz

Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz * Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz * Reference improves *

Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz * Better, worse, or about the same? Itanium 2, 900 MHz 1. 3 GHz * Best possible worsens slightly *

Better, worse, or about the same? Pentium M Core 2 Duo (1 -core) Better, worse, or about the same? Pentium M Core 2 Duo (1 -core)

Better, worse, or about the same? Pentium M Core 2 Duo (1 -core) * Better, worse, or about the same? Pentium M Core 2 Duo (1 -core) * Reference & best improve; relative speedup improves (~1. 4 to 1. 6) *

Better, worse, or about the same? Pentium M Core 2 Duo (1 -core) * Better, worse, or about the same? Pentium M Core 2 Duo (1 -core) * Note: Best fraction of peak decreased from 11% 9. 6% *

Better, worse, or about the same? Power 4 Power 5 Better, worse, or about the same? Power 4 Power 5

Better, worse, or about the same? Power 4 Power 5 * Reference worsens! * Better, worse, or about the same? Power 4 Power 5 * Reference worsens! *

Better, worse, or about the same? Power 4 Power 5 * Relative importance of Better, worse, or about the same? Power 4 Power 5 * Relative importance of tuning increases *

A framework for performance tuning Source: Sci. DAC Performance Engineering Research Institute (PERI) A framework for performance tuning Source: Sci. DAC Performance Engineering Research Institute (PERI)

Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work

Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work

OSKI: Optimized Sparse Kernel Interface Ø Autotuned kernels for user’s matrix & machine Ø OSKI: Optimized Sparse Kernel Interface Ø Autotuned kernels for user’s matrix & machine Ø Ø Faster than standard implementations Ø Ø Ø BLAS-style interface: mat-vec (Sp. MV), tri. solve (Tr. SV), … Hides complexity of run-time tuning Includes fast locality-aware kernels: ATA*x, … Standard Sp. MV < 10% peak, vs. up to 31% with OSKI Up to 4 x faster Sp. MV, 1. 8 x Tr. SV, 4 x ATA*x, … For “advanced” users & solver library writers Ø Ø Ø PETSc extension available (OSKI-PETSc) Kokkos (for Trilinos) by Heroux Adopted by Clear. Shape, Inc. for shipping product (2 x speedup)

Tunable matrix-specific optimization techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Tunable matrix-specific optimization techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Sparse triangular solve Ø Ø Register blocking (RB): up to 4 x over CSR Variable block splitting: 2. 1 x over CSR, 1. 8 x over RB Diagonals: 2 x over CSR Reordering to create dense structure + splitting: 2 x over CSR Symmetry: 2. 8 x over CSR, 2. 6 x over RB Cache blocking: 3 x over CSR Multiple vectors (Sp. MM): 7 x over CSR And combinations… Hybrid sparse/dense data structure: 1. 8 x over CSR Higher-level kernels Ø Ø AAT*x, ATA*x: 4 x over CSR, 1. 8 x over RB A *x: 2 x over CSR, 1. 5 x over RB

Tuning for workloads Ø Bi-conjugate gradients - equal mix of A*x and AT*y Ø Tuning for workloads Ø Bi-conjugate gradients - equal mix of A*x and AT*y Ø Ø Ø Higher-level operation - (Ax, ATy) kernel Ø Ø Ø 3 x 1: Ax, ATy = 1053, 343 Mflop/s 517 Mflop/s 3 x 3: Ax, ATy = 806, 826 Mflop/s 816 Mflop/s 3 x 1: 757 Mflop/s 3 x 3: 1400 Mflop/s Matrix powers (Ak*x) with data structure transformations Ø Ø A 2*x: up to 2 x faster New latency-tolerant solvers? (Hoemmen’s thesis, on-going at UCB)

How OSKI tunes (Overview) Library Install-Time (offline) Application Run-Time How OSKI tunes (Overview) Library Install-Time (offline) Application Run-Time

How OSKI tunes (Overview) Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark How OSKI tunes (Overview) Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Application Run-Time

How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models

How OSKI tunes (Overview) Extensibility: Advanced users may write & dynamically add “Code variants” How OSKI tunes (Overview) Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls

OSKI’s place in the tuning framework OSKI’s place in the tuning framework

Examples of OSKI’s early impact Ø Early adopter: Clear. Shape, Inc. Ø Ø Ø Examples of OSKI’s early impact Ø Early adopter: Clear. Shape, Inc. Ø Ø Ø Core product: lithography simulator 2 x speedup on full simulation after using OSKI Proof-of-concept: SLAC T 3 P accelerator cavity design simulator Ø Ø Ø Sp. MV dominates execution time Symmetry, 2 x 2 block structure 2 x speedups

OSKI-PETSc Performance: Accel. Cavity OSKI-PETSc Performance: Accel. Cavity

Strengths and limitations of the library approach Ø Strengths Ø Ø Isolates optimization in Strengths and limitations of the library approach Ø Strengths Ø Ø Isolates optimization in the library for portable performance Exploits domain-specific information aggressively Handles run-time tuning naturally Limitations Ø Ø “Generation Me”: What about my application and its abstractions? Run-time tuning: run-time overheads Limited context for optimization (without delayed evaluation) Limited extensibility (fixed interfaces)

Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work

Tour of application-specific optimizations Ø Ø Five case studies Common characteristics Ø Complex code Tour of application-specific optimizations Ø Ø Five case studies Common characteristics Ø Complex code Ø Ø Heavy use of abstraction Use generated code (e. g. , SWIG C++/Python bindings) Benefit from extensive code and data restructuring Multiple bottlenecks

[1] Loop transformations for SMG 2000 Ø SMG 2000, implements semi-coarsening multigrid on structured [1] Loop transformations for SMG 2000 Ø SMG 2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) Ø Ø Residual computation has an Sp. MV bottleneck Loop below looks simple but non-trivial to extract for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]

[1] SMG 2000 demo [1] SMG 2000 demo

[1] Before transformation for (si = 0; si < NS; si++) /* Loop 1 [1] Before transformation for (si = 0; si < NS; si++) /* Loop 1 */ for (kk = 0; kk < NZ; kk++) { /* Loop 2 */ for (jj = 0; jj < NY; jj++) { /* Loop 3 */ for (ii = 0; ii < NX; ii++) { /* Loop 4 */ r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]]; } /* Loop 4 */ } /* Loop 3 */ } /* Loop 2 */ } /* Loop 1 */

[1] After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < [1] After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < NZ; kk++) { /* Loop 2 */ for (jj = 0; jj < NY; jj++) { /* Loop 3 */ for (si = 0; si < NS; si++) { /* Loop 1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop 4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop 4 */ for ( ; ii < NX; ii++) { /* fringe Loop 4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop 4 */ } /* Loop 1 */ } /* Loop 3 */ } /* Loop 2 */

[1] Loop transformations for SMG 2000 Ø 2 x speedup on kernel from specialization, [1] Loop transformations for SMG 2000 Ø 2 x speedup on kernel from specialization, loop interchange, unrolling, prefetching Ø Ø Lesson: Need complex sequences of transformations Ø Ø But only 1. 25 x overall---multiple bottlenecks Use profiling to guide Inspect run-time data for specialization Transformations are automatable Research topic: Automated specialization of hypre?

[2] Slicing and dicing W 3 P Ø Accelerator design code from SLAC Ø [2] Slicing and dicing W 3 P Ø Accelerator design code from SLAC Ø Ø Ø calc. Basis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh b = calc. Basis (elem) f = calc. Field (b, mode)

[2] Slicing and dicing W 3 P Ø Accelerator design code Ø Ø calc. [2] Slicing and dicing W 3 P Ø Accelerator design code Ø Ø calc. Basis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible Challenges in practice Ø Ø Ø “Loop nest” ~ 500+ LOC 150+ LOC to calc. Basis() in 6 -deep call chain, 4 deep loop nest, 2 conditionals File I/O Changes must be unobtrusive /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh // { … b = calc. Basis (elem) // } f = calc. Field (b, mode) write. Data. To. Files (…);

[2] W 3 P: Impact and lessons Ø Ø Ø 4 -5 x speedup [2] W 3 P: Impact and lessons Ø Ø Ø 4 -5 x speedup for post-processing step; 1. 5 x overall Changes “checked-in” Lesson: Need clean source-level transformations Ø Ø To automate, need robust program analysis and developer guidance Research: Annotation framework for developers Ø [w/ Quinlan, Schordan, Yi: POHLL’ 06]

[3] Structure splitting Ø Convert (array of structs) into (struct of arrays) Ø Ø [3] Structure splitting Ø Convert (array of structs) into (struct of arrays) Ø Ø Improve spatial locality through increased stride-1 accesses Make code hardware-prefetch and vector/SIMD unit “friendly”c struct Type { double p; double x, y, z; double E; int k; } X[N], Y[N]; for (i = 0; i < N; i++) Y[i]. E += Y[X[i]. k]. p; double Xp[N]; double Xx[N], Xy[N], Xz[N]; double XE[N]; int Xk[N]; // … same for Y … for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]);

[3] Structure splitting: Impact and challenges Ø Ø 2 x speedup on a KULL [3] Structure splitting: Impact and challenges Ø Ø 2 x speedup on a KULL benchmark (suggested by Brian Miller) Implementation challenges Ø Ø Potentially affects entire code Can apply only locally, at a cost Ø Ø Ø Extra storage Overhead of copying Tedious to do by hand Lesson: Extensive data restructuring may be necessary Research: When and how best to split?

[4] Finding a loop-fusion needle in a haystack Ø Interprocedural loop fusion finder [w/ [4] Finding a loop-fusion needle in a haystack Ø Interprocedural loop fusion finder [w/ B. White : Cornell U. ] Ø Ø Known example had 2 x speedup on benchmark (Miller) Built “abstraction-aware” analyzer using ROSE Ø Ø First pass: Associate “loop signatures” with each function Second pass: Propagate signatures through call chains for (Zone: : iterator z = zones. begin (); z != zones. end (); ++z) for (Corner: : iterator c = (*z). corners(). begin (); …) for (int s = 0; s < c->sides(). size(); s++) …

[4] Finding a loop-fusion needle in a haystack Ø Found 6 examples of 3 [4] Finding a loop-fusion needle in a haystack Ø Found 6 examples of 3 - and 4 -deep nested loops Ø Ø “Analysis-only” tool Finds, though does not verify/transform Lesson: “Classical” optimizations relevant to abstraction use Research Ø Ø Recognizing and optimizing abstractions [White’s thesis, ongoing] Extending traditional optimizations to abstraction use

[5] Aggregating messages (on-going) Ø Idea: Merge sends (suggested by Miller) Data. Type A; [5] Aggregating messages (on-going) Ø Idea: Merge sends (suggested by Miller) Data. Type A; // … operations on A … A. all. To. All(); // … Data. Type B; // … operations on B … B. all. To. All(); Ø Ø Data. Type A; // … operations on A … // … Data. Type B; // … operations on B … bulk. All. To. All(A, B); Implementing a fully automated translator to find and transform Research: When and how best to aggregate?

Summary of application-specific optimizations Ø Like library-based approach, exploit knowledge for big gains Ø Summary of application-specific optimizations Ø Like library-based approach, exploit knowledge for big gains Ø Ø Ø Would benefit from automated transformation tools Ø Ø Ø Guidance from developer Use run-time information Real code is hard to process Changes may become part of software re-engineering Need robust analysis and transformation infrastructure Range of tools possible: analysis and/or transformation No silver bullets or magic compilers

Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library “Real world” optimization Outline Ø Ø Ø Motivation OSKI: An autotuned sparse kernel library “Real world” optimization Toward end-to-end application autotuning Summary and future work

A framework for performance tuning Source: Sci. DAC Performance Engineering Research Institute (PERI) A framework for performance tuning Source: Sci. DAC Performance Engineering Research Institute (PERI)

OSKI’s place in the tuning framework OSKI’s place in the tuning framework

An empirical tuning framework using ROSE Empirical Tuning Framework using ROSE gprof, HPCtoolkit Open An empirical tuning framework using ROSE Empirical Tuning Framework using ROSE gprof, HPCtoolkit Open Speed. Shop POET Search engine

An end-to-end autotuning framework using ROSE Ø Guiding philosophy Ø Ø User or “system” An end-to-end autotuning framework using ROSE Ø Guiding philosophy Ø Ø User or “system” profiles to collect data and/or analyses In ROSE Ø Ø Ø Leverage external stand-alone components Provide open components and tools for community Mark-up AST with data/analysis, to identify optimizable target(s) Outline target into stand-alone dynamically loadable library routine Make “benchmark” by inserting checkpoint library calls into app Generate parameterized representation of target Independent search engine performs search

Interfaces to performance tools Ø Mark-up AST with data, analysis, to identify optimizable target(s) Interfaces to performance tools Ø Mark-up AST with data, analysis, to identify optimizable target(s) Ø Ø Ø gprof HPCToolkit [Mellor-Crummey : Rice] Vizz. Analyzer / Vizz 3 D [Panas : LLNL] In progress: Open Speed. Shop [Schulz : LLNL] Needed: Analysis to identify targets

Outlining Ø Outline target into dynamically loadable library routine Ø Ø Extends initial implementations Outlining Ø Outline target into dynamically loadable library routine Ø Ø Extends initial implementations by Liao [U. Houston], Jula [TAMU] Handles many details of C & C++ Ø Ø Ø Wraps up variables, inserts declarations, generates call Produces suitable interfaces for dynamic loading Handles non-local control flow void OUT_38725__ (double* r, int JR, int KR, const double* A, …) { int si, j, k, i; for (si = 0; si < NS; si++) … r[i + j*JR + k*KR] -= A[i + …

Making a benchmark Ø Make “benchmark” by inserting checkpoint library calls Ø Ø Ø Making a benchmark Ø Make “benchmark” by inserting checkpoint library calls Ø Ø Ø Measure application behavior “in context” Use ckpt (user-level) [Zander : U. Wisc. ] Insert timing code (cycle counter) May insert arbitrary code to distinguish calling contexts Reasonably fast in practice Ø Ø Ø Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M For SMG 2000: Problem consuming ~500 MB footprint takes ~30 s to run Needed Ø Best procedure to get accurate and fair comparisons? Ø Ø Do restarts resume in comparable states? More portable checkpoint library

Example of “benchmark” (pseudo)code static int num_calls = 0; // no. of invocations of Example of “benchmark” (pseudo)code static int num_calls = 0; // no. of invocations of outlined code if (!num_calls) { ckpt (); // Checkpoint/resume OUT_38725__ = dlsym (…); // Load an implementation start. Timer (); } OUT_38725__ (…); // outlined call-site if (++num_calls == CALL_LIMIT) { // Measured CALL_LIMIT calls stop. Timer (); output. Time (); exit (0); }

Generating parameterized representations Ø Generate parameterized representation of target Ø Ø Ø Hand-coded POET Generating parameterized representations Ø Generate parameterized representation of target Ø Ø Ø Hand-coded POET for SMG 2000 Ø Ø POET: Embedded scripting language for expressing parameterized code variations [see POHLL’ 07] Loop optimizer will generate POET for each target Interchange Machine-specific: Unrolling, prefetching Source-specific: register & restrict keywords, C pointer idiom New parameterization for loop fusion [Zhao, Kennedy : Rice, Yi : UTSA]

SMG 2000 kernel POET instantiation for (kk = 0; kk < NZ; kk++) { SMG 2000 kernel POET instantiation for (kk = 0; kk < NZ; kk++) { /* L 4 */ for (jj = 0; jj < NY; jj++) { /* L 3 */ for (si = 0; si < NS; si++) { /* L 1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core L 2 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core L 2 */ for ( ; ii < NX; ii++) { /* fringe L 2 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe L 2 */ } /* L 1 */ } /* L 3 */ } /* L 4 */

Search Ø Ø We are search-engine agnostics Many possible hybrid modeling/search techniques Search Ø Ø We are search-engine agnostics Many possible hybrid modeling/search techniques

Summary of autotuning compiler approach Ø End-to-end framework leverages existing work Ø Ø Ø Summary of autotuning compiler approach Ø End-to-end framework leverages existing work Ø Ø Ø ROSE provides a heavy-duty (robust) source-level infrastructure Assemble stand-alone components Current and future work Ø Ø Ø Assembling a more complete end-to-end example Interfaces between components? Extending basic ROSE infrastructure, particularly program analysis

Current and future research directions Ø Autotuning Ø Ø Performance modeling Ø Ø End-to-end Current and future research directions Ø Autotuning Ø Ø Performance modeling Ø Ø End-to-end autotuning compiler framework Tuning for novel architectures (e. g. , multicore) Tools for generating domain-specific libraries Kernel- and machine-specific analytical and statistical models Hybrid symbolic/empirical modeling Implications for applications and architectures? Tools for debugging massively parallel applications Ø Ø Jitter. Bug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen] Static/dynamic analyses for debugging MPI

End End

What is ROSE? Ø Research: Develop techniques to optimize applications that rely heavily on What is ROSE? Ø Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Ø Ø Ø Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning Infrastructure: Tool for building source-to-source optimizers Ø Ø Full compiler: basic program analysis, loop optimizer, Open. MP [UH] Support for C, C++; Fortran 90 in progress Target “non-compiler audience” Open-source

What is ROSE? Ø Research: Develop techniques to optimize applications that rely heavily on What is ROSE? Ø Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Ø Ø Ø Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning Infrastructure: Tool for building source-to-source optimizers Ø Ø Full compiler: basic program analysis, loop optimizer, Open. MP [UH] Support for C, C++; Fortran 90 in progress Target “non-compiler audience” Open-source

Bug hunting in MPI programs Ø Ø Motivation: MPI is a large, complex API Bug hunting in MPI programs Ø Ø Motivation: MPI is a large, complex API Bug pattern detectors Ø Ø Ø Check basic API usage Adapt existing tools: MPI-CHECK; Find. Bugs; Farchi, et al. VC’ 05 Tasks requiring deeper program analysis Ø Ø Ø Properly matched sends/receives, barriers, collectives Buffer errors, e. g. , overruns, read before non-blocking op completes Temporal usage properties See error survey by De. Souza, Kuhn, & de Supinski ‘ 05 Extend existing analyses by Shires, et al. , PDPTA’ 99; Strout, et al. ICPP‘ 06

Compiler-based testing tools Ø Ø Ø Instrumentation and dynamic analysis to measure coverage [IBM] Compiler-based testing tools Ø Ø Ø Instrumentation and dynamic analysis to measure coverage [IBM] Measurement-unit validation via Osprey [Jiang and Su, UC Davis] Numerical interval/bounds analysis [Sun] Interface to MOPS model-checker [Collingbourne, Imperial College] Interactive program visualization via Vizz. Analyzer [Panas, LLNL]

Trends in uniprocessor Sp. MV performance (absolute Mflop/s) Trends in uniprocessor Sp. MV performance (absolute Mflop/s)

Trends in uniprocessor Sp. MV performance (fraction of peak) Trends in uniprocessor Sp. MV performance (fraction of peak)

Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for all A(i, j): y(i) += A(i, j) * x(j)

Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for all A(i, j): y(i) += A(i, j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t

Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for Motivation: The Difficulty of Tuning Sp. MV // y <-- y + A*x for all A(i, j): y(i) += A(i, j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8 x 8 dense blocks

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Reference Mflop/s (7. Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Reference Mflop/s (7. 6%)

Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Best: 4 x Speedups on Itanium 2: The Need for Search Mflop/s (31. 1%) Best: 4 x 2 Reference Mflop/s (7. 6%)

Sp. MV Performance—raefsky 3 Sp. MV Performance—raefsky 3

Sp. MV Performance—raefsky 3 Sp. MV Performance—raefsky 3

Better, worse, or about the same? Pentium 4, 1. 5 GHz Xeon, 3. 2 Better, worse, or about the same? Pentium 4, 1. 5 GHz Xeon, 3. 2 GHz

Better, worse, or about the same? Pentium 4, 1. 5 GHz Xeon, 3. 2 Better, worse, or about the same? Pentium 4, 1. 5 GHz Xeon, 3. 2 GHz * Faster, but relative improvement increases (20% ~50%) *

Problem-Specific Performance Tuning Problem-Specific Performance Tuning

Problem-Specific Optimization Techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Sparse Problem-Specific Optimization Techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Sparse triangular solve Ø Ø Register blocking (RB): up to 4 x over CSR Variable block splitting: 2. 1 x over CSR, 1. 8 x over RB Diagonals: 2 x over CSR Reordering to create dense structure + splitting: 2 x over CSR Symmetry: 2. 8 x over CSR, 2. 6 x over RB Cache blocking: 3 x over CSR Multiple vectors (Sp. MM): 7 x over CSR And combinations… Hybrid sparse/dense data structure: 1. 8 x over CSR Higher-level kernels Ø Ø AAT*x, ATA*x: 4 x over CSR, 1. 8 x over RB A *x: 2 x over CSR, 1. 5 x over RB

Problem-Specific Optimization Techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Sparse Problem-Specific Optimization Techniques Ø Optimizations for Sp. MV Ø Ø Ø Ø Ø Sparse triangular solve Ø Ø Register blocking (RB): up to 4 x over CSR Variable block splitting: 2. 1 x over CSR, 1. 8 x over RB Diagonals: 2 x over CSR Reordering to create dense structure + splitting: 2 x over CSR Symmetry: 2. 8 x over CSR, 2. 6 x over RB Cache blocking: 3 x over CSR Multiple vectors (Sp. MM): 7 x over CSR And combinations… Hybrid sparse/dense data structure: 1. 8 x over CSR Higher-level kernels Ø Ø AAT*x, ATA*x: 4 x over CSR, 1. 8 x over RB A *x: 2 x over CSR, 1. 5 x over RB

BCSR Captures Regularly Aligned Blocks Ø Ø Ø n = 21216 nnz = 1. BCSR Captures Regularly Aligned Blocks Ø Ø Ø n = 21216 nnz = 1. 5 M Source: NASA structural analysis problem 8 x 8 dense substructure Reduces storage

Problem: Forced Alignment Ø BCSR(2 x 2) Ø Stored / true nz = 1. Problem: Forced Alignment Ø BCSR(2 x 2) Ø Stored / true nz = 1. 24

Problem: Forced Alignment Ø BCSR(2 x 2) Ø Ø Stored / true nz = Problem: Forced Alignment Ø BCSR(2 x 2) Ø Ø Stored / true nz = 1. 24 BCSR(3 x 3) Ø Stored / true nz = 1. 46

Problem: Forced Alignment Implies UBCSR Ø BCSR(2 x 2) Ø Ø BCSR(3 x 3) Problem: Forced Alignment Implies UBCSR Ø BCSR(2 x 2) Ø Ø BCSR(3 x 3) Ø Ø Ø Stored / true nz = 1. 24 Stored / true nz = 1. 46 Forces i mod 3 = j mod 3 = 0 Unaligned BCSR format: Ø Store row indices

The Speedup Gap: BCSR vs. CSR The Speedup Gap Speedup: BCSR/CSR 1. 1— 1. The Speedup Gap: BCSR vs. CSR The Speedup Gap Speedup: BCSR/CSR 1. 1— 1. 5 x gap Machine

Approach: Splitting + Relaxed Block Alignment Ø Goal: Close the gap between FEM classes Approach: Splitting + Relaxed Block Alignment Ø Goal: Close the gap between FEM classes Ø Our approach: Capture actual structure more precisely Ø Ø Split: A = A 1 + A 2 + … + As Store each Ai in unaligned BCSR (UBCSR) format Ø Ø Relax both row and column alignment Buttari, et al. (2005) show improvements from relaxed column alignment 2. 1 x over no blocking, 1. 8 x over blocking When not faster than BCSR, may still reduce storage

Variable Block Row (VBR) Analysis Ø Partition by grouping consecutive rows/columns having same pattern Variable Block Row (VBR) Analysis Ø Partition by grouping consecutive rows/columns having same pattern

From VBR, Identify Multiple Natural Block Sizes From VBR, Identify Multiple Natural Block Sizes

VBR with Fill Ø Can also pad by matching rows/columns with nearly similar patterns VBR with Fill Ø Can also pad by matching rows/columns with nearly similar patterns Ø Define VBR(q) = Ø Ø VBR where consecutive rows grouped when “similarity” ³ q 0£q£ 1

VBR with Fill q=1 q = 0. 7 Fill of 1% VBR with Fill q=1 q = 0. 7 Fill of 1%

A Complex Tuning Problem Ø Many parameters need “tuning” Ø Ø Fill threshold, . A Complex Tuning Problem Ø Many parameters need “tuning” Ø Ø Fill threshold, . 5 £ q £ 1 Number of splittings, 2 £ s £ 4 Ordering of block sizes, ri´ci; rs´cs = 1´ 1 See paper in HPCC 2005 for proof-of-concept experiments based on a semi-exhaustive search Ø Heuristic in progress (uses Buttari, et al. (2005) work)

Matrix Dimension # non-zeros FEM 2 Matrices 52 k 10 -ct 20 stif Dominant Matrix Dimension # non-zeros FEM 2 Matrices 52 k 10 -ct 20 stif Dominant blocks 2. 7 M 6 x 6 (39%), 3 x 3 (15%) 20 k 1. 3 M 3 x 3 (96%) 16 k 1. 1 M 1 x 1 (38%), 3 x 3 (23%) 41 k 1. 7 M 2 x 1 (81%), 2 x 2 (19%) 23 k 1. 0 M 1 x 1 (75%), 3 x 1 (12%) 141 k 7. 3 M 6 x 6 (82%) 121 k 4. 8 M 2 x 1 (26%), 1 x 2 (26%), 1 x 1 (26%), 2 x 2 (22%) 218 k 11. 6 M 6 x 6 (94%) 47 k 2. 4 M 2 x 2 (17%), 3 x 2 (15%), 2 x 3 (15%), 4 x 2 (9%), 2 x 4 (9%) 90 k 4. 8 M 6 x 6 (99%) Engine block 12 -raefsky 4 Buckling 13 -ex 11 Fluid flow 15 -Vavasis 3 2 D PDE 17 -rim Fluid flow A-bmw 7 st_1 Car chassis B-cop 20 k_m Accel. Cavity C-pwtk Wind tunnel D-rma 10 Charleston Harbor E-s 3 dkqm 4 Cylindrical shell

Power 4 Performance Power 4 Performance

Storage Savings Storage Savings

Traveling Salesman Problem-based Reordering Ø Ø Application: Stanford accelerator design problem (Omega 3 P) Traveling Salesman Problem-based Reordering Ø Ø Application: Stanford accelerator design problem (Omega 3 P) Reorder by approximately solving TSP [Pinar & Heath ‘ 97] Ø Ø Ø Ø Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’ 97] Also: symmetric storage, register blocking Manually selected optimizations Problem: High-cost of computing approximate solution to TSP

100 x 100 Submatrix Along Diagonal 100 x 100 Submatrix Along Diagonal

“Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue

Inter-Iteration Sparse Tiling (1/3) x 1 t 1 y 1 Ø Ø x 2 Inter-Iteration Sparse Tiling (1/3) x 1 t 1 y 1 Ø Ø x 2 t 2 y 2 Ø Idea: Strout, et al. , ICCS 2001 Let A be 5 x 5 tridiagonal Consider y=A 2 x Ø x 3 t 3 y 3 Ø Ø x 4 t 4 y 4 x 5 t 5 y 5 t=Ax, y=At Nodes: vector elements Edges: matrix elements aij

Inter-Iteration Sparse Tiling (2/3) x 1 t 1 y 1 Ø Ø x 2 Inter-Iteration Sparse Tiling (2/3) x 1 t 1 y 1 Ø Ø x 2 t 2 y 2 Ø Idea: Strout, et al. , ICCS 2001 Let A be 5 x 5 tridiagonal Consider y=A 2 x Ø x 3 t 3 y 3 Ø Ø x 4 t 4 y 4 x 5 t 5 y 5 Ø t=Ax, y=At Nodes: vector elements Edges: matrix elements aij Orange = everything needed to compute y 1 Ø Reuse a 11, a 12

Inter-Iteration Sparse Tiling (3/3) x 1 t 1 y 1 Ø Ø x 2 Inter-Iteration Sparse Tiling (3/3) x 1 t 1 y 1 Ø Ø x 2 t 2 y 2 Ø Idea: Strout, et al. , ICCS 2001 Let A be 5 x 5 tridiagonal Consider y=A 2 x Ø x 3 t 3 y 3 Ø Ø x 4 t 4 y 4 x 5 t 5 y 5 Ø Nodes: vector elements Edges: matrix elements aij Orange = everything needed to compute y 1 Ø Ø t=Ax, y=At Reuse a 11, a 12 Grey = y 2, y 3 Ø Reuse a 23, a 33, a 43

Serial Sparse Tiling Performance (Itanium 2) Serial Sparse Tiling Performance (Itanium 2)

OSKI Software Architecture and API OSKI Software Architecture and API

Empirical Model Evaluation Ø Tuning loop Ø Ø Compute a “tuning time budget” based Empirical Model Evaluation Ø Tuning loop Ø Ø Compute a “tuning time budget” based on workload While (time remains and no tuning chosen) Try a heuristic Ø Ø Heuristic for blocked Sp. MV: Choose r x c to minimize Ø Tuning for workloads Ø Ø Weighted sums of empirical models Dynamic programming for alternatives Ø Example: Combined y=ATAx vs. separate (w=Ax, y=ATw)

Cost of Tuning Ø Non-trivial run-time cost: up to ~40 mat-vecs Ø Ø Design Cost of Tuning Ø Non-trivial run-time cost: up to ~40 mat-vecs Ø Ø Design point: user calls “tune” routine explicitly Ø Ø Ø Dominated by conversion time (~ 80%) Exposes cost Tuning time limited using estimated workload Ø Provided by user or inferred by library User may save tuning results Ø Ø To apply on future runs with similar matrix Stored in “human-readable” format

Interface supports legacy app migration int* ptr = …, *ind = …; double* val Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y); /* Some dense BLAS op on vectors */

Interface supports legacy app migration int* ptr = …, *ind = …; double* val Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_Create. Mat. CSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_Create. Vec. View(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_Create. Vec. View(y, num_rows, UNIT_STRIDE); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y);

Interface supports legacy app migration int* ptr = …, *ind = …; double* val Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_Create. Mat. CSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_Create. Vec. View(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_Create. Vec. View(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_Set. Hint. Mat. Mult (A_tunable, …, 500); oski_Tune. Mat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y);

Interface supports legacy app migration int* ptr = …, *ind = …; double* val Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_Create. Mat. CSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_Create. Vec. View(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_Create. Vec. View(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_set. Hint. Mat. Mult (A_tunable, …, 500); oski_Tune. Mat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) oski_Mat. Mult (A_tunable, OP_NORMAL, , x_view, , y_view); // Step 3 r = ddot (x, y);

Quick-and-dirty Parallelism: OSKI-PETSc Ø p 0 Extend PETSc’s distributed memory Sp. MV (MATMPIAIJ) Ø Quick-and-dirty Parallelism: OSKI-PETSc Ø p 0 Extend PETSc’s distributed memory Sp. MV (MATMPIAIJ) Ø PETSc Ø Each process stores diag (all-local) and off-diag submatrices p 1 Ø OSKI-PETSc: Ø p 2 p 3 Ø Add OSKI wrappers Each submatrix tuned independently

OSKI-PETSc Proof-of-Concept Results Ø Matrix 1: Accelerator cavity design (R. Lee @ SLAC) Ø OSKI-PETSc Proof-of-Concept Results Ø Matrix 1: Accelerator cavity design (R. Lee @ SLAC) Ø Ø Matrix 2: Linear programming (Italian Railways) Ø Ø N ~ 1 M, ~40 M non-zeros 2 x 2 dense block substructure Symmetric Short-and-fat: 4 k x 1 M, ~11 M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc format Evaluation machine: Xeon cluster Ø Peak: 4. 8 Gflop/s per node

Accelerator cavity matrix from SLAC’s T 3 P code Accelerator cavity matrix from SLAC’s T 3 P code

Additional Features: OSKI-Lua Ø Embedded scripting language for selecting customized, complex transformations Ø Mechanism Additional Features: OSKI-Lua Ø Embedded scripting language for selecting customized, complex transformations Ø Mechanism to save/restore transformations /* In “my_app. c” */ fp = fopen(“my_xform. txt”, “rt”); fgets(buffer, BUFSIZE, fp); oski_Apply. Mat. Transform(A_tunable, buffer); oski_Mat. Mult(A_tunable, …); # In file, “my_xform. txt” # Compute Afast = P*A*PT using Pinar’s reordering algorithm A_fast, P = reorder_TSP(Input. Mat); # Split Afast = A 1 + A 2, where A 1 in 2 x 2 block format, A 2 in CSR A 1, A 2 = A_fast. extract_blocks(2, 2); return transpose(P)*(A 1+A 2)*P;

Current Work and Future Directions Current Work and Future Directions

Current and Future Work on OSKI Ø OSKI 1. 0. 1 at bebop. cs. Current and Future Work on OSKI Ø OSKI 1. 0. 1 at bebop. cs. berkeley. edu/oski Ø Ø “Pre-alpha” version of OSKI-PETSc available; “Beta” for Kokkos (Trilinos) Future work Ø Evaluation on full solves/apps Ø Ø Ø Bay area lithography shop - 2 x speedup in full solve Code generators Studying use of higher-level OSKI kernels Port to additional architectures (e. g. , vectors, SMPs) Additional heuristics [Buttari, et al. (2005)] Many Be. BOP projects on-going Ø Ø Ø Sp. MV benchmark for HPC-Challenge [Gavari & Hoemmen] Evaluation of Cell [Williams] Higher-level kernels, solvers [Hoemmen, Nishtala] Tuning collective communications [Nishtala] Cache-oblivious stencils [Kamil]

ROSE: A Compiler-Based Approach to Tuning General Applications Ø ROSE: Tool for building customized ROSE: A Compiler-Based Approach to Tuning General Applications Ø ROSE: Tool for building customized source-to-source tools (Quinlan, et al. ) Ø Ø Ø Focus on performance optimization for scientific computing Ø Ø Ø Domain-specific analysis and optimizations Object-oriented abstraction recognition Rich loop-transformation support Annotation language support Additional infrastructure support for s/w assurance, testing, and debugging Toward an end-to-end empirical tuning compiler Ø Ø Ø Full support for C and C++; Fortran 90 in development Targets users with little or no compiler background Combines profiling, checkpointing, analysis, parameterized code generation, search Joint work with Qing Yi (University of Texas at San Antonio) Sponsored by U. S. Department of Energy

ROSE Architecture Application Library Interface Annotations Front-end (EDG-based) AST Tools Mid-end AST fragment Abtraction ROSE Architecture Application Library Interface Annotations Front-end (EDG-based) AST Tools Mid-end AST fragment Abtraction Recognition Abstraction Aware Analysis Abstraction Elimination Extended Traditional Optimizations Source+AST Transformations AST Back-end Transformed application source Source fragment AST fragment