f193f2512b490cf5580748839a4ff425.ppt

- Количество слайдов: 59

The Future of LAPACK and Sca. LAPACK www. netlib. org/lapack-dev Jim Demmel UC Berkeley 28 Sept 2005

Outline • • • Motivation Participants Goals 1. Better numerics (faster and more accurate algorithms) 2. Expand contents (more functions, more parallel implementations) 3. Automate performance tuning 4. Improve ease of use 5. Better maintenance and support 6. Increase community involvement (this means you!) Questions for the audience Selected Highlights Concluding poem

Motivation • LAPACK and Sca. LAPACK are widely used – Adopted by Cray, Fujitsu, HP, IBM, IMSL, Math. Works, NAG, NEC, SGI, … – >52 M web hits @ Netlib (incl. CLAPACK, LAPACK 95) • Many ways to improve them, based on – – Own algorithmic research Enthusiastic participation of research community On-going user/vendor survey (url below) Opportunities and demands of new architectures, programming languages • New releases planned (NSF support) • Your feedback desired – www. netlib. org/lapack-dev

Participants • UC Berkeley: – Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… • U Tennessee, Knoxville – Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov • Other Academic Institutions – UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara – TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions – CERFACS, LBL • Industrial Partners – Cray, HP, Intel, Math. Works, NAG, SGI

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability – MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems – Up to 10 x faster HQR: Byers / Mathias / Braman – Extensions to QZ: Kågström / Kressner – Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton – Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability – MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems – Up to 10 x faster HQR: Byers / Mathias / Braman – Extensions to QZ: Kågström / Kressner – Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton – Recursive blocked layouts for packed formats: Gustavson / Kågström / Elmroth / Jonsson/ • New: Most accurate algorithm providing “standard” speed – Iterative refinement for Ax=b, least squares • Assume availability of Extra Precise BLAS (Li/Hida/…) • www. netlib. org/blast-forum/ – Jacobi SVD: Drmaĉ/Veselić

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability – MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems – Up to 10 x faster HQR: Byers / Mathias / Braman – Extensions to QZ: Kågström / Kressner – Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton – Recursive blocked layouts for packed formats, Gustavson / Kågström / Elmroth / Jonsson/ • New: Most accurate algorithm providing “standard” speed – Iterative refinement for Ax=b, least squares • Assume availability of Extra Precise BLAS (Li/Hida/…) • www. netlib. org/blast-forum/ – Jacobi SVD: Drmaĉ/Veselić • Condition estimates for (almost) everything (ongoing)

Goal 1 – Better Numerics • Fastest algorithm providing “standard” backward stability – MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems – Up to 10 x faster HQR: Byers / Mathias / Braman – Extensions to QZ: Kågström / Kressner – Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn, Bischof / Lang , Howell / Fulton – Recursive blocked layouts for packed formats, Gustavson / Kågström / Elmroth / Jonsson/ • New: Most accurate algorithm providing “standard” speed – Iterative refinement for Ax=b, least squares • Assume availability of Extra Precise BLAS (Li/Hida/…) • www. netlib. org/blast-forum/ – Jacobi SVD: Drmaĉ/Veselić • Condition estimates for (almost) everything (ongoing) • What is not fast or accurate enough?

What goes into Sca/LAPACK? For all linear algebra problems For all matrix structures For all data types For all architectures and networks For all programming interfaces Produce best algorithm(s) w. r. t. performance and accuracy (including condition estimates, etc) Need to automate and prioritize!

Goal 2 – Expanded Content • Make content of Sca. LAPACK mirror LAPACK as much as possible – Full Automation would be nice, not yet robust, general enough • Telescoping languages, Bernoulli, Rose, FLAME, … • New functions (examples) – Updating / downdating of factorizations: Stewart, Langou – More generalized SVDs: Bai , Wang – More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat – Structured eigenproblems • O(n 2) version of roots(p) – Gu, Chandrasekaran, Zhu et al • Selected matrix polynomials: Mehrmann • How should we prioritize missing functions?

Goal 3 – Automate Performance Tuning • Not just BLAS • 1300 calls to ILAENV() to get block sizes, etc. – Never been systematically tuned • Extend automatic tuning techniques of ATLAS, etc. to these other parameters – Automation important as architectures evolve • Convert Sca. LAPACK data layouts on the fly • How important is peak performance?

Goal 4: Improved Ease of Use • Which do you prefer? A B CALL PDGESV( N , NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) CALL PDGESVX( FACT, TRANS, N , NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Goal 4: Improved Ease of Use, Software Engineering (1) • Development versus Research – Development: practical approach to produce useful code – Research: If we could start over and do it right, how would we? • Life after F 77? – Fortran 95, C, C++, Java, Matlab, Python, … • Easy interfaces vs access to details – No universal agreement across systems on “easiest interface” – Leave decision to higher level packages – Keep expert driver / simple driver / computational routines • Conclusion: Subset of F 95 for core + wrappers for drivers – What subset? • Recursion, for new data structures • Modules, to produce multiple precision versions • Environmental enquiries, to replace x. LAMCH – Wrappers for Fortran 95, Java, Matlab, Python, … even for CLAPACK • Automatic memory allocation of workspace

Goal 4: Improved Ease of Use, Software Engineering (2) • Why not full F 95 for core? – Would make interfacing to other languages and packages harder – Some users want control over memory allocation • Why not C for core? – High cost/benefit ratio for full rewrite – Performance • Automation would be nice – Use Babel/SIDL to produce native looking interfaces when possible

Precisions beyond double (1) • Range of designs possible – Just run in quad (or some other fixed precision) – Support codes like #bits = 32 Repeat #bits = 2* #bits Solve(A, b, x, error_bound, #bits) Until error_bound < tol

Precisions beyond double (2) • Easiest approach – fixed precision – Use F 95 modules to produce any precision on request – Could use QD, ARPREC, GMP, … – Keep current memory allocation (twiddle GMP…) • Next easier approach – maximum precision – Build maximum allowable precision on request – Pass in precision parameter, up to this amount • More flexible (and difficult) approach – Choose any precision at run time – Dynamically allocate all variables • Most aggressive approach – New algorithms that minimize work to get desired prec. • What do users want? – Compatibility with symbolic manipulation systems?

Goal 4: Improved Ease of Use, Software Engineering (3) • Research Issues – May or may not impact development • How to map multiple software layers to emerging architectural layers? • Are emerging HPCS languages better? • How much can we automate? Do we keep having to write Gaussian elimination over and over again? • Statistical modeling to limit performance tuning costs, improve use of shared clusters

Goal 5: Better Maintenance and Support • Website for user feedback and requests • New developer and discussion forums • URL: www. netlib. org/lapack-dev – Includes NSF proposal • • • Version control and bug tracking system Automatic Build and Test environment Wide variety of supported platforms Cooperation with vendors Anything else desired?

Goal 6: Involve the Community • To help identify priorities – More interesting tasks than we are funded to do – See www. netlib. org/lapack-dev for list • To help identify promising algorithms – What have we missed? • To help do the work – – Bug reports, provide fixes Again, more tasks than we are funded to do Already happening: thank you! We retain final decisions on content • Anything else?

Some Highlights • • • Putting more of LAPACK into Sca. LAPACK performance on 1 D vs 2 D grids MRRR “Holy Grail” algorithm for symmetric EVD Iterative Refinement for Ax=b O(n 2) polynomial root finder Generalized SVD

Missing Drivers in Sca/LAPACK Sca. LAPACK Linear Equations LU Cholesky LDLT x. GESV x. POSV x. SYSV Px. GESV Px. POSV missing Least Squares (LS) QR QR+pivot SVD/QR SVD/D&C SVD/MRRR QR + iterative refine. x. GELSY x. GELSS x. GELSD missing Px. GELS missing driver missing (intent) missing Generalized LS LS + equality constr. Generalized LM Above + Iterative ref. x. GGLSE missing x. GGGLM missing

More missing drivers LAPACK Sca. LAPACK Symmetric EVD QR / Bisection+Invit D&C MRRR x. SYEV / X x. SYEVD x. SYEVR Px. SYEV / X missing (intent) missing Nonsymmetric EVD Schur form Vectors too x. GEES / X x. GEEV /X missing driver SVD QR D&C MRRR Jacobi x. GESVD x. GESDD missing Px. GESVD missing (intent) missing Generalized Symmetric EVD QR / Bisection+Invit D&C MRRR x. SYGV / X x. SYGVD missing Px. SYGV / X missing (intent) missing Generalized Nonsymmetric EVD Schur form Vectors too x. GGES / X x. GGEV / X missing Generalized SVD Kogbetliantz MRRR x. GGSVD missing (intent) missing

Missing matrix types in Sca. LAPACK • Symmetric, Hermitian, triangular – Band, Packed • Positive Definite – Packed • Orthogonal, Unitary – Packed

Speedups for using 2 D processor grid range from 2 x to 8 x Times obtained on: 60 processors, Dual AMD Opteron 1. 4 GHz Cluster w/Myrinet Interconnect 2 GB Memory

Cost of redistributing matrix to optimal layout is small Times obtained on: 60 processors, Dual AMD Opteron 1. 4 GHz Cluster w/Myrinet Interconnect 2 GB Memory

MRRR Algorithm for eig(tridiagonal) and svd(bidiagonal) • “Multiple Relatively Robust Representation” • 1999 Householder Award honorable mention for Dhillon • O(nk) flops to find k eigenvalues/vectors of nxn tridiagonal matrix (similar for SVD) – Minimum possible! • Naturally parallelizable • Accurate – Small residuals || Txi – li xi || = O(n e) – Orthogonal eigenvectors || xi. Txj || = O(n e) • Hence nickname: “Holy Grail” • 2 versions – LAPACK 3. 0: large error on “hard” cases – Next release: fixed! • How should we tradeoff speed and accuracy?

Timing of Eigensolvers (1. 2 GHz Athlon, only matrices where time >. 1 sec)

Timing of Eigensolvers (1. 2 GHz Athlon, only matrices where time >. 1 sec)

Timing of Eigensolvers (1. 2 GHz Athlon, only matrices where time >. 1 sec)

Timing of Eigensolvers (only matrices where time >. 1 sec)

Accuracy Results (old vs new Grail) maxi ||Tqi – li qi || / ( n e ) || QQT – I || / (n e )

Accuracy Results (Grail vs QR vs DC) maxi ||Tqi – li qi || / ( n e ) || QQT – I || / (n e )

More Accurate: Solve Ax=b Conventional Gaussian Elimination 1/e e e = n 1/2 2 -24 With extra precise iterative refinement

What’s new? • Need extra precision (beyond double) – Part of new BLAS standard – Cost = O(n 2) extra per right-hand-side, vs O(n 3) to factor • Get tiny componentwise bounds too – Error in xi small compared to |xi|, not just maxj |xj| • “Guarantees” based on condition number estimates – – No bad bounds in 6. 2 M tests Different condition number for componentwise bounds Traditional iterative refinement can “fail” Only get “matrix close to singular” message when answer wrong? • Extends to least squares • Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan, … • LAPACK Working Note # 165

Can condition estimators lie? • Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization – Demmel/Diament/Malajovich (FCM 2001) • But what if matrix multiply costs O(n 2)? – Cohn/Umans/Kleinberg (FOCS 2003/5)

New algorithm for roots(p) -p 1 -p 2 1 0 – Roots(p) does eig(C(p)) C(p)= 0 1 – Costs O(n 3), stable, reliable …… • O(n 2) Alternatives 0 … – Newton, Jenkins-Traub, Laguerre, … • To find roots of polynomial p … -pd … 0 … … 1 0 – Stable? Reliable? • New: Exploit “semiseparable” structure of C(p) – Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration – Complexity drops from O(n 3) to O(n 2) • Related work: Gemignani, Bini, Pan, et al • Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel

Properties of new roots(p) • First (? ) algorithm that – Is O(n 2) – Is backward stable, in the matrix sense – Is backward stable, in the sense that the computed roots are the exact roots of a slightly perturbed input polynomial • Depends on balancing = scaling roots by a constant • Still need to automate choice of • Byers, Mathias, Braman • Tisseur, Higham, Mackey …

New GSVD Algorithm: Timing Comparisons 1. 0 GHz Itanium– 2 (2 GB RAM); Intel's Math Kernel Library 7. 2. 1 (incl. BLAS, LAPACK) Bai et al, UC Davis PSVD, CSD on the way

Conclusions • Lots to do in Dense Linear Algebra – New numerical algorithms – Continuing architectural challenges • Parallelism, performance tuning • Grant support, but success depends on contributions from community

Extra Slides

Downa Dating With apologies to Robert Burns & Nick Higham Should some equations be forgot when overdetermīned? Should some equations be forgot using hyperbolic sines? With hyperbolic sines, my dear with hyperbolic sines, We’ll hope to get some boundedness with hyperbolic sines.

New GSVD Algorithm (XGGQSV) • UT A Z =∑a( 0 R ), VT B Z =∑b( 0 R ); A: M x N, B: P x. N • Modified Van Loan's method (1) Pre-processing: reveal rank of ( AT ; BT )T (2) Split QR: reduce two upper triangular into one (3) CSD: Cosine-Sine Decomposition (4) Post-processing: assemble resulted matrices • Workspace = (max(M, N, P)) + 5 L 2 where L = rank(B) • Outperforms current XGGSVD in LAPACK

Profile of SGGQSV 1. 0 GHz Itanium– 2 (2 GB RAM); Intel's Math Kernel Library 7. 2. 1 (incl. BLAS, LAPACK)

Related New Routines • CSD (XORCSD) – UTQ 1 Z =∑a, VTQ 2 Z =∑b – Based on Von Loan's method (1985) – Dominated by the cost of SVD – Workspace = (Max(P+M, N)) • PSVD (XGGPSV) – UTABV = ∑ – Based on work by Golub, Solna, Van Dooren (2000) – Workspace = (Max(M, P, N))

Benchmark Details • AMD 1. 2 GHz Athlon, 2 GB mem, Redhat + Intel compiler • Compute all eigenvalues and eigenvectors of a symmetric tridiagonal matrix T • Codes compared: – – qr: QR iteration from LAPACK: dsteqr dc: Cuppen’s Divide&Conquer from LAPACK: dstedc gr: New implementation of MRRR algorithm (“Grail”) ogr: MRRR from LAPACK 3. 0: dstegr (“old Grail”)

Timing of Eigensolvers (1. 2 GHz Athlon, only matrices where time >. 1 sec

Timing of Eigensolvers (1. 2 GHz Athlon, only matrices where time >. 1 sec)

More Accurate: Solve Ax=b • Old idea: Use Newton’s method on f(x) = Ax-b – On a linear system? – Roundoff in Ax-b makes it interesting (“nonlinear”) – Iterative refinement • Snyder, Wilkinson, Moler, Skeel, … Repeat r = Ax-b … compute with extra precision Solve Ad = r … using LU factorization of A Update x = x – d Until “accurate enough” or no progress

Speedups from 1. 33 x to 11. 5 x Times obtained on: 60 processors, Dual AMD Opteron 1. 4 GHz Cluster w/Myrinet Interconnect 2 GB Memory

Times obtained on: 60 processors, Dual Opteron 1. 4 GHz, 64 -bit machine 2 GB Memory, Myrinet Interconnect

Speedups from 1. 8 x to 7. 5 x Times obtained on: 60 processors, Dual Intel Xeon 2. 4 GHz, IA-32 w/Gigabit Ethernet Interconnect 2 GB Memory

Times obtained on: 60 processors, Dual Intel Xeon 2. 4 GHz, IA-32 w/Gigabit Ethernet Interconnect 2 GB Memory

Speedups from 2. 2 x to 6 x Times obtained on: 60 processors, Dual Intel Xeon 2. 4 GHz, IA-32 w/Gigabit Ethernet Interconnect 2 GB Memory

Times obtained on: 60 processors, Dual Intel Xeon 2. 4 GHz, IA-32 w/Gigabit Ethernet Interconnect 2 GB Memory

Speedups from 2. 2 x to 2. 8 x Times obtained on: 60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network Block size: 64

Times obtained on: 60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network Block size: 64

The Future of LAPACK and Sca. LAPACK www. cs. berkeley. edu/~demmel/Sca-LAPACK-Proposal. pdf Jim Demmel UC Berkeley Householder XVI

OO-like interfaces (VE) • Aim: native looking interfaces in modern languages (C++, F 90, Java, Python…) • Overloading to have identical interfaces for all data types (precision/storage) • Overloading to omit certain parameters (stride, transpose, …) • Expose data only when necessary (automatic allocation of temporaries, pivoting permutation, …. ) • Mechanism: Use Babel/SIDL to interface to existing F 77 code base

Example of HLL interface (VE) // C++ example Lapack: : PDTridiagonal. Matrix A = Lapack: : PDTridiagonal. Matrix: : _create(); x = (double*) malloc(SIZE*sizeof(double)); y = (double*) malloc(SIZE*sizeof(double)); A. Mat. Vec(x, y); // or A. Mat. Vec(Mat. Transpose, x, y); A. Factor(); A. Solve(y); ! F 90 example ! Create matrix from user array allocate(data_array(20, 25)) ! Fill in the array…. call Create. With. Array(A, data_array) ! Retrieve the array of a (factored) matrix call Get. Array(A, mat) mat_array => mat%d_data print *, 'pivots', mat_array(0, 0), mat_array(1, 1) Lots of details omitted here!