Скачать презентацию Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Скачать презентацию Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike

04954c64d30dea633b42995e8b6bfcf2.ppt

  • Количество слайдов: 27

Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis de Supinski and Matt Le. Gendre Lawrence Livermore National Lab

Background • Floating point represents real numbers as (± sgnf × 2 exp) o Background • Floating point represents real numbers as (± sgnf × 2 exp) o Sign bit o Exponent o Significand (“mantissa” or “fraction”) 32 16 8 4 0 IEEE Single Exponent (8 bits) 64 32 Significand (23 bits) 16 8 4 0 IEEE Double Exponent (11 bits) Significand (52 bits) • Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) o Introduces rounding error 2

Motivation • Double precision is ubiquitous o Necessary for some computations o Lack of Motivation • Double precision is ubiquitous o Necessary for some computations o Lack of easy-to-use techniques for reasoning about precision • Single precision is preferable o Faster computation o Tesla K 20 X: 2. 95 TFlops (singles) vs. 1. 31 TFlops (doubles) o Intel Xeon Phi: 2. 15 GFlops (singles) vs. 1. 07 GFlops (doubles) o Standard CPUs: 2 x operations w/ SSE vector operations o Reduced memory pressure o Up to 50% footprint reduction o Data movement is a bottleneck for some domains Desire: Balance speed (singles) with accuracy (doubles) 3

Mixed Precision • Use double precision where necessary • Use single precision where possible Mixed Precision • Use double precision where necessary • Use single precision where possible • Nearly 2 x speedups [Baboulin 2008] 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2, . . . do 5: rk ← b – Axk-1 6: solve Ly = Prk 7: solve Uzk = y 8: xk ← xk-1 + zk 9: check for convergence 10: end for Mixed-precision linear solver algorithm Red text indicates steps performed in double-precision (all other steps are single-precision) 4

Our Goal Use automated analysis techniques to prototype mixed-precision variants and provide insight about Our Goal Use automated analysis techniques to prototype mixed-precision variants and provide insight about a program’s precision level requirements. 5

Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning • Static binary instrumentation o Parse Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning • Static binary instrumentation o Parse binary on disk o Replace or augment floating-point instructions with new code o Rewrite modified binary • Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 6

Previous Work • Cancellation detection [WHIST’ 11] o Reports loss of precision due to Previous Work • Cancellation detection [WHIST’ 11] o Reports loss of precision due to subtraction o Provides insight regarding numerical behavior • Range tracking o Reports per-instruction min/max values o Provides insight regarding low dynamic ranges • Mixed-precision variants o Replaces double-precision instructions and operands o Provides insight regarding precision-level sensitivity 7

Implementation • In-place replacement o Narrowed focus: doubles singles o In-place downcast conversion o Implementation • In-place replacement o Narrowed focus: doubles singles o In-place downcast conversion o Flag in the high bits to indicate replacement 64 32 16 8 4 0 Double downcast conversion Replaced Double 32 7 F F 4 D E A Non-signalling Na. N 16 8 4 0 32 64 16 8 4 0 D Single 8

Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 movsd 0 x Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 movsd 0 x 601 e 38(%rax, %rbx, 8) %xmm 0 2 mulsd -0 x 78(%rsp) * %xmm 0 3 addsd -0 x 4 f 02(%rip) + %xmm 0 4 movsd %xmm 0 0 x 601 e 38(%rax, %rbx, 8) 9

Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 movsd 0 x Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 movsd 0 x 601 e 38(%rax, %rbx, 8) %xmm 0 2 mulss -0 x 78(%rsp) * %xmm 0 3 addss -0 x 4 f 02(%rip) + %xmm 0 4 movsd %xmm 0 0 x 601 e 38(%rax, %rbx, 8) 10

Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 3 movsd 0 Example gvec[i, j] = gvec[i, j] * lvec[3] + gvar 1 3 movsd 0 x 601 e 38(%rax, %rbx, 8) %xmm 0 check/replace -0 x 78(%rsp) and %xmm 0 mulss -0 x 78(%rsp) * %xmm 0 check/replace -0 x 4 f 02(%rip) and %xmm 0 addss -0 x 4 f 02(%rip) + %xmm 0 4 movsd %xmm 0 0 x 601 e 38(%rax, %rbx, 8) 2 11

Replacement Code push %rax push %rbx <for each input operand> <copy input into %rax> Replacement Code push %rax push %rbx mov %rbx, 0 xffff 0000 and %rax, %rbx mov %rbx, 0 x 7 ff 4 dead 0000 test %rax, %rbx je next cvtsd 2 ss %rax, %rax or %rax, %rbx next: pop %rbx pop %rax # extract high word # check for flag # skip if replaced # down-cast value # set flag # e. g. addsd => addss 12

Dyninst • Binary analysis framework o o Parses executable files (Instruction. API & Parse. Dyninst • Binary analysis framework o o Parses executable files (Instruction. API & Parse. API) Inserts instrumentation (Dyninst. API) Supports full binary modification (Patch. API) Rewrites binary executable files (Symtab. API) dyninst. org 13

Block Editing original instruction in block splits double single conversion initialization check/replace 14 Block Editing original instruction in block splits double single conversion initialization check/replace 14

Overhead Benchmark (name. CLASS) Average Overhead bt. A 50. 6 X cg. A 6. Overhead Benchmark (name. CLASS) Average Overhead bt. A 50. 6 X cg. A 6. 1 X ep. A 13. 8 X ft. A 10. 1 X lu. A 28. 5 X mg. A 14. 0 X sp. A 19. 5 X 15

Binary Editing Double Precision Original Binary (“mutatee”) Configuration (parser & GUI) Mixed Precision CRAFT Binary Editing Double Precision Original Binary (“mutatee”) Configuration (parser & GUI) Mixed Precision CRAFT (“mutator”) Modified Binary Mixed Config 16

Configuration 17 Configuration 17

Automated Search • Manual mixed-precision replacement o Hard to use without intuition regarding potential Automated Search • Manual mixed-precision replacement o Hard to use without intuition regarding potential replacements • Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 18

System Overview 19 System Overview 19

Example Results 20 Example Results 20

Example Results 21 Example Results 21

NAS Results Benchmark (name. CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % NAS Results Benchmark (name. CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt. W 6, 647 3, 854 76. 2 85. 7 bt. A 6, 682 3, 832 75. 9 81. 6 cg. W 940 270 93. 7 6. 4 cg. A 934 229 94. 7 5. 3 ep. W 397 112 93. 7 30. 7 ep. A 397 113 93. 1 23. 9 ft. W 422 72 84. 4 0. 3 ft. A 422 73 93. 6 0. 2 lu. W 5, 957 3, 769 73. 7 65. 5 lu. A 5, 929 2, 814 80. 4 69. 4 mg. W 1, 351 458 84. 4 28. 0 mg. A 1, 351 456 84. 1 24. 4 sp. W 4, 772 5, 729 36. 9 45. 8 sp. A 4, 821 5, 044 51. 9 43. 0 22

AMGmk Results • Algebraic Multi. Grid microkernel • Multigrid method is iterative and highly AMGmk Results • Algebraic Multi. Grid microkernel • Multigrid method is iterative and highly adaptive • Good candidate for replacement • Automatic search • Complete conversion (100% replacement) • Manually-rewritten version • Speedup: 175 sec to 95 sec (1. 8 X) • Conventional x 86_64 hardware 24

Super. LU Results • Package for LU decomposition and linear solves • Reports final Super. LU Results • Package for LU decomposition and linear solves • Reports final error residual (useful for threshholding) • Both single- and double-precision versions • Verified manual conversion via automatic search • Used error from provided single-precision version as threshold • Final config matched single-precision profile (99. 9% replacement) Threshold Instructions Replaced % Static Final Error % Dynamic 1. 0 e-03 99. 1 99. 9 1. 59 e-04 1. 0 e-04 94. 1 87. 3 4. 42 e-05 7. 5 e-05 91. 3 52. 5 4. 40 e-05 5. 0 e-05 87. 9 45. 2 3. 00 e-05 2. 5 e-05 80. 3 26. 6 1. 69 e-05 1. 0 e-05 75. 4 1. 6 7. 15 e-07 1. 0 e-06 72. 6 1. 6 4. 7 e 7 -07 25

Future Work • Memory-based analysis • Case studies • Search optimization 26 Future Work • Memory-based analysis • Case studies • Search optimization 26

Conclusion Automated binary modification can build prototype mixed-precision program variants. Automated search can provide Conclusion Automated binary modification can build prototype mixed-precision program variants. Automated search can provide insight to focus mixed-precision implementation efforts. 27

Thank you! sf. net/p/crafthpc 28 Thank you! sf. net/p/crafthpc 28