Скачать презентацию On-line Automated Performance Diagnosis on Thousands of Processors Скачать презентацию On-line Automated Performance Diagnosis on Thousands of Processors

cb176494aedb493ee18a5b0d700c72ac.ppt

  • Количество слайдов: 44

On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future Technologies Group On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory Paradyn Research Group Computer Sciences Department University of Wisconsin-Madison OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 1

High Performance Computing Today · Large parallel computing resources - Tightly coupled systems (Earth High Performance Computing Today · Large parallel computing resources - Tightly coupled systems (Earth Simulator, Blue. Gene/L, XT 3) - Clusters (LANL Lightning, LLNL Thunder) - Grid · Large, complex applications - ASCI Blue Mountain job sizes (2001) · 512 cpus: 17. 8% · 1024 cpus: 34. 9% · 2048 cpus: 19. 9% · Small fraction of peak performance is the rule OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 2

Achieving Good Performance · Need to know what and where to tune · Diagnosis Achieving Good Performance · Need to know what and where to tune · Diagnosis and tuning tools are critical for realizing potential of large-scale systems · On-line automated tools are especially desirable · - Manual tuning is difficult · Finding interesting data in large data volume · Understanding application, OS, hardware interactions - Automated tools require minimal user involvement; expertise is built into the tool - On-line automated tools can adapt dynamically · Dynamic control over data volume · Useful results from a single run But: tools that work well in small-scale environments often don’t scale OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 3

Barriers to Large-Scale Performance Diagnosis • Managing performance data volume • Communicating efficiently between Barriers to Large-Scale Performance Diagnosis • Managing performance data volume • Communicating efficiently between distributed tool components • Making scalable presentation of data and analysis results Tool Front End Tool Daemons d 0 d 1 d 2 d 3 d. P-4 d. P-3 d. P-2 d. P-1 App Processes a 0 a 1 a 2 a 3 a. P-4 a. P-3 a. P-2 a. P-1 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 4

Our Approach for Addressing These Scalability Barriers · MRNet: multicast/reduction infrastructure for scalable tools Our Approach for Addressing These Scalability Barriers · MRNet: multicast/reduction infrastructure for scalable tools · Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications · Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck diagnosis results for large-scale applications OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 5

Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 6

Performance Consultant · Automated performance diagnosis · Search for application performance problems - Start Performance Consultant · Automated performance diagnosis · Search for application performance problems - Start with global, general experiments (e. g. , test CPUbound across all processes) - Collect performance data using dynamic instrumentation · Collect only the data desired · Remove the instrumentation when no longer needed - Make decisions about truth of each experiment - Refine search: create more specific experiments based on “true” experiments (those whose data is above userconfigurable threshold) OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 7

Performance Consultant c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. Performance Consultant c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 8

Performance Consultant c 001. cs. wisc. edu CPUbound main Do_row c 002. cs. wisc. Performance Consultant c 001. cs. wisc. edu CPUbound main Do_row c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu Do_col c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_mult … Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 9

Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu Do_col c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_mult … Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 10

Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 11

MRNet: Multicast/Reduction Overlay Network · Parallel tool infrastructure providing: - Scalable multicast - Scalable MRNet: Multicast/Reduction Overlay Network · Parallel tool infrastructure providing: - Scalable multicast - Scalable data synchronization and transformation · Network of processes between tool front-end and back -ends · Useful for parallelizing and distributing tool activities - Reduce latency - Reduce computation and communication load at tool front-end · Joint work with Dorian Arnold (University of Wisconsin -Madison) OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 12

Typical Parallel Tool Organization Tool Front End Tool Daemons d 0 d 1 d Typical Parallel Tool Organization Tool Front End Tool Daemons d 0 d 1 d 2 d 3 d. P-4 d. P-3 d. P-2 d. P-1 App Processes a 0 a 1 a 2 a 3 a. P-4 a. P-3 a. P-2 a. P-1 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 13

MRNet-based Parallel Tool Organization Tool Front End Internal Process Filter Multicast/ Reduction Network Tool MRNet-based Parallel Tool Organization Tool Front End Internal Process Filter Multicast/ Reduction Network Tool Daemons d 0 d 1 d 2 d 3 d. P-4 d. P-3 d. P-2 d. P-1 App Processes a 0 a 1 a 2 a 3 a. P-4 a. P-3 a. P-2 a. P-1 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 14

Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 15

Performance Consultant: Scalability Barriers · MRNet can alleviate scalability problem for global performance data Performance Consultant: Scalability Barriers · MRNet can alleviate scalability problem for global performance data (e. g. , CPU utilization across all processes) · But front-end still processes local performance data (e. g. , utilization of process 5247 on host mcr 398. llnl. gov) OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 16

Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu Do_col c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_mult … Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 17

Distributed Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. Distributed Performance Consultant cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu Do_col c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_mult … Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 18

Distributed Performance Consultant: Variants · Natural steps from traditional centralized approach (CA) · Partially Distributed Performance Consultant: Variants · Natural steps from traditional centralized approach (CA) · Partially Distributed Approach (PDA) - Distributed local searches, centralized global search - Requires complex instrumentation management · Truly Distributed Approach (TDA) - Distributed local searches only - Insight into global behavior from combining local search results (e. g. , using Sub-Graph Folding Algorithm) - Simpler tool design than PDA OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 19

Distributed Performance Consultant: PDA cham. cs. wisc. edu CPUbound main Do_row c 001. cs. Distributed Performance Consultant: PDA cham. cs. wisc. edu CPUbound main Do_row c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu Do_col c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_mult … Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 20

Distributed Performance Consultant: TDA cham. cs. wisc. edu c 001. cs. wisc. edu c Distributed Performance Consultant: TDA cham. cs. wisc. edu c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … main Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row c 128. cs. wisc. edu U. S. DEPARTMENT OF ENERGY 21

Distributed Performance Consultant: TDA cham. cs. wisc. edu c 001. cs. wisc. edu c Distributed Performance Consultant: TDA cham. cs. wisc. edu c 001. cs. wisc. edu c 002. cs. wisc. edu c 128. cs. wisc. edu myapp 367 myapp 4287 myapp 27549 … c 001. cs. wisc. edu c 002. cs. wisc. edu myapp{367} myapp{4287} myapp{27549} main … c 128. cs. wisc. edu main Sub-Graph Folding Algorithm Do_row Do_col Do_mult … Do_row … Do_col Do_mult … … … OAK RIDGE NATIONAL LABORATORY Do_row U. S. DEPARTMENT OF ENERGY 22

Outline · · · Paradyn and the Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Outline · · · Paradyn and the Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 23

Search History Graph Example CPUbound c 33. cs. wisc. edu c 34. cs. wisc. Search History Graph Example CPUbound c 33. cs. wisc. edu c 34. cs. wisc. edu main myapp{1272} myapp{1273} myapp{7624} myapp{7625} main A B A C D A B A E B C C D B A C D D D OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 24

Search History Graphs · Search History Graph is effective for presenting search-based performance diagnosis Search History Graphs · Search History Graph is effective for presenting search-based performance diagnosis results… · …but it does not scale to a large number of processes because it shows one sub-graph per process OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 25

Sub-Graph Folding Algorithm · Combines host-specific sub-graphs into composite sub-graphs · Each composite sub-graph Sub-Graph Folding Algorithm · Combines host-specific sub-graphs into composite sub-graphs · Each composite sub-graph represents a behavioral category among application processes · Dynamic clustering of processes by qualitative behavior OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 26

SGFA: Example CPUbound c 33. cs. wisc. edu c 34. cs. wisc. edu c*. SGFA: Example CPUbound c 33. cs. wisc. edu c 34. cs. wisc. edu c*. cs. wisc. edu main myapp{1272} myapp{1273} myapp{7624} myapp{*} myapp{7625} main A B A C D A B B C C D A E D D A B C B D C E D OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 27

SGFA: Implementation · Custom MRNet filter · Filter in each MRNet process keeps folded SGFA: Implementation · Custom MRNet filter · Filter in each MRNet process keeps folded graph of search results from all reachable daemons · Updates periodically sent upstream · By induction, filter in front-end holds entire folded graph · Optimization for unchanged graphs OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 28

Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Outline · · · Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 29

DPC + SGFA: Evaluation · Modified Paradyn to perform bottleneck searches using CA, PDA, DPC + SGFA: Evaluation · Modified Paradyn to perform bottleneck searches using CA, PDA, or TDA approach · Modified instrumentation cost tracking to support PDA - Track global, per-process instrumentation cost separately - Simple fixed-partition policy for scheduling global and local instrumentation · Implemented Sub-Graph Folding Algorithm as custom MRNet filter to support TDA (used by all) · Instrumented front-end, daemons, and MRNet internal processes to collect CPU, I/O load information OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 30

DPC + SGFA: Evaluation · su 3_rmd - QCD pure lattice gauge theory code DPC + SGFA: Evaluation · su 3_rmd - QCD pure lattice gauge theory code - C, MPI - Weak scaling scalability study · LLNL MCR cluster - 1152 nodes (1048 compute nodes) Two 2. 4 GHz Intel Xeons per node 4 GB memory per node Quadrics Elan 3 interconnect (fat tree) Lustre parallel file system OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 31

DPC + SGFA: Evaluation · PDA and TDA: bottleneck searches with up to 1024 DPC + SGFA: Evaluation · PDA and TDA: bottleneck searches with up to 1024 processes so far, limited by partition size · CA: scalability limit at less than 64 processes · Similar qualitative results from all approaches OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 32

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 33 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 33

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 34 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 34

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 35 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 35

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 36 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 36

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 37 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 37

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 38 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 38

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 39 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 39

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 40 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 40

DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 41 DPC: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 41

SGFA: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 42 SGFA: Evaluation OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 42

Summary · Tool scalability is critical for effective use of large-scale computing resources · Summary · Tool scalability is critical for effective use of large-scale computing resources · On-line automated performance tools are especially important at large scale · Our approach: - MRNet - Distributed Performance Consultant (TDA) plus Sub-Graph Folding Algorithm OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 43

References · P. C. Roth, D. C. Arnold, and B. P. Miller, “MRNet: a References · P. C. Roth, D. C. Arnold, and B. P. Miller, “MRNet: a Software-Based Multicast/Reduction Network for Scalable Tools, ” SC 2003, Phoenix, Arizona, November 2003 · P. C. Roth and B. P. Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes, ” in submission · Publications available from http: //www. paradyn. org · MRNet software available from http: //www. paradyn. org/mrnet OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 44