Скачать презентацию Performance Instrumentation and Measurement for Terascale Systems Jack Скачать презентацию Performance Instrumentation and Measurement for Terascale Systems Jack

fa2ef24349affeaa63b466dc4f2941ce.ppt

  • Количество слайдов: 38

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and Allen Malony University of Oregon June 2, 2003 ICCS 2003 1

Requirements for Terascale Systems • Performance framework must support a wide range of – Requirements for Terascale Systems • Performance framework must support a wide range of – Performance problems (e. g. , single-node performance, synchronization and communication overhead, load balancing) – Performance evaluation methods (e. g. , parameter-based modeling, bottleneck detection and diagnosis) – Programming environments (e. g. , multiprocess and /or multithreaded, parallel and distributed, large-scale) • Need for flexible and extensible performance observation framework June 2, 2003 ICCS 2003 2

Research Problems • Appropriate level and location for implementing instrumentation and measurement • How Research Problems • Appropriate level and location for implementing instrumentation and measurement • How to make the framework modular and extensible • Appropriate compromise between level of detail/accuracy and instrumentation cost June 2, 2003 ICCS 2003 3

Instrumentation Strategies • Source code instrumentation – Manual or using preprocessor • Library level Instrumentation Strategies • Source code instrumentation – Manual or using preprocessor • Library level instrumentation – e. g. , MPI and Open. MP profiling interfaces • Binary rewriting – E. g. , Pixie, ATOM, EEL, PAT • Dynamic instrumentation – Dyninst. API June 2, 2003 ICCS 2003 4

Types of Measurements • Profiling • Tracing • Real-time Analysis June 2, 2003 ICCS Types of Measurements • Profiling • Tracing • Real-time Analysis June 2, 2003 ICCS 2003 5

Profiling • Recording of summary information during execution – inclusive, exclusive time, # calls, Profiling • Recording of summary information during execution – inclusive, exclusive time, # calls, hardware statistics, … • Reflects performance behavior of program entities – functions, loops, basic blocks – user-defined “semantic” entities • Very good for low-cost performance assessment • Helps to expose performance bottlenecks and hotspots • Implemented through – sampling: periodic OS interrupts or hardware counter traps – instrumentation: direct insertion of measurement code June 2, 2003 ICCS 2003 6

Tracing – Recording of information about significant points (events) during program execution • entering/exiting Tracing – Recording of information about significant points (events) during program execution • entering/exiting code region (function, loop, block, …) • thread/process interactions (e. g. , send/receive message) – Save information in event record • timestamp • CPU identifier, thread identifier • Event type and event-specific information – Event trace is a time-sequenced stream of event records – Can be used to reconstruct dynamic program behavior – Typically requires code instrumentation June 2, 2003 ICCS 2003 7

Real-time Analysis • Allows evaluation of program performance during execution • Examples – Paradyn Real-time Analysis • Allows evaluation of program performance during execution • Examples – Paradyn – Autopilot – Perfometer June 2, 2003 ICCS 2003 8

TAU Performance System Architecture Paraver June 2, 2003 ICCS 2003 EPILOG 9 TAU Performance System Architecture Paraver June 2, 2003 ICCS 2003 EPILOG 9

TAU Instrumentation • Manually using TAU instrumentation API • Automatically using – Program Database TAU Instrumentation • Manually using TAU instrumentation API • Automatically using – Program Database Toolkit (PDT) – MPI profiling library – Opari Open. MP rewriting tool • Uses PAPI to access hardware counter data June 2, 2003 ICCS 2003 10

Program Database Toolkit (PDT) • Program code analysis framework for developing source-based tools • Program Database Toolkit (PDT) • Program code analysis framework for developing source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query – commercial grade front end parsers – portable IL analyzer, database format, and access API – open software approach for tool development • Targets and integrates multiple source languages • Used in TAU to build automated performance instrumentation tools June 2, 2003 ICCS 2003 11

PDT Components • Language front end – Edison Design Group (EDG): C, C++ – PDT Components • Language front end – Edison Design Group (EDG): C, C++ – Mutek Solutions Ltd. : F 77, F 90 – creates an intermediate-language (IL) tree • IL Analyzer – processes the intermediate language (IL) tree – creates “program database” (PDB) formatted file • DUCTAPE (Bernd Mohr, ZAM, Germany) – C++ program Database Utilities and Conversion Tools APplication Environment – processes and merges PDB files – C++ library to access the PDB for PDT applications June 2, 2003 ICCS 2003 12

TAU Analysis • Profile analysis – pprof • parallel profiler with text-based display – TAU Analysis • Profile analysis – pprof • parallel profiler with text-based display – Racy / j. Racy • graphical interface to pprof (Tcl/Tk) • j. Racy is a Java implementation of Racy – Para. Prof • Next-generation parallel profile analysis and display • Trace analysis and visualization – – Trace merging and clock adjustment (if necessary) Trace format conversion (ALOG, SDDF, Vampir) Vampir (Pallas) trace visualization Paraver (CEPBA) trace visualization June 2, 2003 ICCS 2003 13

TAU Pprof Display June 2, 2003 ICCS 2003 14 TAU Pprof Display June 2, 2003 ICCS 2003 14

jracy (NAS Parallel Benchmark – LU) Global profiles Routine profile across all nodes n: jracy (NAS Parallel Benchmark – LU) Global profiles Routine profile across all nodes n: node c: context t: thread Individual profile June 2, 2003 ICCS 2003 15

Para. Prof Scalable Profiler • Re-implementation of j. Racy tool • Target flexibility in Para. Prof Scalable Profiler • Re-implementation of j. Racy tool • Target flexibility in profile input source – Profiles, performance database, online • Target scalability in profile size and display – Will include three-dimensional display support • Provide more robust analysis and extension – Derived performance statistics June 2, 2003 ICCS 2003 16

Para. Prof Architecture June 2, 2003 ICCS 2003 17 Para. Prof Architecture June 2, 2003 ICCS 2003 17

512 -Processor Profile (SAMRAI) June 2, 2003 ICCS 2003 18 512 -Processor Profile (SAMRAI) June 2, 2003 ICCS 2003 18

Three-dimensional Profile Displays 500 -processor Uintah execution (University of Utah) June 2, 2003 ICCS Three-dimensional Profile Displays 500 -processor Uintah execution (University of Utah) June 2, 2003 ICCS 2003 19

Overview of PAPI • Performance Application Programming Interface • The purpose of the PAPI Overview of PAPI • Performance Application Programming Interface • The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. • Parallel Tools Consortium project • References implementations for all major HPC platforms • Installed and in use at major government labs, academic sites • Becoming de facto industry standard • Incorporated into many performance analysis tools – e. g. , HPCView, Sv. Pablo, TAU, Vampir, Vprof June 2, 2003 ICCS 2003 20

PAPI Counter Interfaces • PAPI provides three interfaces to the underlying counter hardware: 1. PAPI Counter Interfaces • PAPI provides three interfaces to the underlying counter hardware: 1. The low level interface provides functions for setting options, accessing native events, callback on counter overflow, etc. 2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3. Graphical tools to visualize information. June 2, 2003 ICCS 2003 21

PAPI Implementation Tools Portable PAPI Low Level Layer Machine Specific Layer PAPI High Level PAPI Implementation Tools Portable PAPI Low Level Layer Machine Specific Layer PAPI High Level PAPI Machine Dependent Substrate Kernel Extension Operating System Hardware Performance Counter June 2, 2003 ICCS 2003 22

PAPI Preset Events • Proposed standard set of events deemed most relevant for application PAPI Preset Events • Proposed standard set of events deemed most relevant for application performance tuning • Defined in papi. Std. Event. Defs. h • Mapped to native events on a given platform – Run tests/avail to see list of PAPI preset events available on a platform June 2, 2003 ICCS 2003 23

Scalability of PAPI Instrumentation • Overhead of library calls to read counters can be Scalability of PAPI Instrumentation • Overhead of library calls to read counters can be excessive. • Statistical sampling can reduce overhead. • PAPI substrate for Alpha Tru 64 UNIX – Built on top of DADD/DCPI (Dynamic Access to DCPI Data/Digital Continuous Profiling Interface) – Sampling approach supported in hardware – 1 -2% overhead compared to 30% on other platforms • Using sampling and hardware profiling support on Itanium/Itanium 2 June 2, 2003 ICCS 2003 24

Vampir v 3. x: Hardware Counter Data • Counter Timeline Display June 2, 2003 Vampir v 3. x: Hardware Counter Data • Counter Timeline Display June 2, 2003 ICCS 2003 25

What is Dyna. Prof? • A portable tool to instrument a running executable with What is Dyna. Prof? • A portable tool to instrument a running executable with Probes that monitor application performance. • Simple command line interface. • Open Source Software • A work in progress… June 2, 2003 ICCS 2003 26

Dyna. Prof Methodology • Make collection of run-time performance data easy by: – Avoiding Dyna. Prof Methodology • Make collection of run-time performance data easy by: – Avoiding instrumentation and recompilation – Using the same tool with different probes – Providing useful and meaningful probe data – Providing different kinds of probes – Allowing custom probes June 2, 2003 ICCS 2003 27

Why the “Dyna”? • Instrumentation is selectively inserted directly into the program’s address space. Why the “Dyna”? • Instrumentation is selectively inserted directly into the program’s address space. • Why is this a better way? – No perturbation of compiler optimizations – Complete language independence – Multiple Insert/Remove instrumentation cycles June 2, 2003 ICCS 2003 28

Dyna. Prof Design • GUI, command line & script driven user interface • Uses Dyna. Prof Design • GUI, command line & script driven user interface • Uses GNU readline for command line editing and command completion. • Instrumentation is done using: – Dyninst on Linux, Solaris and IRIX – DPCL on AIX June 2, 2003 ICCS 2003 29

Dyna. Prof Commands load <executable> list [module pattern] use <probe> [probe args] instr module Dyna. Prof Commands load list [module pattern] use [probe args] instr module [probe args] instr function [probe args] stop continue run [args] Info unload June 2, 2003 ICCS 2003 30

Dyna. Prof Probe Design • Probes provided with distribution – Wallclock probe – PAPI Dyna. Prof Probe Design • Probes provided with distribution – Wallclock probe – PAPI probe – Perfometer probe • Can be written in any compiled language • Probes export 3 functions with a standardized interface. • Easy to roll your own (<1 day) • Supports separate probes for MPI/Open. MP/Pthreads June 2, 2003 ICCS 2003 31

Future development • GUI development • Additional probes – Perfex probe – Vprof probe Future development • GUI development • Additional probes – Perfex probe – Vprof probe – TAU probe • Better support for parallel applications June 2, 2003 ICCS 2003 32

Perfometer • Application is instrumented with PAPI – call perfometer() – call mark_perfometer(int color, Perfometer • Application is instrumented with PAPI – call perfometer() – call mark_perfometer(int color, char *label) • Application is started. At the call to perfometer, signal handler and a timer are set up to collect and send the information to a Java applet containing the graphical view. • Sections of code that are of interest can be designated with specific colors • Real-time display or trace file June 2, 2003 ICCS 2003 33

Perfometer Display Machine info Flop/s Rate Flop/s Min/Max Process & Real time June 2, Perfometer Display Machine info Flop/s Rate Flop/s Min/Max Process & Real time June 2, 2003 ICCS 2003 34

Perfometer Parallel Interface June 2, 2003 ICCS 2003 35 Perfometer Parallel Interface June 2, 2003 ICCS 2003 35

Conclusions • TAU and PAPI projects are addressing important research problems involved in constructing Conclusions • TAU and PAPI projects are addressing important research problems involved in constructing a flexible and extensible performance observation framework. • Widespread adoption of PAPI demonstrates the value of a portable interface to low-level architecture-specific performance monitoring hardware. • TAU framework provides flexible mechanisms for instrumentation and measurement. June 2, 2003 ICCS 2003 36

Conclusions (cont. ) • Terascale systems require scalable low-overhead means of collecting performance data. Conclusions (cont. ) • Terascale systems require scalable low-overhead means of collecting performance data. – Statistical sampling support in PAPI – TAU filtering and feedback schemes for focusing instrumentation – Real-time monitoring capabilities (Dynaprof, Perfometer) • PAPI and TAU infrastructure is designed for interoperability, flexibility, and extensibility. June 2, 2003 ICCS 2003 37

More Information • • TAU (http: //www. acl. lanl. gov/tau) PDT (http: //www. acl. More Information • • TAU (http: //www. acl. lanl. gov/tau) PDT (http: //www. acl. lanl. gov/pdtoolkit) PAPI (http: //icl. cs. utk. edu/papi/) OPARI (http: //www. fz-juelich. de/zam June 2, 2003 ICCS 2003 38