Скачать презентацию PAL CCS-3 A Performance Model of non-Deterministic Particle Скачать презентацию PAL CCS-3 A Performance Model of non-Deterministic Particle

8809fcc0bdd6921e070db02a9df42cea.ppt

  • Количество слайдов: 19

PAL CCS-3 A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems Mark Mathis PAL CCS-3 A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory Presented by: Kei Davis

PAL CCS-3 PAL CCS-3

PAL l Performance & Architectures Lab Performance Analysis Portfolio at Los Alamos: – – PAL l Performance & Architectures Lab Performance Analysis Portfolio at Los Alamos: – – l Benchmarking near-to-market advanced system Large-scale Simulation: Parsims Design of advanced systems Application centric modeling Developed models of many applications: – – l CCS-3 Deterministic transport (Sweep 3 D, Tycho) Hydro code (SAGE) Ocean Simulation (POP) MCNP (described here) Models are being used in many ways: – – Predict performance prior to availability Comparison of Large-scale systems (e. g. ASCI Q vs. the Earth Simulator) In Procurement of ASCI purple (expected to be a 100 T system in 2004/5) During installation of ASCI Q (just completed, 20 T Alpha system)

PAL Why Model Performance? CCS-3 l Performance analysis is necessary to evaluate the impact PAL Why Model Performance? CCS-3 l Performance analysis is necessary to evaluate the impact of architectural evolution and innovation. – Application modeling provides insight into the achievable performance on current systems, and – allows exploration of expected performance improvements possible on future systems.

Need to have an expectation PAL l CCS-3 Complex machines and software – single Need to have an expectation PAL l CCS-3 Complex machines and software – single processors, interactions within nodes, interaction between nodes (communication networks), I/O l Large cost for development, deployment and maintenance l Need to know in advance what performance will be. Update of SW and/or HW Verification of ASCI Q Performance Procurement Implementation Design Maintenance Installation What should we buy? (ASCI Purple) Some measurement possible (small scale) Lots of system choices

PAL MCNP (Monte-Carlo N-Particle) CCS-3 l General-purpose code that can be used for neutron, PAL MCNP (Monte-Carlo N-Particle) CCS-3 l General-purpose code that can be used for neutron, photon, electron, or coupled transport. – Simulates individual particle (histories) and records aspects (tallies) of their average behavior. – Sequentially, MCNP simulates the requested number of histories on a given input geometry and reports the requested output tallies. l In parallel, MCNP copies the entire input geometry from a master to two or more slaves – Each slave simulates a different set of particles. – In each iteration, or cycle, the master merges tallies from all slaves during a rendezvous. – Complexity of the problem is constrained by available memory at a single node – due to the input geometry being copied to each PE. – Hence, parallelism is utilized to solve the problem faster, rather than solve a more complex problem in the same amount of time. (Strong Scaling)

PAL Example Experiment - Criticality CCS-3 l A “critical” system is one where exactly PAL Example Experiment - Criticality CCS-3 l A “critical” system is one where exactly one of the neutrons produced in a fission reaction continues a chain reaction. – Such a system has a neutron multiplication of one, or keff =1. – In a “subcritical” system, keff <1, and the chain reaction will die away. – If keff >1, the system is “supercritical”. Such a system will produce large amounts of radiation and persistent radioactive contamination. l MCNP can be used to simulate the neutron interactions for a given input geometry and calculate keff. – An example input geometry consists of an insulating cylinder with rods of various types arranged in the middle.

PAL Example Input Geometry CCS-3 Vertical cross-section Horizontal cross-section PAL Example Input Geometry CCS-3 Vertical cross-section Horizontal cross-section

Parallel Activity in MCNP PAL Scatter Phase Work CCS-3 Gather Phase Master Slave 1 Parallel Activity in MCNP PAL Scatter Phase Work CCS-3 Gather Phase Master Slave 1 Slave. P-1 Stage l 1 2 3 4 5 6 7 8 To develop a model, an understanding of the key processing operations and their scaling behavior is required. – The parallel activity for one cycle of MCNP is shown above. l An analytical model is obtained from this type of analysis.

PAL Analysis of Parallel Activity CCS-3 Stage Source Action 1 Master bcast P*8 2 PAL Analysis of Parallel Activity CCS-3 Stage Source Action 1 Master bcast P*8 2 Master bcast 229240 3 Slave work Thist* Nph/(P-1) 4 Slave pt 2 pt 5512 task common 5 Slave pt 2 pt 320 tally data 6 Slave pt 204920 task array 1 7 Slave pt 2 pt 48* Nph/(P-1) task array 2 8 Slave pt 2 pt 32 timing data l Quantity Description particle range to be computed by slaves update current history Thist times the number of particle histories Only main activities shown – Stages 1 and 2 correspond to the scatter phase – Stage 3 is the work phase – Stages 4 -8 correspond to the gather phase

PAL Performance Model (Overview) CCS-3 l Performance described by analytical expressions. Top level: l PAL Performance Model (Overview) CCS-3 l Performance described by analytical expressions. Top level: l Elements represent the main processing stages. For example: l Parameters in model enable scalability studies, e. g. : – P (# PEs), Nph (# histories),

PAL l System Model The system model encapsulates key system characteristics including: CCS-3 – PAL l System Model The system model encapsulates key system characteristics including: CCS-3 – Communication (e. g. latency and bandwidth) – Computational Performance (e. g. processor speed). l For example, point-to-point communications can be modeled as a piece-wise linear curve: Tpt 2 pt(S) = 0 £ n £ 32 64 £ n £ 1024 n > 1024 T ~ 5 ms + 15 ns / byte T ~ 10 ms + 3. 4 ns / byte S (= message size in bytes)

PAL Single-Processor Performance CCS-3 l The single-processor performance can be modeled or measured. l PAL Single-Processor Performance CCS-3 l The single-processor performance can be modeled or measured. l A measured value has several advantages… – Avoids necessity to model compiler optimizations (which are complex!) – Eliminates need to model memory hierarchy. l and disadvantages… – Requires preliminary benchmarking experiments (and access to system). – Values needed for all systems in a comparison

PAL Experimental Test-bed CCS-3 l Compaq Alphaserver ES 40: – – l 32 nodes, PAL Experimental Test-bed CCS-3 l Compaq Alphaserver ES 40: – – l 32 nodes, each with 4 PEs 833 MHz, EV 68 Alpha processors 64 K L 1, 8 MB L 2 8 GB memory per node Quadrics Qs. Net Interconnect – Fat-tree topology – Low latency (typically 6µs), high bandwidth (~ 300 MB/s)

PAL MCNP Model Parameters CCS-3 Values Type Parameter System Lc(S), Bc(S) 5. 05µs, 0. PAL MCNP Model Parameters CCS-3 Values Type Parameter System Lc(S), Bc(S) 5. 05µs, 0. 0 MB/s (S < 64) 5. 47µs, 78 MB/s (64 512) Tpack(S) 0. 12 ns (S < 32 K) 0. 16 ns (32 K < S < 4 M) 0. 67 ns (S > 4 M) Nph 100, 500, 1000, 5000, 10000, 50000, 100000 Thist 798µs Application S in bytes

PAL Model Validation CCS-3 l The model predicts well. – The predicted time is PAL Model Validation CCS-3 l The model predicts well. – The predicted time is often within 10% of the measured time. l l Accuracy generally decreases as the number of PEs grows. Some work remains to increase model accuracy

PAL l l Exploring Performance Once validated, the model can be used to predict PAL l l Exploring Performance Once validated, the model can be used to predict performance. E. g. new scenarios on the current architecture. CCS-3 – What-if we processed a larger problem? – What-if we used more processors? Strong Scaling Weak Scaling

PAL l Exploring Performance (2) Can explore performance on possible future architectures: – What-if PAL l Exploring Performance (2) Can explore performance on possible future architectures: – What-if the network was faster? – What-if the processors were faster? – What-if message packing was faster? l CCS-3 Can predict the performance for possible code modifications: – What-if the “gather” phase was re-implemented using reductions? Strong Scaling Weak Scaling

PAL l l Conclusions Developed an analytical performance model for MCNP. Validated the model PAL l l Conclusions Developed an analytical performance model for MCNP. Validated the model on a large-scale Alphaserver testbed. – predicted time is often within 10% of the measured time. l Used the model to explore a number of scenarios. – Studied strong and weak scaling modes for small and large inputs. – Predicted performance for improved systems and code. – Showed that most performance gain will come from increased processor speed. l Illustrated the benefits of developing a performance model of an application. l Part of an on-going effort to model large-scale systems http: www. c 3. lanl. gov/par_arch CCS-3