Скачать презентацию NCSA Terascale Clusters Dan Reed Director NCSA and Скачать презентацию NCSA Terascale Clusters Dan Reed Director NCSA and

fb9c550eb91ae14d7ada272308fbbe19.ppt

  • Количество слайдов: 13

NCSA Terascale Clusters Dan Reed Director, NCSA and the Alliance Chief Architect, NSF ETF NCSA Terascale Clusters Dan Reed Director, NCSA and the Alliance Chief Architect, NSF ETF Tera. Grid Principal Investigator, NSF NEESgrid William and Jane Marr Gutgsell Professor University of Illinois [email protected] uiuc. edu National Computational Science National Center for Supercomputing Applications National Computational Science

A Blast From the Past … Everybody who has analyzed the logical theory of A Blast From the Past … Everybody who has analyzed the logical theory of computers has come to the conclusion that the possibilities of computers are very interesting – if they could be made to be more complicated by several orders of magnitude. December 29, 1959 Richard Feynman would be proud! National Center for Supercomputing Applications National Computational Science

NCSA Terascale Linux Clusters • 1 TF IA-32 Pentium III cluster (Platinum) – – NCSA Terascale Linux Clusters • 1 TF IA-32 Pentium III cluster (Platinum) – – 512 1 GHz dual processor nodes Myrinet 2000 interconnect 5 TB of RAID storage 594 GF (Linpack), production July 2001 • 1 TF IA-64 Itanium cluster (Titan) – 164 800 MHz dual processor nodes – Myrinet 2000 interconnect – 678 GF (Linpack), production March 2002 • Large-scale calculations on both – molecular dynamics (Schulten) – first nanosecond/day calculations – gas dynamics (Woodward) – others underway via NRAC allocations • Software packaging for communities – NMI GRIDS Center, Alliance “In a Box” … NCSA machine room • Lessons for Tera. Grid National Center for Supercomputing Applications National Computational Science

Platinum Software Configuration • Linux – Red. Hat 6. 2 and Linux 2. 2. Platinum Software Configuration • Linux – Red. Hat 6. 2 and Linux 2. 2. 19 SMP Kernel • Open PBS – resource management and job control • Maui Scheduler – advanced scheduling • Argonne MPICH – parallel programming API • NCSA VMI – communication middleware – MPICH and Myrinet • Myricom GM – Myrinet communication layer • NCSA cluster monitor • IBM GPFS National Center for Supercomputing Applications National Computational Science

Session Questions • Cluster performance and expectations – generally met, though with the usual Session Questions • Cluster performance and expectations – generally met, though with the usual hiccups • MTBI and failure modes – node and disk loss (stay tuned for my next talk …) – copper Myrinet (fiber much more reliable) – avoid open house demonstrations • System utilization – heavily oversubscribed (see queue delays below) • Primary complaints – – long batch queue delays capacity vs. capability balance ISV code availability software tools – debuggers and performance tools – I/O and parallel file system performance National Center for Supercomputing Applications National Computational Science

NCSA IA-32 Cluster Timeline 4/5 Myrinet static mapping in place 4/7 CMS runs successfully NCSA IA-32 Cluster Timeline 4/5 Myrinet static mapping in place 4/7 CMS runs successfully 4/11 400 processor HPL runs completing 4/12 Myricom engineering assistance 2/23 First four racks of IBM hardware arrive Jan 2001 Order placed with IBM 512 compute node cluster Feb 2001 Mar 2001 Apr 2001 May 2001 6/1 Friendly user period begins June 2001 July 2001 3/1 Head nodes operational 5/8 1000 p MP Linpack runs 3/10 First 126 processor Myrinet test jobs 5/11 1008 processor Top 500 run @ 594 GF 3/13 Final IBM hardware shipment 5/14 2. 4 Kernel testing 3/22 First application for compute 5/28 Red. Hat 7. 1 testing nodes (CMS/Koranda/Litvin) 3/26 Initial Globus installation 3/26 Final Myrinet hardware arrives Production service 3/26 First 512 processor MILC and NAMD runs National Center for Supercomputing Applications National Computational Science

NCSA Resource Usage National Center for Supercomputing Applications National Computational Science NCSA Resource Usage National Center for Supercomputing Applications National Computational Science

Alliance HPC Usage NCSA Total Alliance Partner Total Normalized CPU Hours (NU) 35, 000 Alliance HPC Usage NCSA Total Alliance Partner Total Normalized CPU Hours (NU) 35, 000 30, 000 Clusters in Production 25, 000 20, 000 15, 000 10, 000 5, 000 0 FY 98 FY 99 FY 00 FY 01 FY 02 Source: PACI Usage Database National Center for Supercomputing Applications National Computational Science

Hero Cluster Jobs CPU Hours Platinum National Center for Supercomputing Applications Titan National Computational Hero Cluster Jobs CPU Hours Platinum National Center for Supercomputing Applications Titan National Computational Science

Storm Scale Prediction • Sample four hour forecast Radar – Center for Analysis and Storm Scale Prediction • Sample four hour forecast Radar – Center for Analysis and Prediction of Storms – Advanced Regional Prediction System – full-physics mesoscale prediction system • Execution environment – NCSA Itanium Linux Cluster – 240 processors, 4 hours per night for 46 days – four hour prediction, 3 km grid – initial state includes assimilation of – WSR-88 D reflectivity and radial velocity data – surface and upper air data, satellite, and wind • On-demand computing required Forecast w/Radar • Fort Worth forecast 2 hr Source: Kelvin Droegemeier National Center for Supercomputing Applications National Computational Science

NCSA Multiphase Strategy • Multiple user classes – ISV software, hero calculations – distributed NCSA Multiphase Strategy • Multiple user classes – ISV software, hero calculations – distributed resource sharing, parameter studies • Four hardware approaches – shared memory multiprocessors – 12 32 -way IBM p 690 systems (2 TF peak) – large memory and ISV support – Tera. Grid IPF clusters – 64 -bit Itanium 2/Madison (10 TF peak) – SDSC, ANL, Caltech and PSC coupling – Xeon clusters – 32 -bit systems for hero calculations – dedicated sub-clusters (2 -3 TF each) – allocated for weeks – Condor resource pools – parameter studies and load sharing National Center for Supercomputing Applications National Computational Science

Extensible Tera. Grid Facility (ETF) Caltech: Data collection analysis 0. 4 TF IA-64 IA Extensible Tera. Grid Facility (ETF) Caltech: Data collection analysis 0. 4 TF IA-64 IA 32 Datawulf 80 TB Storage ANL: Visualization LEGEND Cluster Visualization Cluster Storage Server Sun IA 64 Shared Memory IA 32 IA 64 IA 32 Disk Storage Backplane Router 1. 25 TF IA-64 96 Viz nodes 20 TB Storage IA 32 Extensible Backplane Network LA Hub 30 Gb/s 40 Gb/s 30 Gb/s 4 TF IA-64 DB 2, Oracle Servers 500 TB Disk Storage 6 PB Tape Storage 1. 1 TF Power 4 IA 64 Chicago Hub Sun IA 64 10 TF IA-64 128 large memory nodes 230 TB Disk Storage 3 PB Tape Storage GPFS and data mining Pwr 4 SDSC: Data Intensive National Center for Supercomputing Applications NCSA: Compute Intensive EV 7 EV 68 6 TF EV 68 71 TB Storage 0. 3 TF EV 7 shared-memory 150 TB Storage Server Sun PSC: Compute Intensive National Computational Science

NCSA Tera. Grid: 10 TF IPF and 230 TB Tera. Grid Network Gb. E NCSA Tera. Grid: 10 TF IPF and 230 TB Tera. Grid Network Gb. E Fabric 2 TF Itanium 2 256 nodes 2 p 1 GHz 4 or 12 GB memory 73 GB scratch ~700 Madison nodes Storage I/O over Myrinet and/or Gb. E 2 p Madison 4 GB memory 73 GB scratch 256 2 x FC Myrinet Fabric Brocade 12000 Switches 230 TB Interactive+Spare Nodes Being Installed Now National Center for Supercomputing Applications Login, FTP 10 2 p Itanium 2 Nodes 10 2 p Madison Nodes National Computational Science