
ac7c0299d2142812bbb0a1e652f4ca51.ppt
- Количество слайдов: 41
Overview of the Blue Gene supercomputers Dr. Dong Chen IBM T. J. Watson Research Center Yorktown Heights, NY
l l l Supercomputer trends Blue Gene/L and Blue Gene/P architecture Blue Gene applications Terminology: FLOPS = Floating Point Operations Per Second Giga = 10^9, Tera = 10^12, Peta = 10^15, Exa = 10^18 Peak speed v. s. Sustained Speed Top 500 list (top 500. org): Based on the Linpack Benchmark: Solve dense linear matrix equation, A x = b A is N x N dense matrix, total FP operations, ~ 2/3 N^3 + 2 N^2 Green 500 list (green 500. org): Rate Top 500 supercomputers in FLOPS/Watt
CMOS Scaling in Petaflop Era §Three decades of exponential clock rate (and electrical power!) growth has ended §Instruction Level Parallelism (ILP) growth has ended §Single threaded performance improvement is dead (Bill Dally) §Yet Moore’s Law continues in transistor count §Industry response: Multi-core (i. e. double the number of cores every 18 months instead of the clock frequency (and power!) Source: “The Landscape of Computer Architecture, ” John Shalf, NERSC/LBNL, presented at ISC 07, Dresden, June 25, 2007 4 © 2007 IBM Corporation
TOP 500 Performance Trend Over the long haul IBM has demonstrated continued leadership in various TOP 500 metrics, even as the performance continues it’s relentless growth. IBM has most aggregate performance for last 22 lists IBM has #1 system for 10 out of last 12 lists (13 in total) IBM has most in Top 10 for last 14 lists IBM has most systems 14 out of last 22 lists 32. 43 PF 1. 759 PF nce orma Perf 433. 2 TF # 1 ate ggreg A Total 24. 67 TF # 10 0 # 50 Source: www. top 500. org Blue Square Markers Indicate IBM Leadership
President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And Innovation Ninth time IBM has received nation's most prestigious tech award Blue Gene has led to breakthroughs in science, energy efficiency and analytics WASHINGTON, D. C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement. President Obama will personally bestow the award at a special White House ceremony on October 7. IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year. Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas. Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks. The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes. The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green 500 List' announced by Green 500. org in July, 2009.
Blue Gene Roadmap • BG/L (5. 7 TF/rack) – 130 nm ASIC (1999 -2004 GA) – 104 racks, 212, 992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, – 0. 5/1 GB/node • BG/P (13. 9 TF/rack) – 90 nm ASIC (2004 -2007 GA) – 72 racks, 294, 912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA – 2/4 GB/node – SMP support, Open. MP, MPI • BG/Q (209 TF/rack) – 20 PF/s
IBM Blue Gene/P Solution: Expanding the Limits of Breakthrough Science Blue Gene Technology Roadmap Performance Blue Gene/Q Power Multi Core Scalable to 100 PF Blue Gene/P (PPC 450 @ 850 MHz) Scalable to 3. 56 PF Blue Gene/L (PPC 440 @ 700 MHz) Scalable to 595 TFlops 2004 2007 2010 Note: All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. IBM® System Blue Gene®/P Solution © 2007 IBM Corporation
Blue. Gene/L System Buildup System 64 Racks, 64 x 32 Rack 32 Node Cards Node Card 180/360 TF/s 64 TB (32 chips 4 x 4 x 2) 16 compute, 0 -2 IO cards 2. 8/5. 6 TF/s 1 TB Compute Card 2 chips, 1 x 2 x 1 Chip 90/180 GF/s 32 GB 2 processors 5. 6/11. 2 GF/s 2. 0 GB 2. 8/5. 6 GF/s 4 MB © 2007 IBM Corporation
Blue. Gene/L Compute ASIC IBM System Blue Gene®/P Solution
IBM Research | Blue. Gene Systems Double Floating-Point Unit Quadword Load Data P 0 S 0 Primary FPR Secondary FPR P 31 S 31 – Two replicas of a standard single-pipe Power. PC FPU – 2 x 32 64 -bit registers – Attached to the PPC 440 core using the APU interface –Issues instructions across APU interface –Instruction decode performed in Double FPU –Separate APU interface from LSU to provide up to 16 B data for load and store –Datapath width is 16 bytes –Feeds two FPUs with 8 bytes each every cycle – Two FP multiply-add operations per cycle – 2. 8 GF/s peak Quadword Store Data © 2006 IBM Corporation
Blue Gene L/Memory Charateristics Memory: L 1 L 2 SRAM L 3 Main store Node System (64 k nodes) 32 k. B/32 k. B per processor 16 k. B 4 MB (ECC)/node 512 MB (ECC)/node 32 TB Bandwidth: L 1 to Registers L 2 to L 1 L 3 to L 2 Main (DDR) 11. 2 GB/s Independent R/W and Instruction 5. 3 GB/s Independent R/W and Instruction 11. 2 GB/s 5. 3 GB/s Latency: L 1 miss, L 2 hit L 2 miss, L 3 hit L 2 miss (main store) 13 processor cycles (pclks) 28 pclks (EDRAM page hit/EDRAM page miss) 75 pclks for DDR closed page access (L 3 disabled/enabled)
Blue Gene Interconnection Networks 3 Dimensional Torus – – – Interconnects all compute nodes (65, 536) Virtual cut-through hardware routing 1. 4 Gb/s on all 12 node links (2. 1 GB/s per node) Communications backbone for computations 0. 7/1. 4 TB/s bisection bandwidth, 67 TB/s total bandwidth Global Collective Network – One-to-all broadcast functionality – Reduction operations functionality – 2. 8 Gb/s of bandwidth per link; Latency of tree traversal 2. 5 µs – ~23 TB/s total binary tree bandwidth (64 k machine) – Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt – Round trip latency 1. 3 µs Control Network – Boot, monitoring and diagnostics Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1: 64) – All external comm. (file I/O, control, user interaction, etc. )
System Blue. Gene/P 72 Racks, 72 x 32 Cabled 8 x 8 x 16 Rack 32 Node Cards Node Card 1 PF/s 144 (288) TB (32 chips 4 x 4 x 2) 32 compute, 0 -1 IO cards 13. 9 TF/s 2 (4) TB Compute Card 435 GF/s 64 (128) GB 1 chip, 20 DRAMs Chip 4 processors 13. 6 GF/s 8 MB EDRAM 13. 6 GF/s 2. 0 GB DDR 2 (4. 0 GB 6/30/08)
Blue. Gene/P compute ASIC 32 k I 1/32 k D 1 PPC 450 Snoop filter 128 Multiplexing switch L 2 Double FPU 32 k I 1/32 k D 1 PPC 450 snoop Snoop filter 128 256 Shared L 3 Directory for e. DRAM 256 512 b data 72 b ECC 4 MB e. DRAM L 3 Cache or On-Chip Memory w/ECC L 2 Double FPU 32 PPC 450 Snoop filter 128 L 2 Double FPU 32 k I 1/32 k D 1 PPC 450 Double FPU Shared SRAM Multiplexing switch 32 k I 1/32 k D 1 Shared L 3 Directory for e. DRAM w/ECC Snoop filter 512 b data 72 b ECC 4 MB e. DRAM L 3 Cache or On-Chip Memory 128 L 2 Arb DMA Hybrid PMU w/ SRAM 256 x 64 b JTAG Access JTAG Torus 6 3. 4 Gb/s bidirectional Collective 3 6. 8 Gb/s bidirectional Global Barrier Ethernet 10 Gbit 4 global barriers or interrupts 10 Gb/s DDR-2 Controller w/ ECC 13. 6 Gb/s DDR-2 DRAM bus
Blue Gene/P Memory Characteristics Memory: L 1 L 2 L 3 Main store Node 32 k. B/32 k. B per processor 8 MB (ECC)/node 2 -4 GB (ECC)/node Bandwidth: L 1 to Registers L 2 to L 1 L 3 to L 2 Main (DDR) 6. 8 GB/s instruction Read 6. 8 GB/s data Read 6. 8 GB/s Write 5. 3 GB/s Independent R/W and Instruction 13. 6 GB/s Latency: L 1 hit L 1 miss, L 2 hit L 2 miss, L 3 hit L 2 miss (main store) 3 processor cycles (pclks) 13 pclks 46 pclks (EDRAM page hit/EDRAM page miss) 104 pclks for DDR closed page access (L 3 disabled/enabled)
Blue. Gene/P Interconnection Networks 3 Dimensional Torus l l l l Interconnects all compute nodes (73, 728) Virtual cut-through hardware routing 3. 4 Gb/s on all 12 node links (5. 1 GB/s per node) 0. 5 µs latency between nearest neighbors, 5 µs to the farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1. 7/3. 9 TB/s bisection bandwidth, 188 TB/s total bandwidth Collective Network l l l One-to-all broadcast functionality Reduction operations functionality 6. 8 Gb/s of bandwidth per link per direction Latency of one way tree traversal 1. 3 µs, MPI 5 µs ~62 TB/s total binary tree bandwidth (72 k machine) Interconnects all compute and I/O nodes (1152) Low Latency Global Barrier and Interrupt l Latency of one way to reach all 72 K nodes 0. 65 µs, MPI 1. 6 µs
Linpack GFLOPS/W November 2007 Green 500 0. 09 0. 05 0. 02
Relative power, space and cooling efficiencies (Published specs per peak performance) IBM BG/P IBM System Blue Gene®/P Solution
Linpack GF/Watt System Power Efficiency Source: www. top 500. org
HPCC 2009 IBM BG/P 0. 557 PF peak (40 racks) ü Class 1: Number 1 on G-Random Access (117 GUPS) ü Class 2: Number 1 Cray XT 5 2. 331 PF peak ü Class 1: Number 1 on G-HPL (1533 TF/s) ü Class 1: Number 1 on EP-Stream (398 TB/s) ü Number 1 on G-FFT (11 TF/s) Source: www. top 500. org
Main Memory Capacity per Rack
Peak Memory Bandwidth per node (byte/flop)
Main Memory Bandwidth per Rack
Interprocessor Peak Bandwidth per node (byte/flop)
Failures per Month per TF From: http: //acts. nersc. gov/events/Workshop 2006/slides/Simon. pdf
Execution Modes in BG/P per Node § Next Generation HPC node core Hardware Abstractions Black Software Abstractions Blue Quad Mode (VNM) 4 Processes 1 Thread/Process – Many Core – Expensive Memory – Two-Tiered Programming Model SMP Mode 1 Process 1 -4 Threads/Process Dual Mode 2 Processes 1 -2 Threads/Process T 0 P 1 T 0 P 0 T 2 T 0 T 1 T 3 T 1 P 0 P 1 T 0 T 2 T 0 T 1 P 3 T 0 P 2 P 1 T 0 P 0 IBM System Blue Gene®/P Solution P 0 © 2007 IBM Corporation
Blue Gene Software Hierarchical Organization § Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) § I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination § Service node performs system management services (e. g. , partitioning, heart beating, monitoring errors) - transparent to application software Front-end nodes, file system 10 Gb Ethernet 1 Gb Ethernet BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
Noise measurements (from Adolphy Hoisie)
Blue Gene/P System Architecture tree Service Node I/O Node DB 2 fs client Front-end Nodes MMCS File Servers C-Node n app ciod System Console C-Node 0 CNK Linux Functiona l Ethernet (10 Gb) torus I/O Node Load. Leveler FPGA BG/P Software Overview | IBM Confidential app ciod Control Ethernet (1 Gb) C-Node n fs client I 2 C C-Node 0 CNK Linux JTAG © 2007 IBM Corporation
BG/P Software Stack Source Availability Hardware Firmware System Message Layer MPI-IO CIOD CNK Messaging SPIs Diags GPFS (1) Node SPIs Bootloader Compute node Compute node Linux kernel totalviewd Common Node Services Hardware init, RAS, Recovery, Mailbox System GPSHMEM BG Nav mpirun Bridge API HPC Toolkit CSM ISV Schedulers, debuggers Loadleveler High Level Control System (MMCS) Partitioning, Job management and monitoring, RAS, Administrator interfaces, CIODB DB 2 Low Level Control System Power On/Off, Hardware probe, Hardware init, Parallel monitoring Parallel boot, Mailbox Link card I/O node Key: Perf. Mon Firmware GA ESSL User/Sched XL Runtime Open Toolchain Runtime MPI Service Node/Front End Nodes Hardware Application I/O and Compute Nodes Service card SN Node card FEN Notes: Closed. No source provided. Not buildable. Closed. Buildable source. No redistribution of derivative works allowed under license. 1. GPFS does have an open build license available which customers may utilize. New open source reference implementation licensed under CPL. New open source community under CPL license. Active IBM participation. Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started. . BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
Areas Where BG is Used § Weather/Climate Modeling (GOVERNMENT / INDUSTRY / UNIVERSITIES) § Computational Fluid Dynamics – Airplane and Jet Engine Design, Chemical Flows, Turbulence (ENGINEERING / AEROSPACE) § Seismic Processing : (PETROLEUM, Nuclear industry) § Particle Physics : (LATTICE Gauge QCD) § Systems Biology – Classical and Quantum Molecular Dynamics (PHARMA / MED INSURANCE / HOSPITALS / UNIV) § Modeling Complex Systems (PHARMA / BUSINESS / GOVERNMENT / UNIVERSITIES) § Large Database Search § Nuclear Industry § Astronomy (UNIVERSITIES) § Portfolio Analysis via Monte Carlo (BANKING / FINANCE / INSURANCE) BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
LLNL Applications BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
IDC Technical Computing Systems Forecast Bio Sci Genomics, proteomics, pharmacogenomics, pharma research, bioinformatics, drug discovery. Chem Eng Chemical Engineering: Molecular modeling, computational chemistry, process design CAD Mechanical CAD, 3 D Wireframe – mostly graphics CAE Computer Aided Engineering – Finite Element modeling, CFD, crash, solid modeling (Cars, Aircraft, …) DCC&D Digital Content Creation and Distribution Econ Fin Economic and Financial Modeling, econometric modeling, portfolio management, stock market modeling. EDA Electronic Design and Analysis: schematic capture, logic synthesis, circuit simulation, system modeling Geo Sciences and Geo Engineering: seismic analysis, oil services, reservoir modeling. Govt Lab Government Labs and Research Centers: government-funded R&D Defense Surveillance, Signal Processing, Encryption, Command, Control, Communications, Intelligence, Geospatial Image Management. Weapon Design Software Engineering Development and Testing of Technical Applications Technical Management Product Data management, Maintenance Records management, Revision Control, Configuration Management Academic University Based R&D Weather Atmospheric Modeling, Meteorology, Weather Forecasting BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
What is driving the need for more HPC cycles? Genome Sequencing Biological Modeling Pandemic Research Materials Science Fluid Dynamics Drug Discovery Financial Modeling Climate Modeling BG/P Software Overview | IBM Confidential Geophysical Data Processing © 2007 IBM Corporation
HPC Use Cases § Examples § Capability – – – Calculations not possible on small machines Usually these calculations involve systems where many disparate scales are modeled. One scale defines required work per “computation step” A different scale determines total time to solution. – – Protein Folding: • 10 -15. secs – 1 sec Refined grids in Weather forecasting: • 10 km today -> 1 km in a few years Full Simulation of Human Brain Useful as proofs of concept § Complexity – – Calculations which seek to combine multiple components to produce an integrated model of a complex system. Individual components can have significant computational requirements. Coupling between components requires that all components be modeled simultaneously. As components are modeled, changes in interfaces are constantly transferred between the components § Examples – – Water Cycle Modeling in Climate/Environment Geophysical Modeling for Oil Recovery Virtual Fab Multisystem / Coupled Systems Modeling Critical to manage multiple scales in physical systems § Understanding – – Repetition of a basic calculation many times with different model parameters, inputs and boundary conditions. Goal is to develop a clear understanding of behavior / dependencies / and sensitivities of the solution over a range of parameters § Examples – Multiple independent simulations of Hurricane paths to develop probability estimates of possible paths, possible strength, – – – Thermodynamics of Protein / Drug Interactions Sensitivity Analysis in Oil Reservoir Modeling Optimization of Aircraft Wing Design, Essential to develop parameter understanding, and sensitivity analysis BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
Capability BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
Complexity: Modern Integrated Water Management Sensors Partner Ecosystem – – – – Climatologists Environmental Observation Systems Companies Sensors Companies Environmental Sciences Consultants Engineering Services Companies. Subject Matter Experts Universities Adv Water Mgmt Reference IT Architecture Climate Hydrological Meteorological Ecological Enabling IT – – – HPC Visualization Data management Model Strategy Physical Models – – Physical Chemical Biological Environmental In-situ Remotely sensed Planning and placement – – Selection Integration & coupling Validation Temporal/spatial scales Analyses – – – Stochastic model & stats Machine learning Optimization Historical – Present – Near future – Seasonal – Long term – Far future BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
Overall Efficiencies of BG Applications - Major Scientific Advances 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Qbox (DFT) LLNL: CPMD IBM: MGDC ddc. MD (Classical MD) LLNL: New ddc. MD LLNL: MDCASK LLNL, SPa. SM LANL: LAMMPS SNL: RXFF, GMD: Rosetta UW: AMBER Quantum Chromodynamics CPS: MILC, Chroma s. PPM (CFD) LLNL: Miranda, Raptor LLNL: DNS 3 D NEK 5 (Thermal Hydraulics) ANL: HYPO 4 D, PLB (Lattice Boltzmann) Para. Dis (dislocation dynamics) LLNL: WRF (Weather) NCAR: POP (Oceanography): HOMME (Climate) NCAR: GTC (Plasma Physics) PPPL: Nimrod GA: FLASH (Supernova Ia) Cactus (General Relativity) DOCK 5, DOCK 6 Argonne v 18 Nuclear Potential “Cat” Brain 56. 5%; 30% 27. 6% 17. 4% 2006 Gordon-Bell Award highest scaling 2005 Gordon-Bell Award 2007 Gordon-Bell Award highest scaling 30%; 2006 GB Special Award 18%; highest scaling 22% 10%; 12%; 7%; 17% 16% highest scaling highest scaling 64 L racks, 16 P 64 L 32 P 64 L 104 L 64 L, 32 P 64 L 20 L 4 L 64 L, 32 P 32 P 64 L 32 P 64 L 8 L 32 L, 24 Ki P 20 L, 32 P highest scaling 2010 Bonner Prize 2009 GB Special Award 64 L, 40 P 16 L, 32 P 36 P BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
High Performance Computing Trends § Three distinct phases. – – – Past: Exponential growth in processor performance mostly through CMOS technology advances Near Term: Exponential (or faster) growth in level of parallelism. Long Term: Power cost = System cost ; invention required § Curve is not only indicative of peak performance but also performance/$ 1 PF: 2008 10 PF: 2011 Long Term Near Term Past BG/P Software Overview | IBM Confidential © 2007 IBM Corporation
ac7c0299d2142812bbb0a1e652f4ca51.ppt