2305072faf01737ebd522d5789129064.ppt
- Количество слайдов: 19
HPC and the ROMS BENCHMARK Program Kate Hedstrom August 2003 1
Outline • New ARSC systems • Experience with ROMS benchmark problem • Other computer news 2
New ARSC Systems • Cray X 1 • 128 MSP (1. 5 TFLOPS) • 4 GB/MSP • Water cooled • IBM p 690+ and p 655+ • • 5 TFLOPS total At least 2 GB/cpu Air cooled Arriving in September, switch later 3
Cray X 1 (klondike) 4
5
Cray • Cray X 1 Node • Node is a 4 -way SMP • 16 GB/node • Each MSP has four vector/scalar processors • Processors in MSP share cache • Node usable as 4 MSPs or 16 SSPs • IEEE floating point hardware 6
Cray • Programming Environment • Fortran, C, C++ • Support for • • • MPI SHMEM Co-Array Fortran UPC Open. MP (Fall 2003) • Compiling executes on CPES - Sun V 480, happens invisibly to user 7
8
IBM • Two p 690+ • Like our Regatta, but faster, more memory (8 GB/cpu) • Shared memory between 32 cpu • For big Open. MP jobs • Six p 655+ towers • Like our SP, but faster, more memory (2 GB/cpu) • Shared memory on each 8 cpu node, 92 nodes in all • For big MPI jobs and small Open. MP jobs 9
10
Benchmark Problem • No external files to read • Three different resolutions • Periodic channel representing the Antarctic Circumpolar Current (ACC) • Steep bathymetry • Idealized winds, clouds, etc. , but full computation of atmospheric boundary layer • KPP vertical mixing 11
12
13
IBM and SX 6 Notes • SX 6 is 8 GFLOPS, Power 4 is 5. 2 GFLOPS peak • Both less than 10% of peak • IBM scales better, Cray person says SX 6 is even worse for more than one node • SX 6 best for 1 x. N tiling, IBM better closer to Mx. M even though this problem is 512 x 64 14
Cray X 1 Notes • Have choice of MSP or SSP mode • Four SSPs faster than one MSP • Sixteen MSPs much faster than 64 SSPs • On one MSP, vanilla ROMS spends: • 66% in bulk_flux • 28% in LMD • 2% in 2 -D engine • Slower than either Power 4 or SX 6 • Can inline lmd_wscale and vastly speed up LMD with compiler option, John Levesque has offered to rewrite bulk_flux - aim for 6 -8 times faster than Power 4 for CCSM 15
Clusters • Can buy rack mounted turnkey systems running Linux • Need to spend money on: • • Memory Processors - single cpu nodes may be best Switch - low latency, high bandwidth Disk storage 16
Don Morton’s Experience • No such thing as turnkey Beowulf • Need someone to take care of it: • Configure queuing system to make it useful for more than one user • Security updates • Backups 17
DARPA Petaflops award • Sun, IBM, Cray each awarded ~$50 million for phase-two development • Two will be awarded phase 3 in 2006 • Goal is to achieve petaflops by about 2010, also easier to program, more robust operating environment • Sun - new switch between cpus, memory • IBM - huge cache on chip • Cray - heavyweight, lightweight cpus 18
Conclusions • Things are still exciting in the computer industry • The only thing you can count on is change 19
2305072faf01737ebd522d5789129064.ppt