
0e4c217cb86c7061a14ba8d49f63086d.ppt
- Количество слайдов: 13
APE group Does HPC really fit GPUs ? Davide Rossetti INFN – Roma davide. rossetti@roma 1. infn. it Napoli, 26 -27 January 2010 January 25 -27 2010 Incontro di lavoro della CCR 1
A case study: Lattice QCD V. Lubicz – CSN 4 talk, settembre 2009 January 25 -27 2010 Incontro di lavoro della CCR 2
A case study: Lattice QCD ● Most valuable product is the gauge configuration ● ● ● Different types: Nf, schemes Different sizes A grid-enabled community (www. ildg. org) ● ● Production sites ● ● Storage sites Analysis sites Gauge configuration production really expensive !!! January 25 -27 2010 Incontro di lavoro della CCR 3
HPC in INFN ● ● Focus on compute intensive Physics (excluding LHC stuff): LQCD, Astro, Nuclear, Medical Needs for 2010 -2015: ● ● ● ~ 0. 01 -1 Pflops for single research group ~ 0. 1 -10 Pflops nationwide Translates to: ● Big infrastructure (cooling, power, …) ● High procurement costs (€/Gflops) ● High maintenance costs (W/Gflops) January 25 -27 2010 Incontro di lavoro della CCR 4
LQCD on GPU ? ● Story begins with video games (Egri, Fodor et al. 2006) ● Wilson-Dirac operator at 120 Gflops (K. Ogawa 2009) ● Domain Wall fermions (Tsukuba/Taiwan 2009) ● Definitive work: Quda lib (M. A. Clark et al. 2009): o Double, Single, Half-precision o Half-prec solver with reliable updates > 100 Gflops o MIT/X 11 Open Source License January 25 -27 2010 Incontro di lavoro della CCR 5
INFN on GPUs § § 2 D Spin models (Di Renzo et al, 2008) LQCD Stag. fermions on Chroma (Cossu, D'Elia et al, Ge+Pi 2009) § Bio-Computing on GPU (Salina, Rossi et al, To. V 2010? ) § Gravitational wave analysis (Bosi, Pg 2010? ) § Geant 4 on GPU (Caccia, Rm 2010? ) January 25 -27 2010 Incontro di lavoro della CCR 6
How many GPUs ? Raw estimate for memory footprint: • Full solver in GPU • Gauge field + 15 fermion fields • No symmetry tricks Double # GTX 280 # Tesla precision C 1060 memory (Gi. B) # Tesla C 2070 1 2. 1 3 1 1 323 x 64 3. 3 6. 7 4 -8 2 1 -2 483 x 96 17 34 17 -35 5 -9 3 -6 643 x 128 January 25 -27 2010 Single precision memory (Gi. B) 243 x 48 • No half-prec tricks Lattice size 54 108 55 -110 14 -28 9 -18 Incontro di lavoro della CCR 7
If one GPU is not enough Multi-GPU, the Fastra II* approach: ● Stick 13 GPUs together ● 12 TFLOPS @ 2 KW ● CPU threads feed GPU kernels ● Embarrassingly parallel → great!!! ● Full problem fits → good! ● Enjoy the warm weather * University of Antwerp, Belgium January 25 -27 2010 Incontro di lavoro della CCR 8
multi-GPUs need scaling! Seems easy: 1. 2. 3. 4. 1 -2 -4 GPUs in 1 -2 U system (or buy Tesla M 1060) Stack many Add an interconnect (IB, Myrinet 10 G, custom) & plug accurately : ) Simply write your program in C+MPI+CUDA/Open. CL(+threads) Multi-node Single GPU parallelism kernel January 25 -27 2010 Multi-GPU mgmt Incontro di lavoro della CCR 9
Some near-term solutions for LQCD Two INFN approved projects: ● QUon. G: cluster of GPUs with custom 3 D torus network APEnet+ talk by R. Ammendola ● Aurora: dual Xeon 5500 custom blade with IB & 3 D first-neighbor network January 25 -27 2010 Incontro di lavoro della CCR 10
INFN assets ● 20 years of experience in high-speed 3 D torus interconnects (APE 100, APEmille, ape. NEXT, APEnet) ● 20 years writing parallel codes ● Control over HW architecture vs. algorithms January 25 -27 2010 Incontro di lavoro della CCR 11
Wish list for multi-GPU computing Open the GPU to the world: Provide APIs to hook inside your drivers • Allow PCIe to PCIe DMAs or better … • … add some high-speed data I/O port toward an external device (FPGA, custom ASIC) • Promote GPU from simple accelerator to main computing engine status !! January 25 -27 2010 Incontro di lavoro della CCR P C I e x p r e s s DRAM Main Memory GPU 12
In conclusions ● ● GPUs are good at small scales Scaling from single GPU, to multi-GPU, to multinode, hierarchy deepen ● Programming complexity increases ● Watch GPU → Network latency ● Please, help us link your GPU to our 3 D network !!! Game over : ) January 25 -27 2010 Incontro di lavoro della CCR 13