191866b187c46244078108016b65f1e8.ppt
- Количество слайдов: 39
SKA The worlds largest Radio Telescope streaming data processor Dr Paul Calleja Director Cambridge HPC Service Cape Town 2013
Overview • Introduction to Cambridge HPCS • Overview of the SKA project • SKA streaming data processing challenge • The SKA SDP consortium Cape Town 2013
Cambridge University • The University of Cambridge is a world leading teaching & research institution, consistently ranked within the top 3 Universities world wide • Annual income of £ 1200 M - 40% is research related - one of the largest R&D budgets within the UK HE sector • 17000 students, 9, 000 staff • Cambridge is a major technology centre – 1535 technology companies in surrounding science parks – £ 12 B annual revenue – 53000 staff • The HPCS has a mandate to provide HPC services to both the University and wider technology company community Cape Town 2013
Four domains of activity Driving Discovery Cambridge HPC Service Promoting uptake of HPC by UK Industry Industrial HPC Service HPC R& D Dell HPC Solution Centre Cape Town 2013 Advancing development and application of HPC Commodity HPC Centre of Excellence
Cambridge HPC vital statistics • 750 registered users from 31 departments • 856 Dell Servers - 450 TF sustained DP performance • 128 node Westmere (1536 cores) (16 TF) • 600 node (9600 core) full non blocking Mellanox FDR IB 2, 6 GHz sandy bridge (200 TF) one of the fastest Intel clusters in he UK • SKA GPU test bed -128 node 256 card NVIDIA K 20 GPU • Fastest GPU system in UK 250 TF • Designed for maximum I/O throughput and message rate • Full non blocking Dual rail Mellanox FDR Connect IB • Design for maximum energy efficiency • 2 in Green 500 • Most efficient air cooled supercomputer in the world • 4 PB storage – Lustre parallel file system 50 GB/s • Run as a cost centre – charges our users – 20% income from industry Cape Town 2013
CORE – Industrial HPC service & consultancy Cape Town 2013
Dell | Cambridge HPC Solution Centre • The Solution Centre is a Dell Cambridge joint funded HPC centre of excellence, provide leading edge commodity open source HPC solutions. Cape Town 2013
SA CHPC collaboration • HPCS has a long term strategic partnership with CHPC • HPCS has been working closely with CHPC for last 6 years • Technology strategy, system design procurement • HPC system stack development • SKA platform development Cape Town 2013
Square Kilometre Array - SKA • Next generation radio telescope • Large multi national Project • 100 x more sensitive • 1000000 X faster • 5 square km of dish over 3000 km • The next big science project • Currently the worlds most ambitious IT Project • First real exascale ready application • Largest global big-data challenge Cape Town 2013
SKA location A Continental sized Radio Telescope • Needs a radio-quiet site • Very low population density • Large amount of space • Two sites: • Western Australia • Karoo Desert RSA Cape Town 2013
SKA phase 1 implementation + SKA 1_Mid incl Meer. KAT SKA 1_AIP_Survey incl ASKAP SKA 1_Low SKA Element Location Dish Array SKA 1_Mid RSA Low Frequency Aperture Array SKA 1_Low ANZ SKA 1_AIP_Survey ANZ Survey Instrument Cape Town 2013
SKA phase 2 implementation SKA 2_Mid_Dish SKA 2_AIP_AA SKA 2_Low SKA Element Low Frequency Aperture Array Mid Frequency Dish Array Mid Frequency Aperture Array Location SKA 2_Low ANZ SKA 2_Mid_Dish RSA SKA 2_Mid_AA RSA Cape Town 2013
What is radio astronomy Astronomical signal (EM wave) s Detect & amplify Digitise & delay Correlate B 1 X X 2 X X Integrate Process Calibrate, grid, FFT Cape Town 2013 SKY Image
SKA – Key scientific drivers Evolution of galaxies Exploring the dark ages Pulsar survey gravity waves Cosmic Magnetism Cradle of life Cape Town 2013
SKA is a cosmic time machine Cape Town 2013
But…… Most importantly the SKA will investigate phenomena we have not even imagined yet Cape Town 2013
SKA timeline 2022 Operations SKA 2023 -2027 Construction of Full SKA, SKA 2017 -2022 10% SKA construction, SKA 2012 Site selection 2012 - 2016 Pre-Construction: 1 yr Detailed design € 90 M PEP 3 yr Production Readiness 2008 - 2012 System design and refinement of specification 2000 - 2007 Initial concepts stage 1995 - 2000 Preliminary ideas and R&D 1 2024: Operations SKA Cape Town 2013 2 1 2 € 2 B € 650 M
SKA project structure SKA Board Advisory Committees (Science, Engineering, Finance, Funding …) Director General Project Office (OSKAO) Work Package Consortium 1 … Work Package Consortium n Cape Town 2013 Locally funded
Work package breakdown 1. System 2. Science 3. Maintenance and support /Operations Plan 4. Site preparation 5. Dishes 6. Aperture arrays 7. Signal transport 8. Data networks 9. Signal processing 10. Science Data Processor SPO UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…) 11. Monitor and Control 12. Power Cape Town 2013
SKA = Streaming data processor Challenge • The SDP consortium led by Paul Alexander University of Cambridge • 3 year design phase has now started (as of November 2013) • To deliver SKA ICT infrastructure need a strong multi-disciplinary team • Radio astronomy expertise • HPC expertise (scalable software implementations; management) • HPC hardware (heterogeneous processors; interconnects; storage) • Delivery of data to users (cloud; UI …) • Building a broad global consortium: • 11 countries: UK, USA, AUS, NZ, Canada, NL, Germany, China, France, Spain, South Korea • Radio astronomy observatories; HPC centres; Multi-national ICT companies; sub-contractors Cape Town 2013
SDP consortium members Management Groupings University of Cambridge (Astrophysics & HPFCS) Netherlands Institute for Radio Astronomy International Centre for Radio Astronomy Research SKA South Africa / CHPC STFC Laboratories Non-Imaging Processing Team University of Manchester Max-Planck-Institut für Radioastronomie University of Oxford (Physics) University of Oxford (Oe. RC) Chinese Universities Collaboration New Zealand Universities Collaboration Canadian Collaboration Forschungszentrum Jülich Centre for High Performance Computing South Africa i. VEC Australia (Pawsey) Centro Nacional de Supercomputación Fundación Centro de Supercomputación de Castilla y León Instituto de Telecomunicações University of Southampton University College London University of Melbourne French Universities Collaboration Universidad de Chile Cape Town 2013 Workshare (%) 9. 15 9. 25 8. 35 8. 15 4. 05 6. 95 4. 85 5. 85 3. 55 13. 65 2. 95 3. 95 1. 85 2. 25 1. 85 3. 95 2. 35 1. 85
SDP –strong industrial partnership • Discussions under way with • Del. I, NVIDIA, Intel, HP IBM, SGI, l, ARM, Microsoft Research • Xyratex, Mellanox, Cray, DDN • NAG, Cambridge Consultants, Parallel Scientific • Amazon, Bull, AMD, Altera, Solar flare, Geomerics, Samsung, CISCO • Apologies to those I’ve forgotten to list Cape Town 2013
SDP work packages Cape Town 2013
SKA data rates 16 Tb/s 4 Pb/s 1000 Tb/s 20 Gb/s 24 Tb/s 20 Gb/s Cape Town 2013
SKA conceptual data flow Cape Town 2013
SKA conceptual data flow Cape Town 2013
Science data processor pipeline Beam Steering Observation Time-series Buffer Searching 10 Tb/s 50 PB 200 Pflop 1000 Tb/s 10/1 TB/s 10 Eflop 10 Tb/s Software complexity Cape Town 2013 Imaging Image Storage Search analysis H P C sc i e n c e pr oc e ssi ng SKA 1 SKA 2 Switch Gridding Visibilities Bul k St or e Beamforming/ De-dispersion Observation Buffer Image Processor Course Delays Visibility Steering UV Processor Buffer store Non-Imaging: Fine F-step/ Correlation Buffer store Switch Corner Turning … Incoming Data from collectors Course Delays Correlator Beamformer Corner Turning Imaging: Object/timing Storage 10 Pflop 1 EB/y 100 Pflop 1 Eflop 10 EB/y
SDP processing rack – feasibility model GGPU, MIC, …? 42 U Rack Processing blade 1 Processing blade 2 Processing Blade: Processing blade 3 Processing blade 4 Processing blade 7 Processing blade 8 Disk 2 ≥ 1 TB Disk 3 ≥ 1 TB Disk 4 ≥ 1 TB 56 Gb/s Processing blade 10 Leaf Switch-1 56 Gb/s Leaf Switch-2 56 Gb/s - >10 TFLOP/ s Processing blade 9 Host processor To rack switches Multi-core X 86 Processing blade 11 Processing blade 12 Processing blade 13 PCI Bus Processing blade 14 Processing blade 15 Blade Specification Processing blade 16 Processing blade 17 Processing blade 18 Processing blade 19 Processing blade 20 M-Core - >10 TFLOP/ s Disk 1 ≥ 1 TB Processing blade 6 M-Core Processing blade 5 • • 20 TFlop 2 x 56 Gb/s comms 4 TB storage <1 k. W power Cape Town 2013 • Capable host (dual Xeon) • Programmable • Significant RAM
SKA feasibility model 1 1 AA-low … Data 1 280 2 16 … … 56 Gb/s each 1 AA-low … Data 2 280 HPC 1 2 3 Imaging Processor N … … … Switch … … 280 … … 1 AA-low … Data 3 Dishes … Data 4 250 Corner Turner switches … Bulk Store Correlator/ UV processor Cape Town 2013 … Further UV processors
SKA conceptual software stack Cape Town 2013
SKA Open Architecture Lab • HPC development and prototyping lab for SKA • Coordinated out of Cambridge and run jointly by HPCS and CHPC • Will work closely with COMP to test and design various potential compute, networking, storage and HPC system / application software components • Rigorous system engineering approach, which describes a formalised design and prototyping loop • Provides a managed, global lab for the whole of the SDP consortium • Provide touch stone and practical place of work for interaction with vendors • First major test bed in the form of a Dell / Mellanox / NVIDIA GPU cluster has been deployed in the lab last month and will be used by consortium to drive design R&D Cape Town 2013
SKA Exascale computing in the desert • The SKA SDP compute facility will be at the time of deployment one of the largest HPC systems in existence • Operational management of large HPC systems is challenging at the best of times - When HPC systems are housed in well established research centres with good IT logistics and experienced Linux HPC staff • The SKA SDP could be housed in a desert location with little surrounding IT infrastructure, with poor IT logistics and little prior HPC history at the site • Potential SKA SDP exascale systems are likely to consist of 100, 000 nodes occupy 800 cabinets and consume 30 MW. This is very large – around 5 times the size of one today largest supercomputer –Titan Cray at Oakridge national labs. • The SKA SDP HPC operations will be very challenging Cape Town 2013
The challenge is tractable • Although the operational aspects of the SKA SDP exacscale facility are challenging they are tractable if dealt with systematically and in collaboration with the HPC community. Cape Town 2013
SKA HPC operations – functional elements • We can describe the operational aspects by functional element Machine room requirements ** SDP data connectivity requirements SDP workflow requirements System service level requirements System management software requirements** Commissioning & acceptance test procedures System administration procedure User access procedures Security procedure Maintenance & logistical procedures ** Refresh procedure System staffing & training procedures ** Cape Town 2013
Machine room requirements • Machine room infrastructure for exascale HPC facilities is challenging • 800 racks, 1600 M squared • 30 MW IT load • ~40 Kw of heat per rack • Cooling efficiency and heat density management is vital • Machine infrastructure at this scale is both costly and time comsuming • The power cost alone at todays cost is £ 30 M per year • Desert location presents particular problems for data centre • • Hot ambient temperature Lack of water Very dry air Remote location - difficult for compressor less cooling - difficult for humidification - difficult for DC maintenance Cape Town 2013
System management software • System management software is the vital element in HPC operations • System management software today does not scale to exascale • Worldwide coordinated effort to develop system management software for exascale • Elements of system management software stack: Power management Network management Storage management Workflow management OS Runtime environment Security management System resilience System monitoring System data analytics Development tool Cape Town 2013
Maintenance logistics • Current HPC technology MBTF for hardware and system software result in failure rates of ~ 2 nodes per week on a cluster a ~600 nodes. • It is expected that SKA exascale systems could contain ~100, 000 nodes • Thus expected failure rates of 300 nodes per week could be realistic • During system commissioning this will be 3 or 4 X • Fixing nodes quickly is vital otherwise the system will soon degrade into a non functional state • The manual engineering processes for fault detection and diagnosis on 600 will not scale to 100, 000 nodes. This needs to be automated by the system software layer • Vendor hardware replacement logistics need to cope with high turn around rates Cape Town 2013
Staffing levels and training • Providing functional staffing levels and experience at remote desert location will be challenging • Its hard enough finding good HPC staff to run small scale HPC systems in Cambridge – finding orders of magnitude more staff to run much more complicated systems in a remote desert location will be very Challenging • Operational procedures using a combination of administration staff and DC smart hands will be needed. remote system • HPC training programmes need to be implemented to skill up way in advance Cape Town 2013
Early Cambridge SKA solution - EDSAC 1 Maurice Wilkes Cape Town 2013


