353b1ed6f0cb6361d88a556c1d4808ba.ppt
- Количество слайдов: 28
Enabling Grids for E-scienc. E Global Grid Operations and their Tools Hélène Cordier EGEE/WLCG Operations IN 2 P 3 Computing Centre Lyon (France) - helene. cordier@in 2 p 3. fr NAREGI Lyon 04/07/2007
Contents Enabling Grids for E-scienc. E Grid Operations Issues EGEE/WLCG ways of solving these issues Use case and daily operations WLCG specifics and current developements in Operations Future Work 2
CPU, countries, sites Enabling Grids for E-scienc. E Russia; 583 SWE; 1593 Series 1; A-P; ROC CPU /1632 8 CERN; 5943 Series 1; Countries / ROC 4 CERN; Series 1; France; 1 De/CH; 2 Series 1; Italy; 1 Series 1; Russia; 2 Series 1; SEE; 2568 UK/I; 2 SWE; 2 France; 2700 Series 1; CE; NE; 3031 7 Series 1; SEE; 8 Series 1; NE; 8 De/CH; 3364 CE; 1875 Series 1; A-P; Russia; 15 Series 1; Sites / ROC 12 20 CERN; Series 1; France; 10 De/CH; 14 Series 1; Italy; 3628 SWE; 15 Series 1; UK/I; 7720 Italy; 37 35000 CPU 45 countries (31 partner countries) 237 sites (131 partner sites) Series 1; SEE; 38 Series 1; NE; Series 1; CE; 27 UK/I; 25 24 Ian Bird - OGF/EGEE User Forum - May 9 th 2007 33
Workload Enabling Grids for E-scienc. E 98000 jobs/day 13000 jobs/day Ian Bird - OGF/EGEE User Forum - May 9 th 2007 44
Operating a grid Enabling Grids for E-scienc. E • Middleware deployment - Availability and Functionnality and Knowledge Management • Management of Pre-production –Functionnality • Security issues - Reliability • Monitoring sites and services - Availability then Evaluation • Accounting – Assessment • Support end-users and sites – Support • Interoperability : m/w, operations • Supervising Production : operations responsability – metrics – Dependability and Sustainability, putting it all together, communications between various actors – Knowledge Management 5
Middleware and Certification Enabling Grids for E-scienc. E • The goal is to produce a middleware distribution that can be deployed widely • Certification testing includes: – Installation and configuration – Component (service) functionality – System testing (trying to emulate real workloads and stress testing) Production service M/W Pre-production service CERT Testing & Certification OMIIEurope CERTIFICATION Integration VDT/OSG … Middleware providers Support, analysis, debugging Certification activities CERT+M/W OPERATIONS 6
Pre-production service Enabling Grids for E-scienc. E • Pre-production service is now ~ 27 sites in 16 countries • Provides access to some 3000 CPU – Some sites allow access to their full production batch systems for scale tests • Sites install and test different configurations and sets of services • Services may be initially demonstrated in this environment • Before further development • New VO-s: adapt their applications & gain experience • (e. g. DILIGENT) 7
Use of the infrastructure Enabling Grids for E-scienc. E >20 k jobs running simultaneously SA 1 - Ian Bird - EGEE-II 1 st EU Review - 15 -16 May 2007 88
Dashboard concept Enabling Grids for E-scienc. E Sites info Monitoring tool #1 Operator Monitoring tool #2 Sites info Operator Mail client Ticketing system Monitoring tool #1 Monitoring tool #2 Monitoring tool #n Dashboard Monitoring tool #n Mail sender Ticketing system MANY ENTRY POINTS SINGLE ENTRY POINT 9
Daily operations Enabling Grids for E-scienc. E CIC DB Information on VOs User GOC DB Information on sites Site GGUS User Support & Ticketing system Operations Portal Integration Tools cic. gridops. org Monitoring tools JS Communication tools Operator BROADCAST IS Regional Center 10
Repository for site information Enabling Grids for E-scienc. E • Keep a central repository of information on the components of the grid – Site registry (name, location, contact information, administrator contact, security contact, …) – Site status (candidate, uncertified, production, suspended, …) – History of scheduled unavailability of the site – Grid services operated by the site: computing elements, storage elements, file catalogue services, virtual organization management services, resource brokers, etc. – Services that sites want to be monitored by the grid operators • Updating this information is a shared responsibility between the site operator and the federation manager • WLCG/EGEE – central repository of site information (a. k. a. Grid Operations Centre) developed and operated by Rutherford Appleton Laboratory (RAL), UK. – http: //goc. grid-support. ac. uk/gridsite/gocdb • This repository is used by the grid monitoring services (more on this later) 11
Monitoring Enabling Grids for E-scienc. E • Grid operators need to have a global view of the status of the infrastructure – Grid information is highly dynamic • Tools required to collect information on the grid component state – Availability of resources and services, based on the static information stored in the central site repository – Collection of metrics on availability of resources and services • WLCG/EGEE – Service of probes sent to every site to check it on a regular basis – Service for regularly testing the consistency of the dynamic information published by the site in the grid information system – Information on the result of those tests is available to grid operators, site managers and end-users – Virtual Organization managers can use this information to select a set of sites they intend to use – Monitoring services developed and operated by CERN, Academia Sinica (Taiwan), Grid. PP (UK) and INFN (Italy) 12
Tickets workflow Enabling Grids for E-scienc. E End User Problem detection FZK, Karlsruhe, Germany IN 2 P 3 -CC, Lyon, France OPERATIONS PORTAL GGUS WSDL dashboard WSDL Ticket UK FR GER Ticket follow-up Ticket IT … Problem detection & reporting Operator on duty Regional Support Units 13
Operations –Global Grid User Support Enabling Grids for E-scienc. E 14
Tracking incidents Enabling Grids for E-scienc. E • Incident tracking model – Unique channel for opening tickets End-users : e. g job submission failures, data transfer failed Operators : e. g job submission failures – Classification and 1 rst assignment done by the ticket process manager – Tickets are assigned to support units - one per domain of expertise Grid operators, applications, federations, m/w experts, … • WLCG/EGEE – Central incident tracking tool developed/operated by Forschungszentrum Karlsruhe (DE) https: //gus. fzk. de/ – Same tool used by grid operators and end users e-mail and web interface – Sites failing the tests receive are assigned a ticket Escalation procedure for solving site-related problems Involves the regional operator and the site operator • Interface with ticket handling tools used by sites/federations (if needed) • Tools for collecting metrics on the responsiveness of support units 15
Putting all together Enabling Grids for E-scienc. E • Web portal for integrating all the tools and sources of operationsrelated information into one single place • Developed and operated by CC-IN 2 P 3, failover instance at CNAF – http: //cic. gridops. org/ – Provides and maintains an integrated operations dashboard for grid on duty operator – Provides mechanisms for keeping information needed for appropriate hand over between operators on duty – Easy access to appropriate contact information on every actor involved in the operations of the grid – Provides communication tools 16
Alarms Dashboard Enabling Grids for E-scienc. E 17
Alarm Details Enabling Grids for E-scienc. E 18
Service Interruptions Enabling Grids for E-scienc. E 19
Tracking incidents Enabling Grids for E-scienc. E 20
Opening tickets Enabling Grids for E-scienc. E 21
GGUS Ticket 1/2 Enabling Grids for E-scienc. E 22
GGUS Ticket 2/2 Enabling Grids for E-scienc. E 23
Operations support model Operators’s escalation process Enabling Grids for E-scienc. E Monitoring shows a problem Operatoron-duty federation 1 st level support Operator submits a GGUS ticket against the site’sfederation and CC’s to the site (when known) 2 nd level support Federation and site work to resolve the problem If the Federation and site cannot resolve the problem, the Tier 1/ROC contacts the relevant Support Unit or assistance. Site Support Unit 3 rd level support (experts) 24
Operations tickets vs. all GGUS tickets Enabling Grids for E-scienc. E • 25% of all GGUS tickets over almost 2 years • Av 200 tickets/month • ENOC tickets since August 2006 25
ROC av. solution time to GGUS tickets Enabling Grids for E-scienc. E • ROCs are attentive to operational tickets 26
Current Work and Summary Enabling Grids for E-scienc. E • Achieve a real 24 x 7 production quality-like service : Failover mechanisms • Increase automation of daily monitoring tools and alarms treatment. • Achieve sustainable structure through WLCG production. • Achieve scalable structure with a constant increase in the number of sites and diversity of users. • Diverse monitoring tools are developed throughout federations because a grid cannot stand on its own. Failures cause are numerous. • Site administrators and end-user need to assess that its services are available and reliable. 30
Credits and References Enabling Grids for E-scienc. E • Gstat – http: //goc. grid. sinica. edu. tw/gstat/ • GGUS – http: //gus. fzk. de/ • GOC-DB – http: //goc. grid-support. ac. uk/ • SAM – http: //goc. grid. sinica. edu. tw/gocwiki/Service_Availability_Monitoring_Environment – https: //WLCG-sam. cern. ch: 8443/sam. cgi • CMS DASHBOARD – http: //arda-dashboard. cern. ch/cms • Grid. Ice – http: //grid. infn. it/gridice • Lavoisier – http: //grid. in 2 p 3. fr/lavoisier Operations Portal http: //cic. gridops. org EGEE http: //www. eu-egee. org WLCG http: //www. cern. ch/WLCG Numerous slides from : Ian Bird - OGF/EGEE User Forum - May 9 th 2007 Rob Quick, Workshop on Grid services Monitoring HPDC’ 07 – June 27 th 2007 31
353b1ed6f0cb6361d88a556c1d4808ba.ppt