- Количество слайдов: 16
Monitoring BOF James Casey, CERN IT-GD WLCG Workshop 1 st September, 2007 1
Welcome o Scope of session n Grid service monitoring o from the viewpoint of a site and a VO n Service availability calculation o Focus of session n Progress since last workshop n Multi-infrastructure issues o Out of scope (at least not prepared : ) n Accounting, information system, auditing 2
Progress since last workshop o Focus in January was on newly created WLCG Monitoring WGs o Highlights n System Management group o http: //www. sysadmin. hep. ac. uk/ o System Analysis group o New tools for experiments (dashboards) o Working on getting usage patterns and requirements from the experiments n Grid Service Monitoring o Worked on system architecture o Created probe and exchange specifications o Created first prototype of site monitoring 3
Grid Service Monitoring WG o Very active participation from many groups n SRCE (Emir Imamagic) contributed nagios based prototype o Based on work done for EGEE CE ROC n OSG provided signifigant input on probes specifications (Arvind Gopu, Rob Quick) n EDS Openlab collaboration (Max Böhm) has worked on architecture and analysis n Grid. Ice, Gridview, SAM, R-GMA teams were regular contributors at phone-cons 4
“The Nagios-based Prototype” o Simple monitoring of grid services based on n Currently available remote data (SAM, Network) n Existing probes from EGEE CE region n New probes written according to component developer provided specifications o Initally implement using one fabric monitoring system - Nagios n … but architecture checked with LEMON developers o OSG actively involved in design process n Parallel working done using Gratia for data collection n Same probes can be used in both systems o Some simple plotting using Ganglia 5
Nagios Display 6
Ganglia display 7
Prototype delivery timescale o Stage I – ‘gather_sam’ DONE n Operations Workshop, mid-June 2007 o Stage II – ‘check_wlcg’ DONE n End mid-July 2007 o Stage III – Local probes End September n CHEP, September 2007 ? o Expect to rapidly iterate, so perhaps only a few “early NIKHEF, SRCE, … (? ) CERN PPS, PIC, adopter” sites in June/July n Will ask for volunteers at Operations meeting 8
Futures o Prototype deployed more widely n Probably as part of g. Lite release in ~1/2 months o Added sensors running on the actual service nodes n Checking logs, daemon status, … o Integrate OSG, EGEE, (NDGF) data in a single SAM/Gridview display n Also some new visualisation tools aimed at giving a better “view of the grid” 9
o Discussion n Progress of WLCG Monitoring WG since last workshop n Demonstration of Nagios-based Prototype n SAM Availability calculation including equivalence of components across multiple grid infrastructure n Site Local vs. Central tests - what is a good balance? n Various job submission methods and job monitoring, monitoring of jobs submitted via condor_g 10
WLCG Monitoring WG 11
WLCG Grid Monitoring Landscape Domain Grid Applications central services Grid Middleware site services local resources site Monitoring Tools in use Application monitoring Experiment Dashboards. . . Grid Services monitoring GStat SAM/Grid. View Grid. ICE Grid. PP Real Time Monitor. . . Local monitoring Lemon/SLS Nagios Ganglia. . . 3 WLCG Monitoring Working Groups Slide by Max Böhm, EDS 13
Aims of Grid Services WG o Create set of ‘standard’ WLCG Probes n And how to calculate availability based on the metrics produced o Improve quality by providing technical guidance n Documenting best practices n Providing example components 14
Direction o Focus on the interaction points between the different systems o “Specifications, not Standards” n Timescales mean we can’t get involved in long and heavyweight standards activities n Take best practices from existing systems, and document them o Get something out to the stakeholders n Close feedback loop is the key to adoption n Plan for a “standards based” solution in the future 15
High-level Model See https: //twiki. cern. ch/twiki/pub/LCG/Grid. Service. Monitoring. Info/0702 WLCG_Monitoring_for_Managers. pdf for details 16
Example Site Component View 17