Скачать презентацию Advances in Monitoring of Grid Services in WLCG Скачать презентацию Advances in Monitoring of Grid Services in WLCG

94365b079d5c6e16866dab15e0fb735e.ppt

  • Количество слайдов: 20

Advances in Monitoring of Grid Services in WLCG James Casey, CERN IT-GD CHEP 5 Advances in Monitoring of Grid Services in WLCG James Casey, CERN IT-GD CHEP 5 th September 2007 1

Why Monitoring ? o You can’t manage what you don’t measure. . . accuracy Why Monitoring ? o You can’t manage what you don’t measure. . . accuracy and credibility appropriate metrics - directly relevant to user experience - clearly defined and understood measurement instrumentation data collection points - active, passive, collection intervals, alarms - system element service real-time historical Sensors/Agents Transport Repositories Views Monitoring Presentation Grid automated decision making manual decision making Manage & Control Slide by Max Böhm, EDS 2

WLCG Grid Monitoring Landscape Domain Grid Applications central services Grid Middleware site services local WLCG Grid Monitoring Landscape Domain Grid Applications central services Grid Middleware site services local resources site Monitoring Tools in use Application monitoring Experiment Dashboards. . . Grid Services monitoring Gstat SAM/Grid. View Grid. ICE Grid. PP Real Time Monitor. . . Local monitoring Lemon/SLS Nagios Ganglia. . . 3 WLCG Monitoring Working Groups Slide by Max Böhm, EDS 3

WLCG Monitoring Working Groups o 3 groups proposed by Ian Bird to the LCG WLCG Monitoring Working Groups o 3 groups proposed by Ian Bird to the LCG Management Board, Oct 06. n Goal to improve the reliability of the WLCG grid System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Repositories Views ……. System Analysis Application monitoring …… 4

Grid Services Monitoring WG Mandate n “…. to help improve the reliability of the Grid Services Monitoring WG Mandate n “…. to help improve the reliability of the grid infrastructure…. ” n “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” n “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Service. Monitoring WGMandate 5

Aims of grid services WG o Not to provide yet another complete technical solution Aims of grid services WG o Not to provide yet another complete technical solution But, o Improve reliability of WLCG o Consolidate existing solutions n Improve communication n Reduce overlap n Increase sharing 6

How? o Engage with stakeholders Operations meetings WLCG Workshops Questionnaires to site managers Grid How? o Engage with stakeholders Operations meetings WLCG Workshops Questionnaires to site managers Grid Infrastructures (EGEE, OSG, NDGF) Grid Middleware providers (g. Lite, VDT) Monitoring software providers (SAM, Grid. Ice, Mon. Ami, Grid. View, LEMON, Nagios, …) n External experts (openlab EDS collaboration) n Other Working Groups n n n 7

Tasks of grid services WG o Collect descriptions of current grid services n So Tasks of grid services WG o Collect descriptions of current grid services n So that probes can be written n Input from developers, deployment team, site admins o Create ‘standard’ solutions as strawman n Standard WLCG probes o And how to calculate availability based on the measurements produced n Prototype site monitoring system 8

Direction o Focus on the interaction points between the different systems n Allow for Direction o Focus on the interaction points between the different systems n Allow for diversity across different grid infrastructures o “Specifications, not Standards” n Timescales mean we can’t get involved in long and heavyweight standards activities n Take best practices from existing systems, and document them o Implement simple prototypes n And mature the bits that work ! o Get something out to the stakeholders n Close feedback loop is the key to adoption n Plan for a “standards based” solution in the future 9

High-level Model See https: //twiki. cern. ch/twiki/pub/LCG/Grid. Service. Monitoring. Info/0702 WLCG_Monitoring_for_Managers. pdf for details High-level Model See https: //twiki. cern. ch/twiki/pub/LCG/Grid. Service. Monitoring. Info/0702 WLCG_Monitoring_for_Managers. pdf for details 10

Site monitoring o We can’t/won’t impose a solution on sites n They might/should have Site monitoring o We can’t/won’t impose a solution on sites n They might/should have something already o Specification based approach n allows our probes fit into any fabric monitoring system o Isolate differences in site systems by a standard Data Exchange format n For both publishing and consuming 11

“The Nagios-based prototype” o Initally implement using one fabric monitoring system - Nagios n “The Nagios-based prototype” o Initally implement using one fabric monitoring system - Nagios n … but architecture checked with LEMON developers o Implement some of the site components n n Configuration Generation Certificate handling Grid service probes/sensors Service Status Calculator 12

Prototype site implementation 13 Prototype site implementation 13

Nagios display 14 Nagios display 14

Ganglia display 15 Ganglia display 15

Current status o Prototype tested against n CERN PPS n egee. srce. hr site Current status o Prototype tested against n CERN PPS n egee. srce. hr site n Early adopter sites - PIC, NIKHEF o Complete Packaging, Installation and configuration instructions exist n Troubleshooting guide developed as problems arise n Will integrate into a g. Lite release for EGEE (MON BOX) o Several probe sets integrated n SRCE (EGEE CE ROC), CERN LFC/DPM, OSG, RGMA n Can report on remote SAM and NPM tests 16

Who should try this? o Site admins who already use Nagios and want to Who should try this? o Site admins who already use Nagios and want to integrate SAM results into their site o Site admins who have no monitoring yet and are thinking of trying Nagios n Can have both SAM results and local checks n Not for the faint hearted - you’ll be a very early adoptor! o But sites doing it already seeing benefits (NIKHEF) o RPMs in an apt repository n Provided by System Management group o http: //www. sysadmin. hep. ac. uk/ o Mailing list for community support of sites n wlcg-monitoring-discuss@cern. ch 17

Futures and other work o We focus here on the prototype n Since this Futures and other work o We focus here on the prototype n Since this is what we are delivering now o Also working on n Specifications and example components n Security architecture o Future work includes n Probe description database n Topology database n Messaging architecture for transport layer 18

Summary o Effort invested to understand the current monitoring landscape o Approach for improvement Summary o Effort invested to understand the current monitoring landscape o Approach for improvement based on specifications of interfaces between components o Prototype has been developed and tested on a small scale o Now looking for more adopters to get feedback Who wants to volunteer? https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Service. Monitoring. Info 19

Thank you. 20 Thank you. 20