
4f7665f22f8ba1f3ea69a713c16fddd4.ppt
- Количество слайдов: 29
Enabling Grids for E-scienc. E New WLCG Grid Service Monitoring Displays James Casey, CERN IT-GD HEPIX, November 2007 www. eu-egee. org EGEE-II INFSO-RI-031688 EGEE and g. Lite are registered trademarks
Overview Enabling Grids for E-scienc. E • Service Monitoring in WLCG • Site Service Monitoring – Nagios • Central Monitoring – Grid. Map • Future work EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 2
WLCG Monitoring Working Groups Enabling Grids for E-scienc. E • 3 groups created by Ian Bird, Oct’ 06 – “…. to help improve the reliability of the grid infrastructure…. ” – “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” – “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. EGEE-II INFSO-RI-031688 Grid Services Grid sensors Transport Metric Repositories Views ……. System Analysis Application monitoring …… Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 3
Monitoring Enabling Grids for E-scienc. E You can’t manage what you don’t measure. . . accuracy and credibility appropriate metrics - directly relevant to user experience - clearly defined and understood measurement instrumentation data collection points - active, passive, collection intervals, alarms - system element service real-time historical Sensors/Agents Transport Repositories Monitoring Grid Views Presentation automated manual decision making Control Slide by Max Böhm, EDS EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 4
WLCG Grid Monitoring Landscape Enabling Grids for E-scienc. E Domain Grid Applications central Grid services Middleware site services local resources site Monitoring Tools in use Application monitoring Experiment Dashboards. . . Grid Services monitoring GStat SAM/Grid. View Grid. ICE Grid. PP Real Time Monitor. . . Local monitoring Lemon/SLS Nagios Ganglia. . . Slide by Max Böhm, EDS EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 5
Grid Monitoring Landscape View Enabling Grids for E-scienc. E site registry GOCDB (other monitoring tools) one per experiment Experiment/VO ATLAS App Layer . . . Exp. Dashb. Apps HTTP/XML pull RGMA agents job state HTTP/XML push Mon. ALISA Ganga/ Panda VO jobs, data, site reliability RGMA, Mon. ALISA DB access GOCDB, BDII RTM GOCDB, BDII GStat real time 3 D job view Atlas. Prod. DB Central Grid Services File Resource Catalog Broker LFC RB RGMA Info System LDAP BDII LB FTS DB access sites Site Services Fabric Resources CE SE HTTP/XML GOCDB, BDII submit test HTTP/SOAP push jobs results html site status + graphs HTTP/XML pull SAM Grid View data transfer, job status, service availability batch GOCDB, ext. BDII EGEE-II INFSO-RI-031688 BDII + fabric/job infos LEMON CPUs Grid. ICE fabric infos TBs Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 6
High-level Model Enabling Grids for E-scienc. E LEMON Grid. View R-GMA Nagios GOCDB Grid. View HTTP Grid. Ice Dashboard LDAP Experiment Dashboard Grid. Map SAME Grid. View Grid. Ice See https: //twiki. cern. ch/twiki/pub/LCG/Grid. Service. Monitoring. Info/0702 -WLCG_Monitoring_for_Managers. pdf for details EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 7
Grid Site Monitoring Principles Enabling Grids for E-scienc. E • Provide an easily extensible site monitoring system – Or be able to plug grid features into existing site monitoring • Should be able to provide (or augment) alarms at the site for the grid services • Don’t force a solution on the site administrators – Should work with any fabric monitoring system that provides basic functionality • Provide the specific plugins to deal with the Grid – Probes that work for Grid Services • Enable export of the data from the site into standard grid monitoring systems e. g. SAM, Grid. View, Grid. ICE, … – Avoid duplicate running of probes EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 8
Purpose Enabling Grids for E-scienc. E • Bring in data from existing monitoring systems inside the site monitoring tools – – Service Availability Monitoring (SAM) Network performance monitoring (NPM) Experiment site blacklists (FCR tool) Experiment dashboards, … • Decided to create a prototype based on Nagios – Due to existing take-up of Nagios in the community • Second stage will be integrate with LEMON – As next most common solution – Based on questionnaire to community EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 9
Nagios Enabling Grids for E-scienc. E • Open source monitoring system • Widely used & actively developed • Host and service problems detection and recovery • Provides set of basic plugins (sensors) – easy to develop custom sensors • No components required on monitored entities EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 10
Architecture Enabling Grids for E-scienc. E Site admins Get remote results Get site’s & nodes information Issue alarms Get Nagios results Get site status Get VOMS proxy Refresh proxy Probe descriptions Monitoring server My. Proxy Live node checks … Get nodes information Service checks Site nodes … Site BDII EGEE-II INFSO-RI-031688 CE SE LFC Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 11
Grid Extensions Enabling Grids for E-scienc. E • Standard probes – provided by SRCE, CERN, OSG – Security facilities & services § CA distribution, Certificate lifetime, My. Proxy – Monitoring & information services § R-GMA, BDII, MDS, Grid. ICE – Job management services § Globus Gatekeeper, RB, WMS, WMProxy, Job matching – Data management services § Grid. FTP, SRM, DPNS, LFC, FTS • Remote gatherers – SAM & NPM • Nagios Config Generator (NCG), Publisher, Credential management EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 12
Standard Components Enabling Grids for E-scienc. E • Probe wrapper – enables integration of standardized probes § One probe can run in Nagios, LEMON, SAM, … – Grid Monitoring Probes Specification – https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Monitoring. Probe. Spec ification • Publisher & remote gatherers – integration with other tools § Existing tools can just consume the data. E. g SAM, Grid. View, Dashboards… – Grid Monitoring Data Exchange Standard Comments, – https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Monitoring. Data. Excha nge. Standard contributions & probes welcome! EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 13
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 14
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 15
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 16
Enabling Grids for E-scienc. E SAM Standard probes NPM EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 17
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 18
Current Status Enabling Grids for E-scienc. E • Three sets of standard probes integrated – SRCE, CERN, OSG • RPMs in apt and yum repository – http: //www. sysadmin. hep. ac. uk • Installation documentation on twiki – https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Monitoring. Nagios. Install • Mailing list for community support of sites – wlcg-monitoring-discuss@cern. ch • Will appear in upcoming g. Lite releases as packaged software • Will be bundled with “follow-up” documentation to help site New (early-access) admins understand what went wrong on probe failure volunteers welcome! EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 19
New visualizations for the Grid ? Enabling Grids for E-scienc. E • Grid monitoring data is complex! – And there are many sites… • Current tools visualize data by sorted tables, bar charts, etc. • Difficult to present an easy to understand top-level view which provides – quick, action oriented oversight and insight – help understand job failures and availability patterns Can new visualizations help? EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 20
Grid. Map Visualization Enabling Grids for E-scienc. E • Idea – visualize the Grid by using Treemaps (Grid + Treemap = Grid. Map) regions • Example Grid. Map site Size of rectangle is e. g. - size of site (#CPUs) - #running jobs -. . . EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 21
Grid. Map Visualization Enabling Grids for E-scienc. E • Idea – visualize the Grid by using Treemaps (Grid + Treemap = Grid. Map) ok degraded down • Example Grid. Map Colour of rectangle is e. g. - SAM status of site / service - Availability of site / service -. . . EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 22
Multiple Views Enabling Grids for E-scienc. E • Grid. Maps can be used for top-level, geographical and VO views Top-level View Global Grid. Map Application Domain Grid. Map VO Views cross-location Large-scale Federated Grid Services Infrastructure Alert Corrective action effect Federation, Partner, Site, etc. Geographical Views Local Grid. Map Next level of Grid. Maps EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 23
Trends Enabling Grids for E-scienc. E • Trends can be understood by looking at a sequence of Grid. Maps Site Availability over time: 20 Sep 2007 21 Sep 2007 23 Sep 2007 24 Sep 2007 EGEE-II INFSO-RI-031688 22 Sep 2007 25 Sep 2007 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 24
More Views Enabling Grids for E-scienc. E • Correlations of metrics can be discovered by switching between different views Site Availability from different VO perspectives: OPS Alice sites without colour do not support the VO Atlas CMS LHCb SE SRM site BDII Status of different Site Services: Overall Site EGEE-II INFSO-RI-031688 CE Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 25
Grid. Map Prototype Architecture Enabling Grids for E-scienc. E Grid. Map Server Title existing monitoring system(s) view 1 view 2 view 3 Grid. Map Server Grid sites Grid. Map View Web Browser - provides client side code and client supporting services - Browser based Web 2. 0 type client component - implements Grid. Map Layout Algorithm - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight Java. Script libraries, AJAX type communication pattern EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 26
Enabling Grids for E-scienc. E Grid. Map Prototype View Component Drilldown into region by clicking on the title Link: http: //gridmap. cern. ch Grid topology view (grouping) Metric selection for size of rectangles Metric selection for colour of rectangles VO selection Overall Site or Site Service selection Show SAM status Show Grid. View availability data Context sensitive information EGEE-II INFSO-RI-031688 Description of current view Colour Key Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 27
Grid. Map Prototype: Link to Existing Tools Enabling Grids for E-scienc. E • Clicking on a site opens a page with details in Grid. View/SAM Site Detail Availability SAM Test Results EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 28
Conclusions Enabling Grids for E-scienc. E • To improve reliability we need to: 1. Provide more information to site administrators – That relate to what users actually see when using their site § A lot of data already gathered, so if possible don’t do it again – Need to get it into the fabric monitoring system already used at a site – Nagios-based prototype validating the approach § Good feedback form early adoptors 2. Improve the visualization 1. Too much data - especially for central monitoring (~250 sites) 2. New techniques help to compress information and bring useful information into view http: //gridmap. cern. ch http: //nagios-test. cern. ch/nagios (guest: guest) EGEE-II INFSO-RI-031688 Nov 8 th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays 29
4f7665f22f8ba1f3ea69a713c16fddd4.ppt