aca377c771219e87d49d69d4111c224b.ppt
- Количество слайдов: 49
Enabling Grids for E-scienc. E Site Monitoring for Grid Services WLCG Grid Services Monitoring Working Group Ian Neilson – CERN HEPSYSMAN, Imperial College, London Material from Emir Imamagic SRCE, Ronald Starink NIKHEF, Max Boehm EDS/Openlab www. eu-egee. org EGEE-II INFSO-RI-031688 EGEE and g. Lite are registered trademarks
Overview Enabling Grids for E-scienc. E • Intro: WLCG Grid Services Monitoring WG – Mandate and approach • Nagios prototype framework – Nagios-based grid monitoring § Architecture § Grid extensions § Standard components • • • – Demo/Pics – Current status & Future Work Some real early-adopter feedback – Experience and wishes A little bit of ganglia Aside (if there is time): Gridmap visualisation • All material extracted from presentations at EGEE’ 07 – http: //indico. cern. ch/session. Display. py? contrib. Id=297&session. Id=37&conf. Id=18714 – http: //tinyurl. com/2 gop 7 v EGEE-II INFSO-RI-031688
WLCG Monitoring Working Groups Enabling Grids for E-scienc. E • 3 groups created by Ian Bird, Oct’ 06 – “…. to help improve the reliability of the grid infrastructure…. ” – “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” – “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. EGEE-II INFSO-RI-031688 Grid Services Grid sensors Transport Metric Repositories Views ……. System Analysis Application monitoring ……
WLCG Grid Monitoring Landscape Enabling Grids for E-scienc. E Domain Grid Applications central services Grid Middleware site services Local Resources site Monitoring Tools in use Application monitoring Experiment Dashboards. . . Grid Services monitoring Gstat SAM/Grid. View Grid. ICE Grid. PP Real Time Monitor. . . Local monitoring Lemon/SLS Nagios Ganglia. . . 3 WLCG Monitoring Working Groups EGEE-II INFSO-RI-031688 Slide by Max Böhm, EDS
WLCG Grid Monitoring Landscape Enabling Grids for E-scienc. E site registry GOCDB (other monitoring tools) one per experiment Experiment/VO. . . ATLAS Experiment/VO. . . Exp. Dashb. RGMA, Mon. ALISA App Layer Apps HTTP/XML pull RGMA job state HTTP/XML push agents Mon. ALISA Ganga/ Panda VO jobs, data, site reliability Exp. Dashb. DB access GOCDB, BDII RTM GOCDB, BDII GStat real time 3 D job view Atlas. Prod. DB File Catalog Central Grid Services Resource Broker LFC RB Info System BDII RGMA html LDAP LB FTS HTTP/XML DB access submit test jobs results site status + graphs HTTP/XML pull Site Services Fabric Resources CE data transfer, job status, service availability Grid View Grid. ICE BDII + fabric/job infos fabric infos HTTP/SOAP push batch GOCDB, ext. BDII CPUs EGEE-II INFSO-RI-031688 SE GOCDB, BDII SAM Nagios sites TBs
High-level View of Monitoring Enabling Grids for E-scienc. E • Work in 4 areas Initial focus to help sites - this session. EGEE-II INFSO-RI-031688 Also visualization - Gridmap
Aims of Grid Services WG Enabling Grids for E-scienc. E • Beginning to look at: – more integration with external monitoring (dashboards…) – messaging systems for reliable transport – management/operations visualisation & reporting requirements BUT • The aim is always: – NOT to provide yet another complete technical solution – incrementally improve service reliability by consolidating existing solutions where possible Please see the twiki for all the information and the links to the other WGs https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Service. Monitoring. Info EGEE-II INFSO-RI-031688
Site Grid Services Monitoring Enabling Grids for E-scienc. E • The rest of this session concentrates on Nagios BUT • Nagios is one (good) choice • You use a tool suited for your site and fabric BUT • The standardized probe-set should be reusable • Data exchange specification allows standardized access to metrics • Work on configuration building should be reusable AND • We want to help sites to deploy monitoring to: – Improve their reliability – Make their life easier EGEE-II INFSO-RI-031688
Nagios Prototype Enabling Grids for E-scienc. E Nagios Prototype EGEE-II INFSO-RI-031688
Nagios Framework Enabling Grids for E-scienc. E • Open source monitoring framework – widely used & actively developed • Host and service problems detection and recovery • Provides wide set of basic sensors – easy to develop custom sensors • Centralized vs. distributed deployment • High configurability – service dependencies, fine-grained notification options • Web interface – status view, administration EGEE-II INFSO-RI-031688
Nagios-based Grid Monitoring Enabling Grids for E-scienc. E • Monitoring CRO-GRID Infrastructure (2004 -2006) – Globus Toolkit Pre-WS & WS, UNICORE, other services – active recovery of services – http: //www. cro-ngi. hr • Monitoring EGEE resources in Central Europe (CE) – core services since mid 2006 – all CE sites for 1 st line support since September 2006 – http: //nagios. ce-egee. org • Grid Services Monitoring (GSM) WG – site monitoring prototype, mid 2007 – http: //crnjak. srce. hr/nagios (egee. srce. hr) – https: //pps-monitoring. cern. ch/nagios (CERN-PPS) EGEE-II INFSO-RI-031688
Architecture Enabling Grids for E-scienc. E Site admins Get remote results Get site’s & nodes information Issue alarms Get Nagios results Get site status Get VOMS proxy Refresh proxy Probe descriptions Monitoring server My. Proxy Live node checks … Get nodes information Service checks Site nodes … Site BDII EGEE-II INFSO-RI-031688 CE SE LFC
Grid Extensions Enabling Grids for E-scienc. E • Standard probes – provided by SRCE, CERN, OSG – Security facilities & services § CA distribution, Certificate lifetime, My. Proxy – Monitoring & information services § R-GMA, BDII, MDS, Grid. ICE – Job management services § Globus Gatekeeper, RB, WMS, WMProxy, Job matching – File management services § Grid. FTP, SRM, DPNS, LFC, FTS EGEE-II INFSO-RI-031688
Grid Extensions Enabling Grids for E-scienc. E • Probe description database – probe dependencies • Remote gatherers – SAM & NPM • Certificate based authentication for the web interface – enables authorization • Nagios Config Generator (NCG), Publisher, Credential management EGEE-II INFSO-RI-031688
Standard Components Enabling Grids for E-scienc. E • Probe wrapper – enables integration of standardized probes – Grid Monitoring Probes Specification – https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Monitoring. Probe. Spec ification • Publisher & remote gatherers – integration with other tools – Grid Monitoring Data Exchange Standard – https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Monitoring. Data. Excha nge. Standard Comments, contributions & probes welcome! EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E SAM Standard probes NPM EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E EGEE-II INFSO-RI-031688
Current Status Enabling Grids for E-scienc. E • Three sets of standard probes integrated – SRCE, CERN, OSG • RPMs in apt and yum repository – http: //www. sysadmin. hep. ac. uk • Mailing list for community support of sites – wlcg-monitoring-discuss@cern. ch • Deployments – CERN-PPS, SRCE, NIKHEF, PIC New volunteers welcome! EGEE-II INFSO-RI-031688
Nagios Prototype Enabling Grids for E-scienc. E NIKHEF Experience as early adopter EGEE-II INFSO-RI-031688
NIKHEF Experience Enabling Grids for E-scienc. E • NIKHEF part of Dutch T 1 – ~ 400 cores, 90 TB storage, ~ 150 hosts – Will grow ~ 10 x – Site without active monitoring, only Ganglia • Motivation – – Recognized need for site monitoring Planned to look at Nagios Opportunity to get started! Hesitation: investment of time on short notice • Expectation – Let's see what it does. . . – Gain experience with Nagios EGEE-II INFSO-RI-031688
NIKHEF Experience Enabling Grids for E-scienc. E • Initial setup: – Dedicated host for monitoring – Remote probes only (SAM) – No g. Lite – Only secure web server • Extended setup: – New VM as g. Lite 3. 0 UI – Remote probes – Local probes CE, RB, BDII, site BDII, LFC, SE DPM, classic SE, MON – Only secure web server • Future: – Add local probes to dedicated host – Use notifications, perhaps event handlers EGEE-II INFSO-RI-031688
NIKHEF Experience Enabling Grids for E-scienc. E • Not difficult to setup – Some manual actions – Early adopter: some small issues – Configuration script complex • Very useful! – – Almost immediate feedback when services are failing Used on daily basis Lots of tests (SAM and local) Not always clear what they test. Documentation? • Some issues – Permanently failing tests: SAM + local probes – Occasionally failing tests, spontaneous recovery – Proxy nearly expired: many tests failing? EGEE-II INFSO-RI-031688
NIKHEF Experience - Conclusions Enabling Grids for E-scienc. E • Not difficult to setup – Manual configuration – Help from mailing list • Very useful! – Good overview of service status – Fast feedback – Extensible – Jeff: “For me it's already worth the investment” • Not yet “production quality”, but close – Some permanently failing tests – Does not yet feel 100% stable • Documentation on tests – What is tested? – What does a failure mean? • Future – Part of infrastructure – Add more tests, perhaps contribute tests – Use alarms EGEE-II INFSO-RI-031688
Nagios Prototype Enabling Grids for E-scienc. E Future and Conclusions EGEE-II INFSO-RI-031688
Future Work Enabling Grids for E-scienc. E • NCG modularization – enables reuse for other monitoring tools (e. g. Lemon) • Enabling “on-host” check via NRPE – process, logs, ports, files, etc • Simplify local probe execution – executing local probes on existing g. Lite-UI nodes – executing local probes without dteam membership • Probe description & site topology databases definition • Migration of credential management to robot certificates EGEE-II INFSO-RI-031688
Conclusions Enabling Grids for E-scienc. E • Nagios – highly configurable monitoring framework with notifications, service dependencies, … – widely used by site admins • Grid extensions – integration with existing infrastructure (user certificates, VOMS, GOCDB, SAM) – probes for key grid services • Implementation of GSM WG specifications – probe wrapper, publisher & remote gatherers – easy integration with existing probes and monitoring systems EGEE-II INFSO-RI-031688
Ganglia Enabling Grids for E-scienc. E • Using publisher interface to populate ganglia EGEE-II INFSO-RI-031688
Contact Us Enabling Grids for E-scienc. E Special thanks to the original authors Questions? wlcg-monitoring-discuss@cern. ch https: //twiki. cern. ch/twiki/bin/view/LCG/Grid. Service. Monitoring. Info EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E Visualizing the State of the Grid with Grid. Maps Max Böhm, Rolf Kubli CERN openlab / EDS EGEE'07 Conference, 1 -5 Oct 2007 www. eu-egee. org EGEE-II INFSO-RI-031688 EGEE and g. Lite are registered trademarks
Outline Enabling Grids for E-scienc. E – Collaboration CERN openlab / EDS – Motivation – Grid. Map Visualization – Prototype – Conclusions EGEE-II INFSO-RI-031688
Collaboration CERN openlab / EDS Enabling Grids for E-scienc. E • EDS is a contributor member of the CERN openlab • The purpose of the joint project between CERN and EDS is to carry out research and development in the field of monitoring, management and operation of Grid services. The CERN openlab is a framework for evaluating and integrating cutting-edge IT technologies or services in partnership with industry Mont Blanc (4810 m) Downtown Geneva EGEE-II INFSO-RI-031688
Motivation Enabling Grids for E-scienc. E • Better understanding the state of the Grid helps improve the reliability of Grid services • "You can't manage what you don't measure" • Many Grid monitoring tools are in use – – – Service Availability Monitoring (SAM) Grid. View GStat Experiment Dashboard Grid. ICE. . . EGEE-II INFSO-RI-031688
Motivation Enabling Grids for E-scienc. E • But. . . • The Grid is a large distributed infrastructure • Grid monitoring data are complex! • Current tools visualize data by sorted tables, bar charts, etc. • Difficult to present an easy to understand top-level view which provides • - quick, action oriented oversight and insight • - help understand job failures and availability patterns • Can new visualizations help? EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E • Grid. Map Visualization EGEE-II INFSO-RI-031688
Grid. Map Visualization Enabling Grids for E-scienc. E • Idea – visualize the Grid by using Treemaps (Grid + Treemap = Grid. Map) • Example Grid. Map site Size of rectangle is e. g. - size of site (#CPUs) - #running jobs -. . . EGEE-II INFSO-RI-031688 regions
Grid. Map Visualization Enabling Grids for E-scienc. E • Idea – visualize the Grid by using Treemaps (Grid + Treemap = Grid. Map) ok • Example Grid. Map Colour of rectangle is e. g. - SAM status of site / service - Availability of site / service -. . . EGEE-II INFSO-RI-031688 degraded down
Multiple Views Enabling Grids for E-scienc. E • Grid. Maps can be used for top-level, geographical and VO views Top-level View Global Grid. Map Application Domain Grid. Map VO Views cross-location Large-scale Federated Grid Services Infrastructure Alert Corrective action effect Federation, Partner, Site, etc. Geographical Views Local Grid. Map Next level of Grid. Maps EGEE-II INFSO-RI-031688
Trends Enabling Grids for E-scienc. E • Trends can be understood by looking at a sequence of Grid. Maps Site Availability over time: 20 Sep 2007 21 Sep 2007 23 Sep 2007 24 Sep 2007 EGEE-II INFSO-RI-031688 22 Sep 2007 25 Sep 2007
More Views Enabling Grids for E-scienc. E • Correlations of metrics can be discovered by switching between different views Site Availability from different VO perspectives: OPS Alice sites without colour do not support the VO Atlas CMS LHCb SE SRM site BDII Status of different Site Services: Overall Site EGEE-II INFSO-RI-031688 CE
Enabling Grids for E-scienc. E • Prototype EGEE-II INFSO-RI-031688
Grid. Map Prototype Architecture Enabling Grids for E-scienc. E Grid. Map Server Title existing monitoring system(s) view 1 view 2 view 3 Grid. Map Server Grid sites Grid. Map View Web Browser - provides client side code and client supporting services - Browser based Web 2. 0 type client component - implements Grid. Map Layout Algorithm - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight Java. Script libraries, AJAX type communication pattern EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E Grid. Map Prototype View Component Drilldown into region by clicking on the title Link: http: //gridmap. cern. ch Grid topology view (grouping) Metric selection for size of rectangles Metric selection for colour of rectangles VO selection Overall Site or Site Service selection Show SAM status Show Grid. View availability data Context sensitive information EGEE-II INFSO-RI-031688 Description of current view Colour Key
Grid. Map Prototype: Link to Existing Tools Enabling Grids for E-scienc. E • Clicking on a site opens a page with details in Grid. View/SAM Site Detail Availability SAM Test Results EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E • Conclusions EGEE-II INFSO-RI-031688
Conclusions Enabling Grids for E-scienc. E – Grid. Maps are a new approach to visualizing complex monitoring data of the Grid – The same type of visualization can be used for top-level, regional, and VO specific views – Grid. Maps can identify correlations and availability patterns – A prototype for visualizing SAM data has been implemented – Can be used for visualizing other data, e. g. of experiments, alarms – Grid. Map web component can be embedded into other tools, e. g. Dashboards (if you are interested, please contact us) – Grid. Maps are a result of the CERN openlab / EDS collaboration which takes place within the CERN-IT Grid Deployment group EGEE-II INFSO-RI-031688
Enabling Grids for E-scienc. E Contacts: Dr. Max Böhm EDS / CERN openlab max. boehm@eds. com max. boehm@cern. ch Dr. Rolf Kubli EDS Switzerland rolf. kubli@eds. com EDS and the EDS logo are registered trademarks of Electronic Data Systems Corporation. EDS is an equal opportunity employer and values the diversity of www. eu-egee. org its people. © 2007 Electronic Data Systems Corporation. All rights reserved. EGEE-II INFSO-RI-031688 EGEE and g. Lite are registered trademarks
aca377c771219e87d49d69d4111c224b.ppt