Скачать презентацию Grid Monitoring and Diagnostic Tools Grid ICE GSTAT Скачать презентацию Grid Monitoring and Diagnostic Tools Grid ICE GSTAT

0e33a049be28e7dccdc87a8f3802ccc0.ppt

  • Количество слайдов: 40

Grid Monitoring and Diagnostic Tools: Grid. ICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe. misurelli<at>cnaf. Grid Monitoring and Diagnostic Tools: Grid. ICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe. misurellicnaf. infn. it I Corso di formazione INFN per amministratori di siti Grid Martina Franca, 5 -9 Novembre 2007 www. ccr. infn. it http: //grid. infn. it/

Disclaimer • This presentation is based on materials provided and authorized by the EGEE Disclaimer • This presentation is based on materials provided and authorized by the EGEE project and is freely available to download and use according to the terms of the following license: http: //creativecommons. org/licenses/by-nc-sa/2. 5/ www. ccr. infn. it http: //grid. infn. it/

Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools www. ccr. infn. it http: //grid. infn. it/

Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools www. ccr. infn. it http: //grid. infn. it/

Grid Monitoring Grid monitoring has to provide • The knowledge of the type, state Grid Monitoring Grid monitoring has to provide • The knowledge of the type, state and features of the resources constituting the Grid by means of: – Grid Resources Inventory – Grid Resources Behavior – Grid Resources Availability www. ccr. infn. it http: //grid. infn. it/

Grid Resources Inventory • Instantaneous picture of the resources constituting the Grid to have Grid Resources Inventory • Instantaneous picture of the resources constituting the Grid to have an idea on how Grid resources are shared among sites: – Number of Computing Element (CE), Worker Node (WN) and Storage Element (SE) – Number of Jobs running and waiting in all the Grid, for VOs www. ccr. infn. it http: //grid. infn. it/

Grid Resources Behavior • Measuring a set of evolving data to investigate historical/statistical aspects Grid Resources Behavior • Measuring a set of evolving data to investigate historical/statistical aspects of a Grid: – Percentage of jobs aborted in a site for a particular Virtual Organization (VO) in a certain period of time – Time duration of a fault situation for a particular service or Grid process – Percentage of CPU/RAM usage during the Grid activity www. ccr. infn. it http: //grid. infn. it/

Grid Resources Availability • Evaluating the accessibility of the Grid main services at Regional, Grid Resources Availability • Evaluating the accessibility of the Grid main services at Regional, Site and VO level for a grid usage improvement – Actual Grid services down (e. g. CE, WN, SE) – Actual Grid site components not working properly (es. authentication and authorization, job submission, data management) – Actual Jobs load in a certain Site – Actual Min/Max Sloat Free where you can submitt jobs www. ccr. infn. it http: //grid. infn. it/

Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools www. ccr. infn. it http: //grid. infn. it/

Day by Day Operations /1 • INFNGrid must be daily monitored both by the Day by Day Operations /1 • INFNGrid must be daily monitored both by the ROC team and Site Managers to test its functionalities – Service Level Agreement according to the Memorandum of Understanding Site must provide a Grid production level www. ccr. infn. it http: //grid. infn. it/

Day by Day Operation /2 • Monitoring procedure is based on: – Problem Detection Day by Day Operation /2 • Monitoring procedure is based on: – Problem Detection and Diagnosis use of monitoring tools low level check on site – Problem Tracking (see next talk on Support Systems) Use of helpdesk ticketing system www. ccr. infn. it http: //grid. infn. it/

Grid Site Monitoring: General Requirements • Efficently scale increasing the number of nodes monitored Grid Site Monitoring: General Requirements • Efficently scale increasing the number of nodes monitored • Use lightweight sensors – Avoid computers overload • Publish reliable data – Hard task in Grid environment • Send notification on daemons/machines problems • Take action in case of problems on services • Allow metrics addition easily – New interesting parameters must be added without to much work • Be “Grid Aware” www. ccr. infn. it http: //grid. infn. it/

Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools www. ccr. infn. it http: //grid. infn. it/

Monitoring Cluster Systems • Use of systems to spot and notify sys administrators in Monitoring Cluster Systems • Use of systems to spot and notify sys administrators in case of outages via email, pager or other alarms • Top systems used in Grid Sites – Ganglia http: //ganglia. sourceforge. net/ – Lemon http: //lemon. web. cern. ch/lemon/doc/howto/lemonization_howto. shtml – Monit http: //www. tildeslash. com/monit/ www. ccr. infn. it http: //grid. infn. it/

Ganglia PRO: • Open source project developed by Berkley University • Adopted by many Ganglia PRO: • Open source project developed by Berkley University • Adopted by many sites • Easy to install and manage • Useful charts – Can easily detect spikes, thanks to the possibility to define the update time • Easy to add new metrics CONS: • Alarms and reactions on failures not available • Problems in scaling to hundreds or thousands node with an high frequency sampling • It is not aware of g. Lite grid-services • Data can be stored only in RRD “DB” – No detailed historical data are available www. ccr. infn. it http: //grid. infn. it/

Lemon /1 PRO: • Open source project developt by CERN • Its goal is Lemon /1 PRO: • Open source project developt by CERN • Its goal is to provide a monitoring system that can scale at thousand node without problems • It is possible to have the detailed history using an Oracle DB as RDBMS • Many advanced parameters can be monitored using standard sensor Less PRO: • It is also possible to install LEMON without DB back-end – With less functionality • It has alarms and reaction on failure – The complete set of function is available only with a DB backend installation • Configuration yet available for some grid-services – must be customized according to the site www. ccr. infn. it http: //grid. infn. it/

Lemon /2 CONS: • It is not so easy to install and manage • Lemon /2 CONS: • It is not so easy to install and manage • It is not so simple to add metrics or checks • A more “friendly” DB back-end is not available yet • It does not have the hourly graph: can be a problem in order to detect spikes www. ccr. infn. it http: //grid. infn. it/

Monit PRO: • Public Open Source project • It has a good base of Monit PRO: • Public Open Source project • It has a good base of standard checks for well known services • Lightweight, easy to install, configure and manage • A simple http server built-in to check the status of each machine CONS: • It is not really a “monitoring system” but an “alert system” • A single web page with the status of all monitored machine is not available yet • No charts available yet www. ccr. infn. it http: //grid. infn. it/

Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools Outline • Monitoring goals • Monitoring Procedure • Fabric Monitoring • INFNGrid Monitoring tools www. ccr. infn. it http: //grid. infn. it/

Monitoring Grid Systems • The INFNGrid project adopts three main Grid monitoring tools to Monitoring Grid Systems • The INFNGrid project adopts three main Grid monitoring tools to check if its Grid resources and services work as expected – Grid. ICE http: //gridice 4. cnaf. infn. it: 50080/gridice – GSTAT http: //gstat 2. gridops. org/gstat/Italy. html – SAM https: //lcg-sam. cern. ch: 8443/sam. py www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Overview • Based on the g. Lite Information System – Daily discovery Grid. ICE: Overview • Based on the g. Lite Information System – Daily discovery of new GRISEs – Periodic queries to the discovered GRISes (every 10 -30 min) CE, Site BDII • Standard Glue info published Extended GRIS (EX GRIS) • Hosts info (es daemons monitoring) • Job monitoring • Computing info gathered from Site Local Resource Management System – Information collected in a central RDMS and published in the Web context www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Geo View www. ccr. infn. it http: //grid. infn. it/ Grid. ICE: Geo View www. ccr. infn. it http: //grid. infn. it/

Grid. ICE Site View Standard Parameters /1 • Downtime status (from GOC DB) • Grid. ICE Site View Standard Parameters /1 • Downtime status (from GOC DB) • Country information (from Grid. ICE detection mechanism) • Administrative information (from GOC DB) www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Site View Extended Parameters • • Site job load as measure of Grid. ICE: Site View Extended Parameters • • Site job load as measure of how busy is the site ((CPU#-CPUFree)/CPU#)*100 Power estimation calculated by adding the power value (Spec. Int) of each CPU of the site WN and CPU number CPULoad is computed by considering the load 1 min as reported by the LRMS for all the WNs www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Site View Standard Parameters /2 • Number of available gatekeepers (CE) • Grid. ICE: Site View Standard Parameters /2 • Number of available gatekeepers (CE) • Number of configured queues on CE • Running and waiting jobs www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Site View Standard Storage Parameters • Available, total and percentage used on Grid. ICE: Site View Standard Storage Parameters • Available, total and percentage used on the storage element of the site www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Site View Monitored Hosts • Number of monitored hosts per site www. Grid. ICE: Site View Monitored Hosts • Number of monitored hosts per site www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Host View General Use Case 2 Grid operator – Site administrator Detecting Grid. ICE: Host View General Use Case 2 Grid operator – Site administrator Detecting Resource Brokers with problems www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: Host View Details www. ccr. infn. it http: //grid. infn. it/ Grid. ICE: Host View Details www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: GRIS View General Use Case 3 Grid operator – Site administrator Detecting Grid. ICE: GRIS View General Use Case 3 Grid operator – Site administrator Detecting GRIS’s status www. ccr. infn. it http: //grid. infn. it/

Grid. ICE: GRIS View Detail www. ccr. infn. it http: //grid. infn. it/ Grid. ICE: GRIS View Detail www. ccr. infn. it http: //grid. infn. it/

Job View • Job section to track VO users activity in order to: – Job View • Job section to track VO users activity in order to: – Search among a huge number of jobs – Inspect jobs resource consuption – Aggregate jobs info based on VOMS attributes (next release) Info selected according with the consumer ID (group/role) www. ccr. infn. it http: //grid. infn. it/

Chart View: Site manager viepoint www. ccr. infn. it http: //grid. infn. it/ Chart View: Site manager viepoint www. ccr. infn. it http: //grid. infn. it/

SAM: CE functionality tests • You can customize your personal SAM interface with desired SAM: CE functionality tests • You can customize your personal SAM interface with desired tests chosen from a list of possibility – – – – Job submission CA certificate version installed on WN Middleware version installed on WN Host certificate validity Replica management tests using lcg-utils Accessibility of experiments software directory Accessibility of VO management tools www. ccr. infn. it http: //grid. infn. it/

SAM: SE and LFC Functionality Tests • SE functionality tests – File copy & SAM: SE and LFC Functionality Tests • SE functionality tests – File copy & register from UI using lcg-cr – File retrieval to the UI using lcg-cp – File delete using lcg-del • LFC functionality tests – Directory listing using lfc-ls – File entry creation www. ccr. infn. it http: //grid. infn. it/

SAM: Error Investigation www. ccr. infn. it http: //grid. infn. it/ SAM: Error Investigation www. ccr. infn. it http: //grid. infn. it/

GSTAT: Overview • Based on g. Lite information System • Uses scripts to generate GSTAT: Overview • Based on g. Lite information System • Uses scripts to generate web-accessible reports • Scripts are executed periodically (every 15 mins) to query and collect information published by each site • The retrieved information is processed by an analysis framework that checks for failures and errors www. ccr. infn. it http: //grid. infn. it/

GStat: General View www. ccr. infn. it http: //grid. infn. it/ GStat: General View www. ccr. infn. it http: //grid. infn. it/

GSTAT: Site Details www. ccr. infn. it http: //grid. infn. it/ GSTAT: Site Details www. ccr. infn. it http: //grid. infn. it/

References • Grid. ICE - Web site – http: //gridice. forge. cnaf. infn. it/ References • Grid. ICE - Web site – http: //gridice. forge. cnaf. infn. it/ • GSTAT - Web doc – http: //gstat 2. gridops. org/gstat/filter_help. html • SAM - Article – Global Grid Monitoring: the EGEE/WLCG case High Performance Distributed Computing. Proceedings of the 2007 workshop on Grid monitoring • Overview of Grid Monitoring Tools – Article – A taxonomy of grid monitoring systems Future Generation Computer Systems Volume 21, Issue 1, 1 January 2005, Pages 163 -188 www. ccr. infn. it http: //grid. infn. it/