Скачать презентацию DB ES Experiment Support Monitoring in CMS Daniele Скачать презентацию DB ES Experiment Support Monitoring in CMS Daniele

82e97942674f76ad5c24710d51ac1508.ppt

  • Количество слайдов: 35

DB ES Experiment Support Monitoring in CMS Daniele Bonacorsi Andrea Sciabà CERN IT Department DB ES Experiment Support Monitoring in CMS Daniele Bonacorsi Andrea Sciabà CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 19/5/2011 Workshop CCR INFN Grid 2011

ES Outline • • CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it ES Outline • • CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Introduction Service monitoring Site monitoring Dashboard monitoring Transfer monitoring Data popularity Other monitoring Conclusions 5/4/2011 Workshop CCR INFN Grid 2011 2

ES Introduction • CMS uses a large variety of monitoring sources and tools • ES Introduction • CMS uses a large variety of monitoring sources and tools • Main providers are – WLCG (SAM/Nagios, Gridview, etc. ) – CERN IT (Lemon, SLS, Hammercloud, Dashboard, Data popularity, etc. ) – Caltech (Mon. ALISA) – CMS (Ph. EDEx monitoring, Overview, etc. ) – KIT+DESY+… (Happy. Faces) • This is not meant to be an exhaustive review of every monitoring system! – Mainly focused on computing rather than data quality / software release validation etc. CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 3

ES Service monitoring • Most CMS services (and IT services used by CMS) are ES Service monitoring • Most CMS services (and IT services used by CMS) are monitored using Lemon and SLS – Lemon: node-centric, standard + custom metrics, provides alarms and actuators – SLS: service-centric, produces one estimator (“availability”) plus arbitrary metrics, provides alarms • Both widely used in CERN IT and LHC experiments • The Critical Service map (developed by the Dashboard team) gives an overview of the status of CMS services CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 4

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it SLS 5/4/2011 Workshop ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it SLS 5/4/2011 Workshop CCR INFN Grid 2011 5

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Critical service map ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Critical service map 5/4/2011 Workshop CCR INFN Grid 2011 6

ES Site monitoring • SAM/Nagios framework – Functional tests run on remote services (computing ES Site monitoring • SAM/Nagios framework – Functional tests run on remote services (computing and storage elements) – Used in WLCG and EGEE/EGI since several years – CMS-specific tests are run with a CMS certificate • CMS Job Robot – “Fake” analysis jobs automatically sent to all sites – Read a dataset replicated everywhere – Job success rate measured • Transfer link quality – Count how many “good” links the site has, looking at the rate of transfer failures CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 7

ES Site monitoring plots SAM JR summaries JR errors Transfer quality CERN IT Department ES Site monitoring plots SAM JR summaries JR errors Transfer quality CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 8

ES Gridview? • The portal is not actively used in CMS • Site availability ES Gridview? • The portal is not actively used in CMS • Site availability calculated by the Dashboard using more critical tests than those considered by Gridview • LCG-CE and CREAM-CE already “ORed“ in the Dashboard – Soon possible also in Gridview CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 9

ES CMS Site Readiness • An aggregator of site monitoring information to express if ES CMS Site Readiness • An aggregator of site monitoring information to express if a site is “working” or not • READY / NOT-READY / WARNING / SCHEDULED DOWNTIME – Use the recent history of the tests rather than a simple “AND” combination of the latest results (e. g. READY if all metrics OK for ≥ 5/7 days) • Combines SAM/Nagios, JR and link quality to answer questions like – Do jobs run? Is CMS software properly installed? Can read local data? Can copy output to local storage? Can data be remotely read and written? Can transfer to/from other sites? • Uses GOCDB to find downtimes CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 10

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Example 5/4/2011 Workshop ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Example 5/4/2011 Workshop CCR INFN Grid 2011 11

ES SR metrics in the Site Status Board • Using the Dashboard Site Status ES SR metrics in the Site Status Board • Using the Dashboard Site Status Board to display arbitrary site information CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 12

ES Historical trends ~6 over 7 T 1 good ~40 over 50 T 2 ES Historical trends ~6 over 7 T 1 good ~40 over 50 T 2 good CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 13

ES Hammercloud • A distributed Analysis testing system used in ATLAS, CMS and LHCb ES Hammercloud • A distributed Analysis testing system used in ATLAS, CMS and LHCb serving two usecases: – Robot-like functional testing: frequent “ping” jobs to all sites to perform basic site validation – Stress testing: on-demand large-scale stress tests using real analysis jobs to test one or many sites to: • • CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Help commission new sites Evaluate changes to site infrastructure Evaluate SW changes Compare site performances Workshop CCR INFN Grid 2011 14

ES Hammercloud statistics Italian Tier-2 sites CERN IT Department CH-1211 Geneva 23 Switzerland www. ES Hammercloud statistics Italian Tier-2 sites CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 15

ES The Dashboard framework • Initially developed in IT for CMS, later extended to ES The Dashboard framework • Initially developed in IT for CMS, later extended to the other LHC experiments • Covers job monitoring and site/service status monitoring • Provides user/VO monitoring views • Information sent by job submission tools and by running jobs to a Mon. ALISA server as UDP messages – Planned to start using the WLCG MSG system • Information stored in Oracle database CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 16

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Dashboard architecture 5/4/2011 ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Dashboard architecture 5/4/2011 Workshop CCR INFN Grid 2011 17

ES CMS job (and not only) monitoring • Interactive view – To see the ES CMS job (and not only) monitoring • Interactive view – To see the status of current and recent jobs, how they are distributed, how they failed, etc • Historical view – To see, as a function of time, the number of running jobs, their success rate, their CPU efficiency, etc. • Task monitoring – See status of user analysis tasks • Other monitoring – Visualization of SAM/Nagios tests, analysis weekly reports, critical service map, etc. CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 18

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Interactive view 5/4/2011 ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Interactive view 5/4/2011 Workshop CCR INFN Grid 2011 19

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Historical view 5/4/2011 ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Historical view 5/4/2011 Workshop CCR INFN Grid 2011 20

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Task monitoring 5/4/2011 ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Task monitoring 5/4/2011 Workshop CCR INFN Grid 2011 21

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Nagios portal in ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Nagios portal in Dashboard 5/4/2011 Workshop CCR INFN Grid 2011 22

ES Transfer monitoring • Extensive current and historical information available from the Ph. EDEx ES Transfer monitoring • Extensive current and historical information available from the Ph. EDEx monitoring CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 23

ES FTS monitoring • • In WLCG all site-to-site transfers proceed via the File ES FTS monitoring • • In WLCG all site-to-site transfers proceed via the File Transfer Service (or via xrootd) Troubleshooting transfer problems needs to directly look at the FTS monitoring – Channel configuration – Details on failed transfer attempts CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 24

ES FTS monitor parser • Being developed in CMS but usable by anybody • ES FTS monitor parser • Being developed in CMS but usable by anybody • Full statistics about successful transfers from FTS monitors worldwide – Average transfer rates per file/stream and their historical evolution • Useful to – find general issues with endpoints and links – check network performance (e. g. for LHCONE) – Optimize FTS channels CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 25

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Example 5/4/2011 Workshop ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Example 5/4/2011 Workshop CCR INFN Grid 2011 26

ES WLCG transfer monitoring • A VO-agnostic project to provide a global transfer monitoring ES WLCG transfer monitoring • A VO-agnostic project to provide a global transfer monitoring • Concept – FTS instances (and other transfer systems) publish transfer events and queue status to the WLCG MSG (Active. MQ) – A global transfer dashboard stores recent data (~3 months) and produces plots and statistics – Raw event data can be consumed via MSG by any application using an API CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 27

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Architecture 5/4/2011 Workshop ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Architecture 5/4/2011 Workshop CCR INFN Grid 2011 28

ES Advantages and plans • Advantages – – Decouple from local FTS monitoring Cross-technology ES Advantages and plans • Advantages – – Decouple from local FTS monitoring Cross-technology interface (FTS, xrootd, etc. ) More details on transfers Correlations among VOs • Plans – Defined message format – Implemented prototype of FTS publisher (IT-GT) – Web interface development starts this summer CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 29

ES Data popularity • A framework developed by CERN IT-ES to provide – Usage ES Data popularity • A framework developed by CERN IT-ES to provide – Usage statistics vs time for CMS files and datasets by analysis jobs: file access success/failure, CPU time, users, … – A data service for future applications CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 30

ES Data popularity: first results Accesses by dataset Accesses by site of most popular ES Data popularity: first results Accesses by dataset Accesses by site of most popular dataset # accesses Fraction of open failures by site CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 Fraction of open failures by file 31

ES Mon. ALISA • Used “behind the curtains”: – Dashboard – CRAB server monitoring ES Mon. ALISA • Used “behind the curtains”: – Dashboard – CRAB server monitoring – Xrootd monitoring • Very stable and reliable CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 32

ES Other monitoring • Storage accounting – Using the Site Status Board to publish ES Other monitoring • Storage accounting – Using the Site Status Board to publish amount of used and free space on sites – Needs work, as BDII information not reliable • xrootd monitoring – Developed for the CMS xrootd global redirector project • Data operations monitoring – T 0 operations: detailed info on T 0 workflows – Local Tier-1 batch monitoring via Happy. Faces • Information sent as standard XML files CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 33

ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Happy. Faces 5/4/2011 ES CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it Happy. Faces 5/4/2011 Workshop CCR INFN Grid 2011 34

ES Future’s main goals • Move towards a coherent framework for alarms and notifications ES Future’s main goals • Move towards a coherent framework for alarms and notifications • Reorganize views to make them more convenient and converge on fewer aggregator technologies • Provide a more powerful monitoring for operators of the workflow management tools, both for production and analysis • “Clean up” and further improve performance of available tools CERN IT Department CH-1211 Geneva 23 Switzerland www. cern. ch/it 5/4/2011 Workshop CCR INFN Grid 2011 35