219c8291017d7de8057d97d779d6c888.ppt
- Количество слайдов: 14
Accounting and metrics Gratia Status Philippe Canal (FNAL) 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting
Overview Web Presenter Statistical Analyzer Collector Data Store Access Layer VO Center Grid Operation Center Prob e Web Presenter Data Store Access Layer Collector Prob e Repository of Accounting Records Statistical Analyzer Prob e Repository of Accounting Records W SA PI Prob e Collector Prob e Data Store Access Layer Prob e 3/6/2007 Repository of Accounting Records Philippe Canal, OSG Consortium All Hands Meeting Web Presenter Statistical Analyzer Resource Provider Site 2
Gratia Probes • Included in OSG 0. 6. 0 – DCache – PBS, LSF, Sun Grid Engine, Condor 6. 8 (Non WS-Gram) – Also available • Raw. CPU (psacct) • Already deployed at 28 production sites – Will contact more Site administrators (In particular ATLAS Tier 2) later this week. • Next: – Disk Storage • The main question will be “What are we measuring? ” – Probe for Condor 6. 8 with WS Gram • The question is “Where are the user log files"? – Packaging of probe for Condor 6. 9 and then improvement (To be able to separate ‘used’ CPU vs. ‘lost’ CPU due to evictions, etc. ) – Display for DCache information 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 3
Gratia Collector • Currently only deployed at Fermilab. • Wider deployment waiting on – Writing of proper install and use Documentation – Implementation of the VOMS based role authentication. • Also need some encryption of the DN … • Need to find the DN of users for PBS/LSF jobs. – Need help of GRAM for that. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 4
Graphs • Job Count per – Site – VO – User • Cpu Used (Wall. Clock or Cpu time) per – Site – VO – User • Known issue: the per VO report are slow. This should be fixed this week. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 5
Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 59 GMT 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 6
3/6/2007 Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 59 GMT Philippe Canal, 7 OSG Consortium All Hands Meeting
Daily Reports • Report from the job level Gratia db – Main report, includes # of jobs and Wall Duration – Compare with the previous day • Report from the daily summary Gratia db – Report on ‘legacy’ sites (including Panda) – Compare with the previous day • Job Success Rate – Has been between 75% to 95% overall • Fraction of resource used by owner of resource – Many issues: Who owns what? How are they related to VO? – How to deal with Fermilab’s subgroup? • Does Minos ‘own’ any of the Fermilab worker node (for the purpose of this report) – No good source of information of the (shared) ownership of the sites • The closest I have so far is the name of Support Center. – This is trying to answer the metric: Do VOs utilize more resources than would be available to them without OSG? 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 8
WLCG Reporting • Will start report monthly usage. – Script required for the upload has been written. – Requested from LCG where to send the info to. – Normalization factor is currently estimated. • Which Sites and/or VO should report to LCG? – CMS Tier 1 and Tier 2 – ATLAS Tier 1 and Tier 2 • Most are not yet reporting to Gratia. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 9
Upcoming Challenges • Data Quality – Verify and understand the discrepancies between the number reporting by Condor and the number reported by the Raw. CPU probe (psacct) – So far anecdotic evidence of problem … • often obscured by other issues (failure from Gram based collection, failure from psacct collections, weird overlap). – No clear reproducible pattern detected yet. • Implement a better estimate of a normalized CPU used – Require a notion of the ‘power’ of the worker node. This could be either: • a performance index passed along the usage record • a description of the cpu (better since we can then change the index being used Spec. Int 2000 to Spec. Int 2006) – Could/Should come from GLUE schema. • We already have the hostname of the worker node – Probe (near the batch system) or Collector (central place) need to acquire the information 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 10
Monitoring the Accounting • Sites Status of the Accounting Probes. – Site Administrators / GOC can start taking advantage of the Site Status web page to insure their sites are reporting as expected: http: //gratia-osg. fnal. gov: 8880/gratia-administration/monitorstatus. html? probename=condor: cmslcgce. fnal. gov 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 11
Accounting Project and Metrics • Extension of our charge to provide some of the OSG Metrics. • Metrics includes but is more than Usage Accounting (See Ruth Presentation). Other metrics will come from Operations, Users etc • With the OSG 0. 6 release the Accounting Project will start to collect data from and provide information to enable answering of some of these questions. – Site Resources provided: from GIP/GLUE – CPU utilization: by site, by VO – Data transport from SRM/d. Cache based Storage Elements (SEs). • (Plan to add information from Grid. FTP based SEs) – Current data is incomplete due to lack of deployment of probes. With the 0. 6 release all sites MUST report accounting information. – In the next few months the validity of the data will be verified and the accuracy improved. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 12
Metrics Accounting can provide • We can obtain some idea of how efficiently OSG is using facilities from – GRATIA accounting data. • GRATIA provides information about utilization. – GIP/GLUE provides a description of the facility and provides basic monitoring. – From these we can answer questions similar to the following: • Facility capability. – How much storage is available? – Total computing power? – Job slots available? • • What is the availability of sites? What part of a sites facilities are available to OSG How many jobs were processed? How many jobs vs. . slots available? Do VOs utilize more resources than would be available to them without OSG? • How big are the jobs being submitted? Average size? Maximum? • What % of jobs fail? Due to user error? Due to Grid failure? 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 13
OSG Effectiveness • Complete accurate (trustworthy) information could be easier to find. • Issues includes: – Too many places where to find the same information – Inconsistencies between the various set – Missing Data • My concrete example: – I wanted to get the list of OSG Sites that are in production – Got 3 different lists with different names – I could not find (except by sending email ) a way to get the contact information for the ‘administrators’ of the sites. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 14


