Скачать презентацию Accounting and metrics Gratia Status Philippe Canal FNAL Скачать презентацию Accounting and metrics Gratia Status Philippe Canal FNAL

219c8291017d7de8057d97d779d6c888.ppt

  • Количество слайдов: 14

Accounting and metrics Gratia Status Philippe Canal (FNAL) 3/6/2007 Philippe Canal, OSG Consortium All Accounting and metrics Gratia Status Philippe Canal (FNAL) 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting

Overview Web Presenter Statistical Analyzer Collector Data Store Access Layer VO Center Grid Operation Overview Web Presenter Statistical Analyzer Collector Data Store Access Layer VO Center Grid Operation Center Prob e Web Presenter Data Store Access Layer Collector Prob e Repository of Accounting Records Statistical Analyzer Prob e Repository of Accounting Records W SA PI Prob e Collector Prob e Data Store Access Layer Prob e 3/6/2007 Repository of Accounting Records Philippe Canal, OSG Consortium All Hands Meeting Web Presenter Statistical Analyzer Resource Provider Site 2

Gratia Probes • Included in OSG 0. 6. 0 – DCache – PBS, LSF, Gratia Probes • Included in OSG 0. 6. 0 – DCache – PBS, LSF, Sun Grid Engine, Condor 6. 8 (Non WS-Gram) – Also available • Raw. CPU (psacct) • Already deployed at 28 production sites – Will contact more Site administrators (In particular ATLAS Tier 2) later this week. • Next: – Disk Storage • The main question will be “What are we measuring? ” – Probe for Condor 6. 8 with WS Gram • The question is “Where are the user log files"? – Packaging of probe for Condor 6. 9 and then improvement (To be able to separate ‘used’ CPU vs. ‘lost’ CPU due to evictions, etc. ) – Display for DCache information 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 3

Gratia Collector • Currently only deployed at Fermilab. • Wider deployment waiting on – Gratia Collector • Currently only deployed at Fermilab. • Wider deployment waiting on – Writing of proper install and use Documentation – Implementation of the VOMS based role authentication. • Also need some encryption of the DN … • Need to find the DN of users for PBS/LSF jobs. – Need help of GRAM for that. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 4

Graphs • Job Count per – Site – VO – User • Cpu Used Graphs • Job Count per – Site – VO – User • Cpu Used (Wall. Clock or Cpu time) per – Site – VO – User • Known issue: the per VO report are slow. This should be fixed this week. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 5

Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 59 Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 59 GMT 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 6

3/6/2007 Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 3/6/2007 Date range: 2007 -02 -26 00: 00 GMT - 2007 -03 -05 23: 59 GMT Philippe Canal, 7 OSG Consortium All Hands Meeting

Daily Reports • Report from the job level Gratia db – Main report, includes Daily Reports • Report from the job level Gratia db – Main report, includes # of jobs and Wall Duration – Compare with the previous day • Report from the daily summary Gratia db – Report on ‘legacy’ sites (including Panda) – Compare with the previous day • Job Success Rate – Has been between 75% to 95% overall • Fraction of resource used by owner of resource – Many issues: Who owns what? How are they related to VO? – How to deal with Fermilab’s subgroup? • Does Minos ‘own’ any of the Fermilab worker node (for the purpose of this report) – No good source of information of the (shared) ownership of the sites • The closest I have so far is the name of Support Center. – This is trying to answer the metric: Do VOs utilize more resources than would be available to them without OSG? 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 8

WLCG Reporting • Will start report monthly usage. – Script required for the upload WLCG Reporting • Will start report monthly usage. – Script required for the upload has been written. – Requested from LCG where to send the info to. – Normalization factor is currently estimated. • Which Sites and/or VO should report to LCG? – CMS Tier 1 and Tier 2 – ATLAS Tier 1 and Tier 2 • Most are not yet reporting to Gratia. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 9

Upcoming Challenges • Data Quality – Verify and understand the discrepancies between the number Upcoming Challenges • Data Quality – Verify and understand the discrepancies between the number reporting by Condor and the number reported by the Raw. CPU probe (psacct) – So far anecdotic evidence of problem … • often obscured by other issues (failure from Gram based collection, failure from psacct collections, weird overlap). – No clear reproducible pattern detected yet. • Implement a better estimate of a normalized CPU used – Require a notion of the ‘power’ of the worker node. This could be either: • a performance index passed along the usage record • a description of the cpu (better since we can then change the index being used Spec. Int 2000 to Spec. Int 2006) – Could/Should come from GLUE schema. • We already have the hostname of the worker node – Probe (near the batch system) or Collector (central place) need to acquire the information 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 10

Monitoring the Accounting • Sites Status of the Accounting Probes. – Site Administrators / Monitoring the Accounting • Sites Status of the Accounting Probes. – Site Administrators / GOC can start taking advantage of the Site Status web page to insure their sites are reporting as expected: http: //gratia-osg. fnal. gov: 8880/gratia-administration/monitorstatus. html? probename=condor: cmslcgce. fnal. gov 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 11

Accounting Project and Metrics • Extension of our charge to provide some of the Accounting Project and Metrics • Extension of our charge to provide some of the OSG Metrics. • Metrics includes but is more than Usage Accounting (See Ruth Presentation). Other metrics will come from Operations, Users etc • With the OSG 0. 6 release the Accounting Project will start to collect data from and provide information to enable answering of some of these questions. – Site Resources provided: from GIP/GLUE – CPU utilization: by site, by VO – Data transport from SRM/d. Cache based Storage Elements (SEs). • (Plan to add information from Grid. FTP based SEs) – Current data is incomplete due to lack of deployment of probes. With the 0. 6 release all sites MUST report accounting information. – In the next few months the validity of the data will be verified and the accuracy improved. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 12

Metrics Accounting can provide • We can obtain some idea of how efficiently OSG Metrics Accounting can provide • We can obtain some idea of how efficiently OSG is using facilities from – GRATIA accounting data. • GRATIA provides information about utilization. – GIP/GLUE provides a description of the facility and provides basic monitoring. – From these we can answer questions similar to the following: • Facility capability. – How much storage is available? – Total computing power? – Job slots available? • • What is the availability of sites? What part of a sites facilities are available to OSG How many jobs were processed? How many jobs vs. . slots available? Do VOs utilize more resources than would be available to them without OSG? • How big are the jobs being submitted? Average size? Maximum? • What % of jobs fail? Due to user error? Due to Grid failure? 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 13

OSG Effectiveness • Complete accurate (trustworthy) information could be easier to find. • Issues OSG Effectiveness • Complete accurate (trustworthy) information could be easier to find. • Issues includes: – Too many places where to find the same information – Inconsistencies between the various set – Missing Data • My concrete example: – I wanted to get the list of OSG Sites that are in production – Got 3 different lists with different names – I could not find (except by sending email ) a way to get the contact information for the ‘administrators’ of the sites. 3/6/2007 Philippe Canal, OSG Consortium All Hands Meeting 14