Скачать презентацию DØ Computing Model Monte Carlo Data Скачать презентацию DØ Computing Model Monte Carlo Data

5de05093aa7f0c415ee1b25fb82cb216.ppt

  • Количество слайдов: 21

DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005 SUSY 04

Outline n Operational status u u n DØ Computing model u n Globally – Outline n Operational status u u n DØ Computing model u n Globally – continue to do well Shared by recent Run II Computing Review Ongoing, ‘long’ established plan Production Computing u u Monte Carlo Reprocessing of Run II data t 109 events reprocessed on the grid – largest HEP grid effort n Looking forward n Conclusions DOSAR Workshop, Sept 2005 2

Snapshot of Current Status n Reconstruction keeping up with data taking n Data handling Snapshot of Current Status n Reconstruction keeping up with data taking n Data handling is performing well n Production computing is off-site and grid based. It continues to grow & work well n Over 75 million Monte Carlo events produced in last year n Run IIa data set being reprocessed on the grid – 109 events n Analysis cpu power has been expanded n Globally doing well u Shared by recent Run II Computing Review DOSAR Workshop, Sept 2005 3

Computing Model n Started with distributed computing with evolution to automated use of common Computing Model n Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks u u n Scalable Not alone – Joint effort with others at FNAL and elsewhere, LHC… 1997 – Original Plan u u All Monte Carlo to be produced off-site SAM to be used for all data handling, provides a ‘data-grid’ n Now: Monte Carlo and data reprocessing with SAM-Grid n Next: Other production tasks e. g. fixing and then user analysis n Use concept of Regional Centres u u DOSAR one of pioneers Builds local expertise DOSAR Workshop, Sept 2005 4

Reconstruction Release n Periodically update version of reconstruction code u u n Frequency of Reconstruction Release n Periodically update version of reconstruction code u u n Frequency of releases decreases with time u u n As develop new / more refined algorithms As get better understanding of detector One major release in last year – p 17 Basis for current Monte Carlo (MC) & data reprocessing Benefits of p 17 u u u Reco speed-up Full calorimeter calibration Fuller description of detector material Use of zero-bias overlay for MC (More details: http: //cdinternal. fnal. gov/RUNIIRev/run. IIMP 05. asp) DOSAR Workshop, Sept 2005 5

Data Handling - SAM n SAM continues to perform well, providing a data-grid http: Data Handling - SAM n SAM continues to perform well, providing a data-grid http: //d 0 db-prd. fnal. gov/sm_local/Sam. At. AGlance/ u u u 50 SAM sites worldwide Over 2. 5 PB (50 B events) consumed in the last year Up to 300 TB moved per month Larger SAM cache solved tape access issues Continued success of SAM shifters t t u Often remote collaborators Form 1 st line of defense SAMTV monitors SAM & SAM stations DOSAR Workshop, Sept 2005 6

SAMGrid More than 10 DØ execution sites http: //samgrid. fnal. gov: 8080/ SAM – SAMGrid More than 10 DØ execution sites http: //samgrid. fnal. gov: 8080/ SAM – data handling JIM – job submission & monitoring SAM + JIM SAM-Grid http: //samgrid. fnal. gov: 8080/list_of_schedulers. php http: //samgrid. fnal. gov: 8080/list_of_resources. php DOSAR Workshop, Sept 2005 7

Remote Production Activities – Monte Carlo - I n Over 75 M events produced Remote Production Activities – Monte Carlo - I n Over 75 M events produced in last year, at more than 10 sites u More than double last year’s production n Vast majority on shared sites u DOSAR major part of this n SAM-Grid introduced in spring 04, becoming the default u Based on request system and jobmanager-mc_runjob u u n 04 m ro MC software package retrieved via SAMo way, inc central F farm Average production efficiency ~90% Average inefficiency due to grid infrastructure ~1 -5% http: //www-d 0. fnal. gov/computing/grid/deployment-issues. html Continued move to common tools u DOSAR sites continue move to SAMGrid from Mc. Farm DOSAR Workshop, Sept 2005 8

Remote Production Activities – Monte Carlo - II n Beyond just ‘shared’ resources More Remote Production Activities – Monte Carlo - II n Beyond just ‘shared’ resources More than 17 M events produced ‘directly’ on LCG via submission from Nikhef u u. Good example of remote site driving the ‘development’ n Similar momentum building on/for OSG n. Two good site examples within p 17 reprocessing DOSAR Workshop, Sept 2005 9

Remote Production Activities – Reprocessing - I n After significant improvements to reconstruction, reprocess Remote Production Activities – Reprocessing - I n After significant improvements to reconstruction, reprocess old data n P 14 Winter 2003/04 u 500 M events, 100 M remotely, from DST u Based around mc_runjob u Distributed computing rather than Grid n P 17 End march ~Oct u x 10 larger ie 1000 M events, 250 TB u Basically all remote u From raw ie use of db proxy servers u SAM-Grid as default (using mc_runjob) u 3200 1 GHz PIIIs for 6 months u Massive activity - largest grid activity in HEP http: //www-d 0. fnal. gov/computing/reprocessing/p 17/ DOSAR Workshop, Sept 2005 10

Reprocessing - II Grid jobs spawns many batch jobs “Production” “Merging” DOSAR Workshop, Sept Reprocessing - II Grid jobs spawns many batch jobs “Production” “Merging” DOSAR Workshop, Sept 2005 11

Reprocessing -III n SAMGrid provides u Common environment & operation scripts at each site Reprocessing -III n SAMGrid provides u Common environment & operation scripts at each site u Effective book-keeping t SAM avoids data duplication + defines recovery jobs t JIM’s XML-DB used to ease bug tracing u Tough deploying a product, under evolution with limited manpower to new sites (we are a running experiment) u Very significant improvements in JIM (scalability) during this period n Certification of sites - Need to check u SAMGrid vs usual production u Remote sites vs central site u Merged vs unmerged files DOSAR Workshop, Sept 2005 FNAL vs SPRACE 12

Reprocessing - IV n Monitoring (illustration) u http: //samgrid. fnal. gov: 8080/cgi-bin/plot_efficiency. cgi Overall Reprocessing - IV n Monitoring (illustration) u http: //samgrid. fnal. gov: 8080/cgi-bin/plot_efficiency. cgi Overall efficiency, speed or by site. ( = number batch jobs completing successfully) n Status – into the “end-game” u u ( = production speed in M events / day) ~855 Mevents done Data sets allocated, moving to ‘cleaning-up’ Must now push on the Monte Carlo DOSAR Workshop, Sept 2005 13

SAM-Grid Interoperability n Need access to greater resources as data sets grow n Ongoing SAM-Grid Interoperability n Need access to greater resources as data sets grow n Ongoing programme on LCG and OSG interoperability n Step 1 (co-existence) – use shared resources with SAM-Grid head-node u Widely done for both Reprocessing and MC t n OSG co-existence shown for data reprocessing Step 2 – SAMGrid-LCG interface u u SAM does data handling & JIM job submission Basically forwarding mechanism Prototype established at IN 2 P 3/Wuppertal Extending to production level n OSG activity increasing – build on LCG experience n Team work between core developers / sites DOSAR Workshop, Sept 2005 14

Looking Forward n Increased data sets require increased resources for MC, repro etc Route Looking Forward n Increased data sets require increased resources for MC, repro etc Route to these is increased use of grid and common tools n Have an ongoing joint program, but work to do. . n u Continue development of SAM-Grid t u Deployment team t t u Bring in new sites in manpower efficient manner ‘Benefit’ of a new site goes well beyond a ‘cpu’ count – we appreciate / value this. Full interoperability t n Automated production job submission by shifters Ability to access efficiently all shared resources Additional resources for above recommended by Taskforce DOSAR Workshop, Sept 2005 15

Conclusions n Computing model continues to be successful n n Based around grid-like computing, Conclusions n Computing model continues to be successful n n Based around grid-like computing, using common tools Key part of this is the production computing – MC and reprocessing n Significant advances this year: u u Continued migration to common tools Progress on interoperability, both LCG and OSG t u P 17 reprocessing – a tremendous success t n Strongly praised by Review Committee DOSAR major part of this u u n Two reprocessing sites operating under OSG More ‘general’ contribution also strongly acknowledged. Thank you Let’s all keep up the good work DOSAR Workshop, Sept 2005 16

Back-up DOSAR Workshop, Sept 2005 17 Back-up DOSAR Workshop, Sept 2005 17

Terms n Tevatron u u n SAM (Sequential Access to Metadata) u u n Terms n Tevatron u u n SAM (Sequential Access to Metadata) u u n Well developed metadata and distributed data replication system Originally developed by DØ & FNAL-CD JIM (Job Information and Monitoring) u u n Approx equiv challenge to LHC in “today’s” money Running experiments handles job submission and monitoring (all but data handling) SAM + JIM →SAM-Grid – computational grid Tools u u u Runjob dØtools dØrte - Handles job workflow management – User interface for job submission - Specification of runtime needs DOSAR Workshop, Sept 2005 18

Reminder of Data Flow n Data acquisition (raw data in evpack format) u u Reminder of Data Flow n Data acquisition (raw data in evpack format) u u n Reconstruction (tmb/DST in evpack format) u u n u Improvements / corrections coming after cut of production release Centrally performed Skimming (tmb in evpack format) u u n Additional information in tmb → tmb++ (DST format stopped) Sufficient for ‘complex’ corrections, inc track fitting Fixing (tmb in evpack format) u n Currently limited to 50 Hz Level-3 accept rate Request increase to 100 Hz, as planned for Run IIb – see later Centralised event streaming based on reconstructed physics objects Selection procedures regularly improved Analysis (out: root histogram) u u Common root-based Analysis Format (CAF) introduced in last year tmb format remains DOSAR Workshop, Sept 2005 19

Remote Production Activities – Monte Carlo DOSAR Workshop, Sept 2005 20 Remote Production Activities – Monte Carlo DOSAR Workshop, Sept 2005 20

The Good and Bad of the Grid n n Only viable way to go… The Good and Bad of the Grid n n Only viable way to go… Increase in resources (cpu and potentially manpower) u Work with, not against, LHC u Still limited BUT n n n Need to conform to standards – dependence on others. . Long term solutions must be favoured over short term idiosyncratic convenience u Or won’t be able to maintain adequate resources. Must maintain production level service (papers), while increasing functionality u As transparent as possible to non-expert DOSAR Workshop, Sept 2005 21