Скачать презентацию DZero On OSG Site And Application Validation Parag Скачать презентацию DZero On OSG Site And Application Validation Parag

dd67f035b7d9e2b6e1a4f2eb022f9240.ppt

  • Количество слайдов: 13

DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory July DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory July 26, 2007 Parag Mhashilkar, Fermilab 1

Overview l l l DZero & Samgrid – OSG Job Forwarding DZero P 20 Overview l l l DZero & Samgrid – OSG Job Forwarding DZero P 20 Reprocessing Steps involved in starting Production On A Site Problems faced l l l New sites Sites in steady state operation Shortcomings of the infrastructure Improving efficiency of running jobs on OSG l Using OSG resources beyond P 20 reprocessing. l July 26, 2007 Parag Mhashilkar, Fermilab 2

DZero & Samgrid l SAMGrid (JIM + SAM) l DZero’s way of using computing DZero & Samgrid l SAMGrid (JIM + SAM) l DZero’s way of using computing resources on the grid. l l l Applications supported over the Gridl l l Monte Carlo Reprocessing Refixing Skimming (Beta Testing) Computing Elements l l Job Handling: Job and Information Management (JIM). Data Handling: Sequential Access via Metadata (SAM) Native Samgrid execution sites OSG forwarding node(s) LCG forwarding node(s) Storage Elements l l July 26, 2007 SAM SE SE with SRM interfaces Parag Mhashilkar, Fermilab 3

Samgrid – OSG Job Forwarding Flow of Samgrid Job Flow of Local Jobs Offers Samgrid – OSG Job Forwarding Flow of Samgrid Job Flow of Local Jobs Offers Services OSG Station: osg-ouhep on d 0 srv 047. fnal. gov Storage Elements: SAM SE: ouhep 00. nhn. ou. edu, d 0 srv 015. fnal. gov, d 0 rsam 01. fnal. gov, d 0 srv 071. fnal. gov SAM Services SAM-SRM SE: UNL, SPRACE, UW Madison Durable Location: ouhep 00. nhn. ou. edu, d 0 srv 063. fnal. gov, d 0 srv 065. fnal. gov Samgri d Samgrid Job Samgrid client, submission, broker: d 0 mino 0 x. fnal. gov July 26, 2007 SAM-Grid / OSG Forwarding Node Job Forwarding: d 0 srv 015. fnal. gov d 0 srv 047. fnal. gov d 0 srv 066. fnal. gov OSG Sites: Fermilab, USCMS Farm, Oklahoma University, Indiana University, University of Nebraska, … Parag Mhashilkar, Fermilab 4

DZero P 20 Reprocessing l l Doing production on OSG first time on such DZero P 20 Reprocessing l l Doing production on OSG first time on such a large scale. Process 75 TB (500 million events) of raw data in ~4 months 40 TB of output stored in SAM SE in FNAL. Computing Resources: l l 12 OSG Sites (FNGP Farm used for merging and not listed in the graph) 2 Samgrid sites (CCIN 2 P 3, WESTGRID) 3 LCG sites (MANCHSTR, LANCASTR, CLERMONT) Resource Utilization: ~1200 jobs running with ~1500 idle on OSG sites July 26, 2007 Parag Mhashilkar, Fermilab 5

Starting Production On A Site: Steps l l Site needs to be certified before Starting Production On A Site: Steps l l Site needs to be certified before it is considered for production. Site should satisfy following requirements to run DZero jobs l l l Certification l l l Worker nodes have outgoing network access Worker nodes have at least 6 -8 GB of local storages Means to verify the quality of data produced at a site. Certification jobs are production jobs run with test options. Results from certification runs compared with well known results by DZero experts. Certification could take from few days to couple of weeks. Certification jobs are run only once per site for major changes to the experiment binaries. Considerations l l Since the certification is fairly time consuming, it is preferable to have bigger sites rather than smaller sites. Considering the amount of data moved between the sites hosting storage elements and the Fermilab, sites with good network connectivity are preferred. July 26, 2007 Parag Mhashilkar, Fermilab 6

Problems: New Site l New site supports the DZero VO, but … l l Problems: New Site l New site supports the DZero VO, but … l l VO users are not authenticated (no mapping). Account users mapped to does not exist. Account exists, but, the home directory does not exist. Site has enough scratch space on the worker nodes, but this space is not local. Example: NERSC l l l Worker nodes do not have good (or see effective) bandwidth to Fermilab. l l l Made changes to Samgrid infrastructure to support this. Amount of I/O activity affected the performance if several jobs started or were involved in the I/O activity at same time. Use on site SRM SE to supply data to the worker nodes. Example: UNL Use $OSG_APP to use some pre-uploaded files. Example: LONI (LTU) Resolution l l In past: talk to the site administrators. Not scalable as number of sites increase. Now, l l July 26, 2007 Open a GOC ticket. Involve ‘Troubleshooting Task Force’ Parag Mhashilkar, Fermilab 7

Problems: Steady State Operation l Random Authentication/Authorization failures. l l l Reporting site maintenance Problems: Steady State Operation l Random Authentication/Authorization failures. l l l Reporting site maintenance schedule. l l l Authorization policy on the site the changes. Service downtime/crashes No initial notice given to the users. Users find this after their jobs crash. Cleanup of scratch space on the worker nodes l l VO jobs exiting normally should do the cleanup. If the job is killed because of site policies, who should do the cleanup? l l l VO jobs: Job has already ended and does not have any control. Cleanup tools run by site: Are these tools available through OSG stack. Any attempts to standardize them? Discrepancy in the job status reported l Globus reports job is done, Condor. G reports job is idle. Affects production. l July 26, 2007 Troubleshooting team investigating the problem. Parag Mhashilkar, Fermilab 8

Problems: Shortcomings of the Infrastructure … These problems are not necessarily OSG specific problems Problems: Shortcomings of the Infrastructure … These problems are not necessarily OSG specific problems but challenges we faced during our first massive production run on OSG. l Ticketing system l l In the initial phase of this activity, turn around time from GOC was considerably high. This improved over the period of time. Thanks! Supplying data to thousands of worker nodes. l l l Resolved by adding more SAM/SRM SE Implemented queues for data transfers to categorize transfers based on network type (LAN v/s WAN) and type of data. Use local storages whenever possible and available. l l Relies on worker nodes configured with domain names. Not all sites have worker nodes with FQDN. Resource Selection Service l l Not all sites advertise to the OSG Re. SS making resource selection difficult. Without Re. SS, automation is a challenge. Results in either under utilization or over utilization of the resources. July 26, 2007 Parag Mhashilkar, Fermilab 9

… Problems: Shortcomings of the Infrastructure l Lack of Monitoring Service l l Monalisa … Problems: Shortcomings of the Infrastructure l Lack of Monitoring Service l l Monalisa was very useful in getting a snapshot of the system like number of idle/running jobs at sites. Monalisa not supported any more. Gratia, takes VO-centric approach to determine the success and failure of the jobs. Job exist status does not fully represent failure or success. DZero jobs are successful if processed data makes it to SAM. OSG does not have good means/services for reporting such VO specific metrics. Enhancing Gratia to allow VO specific plug-in to measure the success of the jobs could be a solution. DZero had to develop a lot of in-house monitoring to overcome this shortage (via DZero-specific XML databases). Wider availability of SRM storages on the OSG l l l Sites often offer NFS-like (POSIX) storages, but the lack of built-in load protection makes them dangerous to use. DZero had a good experience with new products on the market (Pansas at LONI). These systems are costly and not wide spread solutions. SRM storages would be a valuable alternative. July 26, 2007 Parag Mhashilkar, Fermilab 10

Improving the efficiency on OSG Initial weeks of production Failure analysis after rigorous troubleshooting Improving the efficiency on OSG Initial weeks of production Failure analysis after rigorous troubleshooting with the help of ‘Troubleshooting Task Force’ Success/failure analysis based on the log file size for jobs Color blue aqua pink red gray green July 26, 2007 Parag Mhashilkar, Fermilab Range 0 -8 k Meaning Worker node incompatibility, Lost standard output, OSG no assign 8 k-25 k Forwarding node crash, service failure, could not start bootstrap executable. 25 k-80 k SAM problem. Could not get RTE, possibly raw files 80 k-160 k SAM problem. Could not get raw files, possibly RTE 160 k-250 k Possible D 0 runtime crash >250 k OK 11

Using OSG Resources Beyond P 20 Reprocessing l Monte Carlo on OSG l l Using OSG Resources Beyond P 20 Reprocessing l Monte Carlo on OSG l l l Certification and Site validation policy same as that of reprocessing. MC production on OSG sites ramping up. More OSG sites added to the list of certified sites. Overall production 7. 7 M MC events in the week (Jul 16 – Jul 22) with OSG production setting a weekly record of 3. 1 M FNAL Farm is now a part of Fermigrid and will be used to do Primary processing. Under testing phase. l Using OSG resources for running other job types like skimming, CAF tree production, etc l Doing Analysis on the Grid. l … l July 26, 2007 Parag Mhashilkar, Fermilab 12

Questions? July 26, 2007 Parag Mhashilkar, Fermilab Questions? July 26, 2007 Parag Mhashilkar, Fermilab