Скачать презентацию ATLAS Grid Computing in the Real World Скачать презентацию ATLAS Grid Computing in the Real World

aaa2b9b75c3e1a4efa52206c01c2b50f.ppt

  • Количество слайдов: 24

ATLAS Grid Computing in the Real World § § Taipei, 27 th April 2005 ATLAS Grid Computing in the Real World § § Taipei, 27 th April 2005 Roger Jones ATLAS International Computing Board Chair Grid. PP Applications Co-ordinator

2 Overview § The ATLAS Computing Model § Distribution of data and jobs, roles 2 Overview § The ATLAS Computing Model § Distribution of data and jobs, roles of sites § Required Resources § Testing the Model § See Gilbert Poulard’s Data Challenge talk next § Observations about the Real World (taken from ATLAS DC experiences and those of Grid. PP) § Matching the global view with local expectations § Sharing with other user communities § Working across several Grid deployments § Resources required > resources available? RWL Jones, Lancaster University

Complexity of the Problem Major challenges associated with: Communication and collaboration at a distance Complexity of the Problem Major challenges associated with: Communication and collaboration at a distance (collaborative tools) Distributed computing resources Remote software development and physics analysis University RWL Jones, Lancaster 3

4 ATLAS is not one experiment Extra Dimensions Higgs Heavy Ion Physics SUSY Electroweak 4 ATLAS is not one experiment Extra Dimensions Higgs Heavy Ion Physics SUSY Electroweak B physics QCD RWL Jones, Lancaster University

5 Computing Resources § Computing Model well evolved § Externally reviewed in January 2005 5 Computing Resources § Computing Model well evolved § Externally reviewed in January 2005 § There are (and will remain for some time) many unknowns § Calibration and alignment strategy is still evolving § Physics data access patterns MAY start to be exercised this soon ¨ Unlikely to know the real patterns until 2007/2008! § Still uncertainties on the event sizes § If there is a problem with resources, e. g. disk, the model will have to change § Lesson from the previous round of experiments at CERN (LEP, 1989 -2000) § Reviews in 1988 underestimated the computing requirements by an order of magnitude! RWL Jones, Lancaster University

7 A Tiered Grid The LHC Computing Facility USA Brookhaven Taipei ASCC Physics Department 7 A Tiered Grid The LHC Computing Facility USA Brookhaven Taipei ASCC Physics Department Tier 0 10 Tier 1 30 Tier 2 South. Grid London. Grid UK France CERN Italy T 1 AF Scot. Grid North. Grid ……… Desktop NL Germany RWL Jones, Lancaster University

8 Processing Roles § Tier-0: § § First pass processing on express/calibration physics stream 8 Processing Roles § Tier-0: § § First pass processing on express/calibration physics stream 24 -48 hours later, process full physics data stream with reasonable calibrations § Curate the full RAW data and curate the first pass processing à Getting good calibrations in a short time is a challenge à There will be large data movement from T 0 to T 1 s § Tier-1 s: § § Reprocess 1 -2 months after arrival with better calibrations Curate and reprocess all resident RAW at year end with improved calibration and software § Curate and allow access to all the reprocessed data sets à There will be large data movement from T 1 to T 1 and significant data movement from T 1 to T 2 § Tier-2 s: § § à § § Simulation will take place at T 2 s Simulated data stored at T 1 s There will be significant data movement from T 2 s to T 1 s Need partnerships to plan networking Must have fail-over to other sites RWL Jones, Lancaster University

9 Analysis Roles Analysis model broken into two components § Scheduled central production of 9 Analysis Roles Analysis model broken into two components § Scheduled central production of augmented AOD, tuples & TAG collections from ESD à Done at T 1 s à Derived files moved to other T 1 s and to T 2 s § Chaotic user analysis of (shared) augmented AOD streams, tuples, new selections etc and individual user simulation and CPU-bound tasks matching the official MC production à Done at T 2 s and local facility à Modest job traffic between T 2 s RWL Jones, Lancaster University

10 Required CERN Resources • Includes Heavy Ions data • No replacements included RWL 10 Required CERN Resources • Includes Heavy Ions data • No replacements included RWL Jones, Lancaster University

11 Required Non-CERN Resources The growth is challenging! RWL Jones, Lancaster University 11 Required Non-CERN Resources The growth is challenging! RWL Jones, Lancaster University

12 Networking – CERN to T 1 s § The ATLAS Tier 1 s 12 Networking – CERN to T 1 s § The ATLAS Tier 1 s will be: ASCC, CCIN 2 P 3 -Lyon, RAL, NIKHEF, F 2 KKarlsruhe, BNL, PIC, Nordu. Grid, CNAF, TRIUMF § They vary in size! § § § Traffic from T 0 to average Tier-1 is ~75 MB/s raw With LCG headroom, efficiency and recovery factors for service challenges this is 3. 5 Gbps/sec Most ATLAS T 1 s are shared with other experiments, so aggregate bandwidth & contention larger RWL Jones, Lancaster University

13 Networking – T 1 to T 1 § Significant traffic of ESD (full 13 Networking – T 1 to T 1 § Significant traffic of ESD (full reconstructed data) and AOD (analysis summary data) from reprocessing between T 1 s § 125 MB/sec raw for average T 1 (pp only) § 5 -6 Gbps for average T 1 after usual factors RWL Jones, Lancaster University

14 Networking and Tier— 2 s è Tier-2 to Tier-1 networking requirements far more 14 Networking and Tier— 2 s è Tier-2 to Tier-1 networking requirements far more uncertain è without job traffic ~17. 5 MB/s for ‘average’ T 2 è ~850 Mbps required? RWL Jones, Lancaster University

15 Issues § The following issues are mainly, but not exclusively derived from LCG-2 15 Issues § The following issues are mainly, but not exclusively derived from LCG-2 experience § No-one should feel they are being criticised § Much help given and appreciated § It is acknowledged that new tools are coming that address some of the issues § LCG a major resource for ATLAS, doing serious work RWL Jones, Lancaster University

16 Grid Projects Until deployments provide interoperability the experiments must provide it themselves ATLAS 16 Grid Projects Until deployments provide interoperability the experiments must provide it themselves ATLAS must span 3 major Grid deployments RWL Jones, Lancaster University

18 ATLAS Production System A big problem is data management Data Man. System Must 18 ATLAS Production System A big problem is data management Data Man. System Must cope with >=Prod. DB catalogues 3 Grid AMI Demands even greater for analysis super jabber super LCG exe super Dulcinea Lexor super jabber soap NG exe RLS LCG Windmill soap jabber Don Quijote Capone G 3 exe RLS NG LSF exe RLS Grid 3 LSF RWL Jones, Lancaster University

19 Data Management § Multiple deployments currently means multiple file catalogues § ATLAS build 19 Data Management § Multiple deployments currently means multiple file catalogues § ATLAS build a meta-catalogue on top to join things up § Convergence would really help! § Existing data movement tools are not sophisticated in all three deployments § Anything beyond a single point-to-point movement has required user -written tools eg Don Quijote, Ph. Ed. Ex… § Data transfer scheduling is vital for large transfers § Should also take care of data integrity and retries. § file transfer service tools are emerging from deployments but… § Tools must exist in and/or span all three Grids § Other observations § Existing DM tools are too fragile: they don't have timeouts (so can hang a node) or retries and are not resilient against information system glitches. § They must also handle race conditions if there are multiple writes to a file § Catalogue failures and inconsistencies need to be well-handled RWL Jones, Lancaster University

20 Workload Management § Resource brokerage is still an important idea when serving multiple 20 Workload Management § Resource brokerage is still an important idea when serving multiple VOs § ATLAS has tried to use the Resource Broker ‘as intended’ § However, submission to Resource Broker in LCG is slow (although much improved thanks to work on both sides) § Bulk job submission is needed in cases of shared input § Fair usage policies need to be clearly defined and enforced § Example: using a dummy job to obtain slots then pulling in a real job is very smart and helps get around RB submission problems § But is it reasonable? Pool-table analogy… § Ranking is an art not a science! § Default LCG-2 ranking is not very good! § Matchmaking with input files is not optimal ¨ Would want to add information on replicas in the site ranking ¨ Currently, if input files are specified, sites with at least one replica of any of them are always preferred, even if they have much less CPU than sites with no replicas. ¨ Important to know if file is on tape or disk RWL Jones, Lancaster University

21 Workload Management (2) § At present, we are exploiting LCG resources both with 21 Workload Management (2) § At present, we are exploiting LCG resources both with and without using the RB § Condor-G job submission interfaced to LCG (see Rod Walker’s talk) ü Submission faster, destination chosen by VO û No central logging and bookkeeping services § Choosing suitable sites for BDII is also an art § § § Requiring test suite to succeed every day is too restrictive (3 rd party failures etc) White-listing helps, but looses responsiveness to genuine failures Current approach is labour intensive and not scaleable RWL Jones, Lancaster University

22 User Interaction § Running jobs § User needs global query commands for all 22 User Interaction § Running jobs § User needs global query commands for all her submitted jobs § Really need to be able to tag jobs into groups § Needs access to STDOUT/STDERR while job is running § Information system § § § Should span Grids! Should be accurate (e. g. evaluate free CPUs correctly) Should be robust Should have better user interface than ldapsearch! Debugging § When a job fails, information is not sufficient § Logging info is unfriendly and incomplete § Log files on RB and CE are not accessible to ordinary user and difficult to access by support team. § There is huge scope development here! RWL Jones, Lancaster University

23 Site Issues § Sysadmin education and communication § § § Many people need 23 Site Issues § Sysadmin education and communication § § § Many people need to be trained ¨ Example: storage element usage ¨ Site defaults to ‘permanent’, but many sysadmins think storage is effectively a scratch area Need clear lines of communication to VOs and to Grid Operations Unannounced down times! Policies! Site misconfiguration. § § § It is the cause of most problems, Debugging site problems is very time-consuming Common misconfigurations: ¨ Wrong information published to Information system ¨ Worker Node disk space becomes full; ¨ Incorrect firewall settings, GLOBUS_TCP_RANGE, ssh keys not syncrhonized, ¨ NFS stale mounts, etc. ; ¨ Incorrect configuration of the experiment software area; ¨ Bad middleware installation. RWL Jones, Lancaster University

24 Reality check § The requirement is not completely matched by the current pledges 24 Reality check § The requirement is not completely matched by the current pledges § The table presents Atlas's estimate of the lower bound on the Tier-1 resources available § Recent indications suggest the CPU shortage may be met, but the storage remains a problem § Recovery strategies reduce access to data § The disk shortage is serious Snapshot of 2008 Tier-1 status Summary Tier 1 s Split 2008 ATLAS Offered -24% 10076 Required 15500 Balance -35% Offered Tape (Pbytes) 27000 Offered Disk (Tbytes) Required Balance CPU (k. SI 2 K) 20397 10. 29 Required 10. 1 Balance 3% RWL Jones, Lancaster University

25 Conclusions § § The Grid is the only practical way to function as 25 Conclusions § § The Grid is the only practical way to function as a world-wide collaboration First experiences have inevitably had problems, but we have real delivery § Slower than desirable § Problems of coherence § But a Ph. D student can submit 2500 simulation jobs at once and have them complete within 48 hours with 95. 3% success rate § (all failures were /nfs not accessible) § Real tests of the computing models contine § Analysis on the Grid is still to be seriously demonstrated § Calibration and alignment procedures need to be brought in § Resources are an issue § There are shortfalls with respect to the initial requirements § The requirements grow with time § This may be difficult for Tier-2 s especially RWL Jones, Lancaster University

26 A Final Thought “It is amazing what can be achieved when you do 26 A Final Thought “It is amazing what can be achieved when you do not care who will get the credit” Harry S Trueman RWL Jones, Lancaster University