Скачать презентацию ATLAS Computing Model Tutorial Atlas CC Lyon 5 Скачать презентацию ATLAS Computing Model Tutorial Atlas CC Lyon 5

717639b0cc29bedeb0ba85158915b016.ppt

  • Количество слайдов: 26

ATLAS Computing Model Tutorial Atlas CC, Lyon 5 -3 -2007 Ghita Rahal CC-IN 2 ATLAS Computing Model Tutorial Atlas CC, Lyon 5 -3 -2007 Ghita Rahal CC-IN 2 P 3 5/3/2007 Atlas Tutorial

Guidelines • • Atlas Computing Model Atlas Data management Atlas tests MC Production 5/3/2007 Guidelines • • Atlas Computing Model Atlas Data management Atlas tests MC Production 5/3/2007 Atlas Tutorial 2

ATLAS Tier-1 Data Flow (2008) Reprocessing(s) Month later+…. Tier 1 Tier-2 s cloud Mass ATLAS Tier-1 Data Flow (2008) Reprocessing(s) Month later+…. Tier 1 Tier-2 s cloud Mass Storage ESD Tier-0 RAW 1. 6 MB/event 320 MB/s 43 MB/s disk buffer Processor’s Farm ESD 20 MB/s AOD ESD 1 MB/event 200 MB/s AOD 0. 1 MB/event 20 MB/s Other T 1 Tier-1 s T 1 51+20 MB/s AOD +ESD disk storage ESD 5/3/2007 AOD Atlas Tutorial 3

Data Flow: Monte Carlo Production and User’s Analysis Tier 1 Tier-2 s User’s Analysis Data Flow: Monte Carlo Production and User’s Analysis Tier 1 Tier-2 s User’s Analysis cloud Mass Storage MC HITS RDO Processor’s disk buffer MC HITS RDO Other T 1 Tier-1 s T 1 ESD disk storage AOD 5/3/2007 MC AOD Atlas Tutorial 4

Situation in 2006 -2007 • Previous slides: How Atlas is expected to run when Situation in 2006 -2007 • Previous slides: How Atlas is expected to run when LHC data flows out. • Ante LHC running, i. e. before Nov 2007: – Monte-Carlo is processed at T 1 s and Tiers 2 – All Data Production and Distribution tests exercised with MC 5/3/2007 Atlas Tutorial 5

Guidelines How to apply the ATLAS Computing Model? Atlas Data management 5/3/2007 Atlas Tutorial Guidelines How to apply the ATLAS Computing Model? Atlas Data management 5/3/2007 Atlas Tutorial 6

ATLAS Data Management • Atlas uses 3 grids: LCG, OSG and Nordu. Grid with ATLAS Data Management • Atlas uses 3 grids: LCG, OSG and Nordu. Grid with their own services Requires an ATLAS layer over the Grid middleware • Atlas Model of computing and data distribution: – Storage capacity spread in T 1 sites • Different storage systems with different access technologies. – Computing power distributed over all Tiers, 1, 2, 3 to produce MC and process data • Tool to Distribute the data must: – Allow high performance and reliable data movement – Include information about data location and replication – Support multiple grid flavours. 5/3/2007 Atlas Tutorial 7

ATLAS Distribution tool : DDM Stephane’s talk 5/3/2007 Atlas Tutorial 8 ATLAS Distribution tool : DDM Stephane’s talk 5/3/2007 Atlas Tutorial 8

Guidelines Exercising the Model and preparing for real Data: Atlas Tests 5/3/2007 Atlas Tutorial Guidelines Exercising the Model and preparing for real Data: Atlas Tests 5/3/2007 Atlas Tutorial 9

Tests TIER-1 Lyon • CSC: Computer System Commissioning: setup of tests and milestones. • Tests TIER-1 Lyon • CSC: Computer System Commissioning: setup of tests and milestones. • Performance and functional tests of data transfers T 0 to T 1 and T 1 to T 2 – June-July 2006, September-October 2006 – Going on in 2007 (march 2007…) Goal: Get a stable and efficient system of data distribution. • New in 2007: CDR Computing Dress Rehearsal to exercise the full Atlas Data Model 5/3/2007 Atlas Tutorial 10

Performance tests T 0=>T 1 July 2006 • Almost reached the goal for few Performance tests T 0=>T 1 July 2006 • Almost reached the goal for few hours • Problems from various sides (availability of the sites, of the services, access to the catalogs …. ) 5/3/2007 Atlas Tutorial 11

Atlas Performance tests T 1=>T 2 • ATLAS: continuous transfer from T 1 to Atlas Performance tests T 1=>T 2 • ATLAS: continuous transfer from T 1 to T 2 sites initiated by the Tier 1 July 2006: 5/3/2007 Atlas Tutorial 12

Atlas Performance tests T 1=>T 2 July 2006 • Transfers to 7 Sites, T Atlas Performance tests T 1=>T 2 July 2006 • Transfers to 7 Sites, T 2 and non-T 2 simultaneously • Some problem of limitations in the bandwidth for simultaneous transfers 5/3/2007 Atlas Tutorial 13

Performance tests T 0=>T 1 October 2006 • Overall weaker throughput due to Multi-VO Performance tests T 0=>T 1 October 2006 • Overall weaker throughput due to Multi-VO Simultaneous tests • Some drops understood (castor) but most not 5/3/2007 Atlas Tutorial 14

Multi-VO tests • 2 days tests involving multi VO • Generate data at Tier-0 Multi-VO tests • 2 days tests involving multi VO • Generate data at Tier-0 according to the rate transfer of each experiment • Transfer to all sites 5/3/2007 Atlas Tutorial 15

Multi-VO tests • Transfer Alice-Atlas-CMS to LYON Tier-1 • Reached nominal transfer rates after Multi-VO tests • Transfer Alice-Atlas-CMS to LYON Tier-1 • Reached nominal transfer rates after few improvements… 5/3/2007 Atlas Tutorial 16

BAD 5/3/2007 OK Atlas Tutorial 17 BAD 5/3/2007 OK Atlas Tutorial 17

Problems: identified or not Many improvements during 2006 year and increase in magnitude of Problems: identified or not Many improvements during 2006 year and increase in magnitude of the overall tests and fixes last quarter of 2006. But stable running not yet achieved for very different reasons: • Persistent and transient site failures; • Frequent failures for FTS transfers : big problem when multi VO runs. • LFC server hanging and failures: solved • Upgrade h/w on some sites for VO BOX: fixed • Memory leaks and other overflow conditions on DDM tool when running for long periods of time: fixed. • Throughput per stream per site seems to vary heavily (and some streams very slow): not understood 5/3/2007 Atlas Tutorial 18

Problems but also successes • Large file sizes always leads to much more stable Problems but also successes • Large file sizes always leads to much more stable running: Still not totally understood • Non Stable data generation (Castor configuration…) : Significant downtimes and problems maintaining constant stream for Tier-1 export • Monitoring: – Missing automated alarms – Missing clear view of errors, per site – Missing overall success metrics per dataset • Lack of Manpower! • BUT despite this list, many successes at the end of 2006 and very reactive and concerned behaviour of Lyon T 1 and cloud. 5/3/2007 Atlas Tutorial 19

Guidelines Exercising the Model and preparing for real Data: Monte Carlo production 5/3/2007 Atlas Guidelines Exercising the Model and preparing for real Data: Monte Carlo production 5/3/2007 Atlas Tutorial 20

Monte-Carlo Production in Lyon • • • Autumn 2006: executor installed in Lyon to Monte-Carlo Production in Lyon • • • Autumn 2006: executor installed in Lyon to distribute the production jobs within the Lyon Cloud. Production shift organization Setup of priorities to boost production jobs based on role in the certificate Running: 291(max: 955), queued: 443(max: 1101), Production rate: 81%(max: 100%) Result: Impressive increase in the efficiency of Data production in Lyon Cloud. 5/3/2007 Atlas Tutorial 21

AOD Replication : pre-testing TO FROM ASG C BNL CERN CNAF FZK LYON NG AOD Replication : pre-testing TO FROM ASG C BNL CERN CNAF FZK LYON NG PIC RAL SAR A TRIU MF ASGC BNL CERN CNAF FZK LYON NDG F NG PIC RAL SARA TRIU MF Data Transfer tested 5/3/2007 Data Transfer failed Atlas Tutorial Data Transfer not tested in progress 22

Monte-Carlo Production in Lyon Cloud 16% of LCG for 2006 22% for October-November 5/3/2007 Monte-Carlo Production in Lyon Cloud 16% of LCG for 2006 22% for October-November 5/3/2007 Atlas Tutorial 23

Monte-Carlo Production in Lyon Cloud 5/3/2007 Atlas Tutorial 24 Monte-Carlo Production in Lyon Cloud 5/3/2007 Atlas Tutorial 24

Monte-Carlo Production in Lyon • Still big room for improvement in the performances – Monte-Carlo Production in Lyon • Still big room for improvement in the performances – Too high failure rate at or before start of jobs or due to site/middleware issues (no loss of CPU) – Failure at output: registration problem, srm , etc. 5/3/2007 Atlas Tutorial 25

Summary • Main baselines of the Atlas Computing Model established but still working on Summary • Main baselines of the Atlas Computing Model established but still working on improvements. • 2006: decisive transition to operation mode: continuous production of high statistics MC samples; • Successful tests of Data Distribution in agreement with CSC (Computer System Commissioning) • Bottlenecks and problems still ahead but most are identified and work is going for a solution. • Improvements expected for : reliability, stability, monitoring. • Lyon site is very actively progressing towards full readiness for first data in the end of 2007 5/3/2007 Atlas Tutorial 26