2d8934b66ef3f194bff224bfe2c7f68c.ppt
- Количество слайдов: 37
ALICE Computing Model F. Carminati BNL Seminar March 21, 2005 ALICE Computing Model
Offline framework l Ali. Root in development since 1998 ¡ ¡ Entirely based on ROOT Used since the detector TDR’s for all ALICE studies Two packages to install (ROOT and Ali. Root) ¡ Plus MC’s l Ported on most common architectures l ¡ l Linux IA 32, IA 64 and AMD, Mac OS X, Digital True 64, Sun. OS… Distributed development ¡ ¡ Over 50 developers and a single CVS repository 2/3 of the code developed outside CERN Tight integration with DAQ (data recorder) and HLT (same codebase) l Wide use of abstract interfaces for modularity l “Restricted” subset of c++ used for maximum portability l March 21, 2005 ALICE Computing Model 2
Ali. Root layout G 3 G 4 FLUKA ISAJET Ali. En/g. Lite Ali. Root Virtual MC Ali. Reconstruction HIJING EVGEN Ali. Simulation HBTAN PMD EMCAL STRUCT CRT TRD START PYTHIA 6 STEER ITS PHOS FMD TOF MUON MEVSIM PDF ZDC TPC RICH HBTP RALICE ESD Ali. Analysis ROOT March 21, 2005 ALICE Computing Model 3
Software management l Regular release schedule ¡ l Emphasis on delivering production code ¡ l Corrections, protections, code cleaning, geometry Nightly produced UML diagrams, code listing, coding rule violations, build and tests , single repository with all the code ¡ l Major release every six months, minor release (tag) every month No version management software (we have only two packages!) Advanced code tools under development (collaboration with IRST/Trento) ¡ ¡ ¡ Smell detection (already under testing) Aspect oriented programming tools Automated genetic testing March 21, 2005 ALICE Computing Model 4
ALICE Detector Construction Database (DCDB) l Specifically designed to aid detector construction in distributed environment: ¡ ¡ Sub-detector groups around the world work independently All data collected in central repository and used to move components from one subdetector group to another and during integration and operation phase at CERN l Multitude of user interfaces: ¡ ¡ ¡ l l In production since 2002 A very ambitious project with important spin-offs ¡ ¡ March 21, 2005 ALICE Computing Model WEB-based for humans Lab. View, XML for laboratory equipment and other sources ROOT for visualisation Cable Database Calibration Database 5
The Virtual MC G 3 G 4 VMC G 4 transport FLUKA User Code G 3 transport FLUKA transport Reconstruction Geometrical Modeller Visualisation Generators March 21, 2005 ALICE Computing Model 6
TGeo modeller March 21, 2005 ALICE Computing Model 7
Results Geant 3 FLUKA HMPID 5 Ge. V Pions 5000 1 Ge. V/c protons in 60 T field March 21, 2005 ALICE Computing Model 8
ITS – SPD: Cluster Size PRELIMINARY! March 21, 2005 ALICE Computing Model 9
Reconstruction strategy Main challenge - Reconstruction in the high flux environment (occupancy in the TPC up to 40%) requires a new approach to tracking l Basic principle – Maximum information approach l ¡ l Use everything you can, you will get the best Algorithms and data structures optimized for fast access and usage of all relevant information ¡ ¡ Localize relevant information Keep this information until it is needed March 21, 2005 ALICE Computing Model 10
Tracking strategy – Primary tracks l Incremental process ¡ ¡ ¡ l Continuous seeding ¡ l Forward propagation towards to the vertex TPC ITS Back propagation ITS TPC TRD TOF Refit inward TOF TRD TPC ITS Track segment finding in all detectors Combinatorial tracking in ITS ¡ ¡ ¡ Weighted two-tracks 2 calculated Effective probability of cluster sharing Probability not to cross given layer for secondary particles March 21, 2005 ALICE Computing Model 11
Tracking & PID TPC ITS+TPC+TOF+TRD ITS & TPC & TOF l Efficiency PIV 3 GHz – (d. N/dy – 6000) ¡ ¡ TPC tracking - ~ 40 s TPC kink finder ~ 10 s ITS tracking ~ 40 s TRD tracking ~ 200 s Contamination March 21, 2005 ALICE Computing Model 12
Condition and alignment l l l Heterogeneous information sources are periodically polled ROOT files with condition information are created These files are published on the Grid and distributed as needed by the Grid DMS Files contain validity information and are identified via DMS metadata No need for a distributed DBMS Reuse of the existing Grid services March 21, 2005 ALICE Computing Model 13
External relations and DB connectivity Relations between DBs not final not all shown Physics data From URs: files ECS API DAQ API Trigger calibration procedures Source, volume, granularity, update frequency, access pattern, runtime environment and dependencies API DCS API DCDB API HLT calibration files Ali. En g. Lite: Calibration classes API API – Application Program Interface March 21, 2005 metadata file store Ali. Root Call for UR sent to subdetectors ALICE Computing Model 14
Metadata Meta. Data are essential for the selection of events l We hope to be able to use the Grid file catalogue for one part of the Meta. Data l ¡ ¡ l We will need an additional event-level Meta. Data ¡ ¡ l During the Data Challenge we used the Ali. En file catalogue for storing part of the Meta. Data However these are file-level Meta. Data This can be simply the TAG catalogue with externalisable references We are discussing with STAR on this subject We will take a decision soon ¡ We would prefer that the Grid scenario be clearer March 21, 2005 ALICE Computing Model 15
ALICE CDC’s Date MBytes/s 10/2002 200 9/2003 300 Tbytes to MSS 200 Rootification of raw data -Raw data for TPC and ITS Integration of single detector HLT code 300 Partial data replication to remote centres HLT prototype for more detectors - Remote reconstruction of partial data streams -Raw digits for barrel and MUON 5/2004 3/2005(? ) 450 5/2005 750 5/2006 750 (1250 if possible) March 21, 2005 Offline milestone Prototype of the final remote data replication (Raw digits for all detectors) Final test (Final system) ALICE Computing Model 16
Use of HLT for monitoring in CDC’s CASTOR Root file Ali. En ESD Monitoring Histograms Aliroot Simulation HLT Algorithms Digits Raw Data March 21, 2005 ALICE Computing Model LDC LDC alimdc Event builder GDC 17
ALICE Physics Data Challenges Period (milestone) Fraction of the final capacity (%) 06/01 -12/01 1% pp studies, reconstruction of TPC and ITS 5% • First test of the complete chain from simulation to reconstruction for the PPR • Simple analysis tools • Digits in ROOT format 01/04 -06/04 10% • • 05/05 -07/05 TBD • Test of condition infrastructure and FLUKA • To be combined with SDC 3 • Speed test of distributing data from CERN 01/06 -06/06 20% • Test of the final system for reconstruction and analysis 06/02 -12/02 March 21, 2005 Physics Objective Complete chain used for trigger studies Prototype of the analysis tools Comparison with parameterised Monte. Carlo Simulated raw data ALICE Computing Model 18
PDC 04 schema Ali. En job control Production of RAW Shipment of RAW to CERN Reconstruction of RAW in all T 1’s Analysis Tier 2 March 21, 2005 Data transfer CERN Tier 1 ALICE Computing Model Tier 1 Tier 2 19
Phase 2 principle Signal-free event March 21, 2005 Mixed signal ALICE Computing Model 20
Simplified view of the ALICE Grid with Ali. En ALICE VO – central services User authentication File Catalogue Workload management Job submission Configuration Job Monitoring Central Task Queue Accounting Storage volume manager Ali. En Site services Computing Element Data Transfer Storage Element Disk and MSS Local scheduler Cluster Monitor Existing site components ALICE VO – Site services integration March 21, 2005 ALICE Computing Model 21
Site services l Inobtrusive – entirely in user space: ¡ ¡ ¡ l Singe user account All authentication already assured by central services Tuned to the existing site configuration – supports various schedulers and storage solutions Running on many Linux flavours and platforms (IA 32, IA 64, Opteron) Automatic software installation and updates (both service and application) Scalable and modular – different services can be run on different nodes (in front/behind firewalls) to preserve site security and integrity: ONLY High ports (50 K 55 K) for parallel file transport Load balanced file transfer nodes (on HTAR) Ali. En Data Transfer March 21, 2005 Fire wall CERN firewall solution for large volume file transfers ALICE Computing Model CERN Intranet Ali. En Other services 22
HP Pro. Liant DL 380 Ali. En Proxy Server Up to 2500 concurrent client connections HP server rx 2600 Ali. En Job Services 500 K archived jobs HP Pro. Liant DL 380 Ali. En Storage Elements Volume Manager HP Pro. Liant DL 580 Ali. En File Catalogue 9 Mio entries, 400 K directories, 10 GB My. SQL DB 4 Mio entries, 3 GB My. SQL DB HP Pro. Liant DL 360 Log files, application software storage 1 TB SATA Disk server March 21, 2005 Ali. En to CASTOR (MSS) interface ALICE Computing Model 23
Phase 2 job structure 00 Central servers Register in Ali. En FC: LCG SE: LCG LFN = Ali. En PFN Sub-jobs Se p Ali. En-LCG interface Sub-jobs . 2 Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Task - simulate the event reconstruction and remote event storage 4 l Underlying event input files CEs Output files CERN CASTOR: backup copy Job processing Co m pl Job processing Storage et CEs CERN CASTOR: underlying events ed RB Storage Output files Local SEs Primary copy March 21, 2005 zip archive of output files Local SEs Primary copy ALICE Computing Model File catalogue edg(lcg) copy®ister 24
Production history l l ALICE repository – history of the entire DC ~ 1 000 monitored parameters: ¡ Running, completed processes ¡ Job status and error conditions ¡ Network traffic ¡ Site status, central services monitoring ¡ …. 7 GB data 24 million records with 1 minute granularity – analysed to improve GRID performance March 21, 2005 l Statistics ¡ ¡ ¡ ¡ 400 000 jobs, 6 hours/job, 750 MSi 2 K hours 9 M entries in the Ali. En file catalogue 4 M physical files at 20 Ali. En SEs in centres world-wide 30 TB stored at CERN CASTOR 10 TB stored at remote Ali. En SEs + 10 TB backup at CERN 200 TB network transfer CERN –> remote computing centres Ali. En efficiency observed >90% LCG observed efficiency 60% (see GAG document) ALICE Computing Model 25
Job repartition Jobs (Ali. En/LCG): Phase 1 - 75/25%, Phase 2 – 89/11% l More operation sites added to the ALICE GRID as PDC progressed l Phase 1 l Phase 2 17 permanent sites (33 total) under Ali. En direct control and additional resources through GRID federation (LCG) March 21, 2005 ALICE Computing Model 26
Summary of PDC’ 04 Ø Computing resources Ø It took some effort to ‘tune’ the resources at the remote computing centres Ø The centres’ response was very positive – more CPU and storage capacity was made available during the PDC Ø Middleware Ø Ali. En proved to be fully capable of executing high-complexity jobs and controlling large amounts of resources Ø Functionality for Phase 3 has been demonstrated, but cannot be used Ø LCG MW proved adequate for Phase 1, but not for Phase 2 and in a competitive environment Ø It cannot provide the additional functionality needed for Phase 3 Ø ALICE computing model validation: Ø Ali. Root – all parts of the code successfully tested Ø Computing elements configuration Ø Need for a high-functionality MSS shown Ø Phase 2 distributed data storage schema proved robust and fast Ø Data Analysis could not be tested March 21, 2005 ALICE Computing Model 27
Development of Analysis l Analysis Object Data designed for efficiency ¡ l Contain only data needed for a particular analysis Analysis à la PAW ¡ ROOT + at most a small library Work on the distributed infrastructure has been done by the ARDA project l Batch analysis infrastructure l ¡ l Interactive analysis infrastructure ¡ l Prototype published at the end of 2004 with Ali. En Demonstration performed at the end 2004 with Ali. En g. Lite Physics working groups are just starting now, so timing is right to receive requirements and feedback March 21, 2005 ALICE Computing Model 28
Site A Site B PROOF SLAVE SERVERS PROOF SLAVE LCG SERVERS Proofd Rootd Forward Proxy New Elements Optional Site Gateway Only outgoing connectivity Site <X> Slave ports mirrored on Master host Proofd Startup Grid Service Interfaces Slave Registration/ Booking- DB TGrid UI/Queue UI Master Setup PROOF Steer Grid Access Control Service Grid/Root Authentication Grid File/Metadata Catalogue “Standard” Proof Session Master Booking Request with logical file names Client retrieves list of logical file (LFN + MSN) Grid-Middleware independend PROOF Setup March 21, 2005 PROOF Master ALICE Computing Model PROOF Client 29
Grid situation l History ¡ ¡ l Jan ‘ 04: Ali. En developers are hired by EGEE and start working on new MW May ‘ 04: A prototype derived from Ali. En is offered to pilot users (ARDA, Biomed. . ) under the g. Lite name Dec ‘ 04: The four experiments ask for this prototype to be deployed on larger preproduction service and be part of the EGEE release Jan ‘ 05: This is vetoed at management level -- Ali. En will not be common software Current situation ¡ EGEE has vaguely promised to provide the same functionality of Ali. En-derived MW l l ¡ ¡ g. Lite But with a 2 -4 months delay at least on top of the one already accumulated But even this will be just the beginning of the story: the different components will have to be field tested in a real environment, it took four years for Ali. En All experiments have their own middleware Our is not maintained because our developers have been hired by EGEE has formally vetoed any further work on Ali. En or Ali. En-derived software LCG has allowed some support for ALICE but the situation is far from being clear March 21, 2005 g. Late ALICE Computing Model 30
ALICE computing model l For pp similar to the other experiments ¡ ¡ l For AA different model ¡ ¡ ¡ l Quasi-online data distribution and first reconstruction at T 0 Further reconstruction passes at T 1’s Calibration, alignment and pilot reconstructions during data taking Data distribution and first reconstruction at T 0 during the four months after AA run (shutdown) Second and third pass distributed at T 1’s For safety one copy of RAW at T 0 and a second one distributed among all T 1’s T 0: First pass reconstruction, storage of one copy of RAW, calibration data and first-pass ESD’s l T 1: Subsequent reconstructions and scheduled analysis, storage of the second collective copy of RAW and one copy of all data to be safely kept (including simulation), disk replicas of ESD’s and AOD’s l T 2: Simulation and end-user analysis, disk replicas of ESD’s and AOD’s l Very difficult to estimate network load l March 21, 2005 ALICE Computing Model 31
ALICE requirements on Middle. Ware l One of the main uncertainties of the ALICE computing model comes from the Grid component ¡ ¡ ALICE was developing its computing model assuming that a MW with the same quality and functionality that Ali. En would have had in two years from now will be deployable on the LCG computing infrastructure If not, we will still analyse the data (!), but l l l Less efficiency more computers more time and money More people for production more money To elaborate an alternative model we should know what will be l l l The functionality of the MW developed by EGEE The support we can count on from LCG Our “political” “margin of manoeuvre” March 21, 2005 ALICE Computing Model 32
Possible strategy l If a) b) l l We have a solution If a) above is not true but if a) b) l l Basic services from LCG/EGEE MW can be trusted at some level We can get some support to port the “higher functionality” MW onto these services We have support for deploying the ARDA-tested Ali. Enderived g. Lite We do not have a political “veto” We still have a solution Otherwise we are in trouble March 21, 2005 ALICE Computing Model 33
ALICE Offline Timeline CDC 04? CDC 05 nous sommes ici PDC 04 2004 PDC 05 Computing TDR 2005 Development of new components PDC 04 Analysis PDC 04 Design of new components March 21, 2005 PDC 06 2006 PDC 06 preparation PDC 05 Ali. Root ready Final development of Ali. Root PDC 06 ALICE Computing Model First data taking preparation 34
Main parameters March 21, 2005 ALICE Computing Model 35
Processing pattern March 21, 2005 ALICE Computing Model 36
Conclusions l ALICE has made a number of technical choices for the Computing framework since 1998 that have been validated by experience ¡ The Offline development is on schedule, although contingency is scarce Collaboration between physicists and computer scientists is excellent l Tight integration with ROOT allows fast prototyping and development cycle l Ali. En goes a long way in providing a GRID solution adapted to HEP needs l ¡ ¡ l However its evolution into a common project has been “stopped” This is probably the largest single “risk factor” for ALICE computing Some ALICE-developed solutions have a high potential to be adopted by other experiments and indeed are becoming “common solutions” March 21, 2005 ALICE Computing Model 37
2d8934b66ef3f194bff224bfe2c7f68c.ppt