4afd30386a53c67c2247ce1f03f74f22.ppt
- Количество слайдов: 19
Tony Doyle a. doyle@physics. gla. ac. uk Grid. PP – Making the Grid Work for the Science, ATSE e-Science Visit, Edinburgh, 20 April 2004 Tony Doyle - University of Glasgow
Contents • Context 1. General (yesterday) 2. Process (today) 3. Operations (tomorrow) • • Start where Steve left off yesterday. . End up where Andrew begins tomorrow. . – How does the Grid Work? – Performance Indicators – Why was the “failure rate” ~20%? – Software Process – External dependencies – Managing a distributed project. . – Is Grid. PP a Grid? • What is the Grid anyway? (from PP perspective) – Demo. . Tony Doyle - University of Glasgow
How Does the Grid Work? 0. Web User Interface… or CLI 1. Authentication grid-proxy-init 2. Job submission edg-job-submit 3. Monitoring and control edg-job-status edg-job-cancel edg-job-get-output 4. Data publication and replication globus-url-copy, RLS 5. Resource scheduling – use of Mass Storage Systems JDL, sandboxes, storage elements Tony Doyle - University of Glasgow
Job Submission (behind the scenes) nit UI JDL Input “sandbox” Data. Sets info y-i pr ox SE & s er ok Br Jo + fo In Job Status Job Submission Service Compute Element Publish ” x bo nd tu sa x” bo nd ta t“ pu sa b. S nfo ut t“ pu Expanded JDL idgr CE i O In Job Query Job Submit Event Globus RSL Job Status Logging & Book-keeping Information Service Output “sandbox” Resource Broker Author. &Authen. Replica Catalogue Storage Element
How do I Authorize? o=xyz, dc=eu-datagrid, dc=org ou=People CN=Homer Simpson CN=Tony Doyle Authentication Certificate o=testbed, dc=eu-datagrid, dc=org ou=Testbed 1 VO Directory CN=Steven Hawking Authentication Certificate ou=People ou=? ? ? CN=Tony Doyle “Authorization Directory” Authentication Certificate mkgridmap local users CN=Steven Hawking grid-mapfile ban list Tony Doyle - University of Glasgow
UK Certificate Authority and Virtual Organisation membership 1. 3. PP “users” engaged from many institutes 3. 2. UK e-Science Certificate Authority now used in application testbed 2. 1. UK participating in 6 ex 9 EDG Virtual Organisations Tony Doyle - University of Glasgow
Performance indicators (as measured by end users) Conclusion: prototype performance, but with quality assurance mechanisms built-in Tony Doyle - University of Glasgow
Why was the “failure rate” ~20%? I. Experiment Layer II. Application Middleware III. Grid Middleware IV. Facilities and Fabrics • Component Testing e. g. RB Stress Tests (LCG) • RB never crashed • ran without problems at load for several days in a row 20 streams with 100 jobs each ( typical error rate ~ 2 % still present) • RB stress test in a job storm of 50 streams, 20 jobs each : – 50% of the streams ran out of connections between UI and RB. (configuration parameter – but machine constraints) – Remaining 50% streams finished normal (2% error rate) – Time between job-submit and return of the command (acceptance by the RB) is 3. 5 seconds (independent of number of streams) • PROBLEMS ARE END-TO-END: e. g. Site advertisement communicated via class ads to all sites (inc. e. g. CNAF) results in RB sending application jobs (e. g. Ali. En for ALICE) to “black hole” – these are recorded as “failures” (application corrects for these via re-submission) • OTHER “PROBLEM” IS INCORPORATION OF ADDED FUNCTIONALITY – ~Resolved by adherence to software process coupled to testbed structure… improved significantly within LCG (leading to EGEE) Tony Doyle - University of Glasgow
Data. Grid Release Milestones Evaluations (2. 0. 12) • Features (2. 0. 12) – R-GMA replaced MDS – Refactored workload mgt. – Interactive, MPI, chkpt. jobs – Replica Location Service – Web Service SE EU Review (2. 1. 13) • Features (2. 1. 13) [0. 5 Mloc] – Reasonable stability, reliability – VOMS incorporated – Bug fixes for all services. • Stabilisation time on application testbed typically a few months Tony Doyle - University of Glasgow
Software Process Infrastructure LCG grid software applications (LHC experiments, projects, etc) LCG Application Area POOL, SEAL, PI, SIMU SPI Infrastructure Common services and infrastructure Tools, templates, training General QA, tests, integration, release – Adopt the same set of tools, standards and procedures – Adopt commonly used open-source or commercial software when easily available – Avoid “do it yourself solutions” – Avoid commercial software, since it may give licensing problems Similar ways of working (process) Tony Doyle - University of Glasgow
SPI Services Overview General Services CVS service Collaborative Facilities External Software Web Portal Task Management Mailing Lists Software Development Coding Quality Assurance Deployment and Installation Analysis and Design Development Release Testing Build systems Specifications Documentation Provide General Services needed by each project – CVS repository, Web Site, Software Library – Mailing Lists, Bug Reports, Task Management, Collaborative Facilities Provide solutions specific to the Software Development phases – Tools, Templates, Policies, Support, Documentations, Examples Tony Doyle - University of Glasgow
External Software • We install software needed by Particle Physics projects • Open Source and Public Domain software (libraries and tools) like: – Compilers (icc, ecc) – HEP made packages – Scientific libraries (GSL) – General tools (python) – Test tools (cppunit, qmtest) – Database software (mysql, mysql++) – Documentation generators (lxr, doxygen) – XML parsers (Xerces. C) • There are currently 50 different packages, plus others under evaluation. For more than 300 installations • The LCG projects propose what to install in agreement with LHC needs • The platforms are decided by the Architect Forum – Linux Red. Hat 7. 3 with the compilers • gcc 3. 2 (rh 73_gcc 32) • icc 7. 1 (rh 73_icc 71) • ecc 7. 1 (rh 73_ecc 71) – Windows • Visual Studio. NET 7. 1: (win 32_vc 7). Tony Doyle - University of Glasgow
How Is the process applied? Middleware Validation: From Testbed to Production Build System Unit Test Development Testbed ~15 CPU Application Testbed ~1000 CPU Certification Testbed ~40 CPU Production Run nightly build & auto. tests Individual WP tests Grid certification Certified public release for use by apps. Build system WPs Fix problems Integration Team Overall release tests Process to: Test frameworks Test support Releases Tagged Test policies candidate Releases Test documentation Test platforms/compilers Test Group Application Certification Apps. Representatives Releases Certified candidate Releases Certified release selected for deployment Certification Tagged release selected for certification add unit tested code to repository Integration Tagged package Build Users 24 x 7 Problem reports Tony Doyle - University of Glasgow
The UK Testbed Tony Doyle - University of Glasgow
e. g. Scot. Grid: Glasgow, Edinburgh and Durham Scot. GRID • Glasgow farm: WNs on a private network with outbound NAT in place • 100, 000 jobs completed (900, 000 CPU hours) EDG 1. 4 CE SE • Data Management Testbed EDG 2. 1 SE BIO Shared resources (LHC, CDF and Bioinformatics) 59 x. WN CE LHC 34 dual blade servers and 5 TB Fast. T 500 being integrated now (next door) • CDF MON • Edinburgh: 24 TB Fast. T 700 and 8 -way server: data storage focus • Durham: 40 node farm • All being integrated into LCG-2 Tony Doyle - University of Glasgow
Managing a Distributed Project: Grid. PP 1 Project Status? Tony Doyle - University of Glasgow Ø 76% of the 190 Grid. PP 1 tasks have been successfully completed
What is “The Grid” Is Grid. PP a Grid? Anyway? http: //www-fp. mcs. anl. gov/~foster/Articles/What. Is. The. Grid. pdf 1. Coordinates resources that are not subject to centralized control 1. YES. This is why development and maintenance of a UK-EU-US testbed is important 2. … using standard, open, general 2. YES. . . Globus/Condor. G/EDG meet -purpose protocols and this requirement. Common interfaces experiment application layers are also important here. 3. … to deliver nontrivial qualities of service 3. NO(T YET)… Experiments define whether this is true - currently only ~100, 000 jobs submitted via the testbed c. f. internal component tests of up 10, 000 jobs per day. Next step: LCG-2 deployment outcome… this year Tony Doyle - University of Glasgow
What is The Grid Anyway? From Particle Physics Perspective The Grid is: not hype, but surrounded by it a working prototype running on testbed(s)… about seamless discovery of PC resources around the world using evolving standards for interoperation the basis for particle physics computing in the 21 st Century not (yet) as transparent as end-users want it to be Tony Doyle - University of Glasgow
The Grid: Demonstrations http: //www. gridpp. ac. uk/demos/ Demos used to establish that e. g. the two LHC multi-purpose detector collaborations • can run jobs on an International Grid • Use common Grid infrastructure with secure Grid access • But doesn’t mean that the Grid works in production mode • (yet) • This is however signi ficant Tony Doyle - University of Glasgow