f51096cc842f88daa6b7b6b9664fd1f5.ppt
- Количество слайдов: 27
Enabling Grids for E-scienc. E Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. Martins LIP: Laboratório de Instrumentação em Física Experimental de Partículas C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. Pacheco PIC: Port d’Informació Científica F. Bernabé, J. Fontán, J. Lopez, P. Rey CESGA: Fundación Centro Tecnológico de Supercomputación de Galicia R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas J. Palacios IFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas www. eu-egee. org INFSO-RI-508833
Outline Enabling Grids for E-scienc. E o The EGEE grid project. o Main operation activities inside EGEE South-West grid infrastructure: – Resources; – Activities coordination: § Certification; • Sites and middleware certification; § Accounting; • • EGEE View Participation in the Accounting Enforcement task; • • Interaction with the Grid Operation Centre (GOC); Participation in COD; § Monitoring; § Support; • Interaction with the Global Grid User Support (GGUS); § Authentication and Security; • Activities in the EUGrid. PMA framework. § Middleware tests and integration. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 2
EGEE project Enabling Grids for E-scienc. E o The Enabling Grids for E-scienc. E project: – An European financed grid project; – The biggest world wide grid for multi-disciplinary sciences; § Integrates several national and regional grids; § More then 90 partners distributed over 32 countries; – Developed on top of the infrastructures and software built in EDG and LCG grid projects. o The LHC Computing Grid project: – LHC will be the world most powerful particle accelerator; § Built at CERN and expected to start operating in 2007; – LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community: § 15 Petabytes of experimental data annually, § Available during the 15 years life time of the LHC machine; § Fully accessible to ~5000 scientists from more than 500 institutes. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 3
EGEE project Enabling Grids for E-scienc. E o EGEE concentrates in three core areas: – Improve and maintain the middleware; § Provide a reliable service; – Attract new users from industry as well as from science; § Ensure they receive high standard of training and support; – Combine national, regional and thematic Grid efforts; § For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry. o EGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields: – Astrophyics; Biomedic and Bioinformatic applications; – Computational chemistry; Earth Sciencies; – Finance; Fusion; Geophysics; – (. . . ) o EGEE supports more than 100 virtual organizations. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 4
EGEE project Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 5
EGEE Operations: The GOC Enabling Grids for E-scienc. E o The Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid: – Devises and manages mechanisms and procedures which encourage optimal operation of the Grid; – It acts as a central point of operational information such as: § Site local and central services; § Site resources configuration; § Contact details. – Monitores the operation of the Grid Infrastructure as a whole; § GOC works with the federation local support groups to assist them in providing the best possible service while their infrastructure is connected to the Grid. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 6
EGEE Operations: The ROCs Enabling Grids for E-scienc. E o The fulfillment of the federations key objectives is supervised by the Regional Operation Centre (ROC): – Operate essential core services; § RBs, data management services, information services, VOMS servers; – Interface between VO requests and sites resources; – To provide monitoring and operational troubleshooting services; – Receiving, responding and coordinating the resolution of grid operation problems from the sites and users point of view. – South-Western Europe – France – UK/Ireland – Northern Europe – Germany/Switzerland – CERN – Italy – Central Europe – South Eastern Europe – Russia – Asia/Pacific INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 7
South-West federation Enabling Grids for E-scienc. E o EGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA 1). o Responsible for maintaining high quality services of the grid infrastructure inside the South-West region: – Portuguese: LIP; – Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI; – PIC is the “Tier 1” centre of the SWE federation. o The EGEE SWE ROC is shared among the different institutes: – This requires a higher coordination effort; § All operations/management questions are weekly reported to the ROC § § INFSO-RI-031688 manager during a VRVS meeting; Promotes the communication between the different site managers; Promotes the knowledge exchange necessary for a faster resolution of problems. Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 8
South-West federation resources Enabling Grids for E-scienc. E o EGEE South-West federation is presently offering… – Core services for the production testbed (13/10/2006): § 8 Resource Brokers; § 8 top BDII machines; § 3 LFC central catalogs; § 1 FTS service. – Local services for the production infrastructure: § 18 Computing Elements; • 1052 CPUs = 935. 2 Normalized CPUs. o (Norm = 1000 Spec. Ints 2000 = Pentium IV @ 2. 8 GHz). § 18 Storage Elements; • • 35. 4 TB of online storage (disk); 1. 5 PB of nearline storage (tape backend). – These resources are currently shared according to the federation internal policies by more than 20 virtual organizations. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 9
SWE ROC tasks: Site certification Enabling Grids for E-scienc. E o The SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure: – Performed by LIP in Portugal; – Performed by PIC in Spain; – The certification process consists on a set of demanding tests: § Information system; § Site configuration; § Interactions with the central core services. – ROC negotiates service level agreements (SLA’s): § Settle the level of services each Resource Center (RC) should provide to the infrastructure. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 10
SWE ROC tasks: Accounting Enabling Grids for E-scienc. E o The EGEE South-West federation was one of the first to widely deploy grid accounting tools; – CESGA is the responsible entity inside the South-West federation for maintaining the accounting portal; – The most relevant information is monthly compiled and reported to the ROC and federation members. o Due to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”… – Monitor all the EGEE infrastructure; – Check if all the Resource Centres are publishing correct accounting information and open tickets if they don’t; – Help the Resource Centres to deploy the necessary accounting tools; o … and take charge of the “EGEE View”: – Portal with accounting information from all EGEE sites. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 11
SWE ROC tasks: Accounting Enabling Grids for E-scienc. E Some SWE accounting charts → 949658 Jobs → 3504204 hours → 2870184 hours INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 12
SWE ROC tasks: Accounting Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 13
SWE ROC tasks: Accounting Enabling Grids for E-scienc. E Some “EGEE View” charts INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 14
SWE ROC tasks: Accounting Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 15
SWE ROC tasks: Monitoring Enabling Grids for E-scienc. E o COD on Duty (COD) is done by Telefonica I+D helped by PIC; o CODs are grid expert teams which manage the day-to-day operation of the grid: – Active monitoring of the infrastructure; – Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex: § A Resource Centre is causing problems by generating invalid information; § COD team opens a ticket to the Resource Centre; § COD team contacts the corresponding ROC operations support line; § COD team informs a network operations centre of suspected failures; § COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed; – Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres; § It is intended that tools will be developed to automate much of this work; INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 16
SWE ROC tasks: Monitoring Enabling Grids for E-scienc. E o CESGA maintains a Grid. ICE portal for all the SWE RC’s. – The Grid. Ice server collects information through specific sensors included in the EGEE middleware: § job information, grid service, fabric monitoring data. – Based on some plugins for Nagios: § Collect the data published by the sites; § Keeps them in a “postgresql” database; § Shows them in a web page. – Grid. ICE also includes e-mail notifications about changes in the status of the sites (Hosts, important processes, etc. . . o CESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: – Site Availability Monitoring: § Collection of comprehensive tests that are run daily on each certified site; – GStat Monitor: § A snapshot of the Grid Information System. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 17
SWE ROC tasks: Monitoring Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 18
SWE ROC tasks: Monitoring Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 19
ROC SWE tasks: Monitoring Enabling Grids for E-scienc. E INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 20
SWE ROC tasks: Support Enabling Grids for E-scienc. E o The regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC: – Users/Admins from the SWE federation can open tickets; o The coordination of the user support services inside the federation is handled by LIP: – It is LIP responsibility to follow all tickets assigned to the SWE federation; – Make sure that they are routed to the correct RC and solved in time; – SWE ROC is automatically warned (and acts accordingly) when: § Open tickets are opened by users or COD staff on federation sites; § SAM or any other monitoring tool reports failures… INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 21
SWE ROC tasks: Support Enabling Grids for E-scienc. E o The SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS); o GGUS is a trouble ticketing system application: – Grid users and administrators can open tickets asking for help; § Users can start a ticket using independent regional portals. Local experts can try to solve the problem or assign it to the central GGUS service; § A ticket can also be opened directly in the GGUS services via a web form or email; – First line of support is provided by “Ticket Processing Managers”: § TPM teams are composed of 3 Grid experts, who change on a weekly basis; § TPM’s are able to provide a solution to a given grid operation problem or assign the issue to more specialized support unit. – Support is assured 5 days a week, 9 hours a day; – GGUS is used to start COD trouble tickets when the monitoring jobs fail; o LIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 22
SWE ROC tasks: Support Enabling Grids for E-scienc. E Regional SWE help-desk INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 23
SWE ROC tasks: Authentication and Security Enabling Grids for E-scienc. E o The emission of valid certificates for EGEE for SWE region is operated by: – LIP, through the LIP Certification Authority (LIPCA), in Portugal; – CSIC-IFCA and PK-IRISGRID in Spain. o These CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGrid. PMA). – EUGrid. PMA coordinates a Public Key Infrastructure (PKI) used in the emission of X. 509 certificates; o SWE CAs participate in the body of EUGrid. PMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement). o LIP (in Portugal) and RED. ES (in Spain) are responsible for security coordination and for handling security incidences. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 24
SWE ROC tasks: Middleware integration Enabling Grids for E-scienc. E o g. Lite is the middleware layer developed by EGEE. – Extends the use of the grid infrastructure to all fields of science; – Follows a Service Oriented Architecture (SOA): § Decreases the middleware dependence on the user’s applications and interactions with the different services. o g. Lite middleware doesn’t support all LRMs systems: – Only LFS and Torque/Maui batch schedulers by default: – LIP and CESGA, together with IC, are involved in an EGEE task force to provide g. Lite support for SGE batch system: § New jobmanager implementation; § New infoprovider scripts; § Upgrade the yaim installation procedure. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 25
SWE pre-production testbed Enabling Grids for E-scienc. E o In parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed: – CESGA, CSIC-IFIC, LIP and PIC; o Objectives of the pre-production testbed: – Test new middleware releases; § First contact with new services; § Test all services interactions/interconnections; § Report bugs to the developers; § Test bug fixes; – Release the middleware packages/patches which were correctly validated to the production testbed; o SWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 26
Summary & Conclusions Enabling Grids for E-scienc. E o We have presented the main EGEE SWE federation activities: – Its resources for the production testbed; – Its operation and regional management procedures; – Its responsibilities in the some general EGEE tasks: § Certification; § Accounting; § Support; § Monitoring § Authentication; § Middleware tests and integration; – Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA. o This presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users. INFSO-RI-031688 Operation and management issues in the EGEE/SWE grid infrastructure CGW’ 06 27
f51096cc842f88daa6b7b6b9664fd1f5.ppt