eaebddefc9d704a390ee8f4f46babfe1.ppt
- Количество слайдов: 16
Partner Logo WP 4 report Plans for testbed 2 Olof. Barring@cern. ch Olof Bärring – WP 4 summary- 4/9/2002 - n° 1
Summary u Reminder u What’s u Piled in R 1. 2 (deployed and not-deployed but integrated) up software from R 1. 3, R 1. 4 u Timeline u. A on how it all fits together for R 2 developments and beyond WP 4 problem u Conclusions Olof Bärring – WP 4 summary- 4/9/2002 - n° 2
How it all fits together (job management) Resource Broker (WP 1) Grid User - Submit job - Optimized selection of site -Authorize -Map grid local credentials Data Mgmt WP 4 subsystems Grid Info Services (WP 3) Other Wps Fabric Gridification - publish resource and accounting information Resource Management Monitoring -Select an optimal batch queue (WP 2) and submit -Return job status and output Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (WP 5) (Mass storage, Disk pools) Olof Bärring – WP 4 summary- 4/9/2002 - n° 3
How it all fits together (system mgmt) -Remove node from queue -Put back node in queue -Wait for running jobs(? ) - Node malfunction detected - Node OK detected WP 4 subsystems Other Wps Resource Management Farm A (LSF) Information Monitoring & Fault Tolerance Farm B (PBS) - Repair (e. g. restart, reboot, reconfigure, …) Installation & Node Mgmt Invocation - Update configuration templates Configuration Management Automation - Trigger repair Olof Bärring – WP 4 summary- 4/9/2002 - n° 4
How it all fits together (node autonomy) Central (distributed) Correlation engines Automation Node mgmt components Buffer copy Monitoring Measurement Repository Monitoring Buffer Cfg cache Node profile Configuration Data Base Local recover if possible (e. g. restarting daemons) Olof Bärring – WP 4 summary- 4/9/2002 - n° 5
What’s in R 1. 2 (and deployed) u Gridification: n Library implementation of LCAS Olof Bärring – WP 4 summary- 4/9/2002 - n° 6
What’s in R 1. 2 but not used/deployed u Resource n management Information provider for Condor (not fully tested because you need a complete testbed including a Condor cluster) u Monitoring n Agent + first prototype repository server + basic linuxproc sensors n No LCFG object not deployed u Installation n mgmt LCFG light exists in R 1. 2. Please provide us feedback on any problems you have with it. Olof Bärring – WP 4 summary- 4/9/2002 - n° 7
Piled up software from R 1. 3, R 1. 4 u Everything mentioned here is ready, unit tested and documented (and rpms are built by autobuild) n Gridification s n Resource mgmt s n Complete prototype enterprise level batch system management with proxy for PBS (see next slide). Includes LCFG object. Monitoring s s n LCAS with dynamic plug-ins. (already in R 1. 2. 1? ? ? ) New agent. Production quality. Already used on CERN production clusters sampling some 110 metrics/node. Has also been tested on Solaris. LCFG object Installation mgmt s Next generation LCFG: LCFGng for RH 6. 2 (RH 7. 2 almost ready) Olof Bärring – WP 4 summary- 4/9/2002 - n° 8
Enterprise level batch system mgmt prototype (R 1. 3) job 1 Grid job 2 job n Gatekeeper Local fabric (Globus or WP 4) JM 1 Scheduler scheduled jobs JM 2 JM n submit new jobs Globus components RMS components move job started, invisible for users execution queue stopped, visible for users get job info user queue 2 Runtime Control System user queue 1 exec job move queues resources Batch system: PBS, LSF, etc. PBS-, LSF-Cluster Olof Bärring – WP 4 summary- 4/9/2002 - n° 9
Timeline for R 2 developments u Configuration management: complete central part of framework n High Level Definition Language: 30/9/2002 n PAN compiler: 30/9/2002 n Configuration Database (CDB): 31/10/2002 u Installation n mgmt LCFGng for RH 72: 30/9/2002 u Monitoring: Complete final framework n TCP transport: 30/9/2002 n Repository server: 30/9/2002 n Repository API WSDL: 30/9/2002 n Oracle DB support: 31/10/2002 n Alarm display: 30/11/2002 n Open Source DB (My. SQL or Postgre. SQL): mid -December 2002 Olof Bärring – WP 4 summary- 4/9/2002 - n° 10
Timeline for R 2 developments u Resource n n mgmt GLUE info providers: 15/9/2002 Maintenance support API (e. g. enable/disable a node in the queue): 30/9/2002 n Provide accounting information to WP 1 accounting group: 30/9/2002 n Support Maui as scheduler u Fault tolerance framework n Various components already delivered n Complete framework by end of November Olof Bärring – WP 4 summary- 4/9/2002 - n° 11
Beyond release 2 u Conclusion from WP 4 workshop, June 2002: LCFG is not the future for EDG (see WP 4 quarterly report for 2 Q 02) because: n Inherent LCFG constraints on the configuration schema (per-component config) n LCFG is a project of its own and our objectives do not always coincide n We have learned a lot from LCFG architecture and we continue to collaborate with the LCFG team u EDG n n n future: first release by end-March 2003 Proposal for a common schema for all fabric configuration information to be stored in the configuration database, implemented using the HLDL. New configuration client and node management replacing LCFG client (the server side is already delivered in October). New software package management (replacing updaterpms) split into two modules: an OS independent part and an OS dependent part (packager). Olof Bärring – WP 4 summary- 4/9/2002 - n° 12
Global schema tree system hardware CPU harddisk memory …. sw hostname architecture partitions services …. sys_name interface_type size …. hda 1 hda 2 …. size type id edg_lcas cluster packages known_repositories edg_lcas …. Component specific configuration version repositories …. The population of the global schema is an ongoing activity http: //edms. cern. ch/document/352656/1 Olof Bärring – WP 4 summary- 4/9/2002 - n° 13
Global schema example SW repository structure (maintained by repository managers): /sw/known_repositories/Arep/url = (host, protocol, prefix dir) /owner = /extras = /directories/dir_name_X/path = (asis) /platform = (i 386_rh 61) /packages/pck_a/name = (kernel) /version = (2. 4. 9) /release = 31. 1. cern /architecture = (i 686) /dir_name_Y /path = (sun_system) /platform = (sun 4_58) /packages/pck_b/name = (SUNWcsd) /version = 11. 7. 0 /release = 1998. 09. 01. 04. 16 /architecture = (? ) Olof Bärring – WP 4 summary- 4/9/2002 - n° 14
Problem u Very little of delivered WP 4 software is of any interest to EDG application WPs, possibly with the exception of producing nice colour plots of the CPU loads when a job was run… u This normal, but… Site administrators do not grow on trees. Because of the lack of good system admin tools, like the ones WP 4 tries to develop, the configuration, installation and supervision of the testbed installations require a substantial amount of manual work. u However, thanks to Bob new priority list the need for automated configuration and installation has bubbled up on the required features stack to become absolutely vital for assuring good quality. Olof Bärring – WP 4 summary- 4/9/2002 - n° 15
Summary u Substantial now u R 2 n n amount of s/w piled up from R 1. 3, R 1. 4 to be deployed also includes two large components: LCFGng – migration is non-trivial but we already perform as much as the non-trivial part ourselves so TB integration should be smooth Complete monitoring framework u Beyond R 2: LCFG is not future for EDG WP 4. First version of new configuration and node management system in March 2003 Olof Bärring – WP 4 summary- 4/9/2002 - n° 16
eaebddefc9d704a390ee8f4f46babfe1.ppt