c65b25998becf6833d84561f9088829d.ppt
- Количество слайдов: 20
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPi. X II/2004 BNL 20. 10. 2004 1
Outline Farms at the CERN CC: • The Tools Framework • The Working Teams • Real Life Use Cases • Collaborations • Summary • Useful Links 20 October 2004 Thorsten Kleinwort IT/FIO/FS 2
The Tools Framework • ELFms • • • = Quattor: • Installation (Kickstart + SWREP) • Configuration (CDB + NCM) • Management (SPMA + NCM) Lemon: • Monitoring • Batch system statistics LEAF: • State management (SMS) • Hardware management (HMS) 20 October 2004 Thorsten Kleinwort IT/FIO/FS + + QUATTOR LEMON LEAF 3
The Tools Framework (cont’d) • The evolution of the ELFms tools is described in various previous presentations: • • HEPi. X II/2003 (Vanouver): • ‘The new Fabric Management Tools in Production at CERN’ HEPi. X I/2004 (Edinburgh): • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): • ‘Current Status of Fabric Management at CERN’ by German This HEPi. X: • `Experience in the use of quattor tool suite outside CERN’ => Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN 20 October 2004 Thorsten Kleinwort IT/FIO/FS 4
Tools (cont’d): • Other tools [interfacing CDB]: • • Script: Prepare. Install. pl: • Does all necessary steps to prepare a machine install • Can run with a list of hosts (for mass installs) • Gets all the necessary information from CDB • Creates a kickstart file for each node Local Script: maintenance: • Script to rundown a node: • Drains batch nodes • Warns users on interactive nodes • Can execute configurable script at the end, e. g. reboot 20 October 2004 Thorsten Kleinwort IT/FIO/FS 5
Tools (cont’d) • Automated Fabric [LEAF]: • • State Management System SMS: • Other CDB changes are done by SMS: • Change OS/Cluster • Systems have state: • ‘production’ or ‘standby’ Hardware Management System HMS: • Workflow to track hardware changes [interfaces CDB]: • New machine arrival • Machine moves • Machine interventions (Vendor calls), retirements 20 October 2004 Thorsten Kleinwort IT/FIO/FS 6
The Working Teams “Customers” Service Manager Sys. Admins Operator 20 October 2004 • Other groups/teams in CERN-IT, • New team like: • DB (ORACLE) • Now 7 staff, more to hire • Farm/Cluster resource • GD (LCG) • Running more and more planning • 24/7 • GM in the services(EGEE) Writing/improving • Alarm display CC the • Experiments • Doing most of the install procedures/tools • Following procedures: (Data Challenges) and maintenancenew on • Following on alarms • Acting up on work • Changing. Remedy tickets farm PCs requirements problems • Open • Following up h/w failures • Email/phone notification ‘Vendor calls’ • Machine reboots Thorsten Kleinwort IT/FIO/FS 7
Another Management Tool • Remedy: • • The problem tracking tool in CERN IT Used in different workflows, e. g. by: • The Operator to open tickets following up on alarms • The Service Managers to ask for machine interventions • The Sys. Admins to follow up on problems/general issues HMS is implemented as a Remedy Workflow as well Recently started to get statistics on hardware failures 20 October 2004 Thorsten Kleinwort IT/FIO/FS 8
Real Life Use Cases • Kernel upgrade (on LXBATCH, ~1500 hosts): 1. 2. 3. 4. Put the new software into the repository (SWREP, precaching) Put the new kernel RPM on the nodes: SPMA, with multi-package option (old kernel is still running!) Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node Drain the nodes by disabling new batch jobs (maintenance) 20 October 2004 Thorsten Kleinwort IT/FIO/FS 9
Real Life Use Cases • Kernel upgrade (cont’d): 5. 6. Þ Node reboots when it is drained (could be at any time) New machine comes up with new kernel, and goes back into production immediately Least downtime for each node. Capacity is always available: • First reboot instantaneous, last one can be several days later • Everything runs automatically, some cleanup has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm 20 October 2004 Thorsten Kleinwort IT/FIO/FS 10
Real Life Use Cases (cont’d) Configure batch resources (LSF): • • LSF resources are defined, depending on availability, power and cluster of machines Resources are defined in CDB Configured on the node using NCM The master file is generated from CDB 2 SQL in a cron job every day (reconfig takes several minutes) Consistency of client/master due to CDB Resources assignments are done in CDB on (sub-) cluster level (template structure) Reassignments of (sub-)clusters in CDB are done with SMS tools 20 October 2004 Thorsten Kleinwort IT/FIO/FS 11
Real Life Use Cases (cont’d) Emptying the Computer Centre • • • For the refurbishment of the CERN Computer Centre all machines had to be moved, either from one side to the other, or downstairs (vault) ~ 2000 machines had to be moved Taking the opportunity to add machines to CDB • As quattor and non-quattor nodes Batch machines were moved in ‘racks=44 nodes’: • HMS was used to steer the moves • SMS/maintenance to shut down the machines • Rename/Prepare. Install to bring machines back 20 October 2004 Thorsten Kleinwort IT/FIO/FS 12
20 October 2004 Thorsten Kleinwort IT/FIO/FS 13
Real Life Use Cases (cont’d) New h/w arrival => mass installation • • • New machines (~400) arrive at CERN (in bunches of 50 – 100) Racks have to be prepared: • Network equipment • Power supply • (Console service) Plan machine membership (cluster) Put machine into CDB: • h/w type • Cluster type/OS 20 October 2004 Thorsten Kleinwort IT/FIO/FS 14
Real Life Use Cases New h/w arrival (cont’d) • • • Physical machine installation (HMS): • New DNS entry • OS installation: Prepare. Install • Installation by the Sys. Admin • Burn-in test (h/w test, several days to weeks) • Follow up on h/w problems with Vendor • Add the machines to the alarm display (SURE) Put machines into production 20 October 2004 Thorsten Kleinwort IT/FIO/FS 15
20 October 2004 Thorsten Kleinwort IT/FIO/FS 16
Collaborations External ‘Customers’: • • EGEE, LCG, and other groups at CERN are now using Quattor managed machines: • They benefit from standard, manageable, and reproducible machine setups • They are able/should learn to do modifications themselves External sites using Quattor: • IN 2 P 3, NIKHEF, UAM Madrid, … discussing to or use already Quattor => see Rafael’s talk This helps to enhance the tools: • Service nodes (for LCG-2) • Having a wider usage • Generalizing components 20 October 2004 Thorsten Kleinwort IT/FIO/FS 17
Summary ELFms is deployed in production at CERN • • • Established technology – from Prototype to Production Though enhancements are ongoing Fundamental part of our infrastructure Merged with our existing environment Quattor and Lemon are generic software • • • Used by others inside/outside CERN Hopefully a fruitful collaboration in the future 20 October 2004 Thorsten Kleinwort IT/FIO/FS 18
Useful Links: ELFms: http: //cern. ch/elfms Quattor: http: //quattor. org/ Lemon: http: //cern. ch/lemon LEAF: http: //cern. ch/leaf Previous presentations: • • HEPi. X II/2003 (Vanouver): http: //www. triumf. ca/hepix 2003 • ‘The new Fabric Management Tools in Production at CERN’: HEPi. X I/2004 (Edinburgh): http: //www. nesc. ac. uk/esi/events/291/ • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket CHEP 2004 (Interlaken): http: //chep 2004. web. cern. ch/chep 2004/ • ‘Current Status of Fabric Management at CERN’ by German Cancio 20 October 2004 Thorsten Kleinwort IT/FIO/FS 19
Questions? 20 October 2004 Thorsten Kleinwort IT/FIO/FS 20
c65b25998becf6833d84561f9088829d.ppt