a2b98fbdb6f44dc2cd6c1b362349758c.ppt
- Количество слайдов: 26
ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPi. X spring 2004 Edinburgh 25/5/2004 ELFms status and deployment, 25/5/2004
Outline u ELFms and its subsystems: n Quattor n Lemon n LEAF u Deployment status ELFms – German Cancio - n° 2
ELFms in a nutshell ELFms stands for ‘Extremely Large Fabric management system’ Subsystems: u : configuration, installation and management of nodes u : system / service monitoring u : hardware / state management Node Configuration Management Node Management u ELFms manages and controls most of the nodes in the CERN CC n n ~2100 nodes out of ~ 2400 Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web, …) n Heterogeneous hardware (CPU, memory, HD size, . . ) n Linux (RH) and Solaris (9) ELFms – German Cancio - n° 3
http: //quattor. org ELFms – German Cancio - n° 4
Quattor takes care of the configuration, installation and management of fabric nodes èA Configuration Database holds the ‘desired state’ of all fabric elements • • • Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…) Cluster (name and type, batch system, load balancing info…) Defined in templates arranged in hierarchies – common properties set only once è Autonomous management agents running on the node for • • Service (re-)configuration • • Base installation Software installation and management Quattor was developed in the scope of EU Data. Grid. Development and maintenance now coordinated by CERN/IT ELFms – German Cancio - n° 5
Configuration Database GUI CDB RDBMS CLI S O A P LEAF, LEMON, others pan XML Scripts S Q L H T T P Node Management Agents Cache CCM Node ELFms – German Cancio - n° 6
Configuration Database GUI CDB RDBMS CLI CERN CC S O A P name_srv 1: 137. 138. 16. 5 time_srv 1: ip-time-1 pan XML Scripts S Q L lxbatch lxplus 001 cluster/name: lxbatch master: lxmaster 01 pkg_add (lsf 5. 1) lxplus H T cluster/name: lxplus T disk_srv pkg_add (lsf 5. 1) P eth 0/ip: 137. 138. 4. 246 eth 0/ip: 137. 138. 4. 225 lxplus 020 lxplus 029 Node Cache pkg_add (lsf 6_beta) Management Agents CCM Node ELFms – German Cancio - n° 7
Configuration Database GUI CDB RDBMS CLI S O A P pan XML Scripts S Q L H T T P Cache CCM Node ELFms – German Cancio - n° 8
Configuration Database GUI CDB RDBMS CLI S O A P pan XML Scripts S Q L H T T P Node Management Agents Cache CCM Node ELFms – German Cancio - n° 9
Configuration Database GUI CDB RDBMS CLI S O A P LEAF, LEMON, others pan XML Scripts S Q L H T T P Cache CCM Node ELFms – German Cancio - n° 10
Configuration Database GUI CDB RDBMS CLI S O A P pan XML Scripts S Q L H T T P Node Management Agents Cache CCM Node ELFms – German Cancio - n° 11
Managing (cluster) nodes Software Servers RPM, PKG packages SW package Manager (SPMA) cache http nfs ftp Managed Standard nodes SWRe p (RPM, PKG Installed software kernel, system, applications. . Install server System services Node Configuration Manager (NCM) CCM CDB base OS nfs/http dhcp pxe AFS, LSF, SSH, accounting. . Vendor System installer RH 73, RHES, Fedora, … Install Manager ELFms Node (re)install – German Cancio - n° 12
Node Management Agents u NCM (Node Configuration Manager): framework system, where service specific plug-ins called Components make the necessary system changes to bring the node to its CDB desired state n n Regenerate local config files (eg. /etc/sshd_config), restart/reload services (Sys. V scripts) Large number of components available (system and Grid services) u SPMA (Software Package Mgmt Agent) and SWRep: Manage all or a subset of packages on the nodes n n Full control on production nodes: full control - on development nodes: non -intrusive, configurable management of system and security updates. Package manager, not only upgrader (roll-back and transactions) u Portability: Generic framework; plug-ins for NCM and SPMA available for RHL (RH 7, RHES 3) and Solaris 9 u Scalability n n to O(10 K) Automated replication for redundant / load balanced CDB/SWRep servers Use scalable protocols eg. HTTP and replication/proxy/caching technology (slides here) ELFms – German Cancio - n° 13
http: //cern. ch/lemon ELFms – German Cancio - n° 14
Lemon – LHC Era Monitoring ELFms – German Cancio - n° 15
LEMON u MSA Agent available since early 2002 n Continuous functionality improvements, specially in the sensor and repository interface n n u Large amount of sensors Ported to and tested on Solaris Stable Oracle-backend MR since Sept 2003 n n u Keeps current and historical samples – no aging out of data but archiving Flat-file MR available as well The Correlation Engine framework allows plug-in correlations accessing collected metrics and external information (eg. quattor CDB, LSF) n u An ‘actuator’ sensor is being developed for local fault recovery n u Eg. cleaning up /tmp if occupancy > x %, restart daemon D if dead RRD based status display pages n u Eg. average number of users on LXPLUS, total number of active LCG batch nodes See Miro’s talk (next!) for more details As with Quattor, LEMON is an EDG development now maintained by CERN/IT ELFms – German Cancio - n° 16
http: //cern. ch/leaf ELFms – German Cancio - n° 17
LEAF – LHC Era Automated Fabric LEAF (LHC Era Automated Fabric): Collection of workflows for automated node hardware and state management n HMS: Hardware Management System n SMS: State Management System u HMS and SMS interface to Quattor and LEMON (or rather: sit on top!) for setting/getting node information respectivel u HMS and SMS report desired and current state of the nodes, and progress trough the workflows ELFms – German Cancio - n° 18
LEAF: HMS and SMS u HMS n (Hardware Management System): Track systems trough all steps in lifecycle eg. installation, moves, vendor calls, retirement n Handle multiple nodes at a time (eg. racks) n Automatically requests installs, retires etc. to technicians n PC finder to locate equipment physically n HMS implementation is CERN specific, but concepts and design should be generic u SMS n (State Management System): Automated handling high-level configuration steps, eg. s s s n n Reconfigure and reboot all LXPLUS nodes for new kernel Reallocate nodes inside LXBATCH for Data Challenges Drain and reconfig node X for diagnosis / repair operations extensible framework – plug-ins for site-specific operations possible Issues all necessary (re)configuration commands on top of quattor CDB and NCM s Uses a state transition engine ELFms – German Cancio - n° 19
LEAF screenshots ELFms – German Cancio - n° 20
ELFms status – Quattor (I) u Manages n n n (almost) all Linux boxes in the computer centre ~ 2100 nodes, to grow to ~ 8000 in 2006 -8 LXPLUS, LXBATCH, LXBUILD, disk and tape servers, Oracle DB servers Solaris clusters, server nodes and desktops to come for Solaris 9 u Starting: head nodes using Apache proxy technology for software and configuration distribution u Misc developments pending, like n Fine-grained ACL protection to templates n HTTPS instead of HTTP for CDB profile and SW transport ELFms – German Cancio - n° 21
ELFms status – Quattor (II) u LCG-2 WN configuration components available n Configuration components for RM, EDG/LCG setup, Globus n Progressive reconfiguration of LXBATCH nodes as LCG-2 WN’s u Community driven effort to use quattor for general LCG-2 configuration n Coordinated by staff from IN 2 P 3 and NIKHEF Aim is to provide a complete porting of EDG-LCFG config components to Quattor CERN and UAM Madrid providing generic installation instructions and siteindependent packaging, as well as a Savannah development portal u EGEE has chosen quattor for managing their integration testbeds u Tier 1/2 sites as well as LHC experiments evaluating using quattor for managing their own farms ELFms – German Cancio - n° 22
ELFms status – LEMON (I) u Smooth production running of MSA agent and Oracle-based repository at CERN-CC n 150 metrics sampled every 30 s -> 1 d n ~ 1 GB of monitoring data / day on ~ 2100 nodes n New sensors and metrics, eg. tape robots, temperature, SMART disk info u Grid. ICE project uses LEMON for data collection u Gathering experiment requirements and interfacing to grid-wide monitoring systems (Mona. Lisa, Grid. ICE) n Good interaction with, and gathered feedback from CMS DC 04 n Archived raw monitoring data will be used for CMS computing TDR u Visualization: n n Operators - Test interface to new generation alarm systems (LHC control alarm system) Sys managers - Finish status display pages (Miro’s talk) ELFms – German Cancio - n° 23
ELFms status – LEMON (II) u Work on redundancy solutions for Monitoring Repository (homegrown and/or Oracle Streams) u Quality of Service indicators, correlations and actuators (in collaboration with BARC India) n n Ie. “tell LEAF to reassign two more nodes from LXBATCH to LXPLUS since capacity insufficient”) Provide batch job mix indicators for improved I/O and CPU load equilibrium ELFms – German Cancio - n° 24
ELFms status - LEAF u HMS n in full production for all nodes in CC HMS heavily used during CC node migration u SMS in production for LXBATCH u Next steps: n n Deploy SMS across more clusters Tighter HMS/SMS integration (automatic put nodes in and out production during eg. rack moves) u Developing ‘asset management’ GUI replacing PC finder n Client of HMS and SMS n Drag&drop nodes to automatically initiate HMS moves n Multiple select nodes, then initiate action eg. kernel upgrade ELFms – German Cancio - n° 25
Summary u ELFms is deployed in production at CERN n Stabilized results from 3 -year developments within EDG and LCG n Established technology n Providing real added-on value for day-to-day operations u Quattor n and LEMON are generic software Other projects and sites getting involved u Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments n LEAF HMS and SMS u CERN sites n n will help with Quattor (and LEMON) deployment at other We provide site-independent software and installation instructions Collaboration for providing missing pieces, eg. configuration components, GUI’s, beginner’s user guides? u More information: http: //cern. ch/elfms ELFms – German Cancio - n° 26


