Скачать презентацию Monitoring and Fabric Management The European Data Grid Скачать презентацию Monitoring and Fabric Management The European Data Grid

635f78abaab33bd2cca377a7a061dc13.ppt

  • Количество слайдов: 44

Monitoring and Fabric Management The European Data. Grid Project Team http: //www. eu-datagrid. org Monitoring and Fabric Management The European Data. Grid Project Team http: //www. eu-datagrid. org

Contents Ø Monitoring and Fabric management overview Ø What is being monitored Ø R-GMA Contents Ø Monitoring and Fabric management overview Ø What is being monitored Ø R-GMA Ø Fabric management Monitoring & Fabric Mgmt Tutorial - n° 2

Information and Monitoring Services Ø EDG information providers n n Ø Software that provides Information and Monitoring Services Ø EDG information providers n n Ø Software that provides information about resources and infrastructure Provided by the work packages responsible for the resource Globus MDS (Metacomputing Directory Service or Monitoring and Discovery Service as it is now called) n Ø Based on Open. LDAP, a hierarchical database R-GMA (Relational Grid Monitoring Architecture) n A relational implementation of the Global Grid Forums GMA s Overview s Uses within the testbed Monitoring & Fabric Mgmt Tutorial - n° 3

LDAP - Directory Information Tree computing element storage element site information status storage elements LDAP - Directory Information Tree computing element storage element site information status storage elements that are close (not necessarily at the same site) supported protocols network information between this and other sites file statistics Monitoring & Fabric Mgmt Tutorial - n° 4

Siteinfo in=siteinfo, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Site. Info object. Class: Data. Grid. Top Siteinfo in=siteinfo, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Site. Info object. Class: Data. Grid. Top object. Class: Dynamic. Object site. Name: RALDEV sys. Admin. Contact: grid. [email protected] ac. uk user. Support. Contact: grid. [email protected] ac. uk site. Security. Contact: grid. [email protected] ac. uk data. Grid. Version: 1. 2 installation. Date: 20020704142800 Z Monitoring & Fabric Mgmt Tutorial - n° 5

Computing Element ce. Id=dev 01. hepgrid. clrc. ac. uk: 2119/jobmanagerpbs-M, hn=dev 01. hepgrid. clrc. Computing Element ce. Id=dev 01. hepgrid. clrc. ac. uk: 2119/jobmanagerpbs-M, hn=dev 01. hepgrid. clrc. ac. uk, Mds-Voname=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Data. Grid. Top object. Class: Computing. Element CEId: dev 01. hepgrid. clrc. ac. uk: 2119/jobmanager -pbs-M Globus. Resource. Contact. String: dev 01. hepgrid. clr c. ac. uk: 2119/jobmanagerpbs: /O=Grid/O=UKHEP/CN=dev 01. hepgrid. c lrc. ac. uk GRAMVersion: ? Architecture: intel Op. Sys: RH 6. 2 Min. Physical. Memory: 258 Min. Local. Disk. Space: 2048 Total. CPUs: 1 Free. CPUs: 1 Num. SMPs: 0 Min. SPUProcessors: 0 Max. SPUProcessors: 0 Total. Jobs: 0 Running. Jobs: 0 Idle. Jobs: 0 Max. Total. Jobs: 1 Max. Running. Jobs: 1 Worst. Traversal. Time: 108000 Estimated. Traversal. Time: 0 Active: TRUE Priority: 20 Max. CPUTime: 108000 Max. Wall. Clock. Time: 432000 Average. SI 00: 300 Min. SI 00: 300 Max. SI 00: 300 Authorized. User: /O=Grid/O=UKHEP/OU=hepgr id. clrc. ac. uk/CN=Tim Eves Authorized. User: /O=Grid/O=UKHEP/OU=hepgr id. clrc. ac. uk/CN=Tim Folkes Run. Time. Environment: RALDEV AFSAvailable: FALSE Outbound. IP: TRUE Inbound. IP: FALSE Queue. Name: M LRMSType: PBS LRMSVersion: Open. PBS_2. 3 Monitoring & Fabric Mgmt Tutorial - n° 6

Close Storage Element close. SE=dev 02. hepgrid. clrc. ac. uk, ce. Id=dev 01. hepgrid. Close Storage Element close. SE=dev 02. hepgrid. clrc. ac. uk, ce. Id=dev 01. hepgrid. clrc. ac. uk: 2119/jobmanager-pbs-M, hn=dev 01. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Close. Storage. Element object. Class: Data. Grid. Top object. Class: Dynamic. Object CEId: dev 01. hepgrid. clrc. ac. uk: 2119/jobmanager-pbs-M ; Close. SE: dev 02. hepgrid. clrc. ac. uk Mount. Point: /flatfiles Monitoring & Fabric Mgmt Tutorial - n° 7

Storage Element se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage Element se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage. Element object. Class: Data. Grid. Top object. Class: Dynamic. Object SEId: dev 02. hepgrid. clrc. ac. uk Close. CE: dev 01. hepgrid. clrc. ac. uk: 2119/jobmanager-pbs-M SEtypearchitecture: disk SEsize: 13177 SEResource. Contact. String: grid. [email protected] ac. uk SEvo: wpsix Monitoring & Fabric Mgmt Tutorial - n° 8

Storage Element Protocols se. Protocol=gridftp, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, Storage Element Protocols se. Protocol=gridftp, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage. Element. Protocol object. Class: Data. Grid. Top object. Class: Dynamic. Object SEId: dev 02. hepgrid. clrc. ac. uk SEProtocol: gridftp Port: 2811 se. Protocol=rfio, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage. Element. Protocol object. Class: Data. Grid. Top object. Class: Dynamic. Object SEId: dev 02. hepgrid. clrc. ac. uk SEProtocol: rfio Port: 3147 se. Protocol=file, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage. Element. Protocol object. Class: Data. Grid. Top object. Class: Dynamic. Object SEId: dev 02. hepgrid. clrc. ac. uk SEProtocol: file Monitoring & Fabric Mgmt Tutorial - n° 9

Storage Element Status in=status, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid Storage Element Status in=status, se. Id=dev 02. hepgrid. clrc. ac. uk, Mds-Vo-name=ral-dev, Mds-Vo-name=uk, o=Grid object. Class: Storage. Element. Status object. Class: Data. Grid. Top object. Class: Dynamic. Object SEfreespace: 12031 SEId: dev 02. hepgrid. clrc. ac. uk Monitoring & Fabric Mgmt Tutorial - n° 10

GRIS/GIIS Hierarchy Ø Mds-Vo-name=datagrid, o=grid n Ø Mds-Vo-name =datagrid Mds-Vo-name=country. A, Mds-Voname=datagrid, o=grid n GRIS/GIIS Hierarchy Ø Mds-Vo-name=datagrid, o=grid n Ø Mds-Vo-name =datagrid Mds-Vo-name=country. A, Mds-Voname=datagrid, o=grid n Ø Mds-Vo-name =country. B Ø Mds-Vo-name =site. B Mds-Vo-name =site. C This will look at all the data from site. B Mds-Vo-name=site. B, o=grid n Mds-Vo-name =site. A This will look at all the data from country. A Mds-Vo-name=site. B, Mds-Voname=country. A, o=grid n Ø This will look at all the data from country. A Mds-Vo-name=country. A, o=grid n Mds-Vo-name =country. A This will look at all the data from site. B Mds-Vo-name =site. D Monitoring & Fabric Mgmt Tutorial - n° 11

Map Centre – WP 7 Ø Ø Alternatively the information can be viewed using Map Centre – WP 7 Ø Ø Alternatively the information can be viewed using WP 7’s Map Center http: //ccwp 7. in 2 p 3. fr/mapcenter/ Monitoring & Fabric Mgmt Tutorial - n° 12

R-GMA Relational - Grid Monitoring Architecture An Overview R-GMA Relational - Grid Monitoring Architecture An Overview

The Consumer Producer Model Producer Ø Registry Ø Ø Ø Command flow Information flow The Consumer Producer Model Producer Ø Registry Ø Ø Ø Command flow Information flow Use the Grid Monitoring Architecture from Global Grid Forum A relational implementation Applied to both information and monitoring Creates impression that you have one RDBMS per Virtual Organization Consumer Monitoring & Fabric Mgmt Tutorial - n° 14

Relational Approach Ø Not a general distributed RDBMS system, but a way to use Relational Approach Ø Not a general distributed RDBMS system, but a way to use the relational model in a distributed environment where ACID properties are not generally important. Ø Producers announce: SQL “CREATE TABLE” publish: SQL “INSERT” Ø Consumers collect: SQL “SELECT” Monitoring & Fabric Mgmt Tutorial - n° 15

R-GMA Application Code Consumer API Ø command flow Information flow Consumer Servlet 9 Registry R-GMA Application Code Consumer API Ø command flow Information flow Consumer Servlet 9 Registry API 4 5 Registry Servlet API – Servlet communication n http(s) in n XML back 8 Schema API 6 2 3 Producer API Sensor Code 7 1 Registry API Schema Servlet Producer. Servlet Monitoring & Fabric Mgmt Tutorial - n° 16

Schema & Contributions CPULoad (Global Schema) Country Site Facility Load Timestamp UK RAL CDF Schema & Contributions CPULoad (Global Schema) Country Site Facility Load Timestamp UK RAL CDF 0. 3 19055711022002 UK RAL ATLAS 1. 6 19055611022002 UK GLA CDF 0. 4 19055811022002 UK GLA ALICE 0. 5 19055611022002 CH CERN ALICE 0. 9 19055611022002 CH CERN CDF 0. 6 19055511022002 CPULoad (Producer 2) UK UK RAL CDF 0. 3 19055711022002 UK RAL ATLAS 1. 6 GLA CDF 0. 4 19055811022002 UK CPULoad (Producer 1) GLA ALICE 0. 5 19055611022002 CPULoad (Producer 3) CH CERN ATLAS 1. 6 19055611022002 CH CERN CDF 0. 6 19055511022002 Monitoring & Fabric Mgmt Tutorial - n° 17

Contributions are Views CPULoad (Producer 1) UK RAL CDF 0. 3 19055711022002 UK RAL Contributions are Views CPULoad (Producer 1) UK RAL CDF 0. 3 19055711022002 UK RAL ATLAS 1. 6 19055611022002 SELECT * FROM cpu. Load WHERE country = ’UK’ AND site = ’RAL’ CPULoad (Producer 2) UK GLA CDF 0. 4 19055811022002 UK GLA ALICE 0. 5 19055611022002 SELECT * FROM cpu. Load WHERE country = ’UK’ AND site = ’GLA’ Monitoring & Fabric Mgmt Tutorial - n° 18

Fabric Management Fabric Management

Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 20

Architecture logical overview - Interface between Grid-wide services and local fabric; - Provides local Architecture logical overview - Interface between Grid-wide services and local fabric; - Provides local Grid User authentication, authorization and mapping of grid credentials. Data Mgmt Resource Broker Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 21

Architecture logical overview Resource Broker Grid User Data Mgmt - provides transparent (WP 2) Architecture logical overview Resource Broker Grid User Data Mgmt - provides transparent (WP 2) access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling Local User policies, advanced reservation, local accounting). Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Farm A (LSF) Monitoring & Fault Tolerance Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 22

Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User - provides the tools to install and manage all Grid software running. Data the on Storage fabric nodes; Farm A (LSF) Farm B (PBS) Configuration Management (WP 5) -Agent to install, upgrade, (Mass storage, remove and configure Disk on the software packagespools) nodes. -bootstrap services and software repositories. Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 23

Architecture logical overview Resource Broker Grid User Data Mgmt Local User Fabric mgt subsystems Architecture logical overview Resource Broker Grid User Data Mgmt Local User Fabric mgt subsystems Grid Info Services Other services -provides a central storage Fabric and management of all Gridification fabric configuration information; - central DB and set of protocols and APIs to Resource store and retrieve Management information. Farm A (LSF) Monitoring & Fault Tolerance Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 24

- provides the tools Architecture logical overview monitoring for gathering information on fabric nodes; - provides the tools Architecture logical overview monitoring for gathering information on fabric nodes; Resource Broker Grid User Data Mgmt Fabric Gridification Resource Management Grid Info Services WP 4 subsystems -central measurement repository stores all monitoring Other Wps information; - fault tolerance correlation engines detect failures and trigger recovery actions. Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 25

User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 26

User job management (Grid and local) Resource Broker - Submit job Grid User Data User job management (Grid and local) Resource Broker - Submit job Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 27

User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric Gridification User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric Gridification Resource Management Fabric mgt subsystems Grid Info Services Other services - publish resource and accounting information Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 28

User job management (Grid and local) Resource Broker Grid User - Optimized selection of User job management (Grid and local) Resource Broker Grid User - Optimized selection of site Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 29

User job management (Grid and local) Resource Broker Grid User - Authorize - Map User job management (Grid and local) Resource Broker Grid User - Authorize - Map grid local credentials Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 30

User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification - Select an optimal batch (WP 2) queue and submit - Return job status and output Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Monitoring & Fabric Mgmt Tutorial - n° 31

Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 32

Automated management of large clusters - Node malfunction detected WP 4 subsystems Other Wps Automated management of large clusters - Node malfunction detected WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 33

Automated management of large clusters -Remove node from queue WP 4 subsystems -Wait for Automated management of large clusters -Remove node from queue WP 4 subsystems -Wait for running jobs(? ) Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 34

Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation - Update configuration templates Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 35

Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt - Trigger repair Monitoring & Fabric Mgmt Tutorial - n° 36

Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) - Repair (e. g. restart, reboot, reconfigure, …) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 37

Automated management of large clusters - Node OK detected WP 4 subsystems Other Wps Automated management of large clusters - Node OK detected WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 38

Automated management of large clusters - Put back node in queue WP 4 subsystems Automated management of large clusters - Put back node in queue WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Monitoring & Fabric Mgmt Tutorial - n° 39

Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A Automated management of large clusters WP 4 subsystems Other Wps Resource Management Farm A (LSF) Monitoring & Fault Tolerance Information Invocation Farm B (PBS) Configuration Management Installation & Node Mgmt Automation Monitoring & Fabric Mgmt Tutorial - n° 40

LCFG (Local Con. Fi. Guration system) Ø Ø Widely used fabric tool, whose purpose LCFG (Local Con. Fi. Guration system) Ø Ø Widely used fabric tool, whose purpose is to handle automated installation and configuration in a very diverse and evolving environment Mechanism: n n Abstract configuration parameters are stored in a central repository located in the LCFG server. Scripts on the host machine (LCFG client) read these configuration parameters and either generate traditional configuration files, or directly manipulate various services. Monitoring & Fabric Mgmt Tutorial - n° 41

Local Authorization: LCAS Ø Ø Ø The Local Centre Authorization Service (LCAS) handles authorization Local Authorization: LCAS Ø Ø Ø The Local Centre Authorization Service (LCAS) handles authorization requests to the local computing fabric. In this release the LCAS is a shared library, which is loaded dynamically by the globus gatekeeper. The gatekeeper has been slightly modified for this purpose and will from now on be referred to as edg-gatekeeper. The authorization decision of the LCAS is based upon the users' certificate and the job specification in RSL (JDL) format. The certificate and RSL are passed to (plug-in) authorization modules, which grant or deny the access to the fabric. Three standard authorization modules are provided by default: n lcas_userallow. mod, checks if user is allowed on the fabric (currently the gridmap file is checked). n lcas_userban. mod, checks if user should be banned from the fabric. n lcas_timeslots. mod, checks if fabric is open at this time of the day for datagrid jobs. Monitoring & Fabric Mgmt Tutorial - n° 42

Authentication control flow EDG gatekeeper GLOBUS Gatekeeper TLS auth GLOBUS + LCAS Gatekeeper TLS Authentication control flow EDG gatekeeper GLOBUS Gatekeeper TLS auth GLOBUS + LCAS Gatekeeper TLS auth LCAS (so) assist_gridmap Jobmanager-* * And store in job repository Monitoring & Fabric Mgmt Tutorial - n° 43

Further Information Ø Information and Monitoring Services n Ø http: //hepunx. rl. ac. uk/edg/wp Further Information Ø Information and Monitoring Services n Ø http: //hepunx. rl. ac. uk/edg/wp 3/ Fabric Management n http: //cern. ch/hep-proj-grid-fabric/ Monitoring & Fabric Mgmt Tutorial - n° 44