Скачать презентацию Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Скачать презентацию Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees

882b8bfd013f94c1f1ad6634b105b447.ppt

  • Количество слайдов: 26

Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics HEPSYSMAN Conference 29 Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees

In a perfect world … • Individual node status o o o Is it In a perfect world … • Individual node status o o o Is it up? What is its load? What is the memory and swap usage? NFS and network load? Are the partitions full? Are applications and services running properly? • Amalgamated node status o Same info, but across groups of nodes HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

In a perfect world … • Historical information o Trends • Notification of service In a perfect world … • Historical information o Trends • Notification of service states o o e. g. Storage down to 100 megs free = Warning Storage down to 10 megs free = Critical sshd no longer running = Failure notify by email, pager, mobile • Easy access to monitoring information o web, email, digest, mobile HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

In a perfect world … • Avoidance of “Too many red flashing lights” o In a perfect world … • Avoidance of “Too many red flashing lights” o “Just the facts, ma’am” – only want root cause failures to be reported, not cascade of every downstram failure. o also includes avoiding unnecessary checks o e. g. HTTP responding, therefore no need to ping o e. g. power outage, doesn’t ping, so don’t bother trying anything else • Other wish list requirements? HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Aspects of Current Grid Monitoring 1. LDAP (Lightweight Directory Access Protocol) is the current Aspects of Current Grid Monitoring 1. LDAP (Lightweight Directory Access Protocol) is the current foundation for MDS. Designed frequent read, infrequent write. 2. MDS (Monitoring and Discovery Service) uses LDAP for maintaining static and dynamic system details. 3. R-GMA (Relational Grid Monitoring Architecture) meant to address shortcomings of LDAP based MDS system by using hierarchy of relational databases. Now being deployed. 4. GRIS (Grid Resource Information Service) stores details about the state of “the grid” (at least from the local node) 5. GIIS (Grid Index Information Service) ties together several GRISes 6. HBM (Heart Beat Monitor) monitor Globus services – seems to have died a quiet death HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Existing Grid Monitoring Lacks… • Historical information for trends • Simple interface for accessing Existing Grid Monitoring Lacks… • Historical information for trends • Simple interface for accessing information • Automated response to changes in system state Here is where RRDtool and Nagios can contribute HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool www. rrdtool. com • • • Round Robin Database for time series data RRDtool www. rrdtool. com • • • Round Robin Database for time series data storage Command line based From the author of MRTG Made to be faster and more flexible Includes CGI and Graphing tools, plus APIs Solves the Historical Trends and Simple Interface problems HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Define Data Sources (Inputs) • DS: speed: COUNTER: 600: U: U • DS: fuel: Define Data Sources (Inputs) • DS: speed: COUNTER: 600: U: U • DS: fuel: GAUGE: 600: U: U o o DS = Data Source speed, fuel = “variable” names COUNTER, GAUGE = variable type 600 = heart beat – UNKNOWN returned for interval if nothing received after this amount of time o U: U = limits on minimum and maximum variable values (U means unknown and any value is permitted) HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Define Archives (Outputs) • RRA: AVERAGE: 0. 5: 1: 24 • RRA: AVERAGE: 0. Define Archives (Outputs) • RRA: AVERAGE: 0. 5: 1: 24 • RRA: AVERAGE: 0. 5: 6: 10 o RRA = Round Robin Archive o AVERAGE = consolidation function o 0. 5 = up to 50% of consolidated points may be UNKNOWN o 1: 24 = this RRA keeps each sample (average over one 5 minute primary sample), 24 times (which is 2 hours worth) o 6: 10 = one RRA keeps an average over every six 5 minute primary samples (30 minutes), 10 times (which is 5 hours worth) • Clear as mud! o all depends on original step size which defaults to 5 minutes HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool Database Format Recent data stored once every 5 minutes for the past 2 RRDtool Database Format Recent data stored once every 5 minutes for the past 2 hours (1: 24) Old data averaged to one entry per day for the last 365 days (288: 365) RRD --step 300 File (5 minute input step size) RRA 1: 24 RRA 6: 10 RRA 288: 365 Medium length data averaged to one entry per half hour for the last 5 hours (6: 10) HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool Example • Monitoring a car – fuel in the tank plus odometer 12: RRDtool Example • Monitoring a car – fuel in the tank plus odometer 12: 05 12: 10 12: 15 12: 20 12: 25 12: 30 12: 35 12: 40 12: 45 12: 50 12: 55 13: 00 13: 05 13: 10 13: 15 HEPSYSMAN Conference 12345 12357 12363 12373 12383 12399 12405 12411 12415 12420 12422 12423 KM KM KM KM 7. 0 5. 8 5. 2 4. 2 3. 2 2. 2 1. 6 9. 0 8. 4 8. 0 7. 5 7. 3 7. 2 L L L STOP L L RESTART L L L REFUEL L L 29 April, 2003 – Ian Stokes-Rees

RRDtool Example • Create an RRD to store distance and fuel rrdtool create car. RRDtool Example • Create an RRD to store distance and fuel rrdtool create car. rrd --start 920804400 DS: speed: COUNTER: 600: U: U DS: fuel: GAUGE: 600: U: U RRA: AVERAGE: 0. 5: 1: 24 RRA: AVERAGE: 0. 5: 6: 10 • --start Defines earliest time RRD accepts HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool Example • Input data: rrdtool rrdtool update update car. rrd car. rrd HEPSYSMAN RRDtool Example • Input data: rrdtool rrdtool update update car. rrd car. rrd HEPSYSMAN Conference 920804700: 12345: 7. 0 920805300: 12363: 5. 2 920805900: 12363: 5. 2 920806500: 12383: 3. 2 920807100: 12399: 1. 6 920807700: 12411: 8. 4 920808300: 12420: 7. 5 920808900: 12423: 7. 2 920805000: 12357: 5. 8 920805600: 12363: 5. 2 920806200: 12373: 4. 2 920806800: 12393: 2. 2 920807400: 12405: 9. 0 920808000: 12415: 8. 0 920808600: 12422: 7. 3 29 April, 2003 – Ian Stokes-Rees

RRDtool Graphing • Now with data in the RRD, RRDtool can generate graphs: rrdtool RRDtool Graphing • Now with data in the RRD, RRDtool can generate graphs: rrdtool graph speed. gif --start 920804400 --end 920808000 --vertical-label m/s DEF: myspeed=car. rrd: speed: AVERAGE DEF: myfuel=car. rrd: fuel: AVERAGE CDEF: realspeed=myspeed, 1000, * LINE 2: realspeed#FF 0000 LINE 2: myfuel#00 FF 00 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool Graphing Output • • • Much more interesting graphs possible Multiple RRDs may RRDtool Graphing Output • • • Much more interesting graphs possible Multiple RRDs may be used as sources for variables Auto-interpolation of points Functions and calculations can be applied to variables Legends, labels, and text can be inserted HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

RRDtool Graphing Output HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Graphing Output HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Nagios www. nagios. org • • • Instantaneous service level monitoring Web based interface Nagios www. nagios. org • • • Instantaneous service level monitoring Web based interface Somewhat complicated set of configuration files to manually edit Automated notification of change in service level (email, phone, etc. ) Defines WARNING, CRITICAL, FAILED levels HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

What Do We Want to Monitor? Static Dynamic Services CPU (SPECint) Load Live RAM What Do We Want to Monitor? Static Dynamic Services CPU (SPECint) Load Live RAM (swap) Mem/swap usage Accessible HD capacity Storage available Globus Network b/w Network utilisation SSH OS Users Etc. Applications Processes Location, Admin Queues (PBS) HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Nagios Host Definitions • Define details about each node and their hierarchy in the Nagios Host Definitions • Define details about each node and their hierarchy in the network: define host{ host_name tbce 01 alias Testbed CE address 163. 1. 243. 105 parents edg-testbed notifications_enabled 1 process_perf_data 1 check_command check-host-alive notification_interval 120 notification_period 24 x 7 notification_options d, u, r } HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Nagios Service Definitions • Define details about each service: define service{ name ping check_command Nagios Service Definitions • Define details about each service: define service{ name ping check_command check_ping!100. 0, 20%!500. 0, 60% contact_groups linux-admins check_period 24 x 7 max_check_attempts 3 normal_check_interval 5 notification_interval 120 notification_period 24 x 7 notification_options c, r } HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Nagios Service and Host Polling • Pull model, where Nagios server executes command to Nagios Service and Host Polling • Pull model, where Nagios server executes command to fetch host or service status • Requires remote hosts and services to cooperate o NRPE installed on clients allows server to execute “plugins” to poll for information o Alternatively use existing client reporting mechanisms (ping, wget, http) • Server responsible for configuration of polling intervals and details to be polled HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Nagios Service and Host Reporting • Push model, where services and hosts decide when Nagios Service and Host Reporting • Push model, where services and hosts decide when to report status to Nagios server o o o push data when available/relevant generally full access to node-local data requires configuring every node independently authentication of nodes at server nodes need to know who to send data to HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees

Finally, some other monitors • NWS (Network Weather Service) attempts to predict network utilisation Finally, some other monitors • NWS (Network Weather Service) attempts to predict network utilisation from historical information • Ganglia cluster monitoring system, provides aggregate graphs of cluster performance – Globus/EDG tie-ins underway • Map Center EDG project to monitor Grid status and services • Active. Map, Grid. Portal, and Info. Portal* appear to be inactive projects HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees