Скачать презентацию Monitoring best practices tools for running highly Скачать презентацию Monitoring best practices tools for running highly

057ca3ad0434099b341b5dac051e3202.ppt

  • Количество слайдов: 35

Monitoring best practices & tools for running highly available databases Internet Services CERN IT Monitoring best practices & tools for running highly available databases Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Miguel Anjo & Dawid Wojcik DM meeting – 20. May. 2008

Oracle Real Application Clusters Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. Oracle Real Application Clusters Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Architecture RAC 1 RAC 2 RAC 5 RAC 3 Internet Services CERN IT Department Architecture RAC 1 RAC 2 RAC 5 RAC 3 Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it RAC 6 RAC 4

Highly Available databases – Oracle ‘services’ • Resources distributed among Oracle services – Applications Highly Available databases – Oracle ‘services’ • Resources distributed among Oracle services – Applications assigned to dedicated service – On node failure, resources re-distributed Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it CMS_COND CMS_C 2 K CMS_DBS_W CMS_SSTRACKER CMS_TRANSFERMGMT Preferred A 1 A 2 A 1 Preferred A 1 A 2 Preferred Preferred A 1

Highly Available databases – Apps and DB Release cycle • Applications’ release cycle Development Highly Available databases – Apps and DB Release cycle • Applications’ release cycle Development service Validation service Production service • Database software release cycle Production service version 10. 2. 0. n Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Validation service version 10. 2. 0. (n+1) Production service version 10. 2. 0. (n+1)

Why monitor? • Monitor (n. ) – Computer Science. A program that observes, supervises, Why monitor? • Monitor (n. ) – Computer Science. A program that observes, supervises, or controls the activities of other programs. Diagnostics Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • • • Performance Reporting Need to keep all components in healthy state We are prepared for single failures, some double failures Commitment to give 24/7 best effort service SW misbehavior affecting performance Trends might indicate need to grow system Security breaches

Monitoring participants Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Monitoring participants Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Presentation title - 7

Monitoring participants Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Monitoring participants Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Presentation title - 8

What we monitor • 25 database clusters • 124 servers, 450 cores, 150 disk-arrays, What we monitor • 25 database clusters • 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier 0 • 10 Tier 1 sites for Streams replication • 150+ Oracle ‘services’ / applications • 2000+ user schemas • 1 M+ connections/day Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

PDB-Backup • 2 node cluster • Using Oracle Clusterware • Running: – – RACMon PDB-Backup • 2 node cluster • Using Oracle Clusterware • Running: – – RACMon (monitoring agents) Stream. Mon (monitoring agents) Backups Scripts repository • Monitored by Lemon. Set as Critical in Operator procedures Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Monitored components • Servers – – – • Disk arrays • Firmware, disk state, Monitored components • Servers – – – • Disk arrays • Firmware, disk state, disk size, disk speed – Tools: Lemon + RACMon CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Database SW – – – Accessibility CDB state Tools: Lemon + RACMon + OEM – Accessibility – State given by controller Internet Services • • Clusterware state Service accessibility Space available Oracle Streams Tools: RACMon + OEM + Stream. Mon Database usage – OS CPU, I/O – User Sessions, CPU, I/O – User quotas, tablespace usage – Bad usage (short connections, bind variables) – Table fragmentation – Tools: RACMon, Reports

Best practises (I) • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland Best practises (I) • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it No overhead to DB (monitored object) Monitor as much as possible Presentation layer simple & compact Possibility to drill down

Best practises (II) • • • Hierarchy of alarms and notifications Simplicity reliability Centralized Best practises (II) • • • Hierarchy of alarms and notifications Simplicity reliability Centralized version vs. deployed everywhere • Independent blocks (monitoring, dashboard, reporting) for HA Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Monitoring tools • Monitoring tools – Lemon, SLS – Basic Monitoring (in house development) Monitoring tools • Monitoring tools – Lemon, SLS – Basic Monitoring (in house development) – SQL scripts (reactive monitoring) – RACMon (in house development, openlab) – Stream. Mon (in house development , openlab) – OEM – Oracle Enterprise Manager (Grid Control) openlab – Service oriented monitoring tools Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • Experiment reports • DB Availability & Performance Pages

Basic monitoring SSH SQL*Plus Select * from dual; • Checking every 5 minutes • Basic monitoring SSH SQL*Plus Select * from dual; • Checking every 5 minutes • Each failure e-mail with error • 3 consecutive failures SMS • Almost perfect for single instance databases Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • Limitations • On RAC, system survives to single HW failures • Users connect to ‘service’, not database instance • No other components (storage, clusterware) monitoring • Missing dashboard view

DBA monitoring • SQL scripts – reactive monitoring (ad-hoc monitoring) • Pros: – Easy DBA monitoring • SQL scripts – reactive monitoring (ad-hoc monitoring) • Pros: – Easy to use – Fast real time information • Cons: Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it – No global overview – Diagnosingle problem – Requires expert knowledge

RACMon requirements • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. RACMon requirements • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • • • Reliable (24/7) Easy to use and configure Provides up to date information (frequent runs) Centralized – no configuration or deployment on RAC side Web interface (RAC monitoring dashboard) – one common place for RACs’ status Monitoring of Oracle services (DB and user level) and Oracle clusterware Monitoring of ASM instances (diskgroups and failgroups) Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility) Notification send via emails & SMSs to DBAs Availability numbers (over extended periods of time) Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)

RACMon Architecture Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it RACMon Architecture Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

RACMon - examples RACMon - examples

RACMon - examples RACMon - examples

RACMon • Pros/Features: – – – Customized for our environment Gives an overview of RACMon • Pros/Features: – – – Customized for our environment Gives an overview of all our HW and RACs Configurable alerts (via email and SMS) and alert levels (production or non-production systems) – Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, Stream. Mon) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • Cons: – Requires manpower for development

Oracle Streams • “Oracle Streams enables the propagation and management of data, transactions and Oracle Streams • “Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another. ” Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Stream. Mon Stream. Mon

Stream. Mon Stream. Mon

Stream. Mon • Streams availability and usage monitoring • Build in alerting in case Stream. Mon • Streams availability and usage monitoring • Build in alerting in case of any error in streams stack • Pros: – Monitoring of all T 1 sites in one place (streams monitoring not available in any other tool, including OEM) – Convenient and easy to use web interface – Advanced plotting utilities • Cons: Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it – Required manpower for development (currently in maintenance only) – Uses not-standard libraries, requires customized server

Oracle Enterprise Manager • Architecture: – Agent running on each server uploads information to Oracle Enterprise Manager • Architecture: – Agent running on each server uploads information to central repository, if repository is not available, it caches data – Management Service provides insight into any monitored target details – Management Service based on set-up metrics and policies sends e-mails (SMSes) – Proactive monitoring possible (actions based on problem diagnostics) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Oracle Enterprise Manager • Oracle Enterprise Manager Grid Control features Internet Services CERN IT Oracle Enterprise Manager • Oracle Enterprise Manager Grid Control features Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Oracle Enterprise Manager • Pros: – – Highly configurable alerts, metrics and notification policies Oracle Enterprise Manager • Pros: – – Highly configurable alerts, metrics and notification policies Advanced and easy to use web interface Easy drill down External product – fully supported • Cons: – – Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Universal – requires more navigation No global overview (per target oriented) Customization for many target requires much work Bugs may by intrusive (e. g. affecting streams, excessive memory/CPU consumption, storage, DB instances) – Manpower required for maintenance and configuration – Not reliable enough for 24/7 monitoring

Weekly reports • Targeted to experiment DBAs and Coordinators • Information about Internet Services Weekly reports • Targeted to experiment DBAs and Coordinators • Information about Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it • Bookkeeping – Application names, contacts • Resource usage – Sessions, CPU, Logical and Physical I/O • Security: Connection errors, expiring passwords, not used schemas • Space: consumed, fragmentation, recycle bin • Bad usage: short connections, queries missing bind variables

Weekly reports • • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland Weekly reports • • • Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it PHP scripts Generate report over last 7 days Specific to one RAC cluster

Weekly reports Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Weekly reports Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Weekly reports • Current functionality – – Simple way to visualize whole DB usage Weekly reports • Current functionality – – Simple way to visualize whole DB usage Concentrates on main users (dynamic) Easy to spot problems (color coded) Very good feedback from our users • Now working on user configurable reports Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

DB availability and performance page • • • PHP, aggregation of other tools Requested DB availability and performance page • • • PHP, aggregation of other tools Requested by experiments Dashboard of “current” DB activity • • • Almost real time monitoring (up to last hour) Application resource usage No extra load – uses SLS, RACMon, Stream. Mon, weekly reports • Possibility to drill down Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

DB availability and performance page Internet Services CERN IT Department CH-1211 Genève 23 Switzerland DB availability and performance page Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Summary • Many monitoring components developed for our environment – – – Out of Summary • Many monitoring components developed for our environment – – – Out of the box tools not sufficient Open frameworks – new features easily added Feedback given to Oracle Enterprise Manager development (openlab) • Very good feedback from T 1 s and experiments Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it – Components included in experiment dashboards, WLCG Service. Maps, SLS