Monitoring Diagnosing and Repairing Eric Anderson U C

Скачать презентацию Monitoring Diagnosing and Repairing Eric Anderson U C

05a83f1e55d06ecceb6e293654b95686.ppt

Количество слайдов: 41

Monitoring, Diagnosing, and Repairing Eric Anderson U. C. Berkeley

Overview l What is System Administration? – What is the problem? – Goals of Dissertation Research – Goals of System Administration Monitoring, diagnosing, and repairing l Dissertation Timeline l Conclusion l 19 -Mar-18 2

What is the problem? l Problems occur in systems, and result in loss of productivity – Server failures denial of service – System overload lower productivity l Cost is too high – Cost of ownership estimated at $5, 000 -$15, 000/year/machine – Median salary (~50 k) / (median # machines/admin) $700 l Our goal: Reduce cost by – Repairing problems faster (possibly automatically) – Handling more problems 19 -Mar-18 3

Goals of Dissertation Research Describe field of System Administration l Monitoring, Diagnosing, and Repairing: l – Approach: Synthesize solutions from other fields of research 1) Detect previously ignored problems 2) Automatic repair of some problems 3) Reduce number of administrators needed 4) Support users’ understanding of system Apply here & distribute software l Thesis: Through our approach, we can achieve goals 1 -4. l 19 -Mar-18 4

Goals of System Administration Goal: Support cost-effective use of the computer environment More specifically (some non-technical): Environment: uniform, customizable, high performance and available Faults & errors: recovery from benign errors, protection from malicious attacks Users: training, accounting & planning, legal 19 -Mar-18 5

Monitoring, Diagnosing, and Repairing (MDR) · · · Introductory examples Fundamental requirements Environmental constraints Previous work Six key innovations · Architecture · Details on innovations · Evaluation methodology 19 -Mar-18 6

MDR: Examples — Intro l Four examples 1) Broken component 2) Resource overload — transient 3) Resource contention — user program 4) Resource exhaustion — long term l Previous Solutions – Pay someone to watch – Ignore or wait for someone to complain – Specialized scripts (not general vast repeated work) 19 -Mar-18 7

MDR: Example 1 Web server has crashed/hung Gather information: process existence, service uptime, restart times l Analyze data: process not responding, and hasn’t been recently restarted. l Automatic repair: restart daemon. l Notify administrator: had to restart daemon. l 19 -Mar-18 8

MDR: Example 2 The NOW is “slow. ” l l l Gather data: load, process info, CPU info Analyze data: bounds on expected values Notified administrator: fileserver overloaded. Visualize data: nfsd’s are overloaded. Repair: admin moves data, adds disks, or starts more nfsd’s 19 -Mar-18 9

MDR: Example 3 User running program Gather: user statistics, CPU, disk l Visualize: spending too much time waiting on remote accesses (User fixes program, gathering, visualization repeated) l Analyze: some nodes have less throughput l Visualize: those have other jobs running on them l Repair: user is benchmarking so kills all extraneous processes l 19 -Mar-18 10

MDR: Example 4 Web server increasing beyond capacity Gather: CPU, request rate, reply latency l Analyze: Burst lengths getting longer, latency increasing l Visualize: Graph of burst lengths & CPU usage over time l Repair: Order more machines, install load balancer l 19 -Mar-18 11

MDR: Fundamental Requirements · Gathering · Flexible data gathering, self-describing storage · Analyzing · Calculate statistical measures, identify relevant statistics. · Notifying · Flexible infrequent messages to administrators or users · Visualizing · Maximize information/pixel, support multiple interfaces · Repairing · Automate simple repairs, support group operations 19 -Mar-18 12

MDR: Environmental Constraints l Change is inherent – Lack of Web/Mbone 5 years ago, now most/many have these. l Problems on many time-scales – Second-Minute transients vs. Week-Month capacity problems l Must operate under very adverse conditions – Often used when system is broken – Would like at least post-mortum analysis l Need to handle hundreds – thousands of nodes – Scalability: All sites are getting larger, possibly wide area – Our system has 200 (NOW) – 2000 (Soda) nodes 19 -Mar-18 13

MDR: Previous Systems Many previous systems: I’ve looked at about 16. l Not comprehensive, not extensible. l Look at a few that did a nice job of a piece: l [Fink 97] — Run test, notify display engine l + Easy to add tests + Selectivity of notification good – Tests are just programs (redo gathering) – Central, non-fault tolerant solution – Many hard coded constants 19 -Mar-18 14

MDR: Previous Systems, cont. l [Hard 92] — buzzerd: Pager notification system + Flexible rules for notification + External interface for adding notify requests – Simplistic gathering – Poor fault tolerance l [Pier 96] — Igor group fixes + Flexible operations + Nice reporting of success/failure – Weak security, runs as root – No delegation of responsibility 19 -Mar-18 15

MDR: Six Key Innovations (1 -3) l Replicated, semi-hierarchical, data storage nodes – Rendezvous point for programs – Handles scaling and fault-tolerance l Self describing structures – Functions (visualize, summarize) + data go in database (OO) – DB has machine and human readable descriptions of data l End to end notification – Detect problems in MDR system – Guarantee important messages get to users 19 -Mar-18 16

MDR: Six Key Innovations (4 -6) l Aggregation and High Resolution Color Displays – Reduce information to manageable amounts – Maximize information per unit area l Partially self-configuring – Learn averages, deviations, burst sizes – Learn which values are relevant to problems l Secure, user-specified group repairs – Don’t enable malicious attacks – Automate repairs of many machines 19 -Mar-18 17

MDR: Architecture Gather Agent vmstat thread SQL-based Data Repository Aggregation Engine ping thread tcpdump thread Diagnostic Console 19 -Mar-18 Daemon Restarter E-mail or Phone Notifier Long-term graphing Tolerance, Relevance Learner 18

MDR-Arch: Derivations SQL-based Data Repository Daemon Restarter Diagnostic Console 19 -Mar-18 E-mail or Phone Notifier Tolerance, Relevance Learner 19

Key: Semi-Hier. DBs. Top level cache Mid level cache Per-node database Per-node database Fault tolerance l Scalability: l – Caches don’t need to commit to disk — authoritative copy elsewhere. – Batching updates over wide area links. 19 -Mar-18 20

Key: Self-Describing De-couple data gathering, data storage, and data use l Self-Describing for Humans l – Descriptions of meanings of values stored with tables – Description of methods of gathering stored with tables – Column names help with self l Self-Describing for Computers – Functions for visualizing or summarizing data – Indication of resource selection from resource statistics 19 -Mar-18 21

Key: End-to-End Notification Recall: System must operate under extreme conditions l Humans must validate that system is still working – Standalone display can indicate timestamps, mark out of date data – Wireless machine could intermittently contact notification system – Pager could be automatically paged every so often l Problems should be propagated to end users. – Flexible notification — connected systems, e-mail, pager. – Limit over-notification 19 -Mar-18 22

Key: Aggregation & Hi. Res System target has hundreds – thousands of nodes l Aggregate by showing out of bounds, relevant values (via automatic tuning) l Also want overview of system l – Aggregate across similar statistics; show value (fill) & dispersion (shade) – Use color to highlight important values. – Aggregate across values (machine utilization = CPU + disk + memory) – Maximize data/pixel [Tufte] 19 -Mar-18 23

Key: Agg & Hi. Res: Snapshot 19 -Mar-18 24

Key: Self-Configuring l Single statistics – Phase 1: Calculate averages, standard deviations, burst sizes – Worked in other systems [Jaco 88, Karn 91] l Identify relevant statistics – Give system Boolean examples (variables out of bounds, and system working/not working) get function. – Works for Boolean disjunctions in some cases: • With lots of irrelevant variables [Litt 89] • With random bad examples [Sloa 89] • In some cases, with malicious bad examples [Ande 94] 19 -Mar-18 25

Key: Secure Remote Actions Security because of malicious attacks, benign errors l Delegation to remove SA from the loop l Independence from particular algorithms l – Building a library – Program with principals (hosts, users), and properties (signed, sealed, verifiable) Use secure, run-time extensible languages l Actions report through gathering system l 19 -Mar-18 26

MDR: Testing Methodology l Fault injection – Deliberately make the system slow – Break hardware/software components l Feature comparison – Paper comparison with other systems l Usage in practice – Experience important to show system works – We have need of administrative tools l Testimonials – Experience at other sites lends credibility 19 -Mar-18 27

MDR: Demo l l l Hierarchical structure working (1 level right now) Alternative Interface Fault Injection Need for Aggregation Crufty right now Demo 19 -Mar-18 28

Timeline: Key Pieces 1) (DBs) Replicated, semi-hierarchical, data storage nodes 2) (SDS) Self describing structures 3) (Vis) Aggregation and High Resolution Color Displays 4) (E 2 EN) End to end notification 5) (Re. S) Automatic Restart 6) (Cfg) Partially self-configuring 7) (Rep) Secure, user-specified group repairs 19 -Mar-18 29

Timeline Deadlines: LISA 6/97 Prototype 1, 2, 3 (DBs, Self. D, Vis) USENIX 12/97 OSDI 3/98 LISA 6/98 Graduation 12/98 SOSP 3/99 Prototype 4, 5 Notify, Restart Prototype 6, 7 AConfig, Repair Experience with 1 -7 Architecture of Complete System Writing 19 -Mar-18 June, 1997 Dec, 1997 June, 1998 Dec, 1998 Mar, 1999 30

Conclusion Description of field shows breadth l Monitoring, diagnosing, and repairing shows depth l – Examples show importance of problem – Fundamental goals & environmental constraints show understanding of problem – Key innovations show differences from previous systems. – Architecture and initial prototype show approach to problem – Testing methods show ways to validate solution. l Timeline shows plan & milestones to graduation 19 -Mar-18 31

Old Slides

Solutions Managing stable storage l Supporting users l Simplifying security l Monitoring, diagnosing, and repairing l 19 -Mar-18 33

Managing Stable Storage l l l Consistency vs. availability Fault tolerance Scalability Recoverability Customization 19 -Mar-18 34

Supporting Users l Automated help desk – Searchable collection of questions – Easy method for addition Remote device access l Site-wide training l 19 -Mar-18 35

Goals: Environment l Uniform – Supports user mobility by eliminating arbitrary changes – Increases effectiveness by avoiding need for users to learn multiple interfaces l Customizable – Handles special systems and special needs [firewalls, servers] – Obviously reduces uniformity 19 -Mar-18 36

Goals: Environment, cont. l High Performance – Increases effectiveness of users [HCI/psych] – Limited by cost-effectiveness l Available – Effectiveness is 0 if system isn’t working – Balanced against expense 19 -Mar-18 37

Goals: Faults & Errors l Benign errors: – Accidentally deleted files – Unnoticed runaway processes l Malicious attacks: – – TCP SYN attack Sendmail bugs Data stealing False data injection 19 -Mar-18 38

Goals: Users l Training – Troubleshooting = one-on-one training – Larger sessions = classes l Accounting – Supports management, helps billing l Capacity Planning – Expanding systems takes time l Legal – Sensitive information needs protection 19 -Mar-18 39

Simplifying Security USENIX talk says “If cryptography is so great, why isn’t it used more? ” SA’s worry about security to protect data. l Goal: Ease development of secure applications l Write programs using principals & properties rather than keys and algorithms l Unify various forms of available cryptography (public key, secret-key, PGP, Kerberos) l My use: protected, transferable rights to allow various actions – Modify system configurations (add filesystems, printers) – Kill/restart processes (runaway, after configurations modified) – Access data (private logs, for backups, etc. ) 19 -Mar-18 40

Conclusion l System administration as area of research – Description of field – Areas for future research • Managing stable storage • Supporting users l Initial investigation of research area – Monitoring, diagnosing, and repairing • Broad, draws from many fields 19 -Mar-18 41