b9096379aa2f7c54ab2de5b870a17751.ppt
- Количество слайдов: 22
• R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004 • Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer? , HOTOS 2005. • Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with Peer. Pressure, OSDI ’ 04
• R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004,
• Motivation – the problem of administrating highly complex systems – managing complexity through automation • from low-level configuration settings to high-level businessoriented policies – the risk of making management harder • systems change more rapidly • administrator controls affecting more systems • So, administrator controls will be both more powerful and more dangerous • Goal: inform the design of AC • Methodology: ethnographic field study!
• What system administrators do? – rehearsal and planning – maintaining situation awareness – managing multitasking, interruptions and diversions
Tools • command-line based console – – command-line interfaces (CLIs) multitasking, history, scripting fast and reliable probing of disparate parts of system easy to customize! • standalone graphical applications – graphical user interfaces (GUIs) – good for unfamiliar tasks and novice users – depending on graphics support, insufficient support for multitasking • web-based management tools – don’t depend on graphics support – can be integrated to provide an organized suite
Analysis and Guidelines for AC • Phases – rehearsal and planning – maintaining situation awareness – managing multitasking, interruptions and diversions
• Rehearsing and Planning – necessary to critical systems because of both the chance for human error and the danger of unforeseen consequences – AC may increase both of these dangers • as the scale and degree of coupling within complex systems increases, new patterns of failure may develop through a series of several smaller failures • as autonomic managers automatically reconfigure subsystems, the results on the overall system may be difficult to predict – Guidelines • should be easy to build test systems • should be designed to be able to quickly undo changes
• Situation Awareness • Administrators deal with dynamic and complex processes at many different levels of abstraction • They need to be aware of systems that are not only complex, but that also change frequently • Each system had its own management interface and so gaining overall situation awareness was very difficult – Guidelines • Automation has made operators more passive • Automated systems typically hide details from operators – Consequently, operator workload decreases during normal operating conditions, but increases during critical conditions • Must provide facilities for rapidly gaining deeper situation awareness when problems arise
• Multitasking, Interruptions, Diversions – conventional systems • Working with many components, but each component works relatively independently – Guidelines • each level affects a component’s operation, it will be difficult to design a general workflow for debugging • Therefore AC interfaces should allow multiple simultaneous views of system components and aggregates to support interaction at multiple levels
• Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer? , HOTOS 2005.
Is Automation Always the Answer? No! Why?
• Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with Peer. Pressure, OSDI ’ 04
Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly 2000] • Much of application malfunctioning comes from misconfigurations • Why? – Shared configuration data (e. g. , Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? – Very hard [Larsson 2001] • Complex software components and compositions • Third party applications • …
Outline ü Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks
Goals • Effectiveness – Small set of sick configuration candidates that contain the root-cause entries • Automation – No second party involvement – No need to remember or identify what is healthy
Intuition behind Peer. Pressure • Assumption – Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure
An Example Suspects e 1 e 2 e 3 Mine 0 on 57 P 1’s 1 on 4 P 2’s 1 on 0 P 3’s 1 on 100 P 4’s 1 off 34 • Is R 1 sick? Most likely • Is R 2 sick? Probably not • Is R 3 sick? Maybe not – R 3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric
System Overview Registry Entry Suspects Entry HKLMSoftwareMsft. . . Run the faulty app On HKLMSystemSetup. . . App Tracer Data 0 HKCU%Software. . . null Canonicalizer Troubleshooting Result Entry 0. 6 HKLMSystemSetup. . . 0. 2 HKCU%Software. . . 0. 003 Search & Fetch Prob. HKLMSoftwareMsft. . . Peer-to-Peer Troubleshooting Community Statistical Analyzer Peer. Pressure Database
Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) – Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR -Asia, and MSR-Redmond. – The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes
Response Time • # of suspects: 8 to 26, 308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2. 4 GHz CPU workstation with 1 GB RAM • Sequential database queries dominate
Troubleshooting Effectiveness • Metric: root cause ranking • Results: – Rank = 1 for 12 cases – Rank = 2 for 3 cases – Rank = 3, 9, 12, 16 for 4 cases, respectively – cannot solve one case
Concluding Remarks • Automatic misconfiguration diagnosis is possible – Use statistics from the mass to automate manual identification of the healthy – Initial results promising
b9096379aa2f7c54ab2de5b870a17751.ppt