Скачать презентацию Intro Overview of RADS goals Armando Fox Скачать презентацию Intro Overview of RADS goals Armando Fox

db5389a78079569a9b6b9ea421feeb31.ppt

  • Количество слайдов: 73

Intro & Overview of RADS goals Armando Fox & Dave Patterson CS 444 A/CS Intro & Overview of RADS goals Armando Fox & Dave Patterson CS 444 A/CS 294 -6, Stanford/UC Berkeley Fall 2004

§ Administrivia § Course logistics & registration § Project expectations and other deliverables § § Administrivia § Course logistics & registration § Project expectations and other deliverables § Background and motivation for RADS § ROC and its relationship to RADS § Early case studies § Discussion: projects, research directions, etc. 3/19/2018 © 2004 A. Fox

Administrivia/goals § Stanford enrollment vs. Axess § SLT and CT tutorial VHS/DVD’s available to Administrivia/goals § Stanford enrollment vs. Axess § SLT and CT tutorial VHS/DVD’s available to view § SLT and CT Lab/assignments grading policy § Stanford and Berkeley meeting/transportation logistics § Format of course 3/19/2018 © 2004 A. Fox

Background & motivation for RADS Background & motivation for RADS

RADS in One Slide § Philosophy of ROC: focus on lowering MTTR to improve RADS in One Slide § Philosophy of ROC: focus on lowering MTTR to improve overall availability § ROC achievements: two levels of lowering MTTR § “Microrecovery”: fine-grained generic recovery techniques recover only the failed part(s) of the system, at much lower cost than whole-system recovery § Undo: sophisticated tools to help human operators selectively back out destructive actions/changes to a system § General approach: use microrecovery as “first line of defense”; when it fails, provide support to human operators to avoid having to “reinstall the world” § RADS insight: can combine cheap recovery with statistical anomaly detection techniques 3/19/2018 © 2004 A. Fox

Hence, (at least) 2 parts to RADS § Investigating other microrecovery methods § Investigating Hence, (at least) 2 parts to RADS § Investigating other microrecovery methods § Investigating analysis techniques § What to capture/represent in a model § Addressing fundamental open challenges § stability § systematic misdiagnosis § subversion by attackers § etc. § General insight: “different is bad” § 3/19/2018 “law of large numbers” arguments support this for large services © 2004 A. Fox

Why RADS § Motivation § 5 9’s availability => 5 down-minutes/year => must recover Why RADS § Motivation § 5 9’s availability => 5 down-minutes/year => must recover from (or mask) most failures without human intervention § a principled way to design “self-*” systems § Technology § High-traffic large-scale distributed/replicated services => large datasets § Analysis is CPU-intensive => a way to trade extra CPU cycles for dependability § Large logs/datasets for models => storage is cheap and getting cheaper § RADS addresses a clear need while exploiting 3/19/2018 demonstrated technology trends © 2004 A. Fox

Cheap Recovery Cheap Recovery

Complex systems of black boxes § “. . . our ability to analyze and Complex systems of black boxes § “. . . our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our economy is painfully inadequate. ” (Choudhury & Weikum, 2000 PITAC Report) § Networked services too complex and rapidly-changing to test exhaustively: “collections of black boxes” § Weekly or biweekly code drops not uncommon § Market activities lead to integration of whole systems § Need to get humans out of loop for at least some monitoring/recovery loops § hence interest in “autonomic” approaches § fast detection is often at odds with false alarms 3/19/2018 © 2004 A. Fox

Consequences § Complexity breeds increased bug counts and bug impact § Heisenbugs, race conditions, Consequences § Complexity breeds increased bug counts and bug impact § Heisenbugs, race conditions, environment-dependent and hard-to -reproduce bugs still account for majority of SW bugs in live systems § up to 80% of bugs found in production are those for which a fix is not yet available* § some application-level failures result in user-visible bad behavior before they are detected by site monitors § Tellme Networks: up to 75% of downtime is “detection” (sometimes by user complaints), followed by localization § Amazon, Yahoo: gross metrics track second-order effect of bugs, but lags actual bug by minutes or tens of minutes § Result: downtime and increased management costs *3/19/2018 Wood, Software reliability from the customer view, IEEE Computer, Aug. 2003 A. P. © 2004 A. Fox

“Always adapting, always recovering” § Build statistical models of “acceptable” operating envelope by measurement “Always adapting, always recovering” § Build statistical models of “acceptable” operating envelope by measurement & analysis on live system § Control theory, statistical correlation, anomaly detection. . . § Detect runtime deviations from model § typical tradeoff is between detection rate & false positive rate § Rely on external control using inexpensive and simple mechanisms that respect the black box, to keep system within its acceptable operating envelope § invariant: attempting recovery won’t make things worse § makes inevitable false positives tolerable § can then reduce false negatives by “tuning” algo’s to be more aggressive and/or deploying multiple detectors Systems that are “always adapting, always recovering” 3/19/2018 © 2004 A. Fox

Toward recovery management invariants § Observation: instrumentation and analysis § collect and analyze data Toward recovery management invariants § Observation: instrumentation and analysis § collect and analyze data from running systems § rely on “most systems work most of the time” to automatically derive baseline models § Analysis: detect and localize anomalous behavior § Action: close loop automatically with “micro-recovery” § “Salubrious”: returns some part of system to known state • Reclaim resources (memory, DB conns, sockets, DHCP lease. . . ), throw away corrupt transient state, setup to retry operation if appropriate § Safe: no effect on correctness, minimal effect on performance § Localized: parts not being microrecovered aren’t affected § Fast recovery simplifies failure detection and recovery management. 3/19/2018 © 2004 A. Fox

Non-goals/complementary work All of the following are being capably studied by others, and directly Non-goals/complementary work All of the following are being capably studied by others, and directly compose with our own efforts. . . § Byzantine fault tolerance § In-place repair of persistent data structures § Hard-real-time response guarantees § Adding checkpointing to legacy non-componentized applications § Source code bug finding § Advancing the state of the art in SLT (analysis algorithms) 3/19/2018 © 2004 A. Fox

Outline § Micro-recoverable systems § Concept of microrecovery § A microrecoverable application server & Outline § Micro-recoverable systems § Concept of microrecovery § A microrecoverable application server & session state store § Application-generic SLT-based failure detection § Path and component analysis and localization for appserver § Simple time series analyses for purpose-built state store § Combining SLT detection with microrecoverable systems § Discussion, related work, implications & conclusions 3/19/2018 © 2004 A. Fox

Microrebooting: one kind of microrecovery § 60+% of software failures in the field* are Microrebooting: one kind of microrecovery § 60+% of software failures in the field* are reboot-curable, even if root cause is unknown. . . why? § Rebooting discards bad temporary data (corrupted data structures that can be rebuilt) and (usually) reclaims used resources § reestablishes control flow in a predictable way (breaks deadlocks/livelocks, returns thread or process to its start state) § To avoid imperiling correctness, we must. . . § Separate data recovery from process recovery § Safeguard the data § Reclaim resources with high confidence § Goal: get same benefits of rebooting but at much finer grain (hence faster and less disruptive) - microrebooting 3/19/2018 * D. Oppenheimer et al. , Why do Internet services fail and what can be done about it? , USITS 2003 © 2004 A. Fox

Write example: “Write to Many, Wait for Few” Try to write to W random Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Write example: “Write to Many, Wait for Few” Try to write to W random Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Write example: “Write to Many, Wait for Few” Try to write to W random Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Write example: “Write to Many, Wait for Few” Try to write to W random Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Write example: “Write to Many, Wait for Few” Try to write to W random Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 App. Server Browser 1 4 S T U B Cookie holds metadata Crashed? Slow? 3/19/2018 Brick 2 Brick 3 Brick 4 Brick 5 © 2004 A. Fox

Read example: Try to read from Bricks 1, 4 Brick 1 1 4 Browser Read example: Try to read from Bricks 1, 4 Brick 1 1 4 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Read example: 1 4 Browser App. Server S T U B Brick 1 Brick Read example: 1 4 Browser App. Server S T U B Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Read example: Brick 1 crashes Brick 1 Browser App. Server S T U B Read example: Brick 1 crashes Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

Read example: Browser App. Server S T U B Brick 2 Brick 3 Brick Read example: Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/19/2018 © 2004 A. Fox

SSM: Failure and Recovery § Failure of single node § No data loss, WQ-1 SSM: Failure and Recovery § Failure of single node § No data loss, WQ-1 remain § State is available for R/W during failure § Recovery § Restart – No special case recovery code § State is available for R/W during brick restart § Session state is self-recovering • User’s access pattern causes data to be rewritten 3/19/2018 © 2004 A. Fox

Backpressure and Admission Control App. Server S T U B Brick 1 Brick 2 Backpressure and Admission Control App. Server S T U B Brick 1 Brick 2 Brick 3 App. Server S T U B Drop Requests Brick 4 Brick 5 Heavy flow to Brick 3 3/19/2018 © 2004 A. Fox

Statistical Monitoring Statistics Brick 1 Brick 2 Brick 3 Brick 4 Pinpoint Num. Elements Statistical Monitoring Statistics Brick 1 Brick 2 Brick 3 Brick 4 Pinpoint Num. Elements Memory. Used Inbox. Size Num. Dropped Num. Reads Num. Writes Brick 5 3/19/2018 © 2004 A. Fox

SSM Monitoring § N replicated bricks handle read/write requests § Cannot do structural anomaly SSM Monitoring § N replicated bricks handle read/write requests § Cannot do structural anomaly detection! § Alternative features (performance, mem usage, etc) § Activity statistics: How often did a brick do something? § Msgs received/sec, dropped/sec, etc. § Same across all peers, assuming balanced workload § Use anomalies as likely failures § State statistics: Current state of system § § Similar pattern across peers, but may not be in phase § 3/19/2018 Memory usage, queue length, etc. Look for patterns in time-series; differences in patterns indicate failure at a node. © 2004 A. Fox

Detecting Anomalous Conditions § Metrics compared against those of “peer” bricks § Basic idea: Detecting Anomalous Conditions § Metrics compared against those of “peer” bricks § Basic idea: Changes in workload tend to affect all bricks equally § Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” § Anomaly in 6 or more (out of 9) metrics => reboot brick § Use different techniques for different stats § “Activity” – absolute median deviation § “State” – Tarzan time-series analysis 3/19/2018 © 2004 A. Fox

Network Fault – 70% packet loss in SAN Network fault injected 3/19/2018 Fault detected Network Fault – 70% packet loss in SAN Network fault injected 3/19/2018 Fault detected Brick killed Brick restarts © 2004 A. Fox

J 2 EE as a platform for u. RB-based recovery § Java 2 Enterprise J 2 EE as a platform for u. RB-based recovery § Java 2 Enterprise Edition, a component framework for Internet request-reply style apps § App is a collection of components (“EJBs”) created by subclassing a managed container class § application server provides component creation, thread management, naming/directory services, abstractions for database and HTTP sessions, etc. § Web pages with embedded servlets and Java Server Pages invoke EJB methods § potential to improve all apps by modifying the appserver § J 2 EE has a strong following, encourages modular programming, and there are open source appservers 3/19/2018 © 2004 A. Fox

Separating data recovery from process recovery § For HTTP workloads, session state app checkpoint Separating data recovery from process recovery § For HTTP workloads, session state app checkpoint § Store session state in a microrebootable session state subsystem (NSDI’ 04) § Recovery==non-state-preserving process restart, redundancy gives probabilistic durability • Response time cost of externalizing session state: ~25% • SSM, an N-way RAM-based state replication [NSDI 04] behind existing J 2 EE API § Microreboot EJB’s: § destroy all instances of EJB and associated threads § releases appserver-level resources (DB connections, etc) § discards appserver metadata about EJB’s § session state preserved across u. RB 3/19/2018 © 2004 A. Fox

Fault injection: null JBoss+u. RB’s+SSM refs, fault injection + deadlocks/infinite loop, corruption of volatile Fault injection: null JBoss+u. RB’s+SSM refs, fault injection + deadlocks/infinite loop, corruption of volatile EJB metadata, resource leaks, Java runtime errors/exc RUBi. S: online auction app (132 K items, 1. 5 M bids, 100 K subscribers) 150 simulated users/node 35 -45 req/sec/node Workload mix based on a commercial auction site Client-based failure detection 3/19/2018 © 2004 A. Fox

u. RB vs. full RB - action weighted goodput § Example: corrupt JNDI database u. RB vs. full RB - action weighted goodput § Example: corrupt JNDI database entry, Runtime. Exception, Java error; measure G_aw in 1 -second buckets § Localization is crude: static analysis to associate failed URL with set of EJB’s, incrementing an EJB’s score whenever it’s implicated § With u. RB’s, 89% reduction in failed requests and 9% more successful requests compared to full RB, despite 6 false positives 3/19/2018 © 2004 A. Fox

Performance overhead of JAGR § 150 clients/node: latency=38 msec (3 -> 7 nodes) § Performance overhead of JAGR § 150 clients/node: latency=38 msec (3 -> 7 nodes) § Human-perceptible delay: 100 -200 msec § Real auction site: 41 req/sec, 33 -300 msec latency 3/19/2018 © 2004 A. Fox

Improving availability from user’s point of view § u. RB improves userperceived availability vs. Improving availability from user’s point of view § u. RB improves userperceived availability vs. full reboot § u. RB complements failover § (a) Initially, excess load on 2 nd node brought it down immediately after failover § (b) u. RB results in some failed requests (96% fewer) from temporary overload § (c, d) Full reboot vs. u. RB without failover § For small clusters, should always try u. RB first 3/19/2018 © 2004 A. Fox

u. RB Tolerates Lax Failure Detection § Tolerates lag in detection latency (up to u. RB Tolerates Lax Failure Detection § Tolerates lag in detection latency (up to 53 s in our microbenchmark) and high false positive rates § Our naive detection algorithm had up to 60% false positive rate in terms of what to u. RB § we injected 97% false positives before reduction in overall availability equaled cost of full RB § Always safe to use as “first line of defense”, even when failover is possible § cost(u. RB+other recovery) cost(other recovery) § success rate of u. RB on reboot-curable failures is comparable to whole-appserver reboot 3/19/2018 © 2004 A. Fox

Performance penalties § Baseline workload mix modeled on commercial site § 150 simulated clients Performance penalties § Baseline workload mix modeled on commercial site § 150 simulated clients per node, ~40 -45 reqs/sec per node § system at ~70% utilization § Throughput ~1% worse due to instrumentation § worst-case response latency increases from 800 to 1200 ms § Average case: 45 ms to 80 ms; compare to 35 -300 ms for commercial service § Well within “human tolerance” thresholds § Entirely due to factoring out of session state Ø Performance penalty is tolerable & worth it 3/19/2018 © 2004 A. Fox

Recovery and maintenance Recovery and maintenance

Microrecovery for Maintenance Operations § Capacity discovery in SSM § TCP-inspired flow control keeps Microrecovery for Maintenance Operations § Capacity discovery in SSM § TCP-inspired flow control keeps system from falling off a cliff § “OK to say no” is essential for this backpressure to work § Microrejuvenation in JAGR (proactively microreboot to fix localized memory leaks) § Splitting/coalescing in Dstore § Split = failure + reappearance of failed node § Same safe/non-disruptive recovery mechanisms are used to lazily repair inconsistencies after new node appears § Consequently, performance impact small enough to do this as an online operation 3/19/2018 © 2004 A. Fox

Using microrecovery for maintenance § Capacity discovery in SSM § redundancy mechanism used for Using microrecovery for maintenance § Capacity discovery in SSM § redundancy mechanism used for recovery (“write many, wait few”) also used to “say no” while gracefully degrading performance 3/19/2018 © 2004 A. Fox

Full rejuvenation vs. microrejuvenation 76% 3/19/2018 © 2004 A. Fox Full rejuvenation vs. microrejuvenation 76% 3/19/2018 © 2004 A. Fox

Splitting/coalescing in Dstore § Split = failure + reappearance of failed node § Same Splitting/coalescing in Dstore § Split = failure + reappearance of failed node § Same mechanisms used to lazily repair inconsistencies 3/19/2018 © 2004 A. Fox

Summary: microrecoverable systems § Separation of data from process recovery § Special-purpose data stores Summary: microrecoverable systems § Separation of data from process recovery § Special-purpose data stores can be made microrecoverable § OK to initiate microrecovery anytime for any reason § no loss of correctness, tolerable loss of performance § likely (but not guaranteed) to fix an important class of transients § won’t make things worse; can always try “full” recovery afterward § inexpensive enough to tolerate “sloppy” fault detection Ø low-cost first line of defense § some “maintenance” ops can be cast as microrecovery § due to low cost, “proactive” maintenance can be done online § can often convert unplanned long downtime into planned shorter performance hit © 2004 3/19/2018 A. Fox

Anomaly detection as failure detection Anomaly detection as failure detection

Example: Anomaly Finding Techniques Before runtime* At runtime** Detectiing anomalliies De t e c Example: Anomaly Finding Techniques Before runtime* At runtime** Detectiing anomalliies De t e c t n g a n o m a e s Runtime-safe languages, dynamic data analysis model checking, Lint-like tools Sandboxing/isolation, stack guarding, etc. human factors/processes (extreme programming, etc. ) Redundancy (TMR, Byzantine, etc. ) static analysis (even gcc -Wall ) Fiindiing/preven-tiing F n d n g / p r e ve n - t n g Bu g s Manual “Inspeculation”, code inspection/reviews, using debugging tools [Eisenstadt 97] Heuristic detection of data races [Eraser, Savage et al] type-safe languages Bugs as anomalous behavior [Chou & Engler 2002] Self-repairing data structures [Demsky & Rinard] Heuristic detection of possible invariant violation [Haglan & Lam 2002] Performance anomalies [Richardson et al] Path-based analysis [Chen, Kiciman et al 2002] Deadlock detection and Question: does anomaly == bug? repair * Includes design time and build time 3/19/2018 ** Includes both offline (invasive) and online detection techniques © 2004 A. Fox

Examples of Badness Inference § Sometimes can detect badness by looking for inconsistencies in Examples of Badness Inference § Sometimes can detect badness by looking for inconsistencies in runtime behavior § We can observe program-specific properties (though using automated methods) as well as program-generic properties § Often, we must be able to first observe program operating “normally” § Eraser: detecting data races [Savage et al. 2000] § Observe lock/unlock patterns around shared variables § If a variable usually protected by lock/unlock or mutex is observed to have interleaved reads, report a violation § DIDUCE: inferring invariants, then detecting violations [Hangal & Lam 2002] § Start with strict invariant (“x is always =3”) § Relax it as other values seen (“x is in [0, 10]”) § Increase confidence in invariant as more observations seen § Report violations of invariants that have threshold confidence 3/19/2018 © 2004 A. Fox

Generic runtime monitoring techniques § What conditions are we monitoring for? § Fail-stop vs. Generic runtime monitoring techniques § What conditions are we monitoring for? § Fail-stop vs. Fail-silent vs. Fail-stutter § Byzantine failures § Generic methods § Heartbeats (what does loss of heartbeat mean? Who monitors them? ) § Resource monitoring (what is “abnormal”? ) § Application-specific monitoring: ask a question you know the answer to § Fault model enforcement § coerce all observed faults to an “expected faults” subset § if necessary, take additional actions to completely “induce” the fault § Simplifies recovery since fewer distinct cases § Avoids potential misdiagnosis of faults that have common symptoms § Note, may sometimes appear to make things “worse” (coerce a lesssevere fault to a more-severe fault) 3/19/2018 § Doesn’t exercise all parts of the system © 2004 A. Fox

Internet performance failure detection § Various approaches, all of which exploit the law of Internet performance failure detection § Various approaches, all of which exploit the law of large numbers and (sort of) Central Limit Theorem (which is? ) § Establish “baseline” of quantity to be monitored • Take observations, factor out data from known failures • Normalize to workload? § Look for “significant” deviations from baseline § What to measure? § Coarse-grain: number of reqs/sec § Finer-grain: Number of TCP connections in Established, Syn_sent, Syn_rcvd state § Even finer: additional internal request “milestones” • Hard to do in an application-generic way. . . but frameworks can save us 3/19/2018 © 2004 A. Fox

Example 1: Detection & recovery in SSM § 9 “State” statistics collected per second Example 1: Detection & recovery in SSM § 9 “State” statistics collected per second from each replica § Tarzan time series analysis* compares relative frequencies of substrings corresponding to discretized time series § “anomalous” => at least 6 stats “anomalous”; works for aperiodic or irregular-period signals § robust against workload changes that affect all replicas equally and against highly-correlated metrics 3/19/2018 © 2004 *Keogh et al. , Finding surprising patterns in a time series database in linear time and space, SIGKDD 2002 A. Fox

What faults does this handle? § Essentially 100% availability vs. injected faults: § Node What faults does this handle? § Essentially 100% availability vs. injected faults: § Node crash/hang/timeout/freeze § Fail-stutter: Network loss (drop up to 70% of packets randomly) § Periodic slowdown (eg from garbage collection) § Persistent slowdown (one node lags the others) § Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” § All anomalies can be safely “coerced” to crash faults § If reboot doesn’t fix, it didn’t cost you much to try it § Human notified after threshold number of restarts; system has no concept of “recovery” § Allows SSM to be managed like a farm of stateless servers 3/19/2018 © 2004 A. Fox

Detecting anomalies in application logic § Goal: detect failures whose only obvious symptom is Detecting anomalies in application logic § Goal: detect failures whose only obvious symptom is change in semantics of application § Example: wrong item data displayed; wouldn’t be caught by HTML scraping or HTTP logs § Typically, site responds to HTTP pings, etc. under such failures § These commonly result from exceptions of the form we injected into RUBi. S § Insight: manifestation of bugs is the rare case, so capture “normal” behavior of system under no fault injection § Then detect threshold deviations from this baseline § Periodically move the baseline to allow for workload evolution 3/19/2018 © 2004 A. Fox

Patterns: Path shape analysis HTTP Frontends Application Components Databases Middleware § Model paths as Patterns: Path shape analysis HTTP Frontends Application Components Databases Middleware § Model paths as parse trees in probabilistic CFG § Build grammar under “believed normal” conditions, then mark very unlikely paths as anomalous § after classification, build decision tree to correlate path features (components touched) with anomalous paths 3/19/2018 © 2004 A. Fox

Patterns: Component Interaction Analysis HTTP Frontends Application Components Databases Middleware § Model interactions between Patterns: Component Interaction Analysis HTTP Frontends Application Components Databases Middleware § Model interactions between a component and its n neighbors in the dynamic call graph as a weighted DAG § 3/19/2018 § § compare to observed call graph using chi-squared goodness-of-fit can compare either across peers or against historical data © 2004 A. Fox

Precision and recall (example) § Detection: Recall = % of failures actually detected as Precision and recall (example) § Detection: Recall = % of failures actually detected as anomalies § Strictly better than HTTP/HTML monitoring Detection: recall, faults affecting >1% of workload § Localization: § recall = % actuallyfaulty requests returned § precision = % requests returned that are faulty = 1 -(FP rate) § Tradeoff between recall and precision (false positive rate) § 3/19/2018 Even low-recall case corresponds to high detection recall (. 83) Localization: Recall vs. precision [R]=. 68 [P]=. 14 [R]=. 34 [P]=. 93 © 2004 A. Fox

Pinpoint key results § Detect 89 -96% of injected failures, compared to 20 -79% Pinpoint key results § Detect 89 -96% of injected failures, compared to 20 -79% for HTML scraping and HTTP log monitoring § Limited success in detecting injected source bugs § Example success: caught a bug that prevented shopping cart from iterating over its contents to display them, and correctly identified at-fault component (where bug was injected) § Resilient to “normal” workload changes § Because we bin analysis by request category § Resilient to “bug fix release” code changes § Currently slow; analysis lags ~20 s behind application 3/19/2018 © 2004 A. Fox

Combining u. RB’s and Pinpoint § Simple recovery policy: § u. RB all components Combining u. RB’s and Pinpoint § Simple recovery policy: § u. RB all components whose normalized anomaly score >1. 0 § if we’ve already done that, reboot the whole application § More sophisticated policies certainly possible 3/19/2018 © 2004 A. Fox

Combining u. RB’s and Pinpoint § Example: data structure corruption in SB_view. Item EJB Combining u. RB’s and Pinpoint § Example: data structure corruption in SB_view. Item EJB § 350 simulated clients § 18. 5 s to detect/localize § <1 s to repair § Note, returned Web page would be valid but incorrect § Robust to typical workload changes & bug patches § More comprehensive deployment in progress 3/19/2018 © 2004 A. Fox

Faulty Request Identification Failures detected, but high rate of mis-identification of faulty requests (false Faulty Request Identification Failures detected, but high rate of mis-identification of faulty requests (false positive) Failures injected but not detected Failures detected, faulty requests identified as such Failures not detected, but low false positives HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner 3/19/2018 © 2004 A. Fox

Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner 3/19/2018 © 2004 A. Fox

Tolerating false positives in DStore § Metrics and algorithm comparable to those used in Tolerating false positives in DStore § Metrics and algorithm comparable to those used in SSM § We inject “fail-stutter” behavior by increasing request latency § Bottom case: more aggressive detection also results in 2 “unnecessary” reboots § But they don’t matter much if there is modest replication § Currently some voodoo constants for thresholds in both SSM and DStore § Recall that these are “off-the-shelf” algorithms; should be able to do better Trade-off: earlier detection vs. false positives 3/19/2018 © 2004 A. Fox

Summary of case studies Subsystem SSM (diskless session state store) [NSDI 04] Instrumentation State Summary of case studies Subsystem SSM (diskless session state store) [NSDI 04] Instrumentation State and activity metric ‘sensors’ built into app DStore (persistent hashtable) [ACM Trans. on Storage] JAGR (J 2 EE application server) [OSDI 04] Inter-EJB call info monitored by modifying container Could also use aspects Microrecovery Whole-node fast reboot (doesn’t preserve state) Statistical monitoring Performance cost Tarzan time-series analysis 20 -50% request latency; still competitive with commercial service Median absolute Whole-node reboot deviation (preserves state) Microreboots of EJB’s Anomalous code paths modeled using PCFG component interactions modeled by comparing dynamic call graphs <1% thruput reduction ~1% on request latency and thruput § Detection and localization good even with “simple” algorithms; fits well with localized recovery § Performance penalty is tolerable & worth it § 3/19/2018 Note, microrecovery can also be used for microrejuvenation © 2004 A. Fox

Discussion Discussion

Discussion: What makes this work? § What made it work in our examples specifically? Discussion: What makes this work? § What made it work in our examples specifically? § Recovery speed: Weaker consistency in SSM and DStore in exchange for fast recovery and predictable work done per request § Recovery correctness: J 2 EE apps constrained to “checkpoint” by manipulating session state, and this is brought out in the appwriter-visible API’s; good isolation between components and relative lack of shared state § Anomaly detection: app behavior alternates short sequences of EJB calls with updates to persistent state, so can be characterized in terms of those calls § Observations § Neither diagnosis recovery nor recovery diagnosis § Localization != diagnosis, but it’s an important optimization 3/19/2018 © 2004 A. Fox

Why are statistical methods appealing? § Large complex systems tend to exercise a lot Why are statistical methods appealing? § Large complex systems tend to exercise a lot of their functionality in a fairly short amount of time § Especially Internet services, with high-volume workloads of largely independent requests § Even if we don’t know what to measure, statistical and data mining techniques can help figure it out § Performance problems are often linked with dependability problems (fail-stutter behavior), for either HW or SW reasons § Most systems work well most of the time § 3/19/2018 Corollary: in a replica system, replicas should behave “the same” most of the time © 2004 A. Fox

When does it not work? § When SLT-based monitoring does not apply § Base-rate When does it not work? § When SLT-based monitoring does not apply § Base-rate fallacy: monitoring events so rare that FP rate dominates § Gaming the system (deliberately or inadvertently) § When failures can’t be cured by any kind of micro-recovery § Persistent-state corruption (or hardware failure) § Corrupted configuration data § “a spectrum of undo” § When you can’t say no § Backpressure and possibility of caller-retry are used to improve predictability § Promising you will say “yes” may be difficult. . . question may be 3/19/2018 whether end-to-end guarantees are needed at lower layers © 2004 A. Fox

SSM/DStore as “extreme” design points § Goal was to investigate extremes of “no special SSM/DStore as “extreme” design points § Goal was to investigate extremes of “no special recovery” § Could explore erasure coding (Rep. Store does this dynamically) § Weakened consistency model of DStore vs. 2 PC § Spread cost of repair lazily across many operations (rather than bulk recovery) § Spread some 2 PC state maintenance to client in the form of “write in progress” cookie § May be that 2 PC would be affordable, but we were interested in extreme design point of “no special restart code” 3/19/2018 © 2004 A. Fox

Role of 3 -tier architecture § Separation of concerns: really, separation of process recovery Role of 3 -tier architecture § Separation of concerns: really, separation of process recovery (control flow) from data recovery § u. RB and reboots recover processes; SSM, DStore, and traditional relational databases recover data § Not addressed is repair of data 3/19/2018 © 2004 A. Fox

Shouldn’t we just make software better? § Yes we should (and many people are), Shouldn’t we just make software better? § Yes we should (and many people are), but. . . § We use commodity HW&SW, despite the fact that they are imperfect, less reliable than “hardened” or purposebuilt components, etc. Why? § Price/performance follows volume § Allows specialization of efforts and composition of reusable building blocks (vs. building stovepipe system) § In short, it allows much faster overall pace of innovation and deployment, for both technical and economic reasons, even though the components themselves are imperfect § We should assume“commodity programmers” too (observation from Brewster Kahle) § 3/19/2018 Give as much generic support to application as we can © 2004 A. Fox

Challenges & open issues § Algorithm issues that impinge on systems work § Hand-tuned Challenges & open issues § Algorithm issues that impinge on systems work § Hand-tuned constants/thresholds in algorithms--seems to be an issue in other applications of SLT as well § Online vs. offline algorithms § Stability of closed loop § Systems issues § How do you “know” you’ve checkpointed all important state, or that something is safe to retry? § How do you debug a “moving target” ? Traditional methods/tools are confounded by code obfuscation, sudden loss of transient program state (stack & heap), etc. (a great Ph. D thesis. . . ) § debugging today’s real systems is already hard for these reasons § 3/19/2018 apps, faultloads, best practices, etc. hard to get! Real © 2004 A. Fox

RADS message in a nutshell Statistical techniques can identify “interesting” features and relationships from RADS message in a nutshell Statistical techniques can identify “interesting” features and relationships from large datasets, but frequent tradeoff between detection rate (or detection time) and false positives Make “micro-recovery” so inexpensive that occasional false positives don’t matter § Achievable now on realistic applications & workloads § Synergistic with componentized apps & frameworks § Specific point of leverage for collaboration with machine learning research; lots of headroom for improvement § 3/19/2018 Even “simple” algorithms show encouraging initial results © 2004 A. Fox

Project possibilities Project possibilities

BACKUP SLIDES BACKUP SLIDES