4531444eca49f59af2fa9854a52e1af0.ppt
- Количество слайдов: 44
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox {bling, emrek, fox}@cs. stanford. edu
Outline n Motivation: What is Session State? n SSM: l l Algorithm l n Architecture Backpressure and Admission Control SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference App. Server integration n Conclusion 3/15/2018 © 2004 Benjamin Ling
Proliferation of J 2 EE and Web Services n J 2 EE embraced as industry standard n Framework l l Allows for portability of services l n Simplifies development Standardized interfaces However, difficulties remain… 3/15/2018 © 2004 Benjamin Ling
The Pain – Administration and Maintenance n Administration is difficult and costly l l n $$ -- Database admins cost ~$200 K/yr a head Development efficiency negatively impacted Failure/Recovery is costly l Recovery slow, especially site outages l Data loss on crashes l Users adversely affected 3/15/2018 © 2004 Benjamin Ling
Not All State is Created Equal n Various types of state in J 2 EE… l l Persistent shared state l n User profile state Transaction history state But usually stored in the same place l Stored in DB or FS Focus on particular class Exploit its properties Simplify Administration and Maintenance 3/15/2018 © 2004 Benjamin Ling
Example of Session State 3/15/2018 © 2004 Benjamin Ling
Properties of Session State n Subcategory of session state l Single-user, serial access, semi-persistent data l Examples: Temporary application data, application workflow l Example of usage (e. g. J 2 EE): 2 1 App Server Browser 6 3 4 5 3/15/2018 © 2004 Benjamin Ling
Goal n Build a session state store that is: l Failure-friendly n Does not lose data on crash n Degrades gracefully l Recovery-friendly n Recovers fast l Self-Managing 3/15/2018 © 2004 Benjamin Ling
Outline n Motivation: What is Session State? n SSM: l l Algorithm l n Architecture Backpressure and Admission Control SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference App. Server integration n Conclusion 3/15/2018 © 2004 Benjamin Ling
Session State Manager (SSM) RAM, Network Interface Redundant, in-memory hash table distributed across nodes App. Server S T U B Brick 1 Brick 2 Brick 3 App. Server S T U B Brick 4 Brick 5 Algorithm: Redundancy similar to quorums • Write to many random nodes, wait for few (avoid performance coupling) • Read one 3/15/2018 © 2004 Benjamin Ling
Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Write example: “Write to Many, Wait for Few” Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 1 App. Server Browser 1 4 S T U B Cookie holds metadata Crashed? Slow? 3/15/2018 Brick 2 Brick 3 Brick 4 Brick 5 © 2004 Benjamin Ling
Read example: Try to read from Bricks 1, 4 Brick 1 1 4 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Read example: 1 4 Browser App. Server S T U B Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Read example: Brick 1 crashes Brick 1 Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
Read example: Browser App. Server S T U B Brick 2 Brick 3 Brick 4 Brick 5 3/15/2018 © 2004 Benjamin Ling
SSM: Failure and Recovery n Failure of single node l l n No data loss, WQ-1 remain State is available for R/W during failure Recovery l Restart – No recovery l No special case recovery code l State is available for R/W during brick restart l Session state is self-recovering n User’s access pattern causes data to be rewritten 3/15/2018 © 2004 Benjamin Ling
Backpressure and Admission Control App. Server S T U B Brick 1 Brick 2 Brick 3 App. Server S T U B Drop Requests Brick 4 Brick 5 Heavy flow to Brick 3 3/15/2018 © 2004 Benjamin Ling
Backpressure and Admission Control App. Server S T U B Brick 1 Brick 2 Brick 3 App. Server S T U B Drop Requests Brick 4 Reduce Sending Brick 5 Reject requests 3/15/2018 © 2004 Benjamin Ling
Outline n Motivation: What is Session State? n SSM: l l Algorithm l n Architecture Backpressure and Admission Control SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference App. Server integration n Conclusion 3/15/2018 © 2004 Benjamin Ling
Recovery Philosophy R E C O V E R Y C O S T Cheap Expensive Undetected Errors Ideal Hard Ideal Undetected Errors Hard Downtime Lax Accurate Aggressive DETECTION ACCURACY 3/15/2018 © 2004 Benjamin Ling
Failure detection and Recovery Recovered Detection Failure Recovery SSM: Failure masked Instant recovery 3/15/2018 © 2004 Benjamin Ling
False Positives Normal Operation False positive triggered 3/15/2018 Instant recovery © 2004 Benjamin Ling
Statistical Monitoring Statistics Brick 1 Brick 2 Brick 3 Brick 4 Pinpoint Num. Elements Memory. Used Inbox. Size Num. Dropped Num. Reads Num. Writes Brick 5 3/15/2018 © 2004 Benjamin Ling
Statistical Monitoring Statistics Brick 1 Brick 2 Brick 3 Brick 4 Pinpoint Num. Elements Memory. Used Inbox. Size Num. Dropped Num. Reads Num. Writes Brick 5 REBOOT 3/15/2018 © 2004 Benjamin Ling
Statistical Monitoring Statistics Brick 1 Brick 2 Brick 3 Brick 4 Pinpoint Num. Elements Memory. Used Inbox. Size Num. Dropped Num. Reads Num. Writes Brick 5 3/15/2018 © 2004 Benjamin Ling
SSM Monitoring n N replicated bricks handle read/write requests l l n Cannot do structural anomaly detection! Alternative features (performance, mem usage, etc) Activity statistics: How often did a brick do something? l l Same across all peers, assuming balanced workload l n Msgs received/sec, dropped/sec, etc. Use anomalies as likely failures State statistics: Current state of system l l Similar pattern across peers, but may not be in phase l 3/15/2018 Memory usage, queue length, etc. Look for patterns in time-series; differences in patterns indicate failure at a node. © 2004 Benjamin Ling
Surprising Patterns in Time-Series 1. Discretize time-series into string. [Keogh] [0. 2, 0. 3, 0. 4, 0. 6, 0. 8, 0. 2] -> “aaabba” 2. Calculate the frequencies of short substrings in the string. “aa” occurs twice; “ab”, “ba” occurs once. 3. Compare frequencies to normal, look for substrings that occur much less or much more than normal. 3/15/2018 © 2004 Benjamin Ling
Outline n Motivation: What is Session State? n SSM: l l Algorithm l n Architecture Backpressure and Admission Control SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference App. Server integration n Conclusion 3/15/2018 © 2004 Benjamin Ling
Microbenchmarks n UC Berkeley Millennium Cluster l Six bricks running n Candidate Write Set = 3, Write quota = 2 n Candidate Read Set = 2 n State Size = 8 K 3/15/2018 © 2004 Benjamin Ling
Induced Fault One bricked killed 3/15/2018 SSM unaffected Brick restarted by PP © 2004 Benjamin Ling
Memory fault SSM unaffected 3/15/2018 Memory fault detected in hash PP restarts Brick © 2004 Benjamin Ling
Network Fault – 70% packet loss 3/15/2018 Network fault injected Fault detected Brick killed PP restarts Brick © 2004 Benjamin Ling
Performance Fault Performance fault injected 3/15/2018 © 2004 Benjamin Ling
Macrobenchmark n Tell. Me’s Email-By-Phone Application n Session state stored in memory l l n Email header information Index information Alter application to store session state using l Disk l SSM 3/15/2018 © 2004 Benjamin Ling
Macrobenchmark 3/15/2018 Throughput preserved compared to disk 25% Throughput Degradation compared to in-memory © 2004 Benjamin Ling
Future Work n Integrate with Sun’s reference Application Server l n Statistical Anomaly Detection l n Enterprise benchmarks Too many magic numbers Integrated ROC-J 2 EE application server 3/15/2018 © 2004 Benjamin Ling
Conclusion SSM A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling bling@cs. stanford. edu http: //swig. stanford. edu/ 3/15/2018 © 2004 Benjamin Ling
Existing solutions : n File System and Databases l l Slow recovery (Both) l Difficult to administer (DB) l n Poor failure behavior n Lose data (FS) Difficult to tune (both) In-memory replication using primary/secondary: l Performance coupling l Poor failover (uneven load balancing) 3/15/2018 © 2004 Benjamin Ling
Other implementation details n Garbage collection l Generational hash table n Hash table of hash tables n Each hash table has an associated time range n When time has passed, GC that table l No reference counting, scanning, etc. 3/15/2018 © 2004 Benjamin Ling
SSM: Self-Managing n Adaptive: l Stub maintains count of maximum allowable in-flight requests to each brick n Additive increase on successful request n Multiplicative decrease on timeout l Stubs discover capacity of each brick Self-Tuning n Admission control l Stubs say “no” if insufficient bricks l Propagate backpressure from bricks to clients n Turn users away under overload Self-Protecting 3/15/2018 © 2004 Benjamin Ling
4531444eca49f59af2fa9854a52e1af0.ppt