09db934b83b23afd1c4077971ace7154.ppt
- Количество слайдов: 57
Capacity Management for Web Operations John Allspaw Operations Engineering
the book I’m writing
? ? ?
Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
Things that can cause downtime bugs (disguised as capacity problems) edge cases (disguised as capacity problems) security incidents real capacity problems* * (should be the last thing you need to worry about)
Capacity != Performance Forget about performance for right now Measure what you have right NOW Don’t count on it getting any better
Thank You HPC Industry! Automated Stuff Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops
Good Measurement Tools record and store metrics in/out custom metrics easily compare lightweight-ish I
Clouds need planning too Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before
Metrics System Statistics
Metrics “Application” Level (photos processed per minute) (average processing time per phot (apache requests) (concurrent busy apache procs)
Metrics App-level meets system-level here, total CPU = ~1. 12 * # busy apache procs
2400 photos per minute being uploaded right NOW (Tuesday
Ceiling s the most amount of “work” your resources will allow before degradation or failure
Forget Benchmarking
Find your ceilings what you have left The End
Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
Like: database ceilings replication lag: bad!
Ceilings waiting on disk sustained disk I/O wait for >40% creates too much slave lag* *for us, YMMV
35, 000 oto requests per second on a Tuesday peak
Safety Factors
Safety Factors Ceiling * Factor of Safety = UR LIMITZ
Safety Factors webserver!
Safety Factors what you have left “safe” ceiling @85% CPU 85% total CPU = ~76 busy apache procs
Safety Factors Yahoo Front Page link to Chinese New. Year Photos (8% spike) (photo requests/second)
Forecasting
Forecasting Fictional Example: webservers
Forecasting peak of the week Fictional example: 15 webservers. 1 week.
Forecasting . . . bigger sample, 6 weeks. . isolate the peaks. . .
Forecasting not too shabby now . . . ”Add a Trendline” with some decent correlation. . .
Forecasting ceiling this will tell you when it is when is this? what you have left 15 servers @76 busy apache proc limit = 1140 total procs
Forecasting (1140 -726) / 42. 751 = 9. 68 (week #10, duh)
Forecasting Automation Writing excel macros is boring All we want is “days remaining”, so all we need is the curve-fit Use http: //fityk. sf. net to automate the curve-fit
Forecasting Fictional Example: storage consumption
Forecasting Automation this will tell you when this is actual flickr storage consumption from early 2005, in GB (ceiling is fictional)
Forecasting Automation jallspaw: ~]$cfityk. /fit-storage. fit cmd line script output 1> # Fityk script. Fityk version: 0. 8. 2 2> @0 < '/home/jallspaw/storage-consumption. xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0. 001 WSSR=464. 564 #1: WSSR=0. 90162 lambda=0. 0001 d(WSSR)=-463. 663 (99. 8059%) #2: WSSR=0. 736787 lambda=1 e-05 d(WSSR)=-0. 164833 (18. 2818%) #3: WSSR=0. 736763 lambda=1 e-06 d(WSSR)=-2. 45151 e-05 (0. 00332729%) #4: WSSR=0. 736763 lambda=1 e-07 d(WSSR)=-3. 84524 e-11 (5. 21909 e-09%) Fit converged. Better fit found (WSSR = 0. 736763, was 464. 564, -99. 8414%). 5> info formula in @0 # storage-consumption 14147. 4+146. 657*x+0. 786854*x^2 6> quit bye. . .
Forecasting Automation fityk gave: y = 0. 786854 x 2 + 146. 657 x + 14147. 4 ( R 2 = 99. 84) Excel gave: y = 0. 7675 x 2 + 146. 96 x + 14147. 3 ( R 2 = 99. 84) (SAME)
Capacity Health 12, 629 nagios checks 1314 hosts 6 datacenters 4 photo “farms” farm = 2 DCs (east/west)
High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
A good dashboard looks something like. . . type # www 20 shard db 20 squid 18 limit/bo ceiling x units current (peak) % peak 62. 50 80 1600 1000 % 27. 50 40 800 220 % 66. 67 950 req/sec 17, 100 11, 400 % (yes, fictional numbers) busy procs I/O wait limit (total) Est days left 36 120 48
Diagonal Scaling vertically scaling your already horizontal nodes Image processing machines Replace Dell PE 860 s with HP DL 140 G 3 s
Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
Diagonal Scaling example: image processing throughput ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3 x more work) “processing” means making 4 sizes from originals
Diagonal Scaling example: image processing went from: 23 to: 8 3008. 4 Dell PE 860 s Watts 1035 photos/min 23 U rack 1036. 8 8 U 1120 HP DL 140 G 3 s Watts photos/min rack !!! (75% faster, even)
3. 52 terabytes will be consumed today (on a
2 nd Order Effects (beware the wandering bottleneck) running hot, so add more
2 nd Order Effects (beware the wandering bottleneck) now these run hot running great now, so more traffic!
Stupid Capacity Tricks
Stupid Capacity Tricks quick and dirty management DSH http: //freshmeat. net/projects/dsh [root@netmon 101 ~]# cat group. of. servers www 100 www 118 dbcontacts 3 admin 1 admin 2
Stupid Capacity Tricks quick and dirty management [root@netmon 101 ~]# dsh -N group. of. servers dsh> date executing 'date' www 100: Mon Jun 23 14: 53 UTC 2008 www 118: Mon Jun 23 14: 53 UTC 2008 dbcontacts 3: Mon Jun 23 07: 14: 53 PDT 2008 admin 1: Mon Jun 23 14: 53 UTC 2008 admin 2: Mon Jun 23 14: 53 UTC 2008 dsh>
Stupid Capacity Tricks Turn Stuff OFF Disable heavy-ish features of the site(on/off switches) We have 195 different things to disable in case of emergency.
Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc. , etc.
Stupid Capacity Tricks Outages Happen Host your outage/status/blog page in more than one datacenter. Tell your users WTF is going on, they’ll appreciate it.
Stupid Capacity Tricks Hit the Pause Button Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and unbake) at will
thanks http: //flickr. com/photos/bondidwhat/402089763/ http: //flickr. com/photos/74876632@N 00/2394833962/ http: //flickr. com/photos/42311564@N 00/220394633/ http: //flickr. com/photos/unloveable/2422483859/ http: //flickr. com/photos/absolutwade/149702085/ http: //flickr. com/photos/krawiec/521836276/ http: //flickr. com/photos/eschipul/1560875648/ http: //flickr. com/photos/library_of_congress/2179060841/ http: //flickr. com/photos/jekkyl/511187885/ http: //flickr. com/photos/ab 8 wn/368021672/ http: //flickr. com/photos/jaxxon/165559708/ http: //flickr. com/photos/sparktography/75499095/
We’re Hiring! flickr. com/jobs Come see me!
questions?
09db934b83b23afd1c4077971ace7154.ppt