System Tomography Gradient-based models of multitier systems Kaustubh

System Tomography Gradient-based models of multitier systems Kaustubh Joshi (AT&T Labs Research) Collaborators: Shuyi Chen (University of Illinois) Matti Hiltunen (AT&T Labs Research) Rick Schlichting (AT&T Labs Research) William Sanders (University of Illinois) © 2008 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.

System Tomography? ? Performability not just for safety-critical systems • • • Enterprise systems: ERM, CRM, web services, e-commerce Consumers: banking, shopping, portals, communications Predicting impact of topology (application–environment) Predicting impact of growth (application–workload) Predicting impact of outsourcing (application–application) Queuing Models. Petri Nets. But … • Organic growth and constant evolution • Organizational challenges • Cost and time to market • Complex hidden factors (firewalls, load balancers, caches, proxies, content accelerators, network file systems) Page 2

System Tomography!!!! Use online measurements • Minimal application instrumentation • No isolated profiling environment • Inject small runtime perturbations Construct simple predictive models, e. g. gradients • Simple: linear, limited domain Cheap enough to allow online retraining • Automatically generate models with predictive capability • And do so for black-box applications • Page 3

Gradients: Definition and Use Defined for end-to-end “metrics” m and “knobs” k as • The partial derivative of the end-to-end metric m for transaction t with respect to the knob vector k Use to predict end-to-end metric in a new configuration • As a function of change In the knob • Simple linear model Page 4

Gradient Measurement Response Time Gradient Operating Point Workload Page 5

Gradient Measurement • Inject perturbations, measure effect on response time • Noise is a problem in production environments 95% Conf. Int. : 3. 17 msec 95% Conf. Int. : 0. 07 msec For up to 10% error More than change ofof magnitude sensitivity improvement Need order 63. 4 msec Need change of 1. 4 msec • But, noise is often not periodic • Frequency domain analysis using Fourier Transforms Page 6

Gradient Measurement Response Time Gradient Operating Point Workload Page 7

Measurement Tool Tomcat. A 1 3 4 Central 2 Coordinator Logs Daemon Apache Daemon Tomcat. B Daemon Calculate Gradient Page 8 My. SQL

Link Gradients Definition • Rate of change of end-to-end response with respect to change in network link latency Perturbation Mechanisms • Inject packet delay using Linux ipqueue, TUN/TAP • In-network ARP/Route redirection Applications • Determining impact of deployment decisions • Application CDNs • Estimating impact of network changes on applications • Optimizing placement of shared components • Runtime server migration depending on workload-mix, user-load Page 9

And the predictions match … Page 10

Frequency Gradients Definition • Rate of change of end-to-end response time with respect to change in CPU frequency of servers Perturbation Mechanisms • Use DVFS. Change processor p-states. Uses • Energy conservation • Performance aware CPU scaling • Machine upgrades Problems • Nonlinearity much more severe Page 11

Nonlinearity via basis functions • Recast the gradient using nonlinear “basis functions” • The response time is linear in terms of the basis functions – • instead of Nonlinearity is primarily due to queuing effects – • i. e. , M/G/1 PS queue Can use other basis functions Page 12

Prediction Accuracy Page 13

$VM Gradients Definition • Rate at which end-to-end response time changes wrt fraction of$

VM Gradients Definition • Rate at which end-to-end response time changes wrt fraction of CPU allocated to individual node VMs Perturbation Mechanisms • Xen hypervisor scheduler parameters: cap VM CPU usage Uses • Cloud: resource sharing using statistical muxing • Performance aware server consolidation • Impact of adding/removing servers Page 14

Linearity with respect to Basis Function Page 15

In conclusion • We have a tool for gradient computation Works for link, frequency gradients – VM, Capacity gradient validation ongoing – • Future Directions Additional gradients – bandwidth, loss (VOIP) – Use models with policy generation framework to generate blackbox application management capability – Page 16

Extra Slides Page 17

Future Capacity Gradients When basis functions aren’t enough • Unpredictable nonlinearity Problem • When and what resource of a system will first become a bottleneck ? • i. e. , Compute gradients at future workloads Gradient • Rate at which throughput changes with respect to change in resource capacity at a different (higher) workload Applications • Planning for upgrades • Detecting current bottlenecks Page 18

Amp Modulation: changing operating point Throughput New Operating Point Bottleneck Gradient Current Operating Point Workload Magnitude Page 19

Workload Spikes • Buffer requests to preserve mean request rate • Produce short (few milliseconds) workload spikes Page 20

Delay Injection • Currently host-based Using iptables to construct a redirecting “firewall” – Using a virtual network tun/tap device – • Completely in-network injection possible Using ARP poisoning-based redirection – In router rules – Page 21

Link Gradients Incoming Transactions Browse, buy, sell, search Srv WS DB Srv Site 1 Srv • Upgrade network link • Move server to another site Page 22 DB Site 2 Site 3

Why CPU Gradient? Energy consumption of IT activity increasingly serious issue • Server farms • In 2006, enterprise data centers accounted for 1. 5% of total US electricity Oops! consumption (61 billion k. Wh) • And it’s growing … • 60% is consumed by low cost, commodity “volume servers” • Multi-tier services are a major tenant CPU frequency scaling can save power • But, applications have responsiveness SLAs • Scaling at different nodes can affect system differently • Scaling, response time, energy saving relationship complex Page 23