Energy Optimization and Stability in Green Data Centers

Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden

Energy Management in Data Centers n n n Total consumption: 2% of energy spent in US (EPA estimate) Energy bill is 20 -50% of total profit Energy expended on: n Computing (powering up racks of machines) n n n Sensors: Utilization, Delay, Throughput, … Actuators: DVS, turning machines On/Off Cooling n n Sensors: Temperature, air flow, … Actuators: Air-conditioning units, fans, …

Current Status n n n Increased emphasis on energy control More “manipulation knobs” are introduced to manage energy and performance Challenge n n n Knobs may interact in unexpected ways Different performance and energy management policies may interfere with one another Uncoordinated interference of multiple knobs can lead to instability or poor efficiency

Energy Saving A Tale of Two Policies DVS + On/Off: more energy consumption than DVS or On/Off alone! DVS alone On/Off alone Empirical measurements from a 30 -machine 3 -tier testbed of a shopping site

Three Performance Management Challenges n n n Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

Response Time Control Problem in VMs VM 1 VM 2 Goal: dynamically change CPU shares of VMs to meet RT constraint CPU has been popular for controlling response time With only CPU control, response time severely violated. Why?

Memory Utilization, Disk I/O, and CPU Consumption # of page faults as a function of memory utilization CPU as a function of memory utilization Page faults drastically increase after a certain threshold Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities

Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region

CPU and Memory Control Application-level performance Resource usage Application SLOs CPU Controller Memory Controller CPU allocation Memory allocation VMM CPU Scheduler Memory Manager VM 1 (App 1) Sr Sp VM n (App n) Sn Sp Resource usage Application-level performance CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90%

Performance of Joint Controllers with Synthetic Workload Cont. VM 1 VM 2 Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory

Three Performance Management Challenges n n n Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem

Results DVS + On/Off DVS alone On/Off alone Optimal

Energy Saving Measurements from a Machine Room Even Bottom-Up Even Optimal + Off Bottom-Up Optimal Fixed cooling set point Fixed number of machines Holistic Optimization

Three Performance Management Challenges n n n Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

Help the Admin: Administrative Cost is Sky Rocketing!

Diagnostics In software systems, key variables in adaptive actions are correlated In mechanical systems, components are connected and correlated Monitor changes in correlations to diagnose performance problems Correlations are broken, the system may not perform as expected 19

Diagnostics AC D + AC R U D + + Learned Translate into causality assumptions R 1. Learning phase: learn adaptation graph by calculating correlation coefficient 2. At run-time: periodically recalculate the sign of edges in adaptation graph U Estimated Adaptation Graph Backup Policy 3. Check the sign System workload Automated-detection Detect assumption violation Control knob settings Regulation Policy Target performance reference Performance Knobs (Actuators) Sensors Monitor the target system 20 Target System output

Diagnostics Stop the component causing the sign problem AC D + R U Translate into causality assumptions + Execute backup action: open loop action Try several times Adaptation Graph Backup Policy System workload Automated-detection Detect assumption violation Control knob settings Regluation Policy Target performance reference Performance Knobs (Actuators) Sensors Monitor the target system 21 Target System output

Example n Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism: ¨ ¨ Admission control based on utilization. It drops lower priority request first + + AC AC Util Pd + Req Util Pd Req n Increased workload interrupt handling to polling utilization drops n Controller tries to accept more requests Aggrevate the situation Most new requests dropped by kernel. n No prioritization enforced 22

Diagnostics Example 1. Network processing is overloaded: switching from interrupt handling to polling 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation CPU utilization # of network interrupts 1. Closed loop 2. Closed loop - violation 23 Correlation Req Util becomes broken

More on Diagnostics n Correlations between continuous variables do not uncover problems due to sequences of discrete events n Focus on runtime events related to performance n Ex) turn on machines. Decrease DVS, send a packet, etc. n Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases n Data mining technique: discriminative sequence analysis

Main Idea n n n Log different events during runtime Most of the time the system works Occasionally it performs poorly Generate the frequent sequences of events that occurs when the system works correctly Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior Identify the “culprit” sequences of events that are found only in the latter case but not the former.

A Case Study on a “Hot” Day: Throughput of a Server Farm Low Th roughpu t

Three Performance Control Policies n Thermal Management Policy n n Energy Aware Load Balancer n n n Puts machine to sleep if machine is overheated Distributes load based on average CPU utilization Attempts to minimize the number of machines in use Machine On/Off Policy n Turns off idle machines to save energy

Regular Operating Condition s 0 degree ll Below 6 e ture is w ra m tempe Maximu

Anomalous Condition degrees above 60 erature is p mum tem Maxi

Anomalous Condition degrees above 60 erature is p mum tem Maxi rheated ove nly the tually, o mained on! Even chine re ma

Diagnostics Output: Reported Culprit Event Sequences n Cycle: n n Sleep. Event, Wake. Up. Event Cycle: n Temp: 65 - 70, Temp: 60 - 65,

Diagnostics Output: Reported Culprit Event Sequences n Cycle: n n Sleep. Event, Wake. Up. Event Cycle: n Temp: 65 - 70, Temp: 60 - 65, Oops: Utilization is computed based on a recent time average (including “sleep” time) Artificially low if machine sleeps

What was going on? n n n No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines

Conclusions (the needs) n n n Must Identify the right knobs to manipulate (e. g. , example with virtual machine memory allocation) Must manage them in a jointly optimal manner to avoid instability or poor performance Must develop automated self-diagnostic techniques to reduce administrator effort

Conclusions (the tools) n n n Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers Advances in event-based control offer opportunities to significantly reduce actuation overhead (e. g. , number of times machines are tuned on/off without degrading performance Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems