Qo S-aware Resource Management in Distributed System ECE

Qo. S-aware Resource Management in Distributed System ECE 7610

Qo. S-Aware Resource Management q Physical Environment v Job scheduling v Load balancing v Data locality v Application deployment v Server/Resource allocation q Virtualized environment (Cloud Computing) v Similar issues as in Physical Environment v Interference-aware Sche. v VM deployment v VM migration v Virtual resource allocation 2

Physical Resource Management q Typical systems in practice v Hadoop Cluster • Resource-aware Scheduling • Data locality-aware Scheduling • Resource Management Framework (YARN) v Grid Computing • Qo. S-aware resource management v Multi-tier Web System • Dynamic application placement • Dynamic servers allocation • Dynamic resource provisioning 3

Hadoop resource-aware Scheduling q Fair Scheduler (Facebook) v v v Hadoop cluster is shared by multiple users with multiple jobs Assigning resource/cluster capacity to jobs such that all jobs get an equal share of resource/cluster capacity Also work with job priorities, the priorities are used as weights to determine the fraction of total compute time that each job gets. Guarantee minimum shares to resource pools or jobs. Maintain a job queue, sorted according to fairness. The job farthest below its fair share will be scheduled first. 4

Hadoop resource-aware Scheduling q Capacity Scheduler (Yahoo) v v v Jobs are fair-sharing the capacity of the cluster Jobs are submitted into queues Queues are allocated a fraction of the total resource capacity Free resources are allocated to queues beyond their total capacity Within a queue a job with a high level of priority will have access to the queue's resources There is no preemption once a job is running. 5

Hadoop Locality-aware Scheduling q Delay Scheduling (Facebook) v v v Try to assign task to its input data as close as possible Local data access is much efficient than remote data access Locality level: node locality, rack locality and off rack The schedule order is based on fairness. Strict policy may hurt data locality Delay some jobs to achieve high data locality by compromising fairness a little bit 6

Hadoop Locality-aware Scheduling Master Job 1 Job 2 Scheduling order Task 2 File 2: Task 3 Task 1 Task 7 Task 4 Slave File 1: Task 5 Slave Slave 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 3 2 3 1 2

Hadoop Locality-aware Scheduling Master Job 2 Job 1 Scheduling order Task 2 File 2: 1 Task 3 Task 1 2 Task 7 Task 4 3 Slave File 1: Task 5 Slave Slave 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 3 2 3 1 2 Problem: Fair decision hurts locality Especially bad for jobs with small input files

Hadoop Locality-aware Scheduling Wait Master Job 2 Job 1 Scheduling order Task 2 File 2: 8 Task 3 1 2 Task 7 Task 4 6 Slave File 1: 1 Task 5 Slave Slave 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 1 3 2 3 1 2 Idea: Wait a short time to get data-local scheduling opportunities

Hadoop Resource Manager q Hadoop Next. Gen Map. Reduce (YARN) v Split the resource management and scheduling/monitoring functions into two daemons v Have a global Resource Manager (RM) and multiple Node Manager (NM) and application specific Application Master (AM) v The RM is the authority that allocates resources among all the applications in the system v NM periodically report Node status 10

Resource Management in Grid q Grid Computing v Large amount of resource from multiple locations to reach a common goal v Usually considered as a distributed system with non-interactive workload that involve a large number of files v Tend to be loosely coupled, heterogeneous, and geographically dispersed q Resource management Challenges in Grid v Satisfactory end-to-end performance v Availability to computational resources v Handle of conflicts of resource demands v Fault-tolerance v Common critical resource Computing Power, Disk Space, Memory, Network Bandwidth, etc • 11

Resource Management in Grid q Stages of Resource Management Resource Discovery • Find the available resource v Systems Selection • Allocate the resource v Job Execution • Run the job • Log resource usage • Release resource v q Target Guarantee Quality of Service v Rapid and cost-effective access to large amounts of resources v Scheduling resource regardless of network topology v 12

Key Issues in RMS q RMS Organization v Flat/Cells/Hierarchical q Job Resource Demand Estimation Predictive • Heuristics prediction/Statistical Modeling/Machine Learning v Non-predictive • Heuristics/Probability Distribution v q Scheduling Policy v Fixed System Oriented/ Application Oriented v Extensible • Ad-hoc/ Structured • 13

Grid RMS Examples 14

Multi-tier Web Systems q Typical Architecture v Web server tier (presentation tier) v Application server tier (logic tier) v Database server tier (data access tier) q Resource Management Challenges v Interactive jobs, time-sensitive v Heterogeneous apps with diff. demand v Dynamic workload q Resource Management Issues v Dynamic Application Placement v Dynamic resource allocation v Dynamic servers allocation 15

Dynamic Application Placement q Problem v Given a set of servers with constrained resources and a set of application with dynamic demands, how many instances to run and where to put them ? q Objective v Maximize the total satisfied application demand v Minimize placement overhead v Balance the workload v Highly scalable A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07 16

Dynamic Application Placement A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07 17

Dynamic Application Placement q Approaches v NP-hard Problem, a variant of the Class Constrained Multiple- Knapsack Problem, traditional approaches are not scalable v Computing the maximum total application demand that can be satisfied by the current placement solution. v First shifting the workload among instances of same applications • Max-flow and min-cost max-flow problem • At most one underutilized instances • Residual memory and CPU co-located v Perform application placement • Outmost Loop rank the apps in increasing load-memory ratio, rank the machines in decreasing CPU-memory ratio • Intermediate loop test all the applications • Innermost Loop find appropriate applications A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07 18

Dynamic Resource Allocation q Problem v How to guarantee the quality to web service with limited resources with dynamic user demand v How to evaluate and monitor the service quality q Objective v Guarantee Client-perceived Qo. S by dynamical adjusting resource allocation v consider the response time of the whole pages instead of single packet q Approach v Model-independent two-level self-tuning fuzzy controller for resource allocation v A Framework to guarantee client-perceived end-to-end Qo. S e. Qo. S: Provisioning of Client-Perceived End-to-End Qo. S Guarantees in Web Servers IEEE Trans. Computers 2006 19

Client-Percieved Qo. S request-based Qo. S server client connection close last object 2 object 1 base page Setup connection Internet waiting for new requests client-perceived pageview Qo. S HTTPS Traffic Mirrored HTTPS Traffic TCP Packets HTTPS Trans Packet Perf Capture Analyzer Wei/Xu, s. Monitor for Measurement of User-Perceived Laency, USENIX’ 2006 20

Dynamic Resource Allocation q Architecture v Qo. S controller makes resource allocation decision v Resource manager manages requests v Qo. S monitor measure the page-view client-perceived response time q Qo. S Controller v Resource controller with fuzzy rules v Scaling factor controller e. Qo. S: Provisioning of Client-Perceived End-to-End Qo. S Guarantees in Web Servers IEEE Trans. Computers 2006 21

Dynamic Server Allocation q Objective v Automatically allocate computing resource (coarse-grained, number of servers) to each application in a data center to maximize performance. q Approach v Machine Learning algorithm Online Resource Allocation Using Decompositional Reinforcement Learning AAAI 2005 22

Qo. S-Aware Resource Management q Physical Environment v Job scheduling v Load balancing v Data locality v Server/Resource allocation v Application deployment q Virtualized environment (Cloud Computing) v Similar issues as in Physical Environment v Interference-aware Sche. v Virtual resource allocation v VM deployment v VM migration 23

Interference-Aware Task Scheduling q Co-hosted VMs share hardware and software q Interference slows down the tasks dramatically 24

Interference-Aware Task Scheduling System architecture TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’ 11 Interference and Locality-Aware Task Scheduling for Map. Reduce Applications in Virtual Clusters HPDC’ 13 25

Interference Prediction Model q Quantify the interference impact on system performance q Different Models v v v Linear Model Quadratic Model Exponential Model I/Obound CPUbound Overall Linear 0. 676 0. 611 0. 657 Quadratic 0. 722 0. 672 0. 714 Exponential 0. 895 0. 879 0. 887 TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’ 11 Interference and Locality-Aware Task Scheduling for Map. Reduce Applications in Virtual Clusters HPDC’ 13 26

Interference-Aware Task Scheduling Least Interference Scheduling Given an available node Predict the slowdown S for all jobs Sort jobs Accept the job with least interference Dynamic Threshold Scheduling Given a job and an available node Given an initial threshold H Predict the slowdown rate S If S<H Then accept this job Else reject this job // num of working slots Lr // dynamic threshold Hd Set Hd = H if (Lr+1)/S > Lr/Hd Then accept the job Update Hd = S Else reject this job TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’ 11 Interference and Locality-Aware Task Scheduling for Map. Reduce Applications in Virtual Clusters HPDC’ 13 27

Dynamic Virtual Resource Allocation Resource waste Over provisioning SLA violation applicat ion Expected Under provisioning 1. When to allocate resource? 2. How much resource to allocate? Dynamic provisioning 28

Dynamic Virtual Resource Allocation q Fine-grained resource management v Dynamical adjust VM capacity v Virtual CPU/Memory/Disk I/O bandwidth q Challenges v Heterogeneous applications with different characteristics v v consolidated in single machine Dynamic workloads Interference between co-hosted Applications/VMs Interplay with related application components Scalability and Adaptability q Objective v Guarantee SLA and Qo. S for each application v Maximizing resource utilization v Maximizing system throughput 29

Dynamic Virtual Resource Allocation ØMulti-Input, Multi-Output (MIMO) Controller Allocates multiple types of resources to multiple enterprise applications. q q Set of application controllers and to determine the amount of resources. Set of node controllers to detect resources bottlenecks and allocate “actual” resources to multiple types of individual applications. Automated control of Multiple Virtualized Resource. Euro. Sys’ 09 30

Approaches q Application Controller Design v Model Estimator: Auto-regressive-moving-average model v Optimizer: Minimizing cost function Performance Cost Control Cost Automated control of Multiple Virtualized Resource. Euro. Sys’ 09 31

Approaches q Node Controller Design v Allocates resources based on the requested resources by Application controllers and resources available at the node q Scenarios v Adequate CPU and Disk Resources. v Adequate Disk but inadequate CPU resources. v Adequate CPU but inadequate Disk Resources v Inadequate CPU and Disk Resources Automated control of Multiple Virtualized Resource. Euro. Sys’ 09 32

Why is modeling hard? Cloud resource is not uniform The resource management needs to be model-free, adaptive, and scalable 33

Reinforcement Learning Method q Learning process through interactions with env v Model-free • Optimal control, feedback control • Statistical Modeling v Optimizes long-term reward • Current decision may have delayed consequences on both future reward and future state. • Avoid Local optimum: mathematical optimization applicat Evaluate decision (S 1, Act 1) = r 1+r 2+r 3+…+rn-1 ion Agent state feedback System S 1 resource adjustment Goal r 1 Act 1 S 2 Act 2 r 3 rn-1 … Actn-1 r 2 S 3 Act 3 S 3 VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration. ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 34

Q-Learning Estimate the future q Q-value v v v Estimated accumulated reward Evaluate the “goodness” of an action at a state Continuously updated using temporal difference method q Policy v v Q(s, a) Exploitation • Select the best one Exploration state • Random try exploration action ? ? exploitation bad good negative positive VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration. ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 35 applicat

VM Resource Management as a RL task q Goal (Host-wide) v Max performance v Min resource cost q State v Rsrc allocations q Action v Rsrc adjustment q Reward v System performance Centralized Resource Management VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration. ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 36

VM Resource Management as a RL task Distributed Resource Management VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration. ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 37

VM Deployment and Migration q Dynamic VM Deployment v Adjust resource allocation according to demand in order to satisfy SLA v Minimize number of working node v Minimize power consumption v Minimize reconfiguration cost q VM Live Migration v Moving a running VMs Between physical servers v Support dynamic Deploy. v Dynamic balance wkload. 38

Data and VM Placement for Hadoop q Job Specific-awareness v Map-input heavy: grep v Map-and-Reduce-input heavy: sort v Reduce-input-heavy: generator Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 39

Reduce Task Locality Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 40

Data and VM Placement for Hadoop q. Load-awareness v. Computation load v. Storage load v. Network load Expected-load-unaware data placement Expected-Load-aware data placement Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 41

Placement Techniques q Minimizing Cost Functions Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 42

Placement Techniques q Map-input heavy jobs v Data placement: load balancing v VM placement: to the physical machine with local data or close q Map-and-Reduce-input jobs v Data placement: load balancing/reduce locality v VM placement: to the physical machine with local data or close q Reduce-input heavy jobs v Data placement: any where v VM placement: close to each other Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 43

Data and VM Placement for Hadoop Reduce phase Map-and-Reduce heavy Job Purlieus: Locality-aware resource Allocation for Map. Reduce in a Cloud. SC’ 11 44

Qo. S-Aware Resource Management q Physical Environment v Job scheduling v Load balancing v Data locality v Application deployment v Server/Resource allocation q Virtualized environment (Cloud Computing) v Similar issues as in Physical Environment v Interference-aware Sche. v VM deployment v VM migration v Virtual resource allocation 45