Скачать презентацию Cloud Service Infrastructure ECE 7650 Key Components Скачать презентацию Cloud Service Infrastructure ECE 7650 Key Components

733e5974a7a546d106ef07095b596fba.ppt

  • Количество слайдов: 92

Cloud Service Infrastructure ECE 7650 Cloud Service Infrastructure ECE 7650

Key Components q Datacenter Architecture v Cluster: compute, storage v Network for DC v Key Components q Datacenter Architecture v Cluster: compute, storage v Network for DC v Storage v Virtualization q Cloud Management SW v Resource Usage Metering v Automated Systems Management v Privacy and Security Measures q Cloud Programming Env 2

What’s a Cluster? q Broadly, a group of networked autonomous computers that work together What’s a Cluster? q Broadly, a group of networked autonomous computers that work together to form a single machine in many respects: To improve performance (speed) v To improve throughout v To improve service availability (highavailability clusters) v q Based on commercial off-the-shelf, the system is often more costeffective than single machine with comparable speed or availability

Highly Scalable Clusters q High Performance Cluster (aka Compute Cluster) A form of parallel Highly Scalable Clusters q High Performance Cluster (aka Compute Cluster) A form of parallel computers, which aims to solve problems faster by using multiple compute nodes. v For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network v q Server Cluster and Datacenter v Aims to improve the system’s throughput , service availability, power consumption, etc by using multiple nodes

Top 500 Installation of Supercomputers Top 500. com Top 500 Installation of Supercomputers Top 500. com

Clusters in Top 500 Clusters in Top 500

An Example of Top 500 Submission (F’ 08) Location Tukwila, WA Hardware – Machines An Example of Top 500 Submission (F’ 08) Location Tukwila, WA Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1. 86 GHz CPU and 8 GB RAM Hardware – Networking Private & Public: Broadcom Gig. E MPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration Number of Compute Nodes 256 Total Number of Cores 2048 Total Memory 2 TB of RAM Particulars of for current Linpack Runs Best Linpack Result 11. 75 TFLOPS Best Cluster Efficiency 77. 1% For Comparison… Linpack rating from June 2007 Top 500 run (#106) on the same hardware 8. 99 TFLOPS Cluster efficiency from June 2007 Top 500 59% run (#106) on the same hardware Typical Top 500 efficiency for Clovertown motherboards w/ IB regardless of Operating System 65 -77% (2 instances of 79%) 30% impro in efficiency on the same hardware; about one hour to deplay

Beowulf Cluster q. A cluster of inexpensive PCs for low-cost personal supercomputing q. Based Beowulf Cluster q. A cluster of inexpensive PCs for low-cost personal supercomputing q. Based on commodity off-the-shelf components: PC computers running a Unix-like Os (BSD, Linux, or Open. Solaris) v Interconnected by an Ethernet LAN v q Head node, plus a group of compute node v Head node controls the cluster, and serves files to the compute nodes q. Standard, free and open source software v Programming in MPI v Map. Reduce

Why Clustering Today q Powerful node (cpu, mem, storage) Today’s PC is yesterday’s supercomputers Why Clustering Today q Powerful node (cpu, mem, storage) Today’s PC is yesterday’s supercomputers v Multi-core processors v q High speed network Gigabit (56% in top 500 as of Nov 2008) v Infiniband System Area Network (SAN) (24. 6%) v q Standard tools for parallel/ distributed computing & their growing popularity. MPI, PBS, etc v Map. Reduce for data-intensive computing v

Major issues in Cluster Design q Programmability v Sequential vs Parallel Programming v MPI, Major issues in Cluster Design q Programmability v Sequential vs Parallel Programming v MPI, DSM, DSA: hybrid of multithreading and MPI v Map. Reduce q Cluster-aware Resource management v Job scheduling (e. g. PBS) v Load balancing, data locality, communication opt, etc q System management v Remote installation, monitoring, diagnosis, v Failure management, power management, etc

Cluster Architecture q Multi-core node architecture q Cluster Interconnect Cluster Architecture q Multi-core node architecture q Cluster Interconnect

Single-core computer Single-core computer

Single-core CPU chip the single core Single-core CPU chip the single core

Multicore Architecture q Combine 2 or more independent cores (normally CPU) into a single Multicore Architecture q Combine 2 or more independent cores (normally CPU) into a single package q Support multitasking and multithreading in a single physical package

Multicore is Everywhere q Dual-core commonplace in laptops q Quad-core in desktops q Dual Multicore is Everywhere q Dual-core commonplace in laptops q Quad-core in desktops q Dual quad-core in servers q All major chip manufacturers produce multicore CPUs SUN Niagara (8 cores, 64 concurrent threads) v Intel Xeon (6 cores) v AMD Opteron (4 cores) v

Multithreading on multi-core David Geer, IEEE Computer, 2007 Multithreading on multi-core David Geer, IEEE Computer, 2007

Interaction with the OS q OS perceives each core as a separate processor q Interaction with the OS q OS perceives each core as a separate processor q OS scheduler maps threads/processes to different cores q Most major OS support multi-core today: Windows, Linux, Mac OS X, …

Cluster Interconnect q Network fabric connecting the compute nodes q Objective is to strike Cluster Interconnect q Network fabric connecting the compute nodes q Objective is to strike a balance between Processing power of compute nodes v Communication ability of the interconnect v q A more specialized LAN, providing many opportunities for perf. optimization Switch in the core v Latency vs bw v

Goal: Bandwidth and Latency Goal: Bandwidth and Latency

Ethernet Switch: allows multiple simultaneous transmissions A q hosts have dedicated, direct connection to Ethernet Switch: allows multiple simultaneous transmissions A q hosts have dedicated, direct connection to switch q switches buffer packets q Ethernet protocol used on each incoming link, but no collisions; full duplex v each link is its own collision domain q switching: A-to-A’ and B-to- B’ simultaneously, without collisions v not possible with dumb hub C’ B 1 2 6 5 3 4 C B’ A’ switch with six interfaces (1, 2, 3, 4, 5, 6)

Switch Table q Q: how does switch know that A’ reachable via interface 4, Switch Table q Q: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5? q A: each switch has a switch table, each entry: v C’ B q Q: how are entries created, maintained in switch table? something like a routing protocol? 1 2 6 5 (MAC address of host, interface to reach host, time stamp) q looks like a routing table! v A 3 4 C B’ A’ switch with six interfaces (1, 2, 3, 4, 5, 6)

Switch: self-learning Source: A Dest: A’ q switch learns which hosts can be reached Switch: self-learning Source: A Dest: A’ q switch learns which hosts can be reached through which interfaces A A A’ C’ B 1 2 3 6 5 4 when frame received, switch “learns” location of sender: incoming LAN segment B’ v records MAC addr sender/location pair ininterface TTL 60 1 A switch table v C A’ Switch table (initially empty)

Source: A Self-learning, forwarding: example Dest: A’ A A A’ C’ B q frame Source: A Self-learning, forwarding: example Dest: A’ A A A’ C’ B q frame destination unknown: flood 1 2 3 6 A A’ 5 4 r destination A location known: A’ A selective send B’ C A’ MAC addr interface TTL A A’ 1 4 60 60 Switch table (initially empty)

Interconnecting switches q Switches can be connected together S 4 S 1 S 2 Interconnecting switches q Switches can be connected together S 4 S 1 S 2 A B S 3 C F D E I G H r Q: sending from A to G - how does S 1 know to forward frame destined to F via S 4 and S 3? r A: self learning! (works exactly the same as in single-switch case!) r Q: Latency and Bandwidth for a large-scale network?

What characterizes a network? q Topology (what) v physical interconnection structure of the network What characterizes a network? q Topology (what) v physical interconnection structure of the network graph v Regular vs irregular q Routing Algorithm (which) v restricts the set of paths that msgs may follow v Table-driven, or routing algorithm based q Switching Strategy v how data in a msg traverses a route v Store and forward vs cut-through (how) q Flow Control Mechanism v when a msg or portions of it traverse a route v what happens when traffic is encountered? (when) q Interplay of all of these determines performance

Tree: An Example q Diameter and ave distance logarithmic v k-ary tree, height d Tree: An Example q Diameter and ave distance logarithmic v k-ary tree, height d = logk N v address specified d-vector of radix k coordinates describing path down from root q Fixed degree q Route up to common ancestor and down v R = B xor A v let i be position of most significant 1 in R, route up i+1 levels v down in direction given by low i+1 bits of B q Bandwidth and Bisection BW?

Bandwidth q Bandwidth v Point-to-Point bandwidth v Bisectional bandwidth of interconnect frabric: rate of Bandwidth q Bandwidth v Point-to-Point bandwidth v Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes. q For a switch with N ports, v If it is non-blocking, the bisectional bandwidth = N * the p -t-p bandwidth v Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node v oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw

How to Maintain Constant BW per Node? q Limited ports in a single switch How to Maintain Constant BW per Node? q Limited ports in a single switch v Multiple switches q Link between a pair of switches be bottleneck v Fast uplink q How to organize multiple switches v Irregular topology v Regular topologies: ease of management

Scalable Interconnect: Examples Fat Tree building block 16 node butterfly Scalable Interconnect: Examples Fat Tree building block 16 node butterfly

Multidimensional Meshes and Tori 2 D Mesh 2 D torus 3 D Cube q Multidimensional Meshes and Tori 2 D Mesh 2 D torus 3 D Cube q d-dimensional array v n = kd-1 X. . . X k. O nodes v described by d-vector of coordinates (id-1, . . . , i. O) q d-dimensional k-ary mesh: N = kd v k = dÖN v described by d-vector of radix k coordinate q d-dimensional k-ary torus (or k-ary d-cube)?

Packet Switching Strategies q Store and Forward (SF) v move entire packet one hop Packet Switching Strategies q Store and Forward (SF) v move entire packet one hop toward destination v buffer till next hop permitted q Virtual Cut-Through and Wormhole v pipeline the hops: switch examines the header, decides where to send the message, and then starts forwarding it immediately v Virtual Cut-Through: buffer on blockage v Wormhole: leave message spread through network on blockage

SF vs WH (VCT) Switching q Unloaded latency: h( n/b+ D) vs n/b+h. D SF vs WH (VCT) Switching q Unloaded latency: h( n/b+ D) vs n/b+h. D v h: distance v n: size of message v b: bandwidth v D: additional routing delay per hop

Conventional Datacenter Network Conventional Datacenter Network

Problems with the Cluster Arch q Resource fragmentation: v If an application grows and Problems with the Cluster Arch q Resource fragmentation: v If an application grows and requires more servers, it cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources q Poor server-to-server connectivity v Servers in different layer-2 domains to communication through the layer-3 portion of the network q See papers in the reading list of Datacenter Network Design for proposed approaches

Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building blocks 4. Datacenter basics 5. Energy and power efficiency 6. Dealing with Failures and Repairs 35

Datercenter q Datacenters are buildings where servers and comm. gear are co-located because of Datercenter q Datacenters are buildings where servers and comm. gear are co-located because of their common environmental requirements and physical security needs, and for ease of maintenance. q Traditional DCs typically host a large number of relatively small- or medium-sized applications, each running on a dedicated hw infra that is decoupled and protected from other systems in the same facility 36

Recent Data Center Projects 37 Recent Data Center Projects 37

Advances in Deployment of DC q Conquering complexity. v Building racks of servers & Advances in Deployment of DC q Conquering complexity. v Building racks of servers & complex cooling systems all separately is not efficient. v Package and deploy into bigger units: Microsoft Generation 4 Data Center http: //www. youtube. com/watch? v=PPno. Kb 9 f. Tk. A 38

Microsoft Generation 4 Datacenter 39 Microsoft Generation 4 Datacenter 39

Typical Elements of a DC q Often low-end servers are used q Connected in Typical Elements of a DC q Often low-end servers are used q Connected in multi-tier networks 40

Key Aspects q Storage q Network Fabric q Storage Hierarchy q Quantify Latency, BW, Key Aspects q Storage q Network Fabric q Storage Hierarchy q Quantify Latency, BW, and Capacity q Power Usage q Handling Failures 41

Storage q Global distributed file systems (e. g. Google’s GFS) v Hard to implement Storage q Global distributed file systems (e. g. Google’s GFS) v Hard to implement at the cluster-level, but lower hw costs and networking fabric utilization v GFS implement replication across different machines (for fault tolerance); Google deploys desktop-clas dis drives, instead of enterprise-grade disks q Network Attached Storage (NAS), directly connected to the cluster-level switching fabric v Simple to deploy, because it pushes the responsibility for data management and integrity to NAS appliance 42

Amazon’s Simple Storage Service (S 3) q Write, read, and delete objects containing from Amazon’s Simple Storage Service (S 3) q Write, read, and delete objects containing from 1 byte to 5 q q q terabytes of data each. The number of objects you can store is unlimited. Each object is stored in a bucket and retrieved via a unique, developer-assigned key. A bucket can be stored in one of several Regions: in the US Standard, EU (Ireland), US West (Northern California) and Asia Pacific (Singapore) Regions. Built to be flexible so that protocol or functional layers can easily be added. The default download protocol is HTTP. A Bit. Torrent™ protocol interface is provided to lower costs for high-scale distribution. Designed to provide 99. 99999% durability and 99. 99% availability of objects over a given year. Designed to sustain the concurrent loss of data in two facilities. Designed to provide 99. 99% durability and 99. 99% availability of objects over a given year. This durability level corresponds to an average annual expected loss of 0. 01% of objects. Designed to sustain the loss of data in a single facility. 43

Amazon’s Elastic Block Storage (EBS) q EBS provides block level storage volumes for use Amazon’s Elastic Block Storage (EBS) q EBS provides block level storage volumes for use with EC 2 instances. It allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC 2 instances. q Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component. q The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC 2 instance stores in nearly all cases. 44

Network Fabric q Tradeoff between speed, scale, and cost q Two-level hierarchy (in rack Network Fabric q Tradeoff between speed, scale, and cost q Two-level hierarchy (in rack and cluster- levels) A switch having 10 times the bi-section bw costs about 100 times as much v A rack with 40 servers, each with a 1 -Gbps port may have between 4 to 8 1 -Gbps uplinks to the cluster switch: an oversubscription factor between 5 and 10 for comm across racks. v q “fat-tree” networks built of lower-cost commodity Ethernet switches 45

Next Generation of Network Fabric q Monsoon v Work by Albert Greenberg, Parantap Lahiri, Next Generation of Network Fabric q Monsoon v Work by Albert Greenberg, Parantap Lahiri, David A. Maltz, Parveen Patel, Sudipta Sengupta. v Designed to scale to 100 K+ data centers. v Flat server address space instead of dozens of VLANS. v Valiant Load Balancing. v Allows a mix of apps and dynamic scaling. v Strong fault tolerance characteristics. 46

Storage Hierarchy 47 Storage Hierarchy 47

Latency, BW, and Capacity q Assume: 2000 servers (8 GB mem and 1 TB Latency, BW, and Capacity q Assume: 2000 servers (8 GB mem and 1 TB disk), 40 per rack connected by a 48 -port 1 Gpbs switch (8 uplinks) q Arch: bridge the gap in a cost-efficient manner q SW : hide the complexity, exploit data locality 48

DC vs Supercomputers q Scale v v v Blue Waters = 40 K 8 DC vs Supercomputers q Scale v v v Blue Waters = 40 K 8 -core “servers” Road Runner = 13 K cell + 6 K AMD servers MS Chicago Data Center = 50 containers = 100 K 8 -core servers. Fat tree network q Network Architecture v v v Supercomputers: CLOS “Fat Tree” infiniband Low latency – high bandwidth protocols Data Center: IP based Network Standard Data Center • Optimized for Internet Access q Data Storage v Supers: separate data farm • GPFS or other parallel file system v DCs: use disk on node + memcache 49

Power Usage q Distribution of peak power usage of a Google DC (circa 2007) Power Usage q Distribution of peak power usage of a Google DC (circa 2007) 50

Workload and SW Infra q Platform-level SW: presented in all individual servers, providing basic Workload and SW Infra q Platform-level SW: presented in all individual servers, providing basic server-level services q Cluster-level infra: collection of distributed systems sw that manages resources and provides services at the cluster level v Map. Reduce, Hadoop, etc q Application-level sw: implements a specific service v Oneline service like web search, gmail, v Offline computations, e. g. data analysis or generate data used for online services such as building index 51

Examples of Appl-SW q Web 2. 0 applications v Provide rich user experience including Examples of Appl-SW q Web 2. 0 applications v Provide rich user experience including real-time global collaboration v Enable rapid software development q Software to scan voluminous Wikipedia edits to identify spam q Organize global news articles by geographic location q Data-intensive workloads based on scalable architectures, such as Google’s Map. Reduce framework v v Financial modeling, real-time speech translation, Web search Next generation rich media, such as virtual worlds, streaming videos, Web conferencing, etc. 52

Characteristics q Ample parallelism at both data and request levels; v Key is not Characteristics q Ample parallelism at both data and request levels; v Key is not to find parallelism, but to manage and efficiently harness the explicit parallelism v Data parallelism arises from the large data sets of relatively independent records to be processed v Request-level parallelism comes from hundreds or thousands of requests per second to be responded; the requests rarely invovle read-after-write sharing of data or synchronization 53

Characteristics (cont’) q Workload churn: isolation from users of Internet services makes it easy Characteristics (cont’) q Workload churn: isolation from users of Internet services makes it easy to deploy new sw quickly. v v Google’s front-end web server binaries are released on a weekly cycle The core of its search services be reimplemented from scratch every 2~3 years!! New products and services frequently emerge, and their success with users directly affects the resulting workload mix in the DC Hard to develop a meaningful benchmark • Not too much for HW architects to do • Count on sw rewrites to take advantage of new hw capabilities ? ! q Fault-free is challenging, but possible 54

Basic Programming Concepts q Data Replication for both perf and availability q Data Partitioning Basic Programming Concepts q Data Replication for both perf and availability q Data Partitioning (sharded) for both p. & a. q Load balancing: v Sharded vs replicated services q Health checking & watchdog timers for availability v No op could rely on a given server to respond to make progress forward q Integrity checks for availability q Application-specific compression q Eventual consistency w. r. t. replicated data v When no updates occur for a long period time, eventually all updates will propagete through the system and all replicas will be consistent 55

Cluster-level SW q Resource Management v Map user tasks to hw resources, enforce priorities Cluster-level SW q Resource Management v Map user tasks to hw resources, enforce priorities and quotas, provide basic task managment services v Simple allocation; or automate allocation of resources; fair-sharing of resources at a finer level of granulaity; power/energy consideration v Related Work at Wayne State • Wei, et al, Resource management for end-to-end quality of service assurance [Wei’s Ph. D disseration’ 06] • Zhou, et al, Proportional resource allocation in web servers, streaming servers, and e-commerce servers proportional ; see cic. eng. wayne. edu for related publications (02 -05) q HW abstraction and Basic Services v E. g. reliable distributed storage, message-passing, cluster -level sync. (GFS, Dynamo ) 56

Cluster-level SW (cont’) q Deployment and maintenance v Sw image distributionj, confguration management, monitoring Cluster-level SW (cont’) q Deployment and maintenance v Sw image distributionj, confguration management, monitoring service perf and quality, alarm trigger for operators in emergency situations, etc v E. g. Autopilot of Microsoft, Google’s Health Infrastructure v Related Work at Wayne State • Jia, et al, Measuring machine capacity, ICDCS’ 08 • Jia, et al, Autonomic VM configuration, ICAC’ 09 • Bu, et al, Autonomic Web apps configuration, ICDCS’ 09 q Programming Frameworks v Tools like Map. Reduce would improve programmer productibity by automatically handling data partitioning, distribution, and fault tolerance 57

Monitoring Infra: An Example q Service-level dashboards v Keep track of service quality (w. Monitoring Infra: An Example q Service-level dashboards v Keep track of service quality (w. r. t target level) v Info must be fresh for corrective actions and avoid significant disruption within sec not min v E. g. how to measure user-perceived pageview response time? (multiple objects, end-to-end) 18 objects 58

Client-Experienced Qo. S request-based Qo. S server client connection close last object 2 object Client-Experienced Qo. S request-based Qo. S server client connection close last object 2 object 1 base page Setup connection Internet waiting for new requests client-perceived pageview Qo. S HTTPS Traffic Mirrored HTTPS Traffic TCP Packets HTTPS Trans Packet Perf Capture Analyzer Wei/Xu, s. Monitor for Measurement of User-Perceived Laency, USENIX’ 2006 59

Perf Debugging Tools q Help operators and service designers to develop understanding of the Perf Debugging Tools q Help operators and service designers to develop understanding of the complex interactions between programs, often running on hundreds of servers, so as to determine the root cause of perf anomalies and identify bottlenecks q Black-box monitoring: observing network traffic among system components and inferring causal relationships through statistical inference methods, assuming no knowledge of or assistance from appl or sw; v But Info may not accurate q Appl/middleware instrumentation systems, like Google’s Dapper, require to modify applications or middleware libraries for passing tracing across machines and across module boundaries. The annotated modules log tracing info to local disks for subsequent analysis 60

HW Building Blocks q Cost Effectiveness of Low-end Servers 61 HW Building Blocks q Cost Effectiveness of Low-end Servers 61

Performance of Parallel Apps q Under a model of fixed local computation time, plus Performance of Parallel Apps q Under a model of fixed local computation time, plus the latency penalty of access to global data structure 62

Parallel Apps Perf q Perf advantage of a cluster of high-end nodes (128 -cores) Parallel Apps Perf q Perf advantage of a cluster of high-end nodes (128 -cores) over a cluster with the same number of cores built with lowend servers (4 cores) [4 to 20 times difference in price] 63

How Small a Cluster Node should Be? q Other factors need to be considered How Small a Cluster Node should Be? q Other factors need to be considered v Amount of parallelism, v Network requirements v Smaller server may lead to low utilization v etc 64

Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building blocks 4. Datacenter basics 5. Energy and power efficiency 6. Dealing with Failures and Repairs 65

Power/Energy Issues in DC q Main components 66 Power/Energy Issues in DC q Main components 66

UPS Systems q A transfer switch that chooses the active power input, with utility UPS Systems q A transfer switch that chooses the active power input, with utility power or generator power. v Typically, a generator takes 10 -15 s to start and assume the full rated load q Batteries to bridge of time of utility power blackout v v v AC-DC-AC double conversion When utility power fails, UPS loses input AC power but retains internal DC power, and thus the AC output power Remove voltage spikes or harmonic distortions in the a. C feed q 100 s k. W up to 2 MW 67

Power Distribution Units q Takes the UPS output (typically 200~480 v) and break it Power Distribution Units q Takes the UPS output (typically 200~480 v) and break it up into many 110 or 220 v circuits that feed the actual servers on the floor. v Each circuit is protected by its own breaker q A typical PDU handles 75~225 k. W of load, whereas a typical circuit handles 20 or 30 A at 110 -220 V (a max of 6 k. W) q PDU provides additional redundancy (in circuit) 68

Datacenter Cooling Systems q CRAC Units: computer room air conditioning q Water-based free cooling Datacenter Cooling Systems q CRAC Units: computer room air conditioning q Water-based free cooling 69

Energy Efficiency q DCPE (DC perf efficiency) : Ratio of amount of computation al Energy Efficiency q DCPE (DC perf efficiency) : Ratio of amount of computation al work to total energy consumed q Total energy consumed: v Power usage effectiveness (PUE): ratio of building power to IT power (currently 1. 5 to 2. 0) 70

Breakdown of DC Energy Overhead 71 Breakdown of DC Energy Overhead 71

Energy Efficiency (cont’) q Power usage effectiveness (PUE) q Server PUE: ratio of total Energy Efficiency (cont’) q Power usage effectiveness (PUE) q Server PUE: ratio of total server input power to its useful power (consumed by the components like motherboard, disks, CPUs, DRAM, I/O bards, etc) v User power excludes losses in power supplies, fans, q Currently, SPUE between 1. 6 to 1. 8. v With VRM (voltage regulatory module), SPUE can be reduced to 1. 2 72

Measuring Power Efficiency q Benchmarks v Green 500 in high-performance computing v Joulesort: • Measuring Power Efficiency q Benchmarks v Green 500 in high-performance computing v Joulesort: • measuing the total system energy to perform an outof-core sort v SPECpower_ssj 2008 • Compute the perf-to-power ratio of a system running a typical business applicatoin an enterprise Java platform 73

Power Efficiency: An Example q SPECpolwer_ssj 2008 on a server with single-chip 2. 83 Power Efficiency: An Example q SPECpolwer_ssj 2008 on a server with single-chip 2. 83 GH quad-core Intel Xeon, $GB mem, one 7. 2 k RPM, s. 5’’ SATA disk 74

Activity Profile of Google Servers q A sample of 5000 servers over a period Activity Profile of Google Servers q A sample of 5000 servers over a period of 6 months v Most of the time, 10 -50% utilization 75

Energy-Proportional Computing q Humans at rest consume at little as 70 w, while being Energy-Proportional Computing q Humans at rest consume at little as 70 w, while being able to sustain peaks of 1 k. W+ for tens of mins q For adult male: 76

Causes of Poor Energy Proportionality q CPU used to be dominant factor (60%); currently Causes of Poor Energy Proportionality q CPU used to be dominant factor (60%); currently slighly lower than 50% at peak, drops to 30% at low activity levels 77

Energy Proportional Computing: How q Hardware components: For example v CPU: dynamic voltage scaling Energy Proportional Computing: How q Hardware components: For example v CPU: dynamic voltage scaling v High speed disk drives spend 70% of their total power simply keeping the platters spinning • Need smaller rotational speeds, smaller platters, q Power distribution and cooling 78

SW role: management q Smart use of power management features in existing hw, low-overhead SW role: management q Smart use of power management features in existing hw, low-overhead inactive or active low-power modes, as well as implementing power-friendly scheduling of tasks to enhance energy proportionality of hw systems q Two challenges: Encapsulation in lower-level modules to hide additional infra complexity v Performance robustness: minizing perf variability caused by power management tools q Related Work at Wayne State v v Zhong, et al “System-wide energy minimization for hard real-time tasks, ” TECS’ 08. Zhong, et al “Energy-aware modeling and scheduling for dynamic voltage scaling with statistical real-time guarantee, ” TC’’ 07; Zong’s Ph. D dissertation ‘ 2007 Gong, Poewr/Performance Optimization (ongoing) 79

Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building Datacenter as a Computer 1. Overview 2. Workloads and SW infrastructure 3. HW building blocks 4. Datacenter basics 5. Energy and power efficiency 6. Dealing with Failures and Repairs 80

Basic Concepts q Failure: A system failure occurs when the delivered service deviates from Basic Concepts q Failure: A system failure occurs when the delivered service deviates from the specified service, where the service specification is an agreed description of the expected service. [Avizienis & Laprie 1986] q Fault: the root cause of failures, defined as a defective state in materials, design, or implementation. Faults may remain undetected for some time. Once a fault becomes visible, it is called an error. v v Faults are unobserved defective states Error is “manifestation” of faults (Source: Salfner’ 08) 81

Challenges of High Service Availability q High service availability expectation translates into a high-reliability Challenges of High Service Availability q High service availability expectation translates into a high-reliability requirement for DC v Faults in hw, sw, and operation are inevitable v In Google, about 45% servers need to reboot at least once over a 6 -month window; 95%+ requires less often than once a month, but the tail is relatively long. v The average downtime ~3 hours, implying 99. 85% availability q Determining the appropriate level of reliability is fundamentally a trade-off between the cost of failures and the cost of preventing them 82

Availability and Reliability q Availability: A measure of the time that a system was Availability and Reliability q Availability: A measure of the time that a system was actually usable, as a fraction of the time that it was intended to be usable. (x nines) v Yield: the ratio of requests that is satisfied by the service to the total q Reliability Metrics: v v v Time to failure (TTF) Time to repair (TTR) Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) 83

Interplay of HW, SW, Operation q Faults in hw and sw are inevitable. But Interplay of HW, SW, Operation q Faults in hw and sw are inevitable. But endeavor will never halt: e. g. RAID disk drive, ECC memory q A fault-tolerant sw infrastructure would help hide much of the failure complexity from application-level sw. v v v E. g. SW-based RAID system across disk drives residing in multiple machines (as in GFS); Map. Reduce Flexibility in choosing the level of hw reliability that maximizes overall system cost efficiency (e. g. inexpensive PC-class hw); Simplification in common operational procedures (e. g. hw/sw upgrade) q With FT sw, not necessary to keep a server running at all costs. This leads to change in every aspect of the systems, from design to operation and opens opportunity to optimization v As long as the hw faults can be detected and reported to sw in a timely manner 84

Fault Characterization q FT-sw be based on fault sources, their statistical characteristics, correponding recovery Fault Characterization q FT-sw be based on fault sources, their statistical characteristics, correponding recovery behavior q Fault Severity v v Corrupted: committed data are lost or corrupted Unreachable: service is down • Google service no better than 99. 9% when its servers are one of the end points • Amazon’s Service Level Agreement is 99. 95% Degraded: service is available but in degraded mode Masked: faults occur but masked from users by FT hw/sw mechanisms 85

Causes of Service-Level Failures q Field data study I on Internet services: Operator-caused or Causes of Service-Level Failures q Field data study I on Internet services: Operator-caused or misconfig errors are larget contriubutors; hw-related faults (server or networking) accounts about 10 -25% [Oppenheimer’ 2003] q Field data study II on early Tandem systems: Hw faults (<10%), sw faults (~60%), op/maintenance (~20%) [Gray’ 90] q Google’s observation over a period of 6 weeks 86

Observations q Seems sw/hw-based ft techniques do well for independent faults. q SW-, Op-, Observations q Seems sw/hw-based ft techniques do well for independent faults. q SW-, Op-, and maintenance-induced faults have a higher impact on outages, possibly because they are most likely affect multiple systems at once, creating a corrected failure scenario which is hard to overcome. 87

Proactive Failure Management? q Failure Prediction? v Predict future machine failures with low false-positive Proactive Failure Management? q Failure Prediction? v Predict future machine failures with low false-positive rates in a short time horizon q Develop models for a good trade-off between accuracy (both in false-positive rates and time horizon) and the penalties involved in failure occurrence and recovery In DC, the penalty is low, the prediction model must be highly accurate to be economically competitive. v In systems where a crash is disruptive to op, less accurate prediction models would be beneficial q Related Work at Wayne State v v v Fu/Xu, Exploring spatial/temporal event correlation for failure prediction, SC’ 07 Fu’s Ph. D dissertation 2008 88

In Summary q Hardware v Building blocks are commodity server-class machines, consumer- or enterprise-grade In Summary q Hardware v Building blocks are commodity server-class machines, consumer- or enterprise-grade disk drives, Ethernetbased networking fabrics v Perf of the network fabric and storage subsystems be more relevant to CPU and mem q Software: v FT sw for high service availability (99. 99%) v Programmability, Parallel efficciency, Manageability q Economics: Cost effectiveness v power and energy factors v Utilization characteristics require systems and components to be energy efficient across a wide load spectrum, particularly at low utilization level 89

In Summary: Key Challenges q Rapidly changing workload v New applications with a large In Summary: Key Challenges q Rapidly changing workload v New applications with a large variety of computational characteristics emerge at a fast pace v Need creative solutions from both hw and sw; but little benchmark available q Building balanced systems from imbalanced components v Processors outpaced memory and magnetic storage in perf and power efficiency; more research should be shifted onto the non-cpu subsystems q Curbing energy usage v Power becomes the first order resources, as speed v Performance under power/energy budget 90

Key Challenges (cont’) q Amdahls’ cruel law v Speedup = 1/ (f_seq + f_par/n) Key Challenges (cont’) q Amdahls’ cruel law v Speedup = 1/ (f_seq + f_par/n) on n-node parallel systems. Sequential part f_seq limits parallel efficiency, no matter how large n v Future perf gains will continue to be delivered mostly due to cores or threads, not so much by faster cpus. v Data-level or request-level parallelism is enough? Parallel computing beyond Map. Reduce!! 91

Reading List q Barroso and Holzle, “The datacenter as a computer, ” Morgan & Reading List q Barroso and Holzle, “The datacenter as a computer, ” Morgan & Claypool, 2009. 92