Скачать презентацию Designing High Availability Networks Systems and Software for Скачать презентацию Designing High Availability Networks Systems and Software for

91b5c7dfc0be99bdb3a708bf9e71aaac.ppt

  • Количество слайдов: 33

Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania January 14, 2004

Copyright D. Kassabian and S. Huque [2004]. This work is the intellectual property of Copyright D. Kassabian and S. Huque [2004]. This work is the intellectual property of the authors. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials and notice is given that the copying is by permission of the author. To disseminate otherwise or to republish requires written permission from the authors.

About Penn n n The University of Pennsylvania was founded by Ben Franklin in About Penn n n The University of Pennsylvania was founded by Ben Franklin in 1740 Penn is part of the Ivy League Located in western Philadelphia Community of more than 35, 000 people

General Goals n n Networked services available as expected by our users Minimized time General Goals n n Networked services available as expected by our users Minimized time to repair (TTR) for when outages do occur Ability to perform maintenance and upgrades (planned downtime) non-disruptively Cost effectiveness in meeting these goals

Definitions n n n Availability High Availability (HA) Rapid Recovery (RR) Disaster Recovery (DR) Definitions n n n Availability High Availability (HA) Rapid Recovery (RR) Disaster Recovery (DR) Basic Systems

Definitions § Disaster Recovery (DR) -The process of restoring a service to full operation Definitions § Disaster Recovery (DR) -The process of restoring a service to full operation after an interruption in service

Definitions § § Basic System - a Basic System is a {Network, System, Service} Definitions § § Basic System - a Basic System is a {Network, System, Service} with only the most basic of protections against outages Examples: § § § A network recoverable using spare parts A single computer system with RAID disk A service recoverable from tape backups

Definitions § § Availability - the percentage of total time that a {Network, System, Definitions § § Availability - the percentage of total time that a {Network, System, Service} is available for use Related points: § § § Advertised periods of availability Availability as advertised Absolute availability

Definitions § High Availability (HA) - a {Network, System, Service} with specific design elements Definitions § High Availability (HA) - a {Network, System, Service} with specific design elements intended to keep availability above a high threshold (eg, 99. 99%)

Definitions § Rapid Recovery (RR) - a {Network, System, Service} with specific design elements Definitions § Rapid Recovery (RR) - a {Network, System, Service} with specific design elements intended to recover from downtime very quickly (eg, 15 minutes)

Metrics n n n Economics of high availability (the costs of non-available) Calculating availability Metrics n n n Economics of high availability (the costs of non-available) Calculating availability How availability measurements are performed

Economics of high availability § What is the cost of an outage in your Economics of high availability § What is the cost of an outage in your § § § § Student Courseware systems and student record systems Financial systems Primary campus web site and Email servers DNS, DHCP and Auth. N systems Internet connection(s) Development / Gifts systems How much should you be willing to spend to minimize downtime of any or all of these?

Calculating availability § § Availability can be measured directly through periodic polling (eg, SNMP, Calculating availability § § Availability can be measured directly through periodic polling (eg, SNMP, Mon, Nagios) A formula for predicting availability of a single component MTBF or (MTBF+TTR) 1 TTR (MTBF+TTR)

Design Principals n Towards HA n n n Minimize points of catastrophic failure Maximize Design Principals n Towards HA n n n Minimize points of catastrophic failure Maximize redundancy Minimize fault zones Minimize complexity and cost Applying the above principles to n n n Networks Systems Services

Specific examples at Penn n n High Availability Services Rapid Recovery Services Specific examples at Penn n n High Availability Services Rapid Recovery Services

High Availability Design n Strategies employed to achieve HA: n n n n Server High Availability Design n Strategies employed to achieve HA: n n n n Server redundancy Hardware component redundancy Storage redundancy (RAID) Network redundancy Redundant power, A/C, cooling etc Application protocols that can transparently failover to alternate servers Secondary offsite hosting (of some services like DNS)

Rapid Recovery Design n Strategies employed to achieve RR: n n Standby servers and Rapid Recovery Design n Strategies employed to achieve RR: n n Standby servers and storage Some HA design elements: n n Hardware redundancy, storage redundancy, network redundancy, power, A/C redundancy etc Note: services deployed in the RR model typically don’t have an easy way to transparently failover to alternate servers (eg. E-mail, Web etc)

Network Aggregation Point n n Abbreviation: NAP Machine rooms in separate campus locations that Network Aggregation Point n n Abbreviation: NAP Machine rooms in separate campus locations that house critical network electronics and servers. Good environmentals and connectivity to campus fiber-optic cable plant Both HA and RR services utilize multiple NAPs

Central Infra. Networks n n AKA “NOC Networks” (historical name) 3 highly redundant IP Central Infra. Networks n n AKA “NOC Networks” (historical name) 3 highly redundant IP networks that house systems providing critical infrastructure services Each network is triply connected to campus routing core via distinct NAP locations Use of router redundancy protocols (VRRP) & Layer-2 path redundancy (802. 1 D) for high availability

HA Server Platforms n Two sets of three replicated servers n n n 3 HA Server Platforms n Two sets of three replicated servers n n n 3 KDC servers: central authentication 3 NOC servers: everything else Kerberos runs on separate systems mainly for security reasons.

High Availability: KDCs n KDCs (3): n n n 3 distinct machines (kdc 1, High Availability: KDCs n KDCs (3): n n n 3 distinct machines (kdc 1, kdc 2, kdc 3) Each located in a different campus machine room Each connected to a distinct IP network n n Via a distinct IP core router Additionally each network is triply connected to the campus routing core via 3 NAPs

High Availability: NOCs n 3 “NOC” systems (a historical name) n n n Provide: High Availability: NOCs n 3 “NOC” systems (a historical name) n n n Provide: DNS, DHCP, NTP, RADIUS plus a few homegrown services Same physical and network connectivity as the KDCs In addition: some servers have a secondary interface on a different NOC network (for reasons to be explained later)

HA Application Failover n n n Kerberos DNS RADIUS NTP DHCP n n Current HA Application Failover n n n Kerberos DNS RADIUS NTP DHCP n n Current spec supports only 2 failover systems Non-HA homegrown services: Penn. Names

Rapid Recovery service n n n n Example: E-mail and Web service A set Rapid Recovery service n n n n Example: E-mail and Web service A set of servers and storage is replicated at two sites: primary and standby Primary site: active servers and storage Secondary site: standby servers and replicated storage Data from 1 st site is synchronously replicated to 2 nd Two separate fibrechannel networks interconnect systems and storage at both sites Catastrophic failure event: system can be manually reconfigured to use the standby servers and/or secondary storage ( ~ 30 minutes) Servers are located on the HA primary infrastructure network

Experiences at Penn n n Where these approaches have been helpful n Higher availability, Experiences at Penn n n Where these approaches have been helpful n Higher availability, non-disruptive maintenance Where they have not n Complexity can be hard to manage! Where cost has been high n Replicated systems and networks, high-end storage solutions Real availability experience n DNS, a critical service, went from 99. 0% to 99. 999% availability!

Future Enhancements n Making RR services highly available: n n n “clustering”, IETF rserpool Future Enhancements n Making RR services highly available: n n n “clustering”, IETF rserpool etc Metropolitan area DR (or better) Others: n n IP Multipathing Trunking links to servers n n 802. 1 ad, SMLT, DMLT or similar Rapid Spanning Tree (IEEE 802. 1 w) Multi-master KADM service Improved management and monitoring infrastructure

Feedback n n Questions, comments Your designs, experiences, successes Contact Info: deke@isc. upenn. edu Feedback n n Questions, comments Your designs, experiences, successes Contact Info: [email protected] upenn. edu [email protected] upenn. edu