8b68f09d2d22e10d4983086ea0642eec.ppt
- Количество слайдов: 27
Achieving Continuity of Operations (COOP) Disaster Recovery of Technology Services: Issues Strategies Directions Presented by Dave Purdy 6 -23 -2005 © 2005 EMC Corporation. All rights reserved. 1
Ever increasing need for COOP People, Data, and Services Availability Drivers/trends for improved recoverability and/or availability of Services: – Current measure increasingly deemed inadequate – Physical vs. Electronic transport of Data – Melding of “DR” and “Operational Availability” – Self Insurance for DR – Public Safety/Service Availability vs. Cost Maturity in understanding COOP issues: – – Recovery vs. Restart Identification of App/DB inter-dependencies DR vs. Operational Availability (HA) Breaking down the problem: • Information Availability • Application Availability © 2005 EMC Corporation. All rights reserved. 2
Production Availability and Disaster Recovery: Converging? “DR” Insurance “CA” “HA” Disaster: Natural or man-made (<1% of occurrences) Flood, fire, earthquake Contaminated building Unplanned occurrences: Failure (13% of occurrences) Database corruption Component failure Human error Planned occurrences: Competing workloads (87% of occurrences) Backup, reporting Data warehouse extracts Application and data restore ROI © 2005 EMC Corporation. All rights reserved. 3
Creating a context: Government moving up the Continuum Differing levels of IT Architectural dependency with regard to Availability Strategies: 100 % Procedural (0% IT Architectural Redundancy ) Low security Manual Resources Consumer Goods Manufacturing Food Manufacturer Banks Financial Services Essential Telecommunications Services Government, Airlines, Hospital High Volume Low Failsafe Low Volume Non Critical Business Small Industries 24 hrs x 7 days High Failsafe Manufacturing Retail Transportation Logistics Transparent Failsafe High Security 100 % Automatic (100 % IT Architectural Redundancy) © 2005 EMC Corporation. All rights reserved. 4
Availability Drivers § Increased realization that critical services depend on IT availability § Pervasive requirements to protect people and data § Increasing nature of real-time “transactions” § “Lost” transactions cannot be re-created § Increased recognition that traditional recovery from tape is no longer viable § New vision - Merger of production and DR disciplines to focusing on continuous availability § Public Service, Safety, and Inter-Agency dependencies driving criticality of COOP © 2005 EMC Corporation. All rights reserved. 5
Traditional Disaster Recovery: Tape Wks Days Tape Backup Offsite Storage Wks Days Hrs Mins Secs RPO n Days Wks Retrieve Tape Set Up Systems Restore from Tape Secs Mins Hrs Days Wks RTO Tape Backup with Offsite Tape Storage n RPO = 24+ hours or time of last backup stored offsite n RTO = 24 - 96 hours or time required to restart operations n Transport tapes to recovery site n Setup systems to receive data n Restore from tape n Synchronize systems and DB for resumption © 2005 EMC Corporation. All rights reserved. 6
Consistency=Usability: This is not a platform or application issue…. Getting All the Data at the Same Time Across databases, applications, and platforms…. Mainframe Consistency Group UNIX Consistency Group Mainframe Windows UNIX © 2005 EMC Corporation. All rights reserved. Consistency Group Windows 7
Patterns of DR Program Evolution: Quick. Ship Offsite Vital Records Local – – – Commercial Hotsite with Electronic Vaulting or Replication Insourced DR & HA To 2 nd Site -Passive -Active Insourced “CA” To 2/3 sites -Active -Triangulate Remote Key Learnings: Restore is very different than Restart Testing effectiveness and control: Subset vs. Full / Hotsite vs. Internal Application/Agency Inter-dependencies Traditional recovery and restore techniques being deemed inadequate Increased complexity (and benefit) in justifying “DR” versus “DR + HA” as 2 nd site becomes more integrated with primary site © 2005 EMC Corporation. All rights reserved. 8
A Practical Approach to Unifying Requirements and IT Capabilities for Mutually Agreement… Customer Problem Area Maximum Acceptable Data Loss (RPO) Maximum Acceptable Downtime (RTO) Sec. Mins Hours > 24 hrs. Zero LOCAL TAPE BACKUP & RECOVERY DISK DATA REPLICATION REMOTE LOCAL SERVER CLUSTERING & VIRTUALIZATION t ts e n rk e a m M ire u q e R REMOTE © 2005 EMC Corporation. All rights reserved. 9
Availability Strategies: Disaster Recovery (DR) High Availability (HA) “Continuous Availability” (CA) Network Out-Region Secondary -or. Tertiary Asynch In-Region Primary Asynch Commercial Hotsite -Sun. Gard -IBM BRCS Synch Secondary © 2005 EMC Corporation. All rights reserved. 10
Remote Replication Capability Continuum Summary Synchronous Source l No data exposure Target Limited Distance l Limited distance Asynchronous Target Source l Seconds of data exposure Unlimited Distance l No performance impact l Unlimited distance Asynchronous Point-in-Time Source l Hours of data exposure l No performance impact Prod Target Unlimited Distance l Unlimited distance Triangulated Synch & Asynch l Simultaneous Synchronous and Limited Asynchronous l Three site awareness © 2005 EMC Corporation. All rights reserved. Primary site 2 nd Site Unlimited Long-distance 11 site
Best Practices for Achieving Business Continuity Determine requirements / service levels – System / application mapping Validate ability to achieve service-level agreements – Evaluate costs / tradeoffs of technologies to meet service levels Create right level of protection for your Agencies (or Inter-Agencies? ) specific business and application requirements Integrate it – – Across information storage platforms Across processing infrastructure (servers, networks, applications) Across data centers and geographic locations Integrate with Change Management © 2005 EMC Corporation. All rights reserved. 12
Business Continuity Planning: Lessons from the Nation’s Capitol in the Post-9/11 World Mary Kaye Vavasour e. Gov Services Office of the Chief Technology Officer District of Columbia
Recent History as Context • 1996 -1999 • 2001 • 2002 • 2003 Y 2 K made continuity a priority Internet made networks a focus and e. Government a reality 9 -11 the unthinkable happened; security of data, network, and infrastructure became key to recovery Federal Patriot Act made Continuity planning a legal mandate Sarbanes-Oxley Act added more regulatory requirements Hurricane Isabel caused regional power outages that lasted 4 -7 days
Key Elements of Business Continuity Strategy • High availability platform and procedures • Proven Emergency Operations Process – – Detailed, service-based procedures Dedicated staff Regional coordination Frequent practice with planned events • Focus on Continuity of Communications – Public safety wireless network – Public portal resiliency with specialized content – High availability messaging platform
High Availability Platform; Centralized Process • In-sourced, high availability Disaster Recovery • Active-Active for availability – Multiple servers behind hardware load balancers (millisecond failover) – Separate web application and database tiers – 95% of public web services covered (104 sites + main portal) • Active-Passive for Disaster Recovery – Two data centers – Multiple types of replication • Cluster synch for dynamic portal content • MS/CRS for legacy applications and static pages • Database tier uses SQL and Oracle replication • Future: tertiary site for continuous availability of portal • Centralized failure recovery process run by senior staff
Comprehensive Emergency Operations Process • Started with Y 2 K; focus on manual processes to back-up automated systems • Post 9/11: focus on continuity of services • Dedicated staff=DC EMA + agency representatives + key service providers (utilities, suppliers, Federal public safety, regional emergency agency staff, etc. ) • Hardened site • 14 Emergency Liaison Officers for key services • Two-tiered operational structure (EOC and JIC) • Clearly defined decision-making process and lines of authority • Redundant communication channels with all levels of responders, and the public at large • Frequent practice, using planned events
Specialized Content for Public Communication • Public portal’s Emergency Center provides detailed content for emergency response plans • Extensive use of GIS-based content – Content to tailored to individual’s location – Facilitates location of shelters, evacuation routes, and major transportation services • Specialized “Emergency Mode” will take over entire portal during catastrophic events
Focus on Continuity of Communications • Public safety wireless network for voice and data • Federal and regional voice interoperability • 99% of District geography is covered • Dedicated transmission towers, and mobile repeater systems • Signals can penetrate thick building walls, metro system tunnels, underground locations
District of Columbia Office of the Chief Technology Officer – 1
Coverage Improvement With New Network • Coverage Improvement With New MPD Network District of Columbia Office of the Chief Technology Officer – 2
Public Portal Resiliency • Active-Active failover, with load balancing on heartbeat for high availability • Actual=99. 99999% • Active-Passive disaster recovery between two local data centers • Future Tertiary site for continuous availability GOAL=Never Go Dark
High Availability Messaging Platform • Completely fault tolerant email system enables government officials to communicate and share data during significant outages • High volume synchronous data replication between primary and secondary data centers, using EMC’s CLARii. ON Mirror View • Homeland Security funding ($900 k) made public safety agencies the priority focus during implementation: – – – MPD FEMS DMH CFSA DOC • Can failover email accounts, and the most recent data from 4 hours prior to the outage
Key Success Factors • People • Process • Practice
Mary Kaye Vavasour Program Manager e. Government Services Office of the Chief Technology Officer District of Columbia marykaye. vavasour@dc. gov