- Количество слайдов: 27
No Fallen ANGELs! Redundancy, Backup, Recovery Andrea Chappell: University of Waterloo Adam Hauerwas: Providence College Ruomiao Wang & Jie Li: Kelly Direct, Indiana University Terry O'Heron & Crystal Foust: Penn State
Agenda Ø How do you backup/archive courses? l What policies and procedures guide your response to requests to recover a course, a file, an internal ANGEL page, a student upload file? Ø How do you protect your system from various failures, and in what time do you “promise” to have it back online?
University of Waterloo (Andrea) Ø ANGEL is the centrally supported LMS since summer 2004. Ø Core to university business. Ø Need to configure against various types of failures, e. g. : l l Disaster (fire, flooding, etc. ) Partial system failure (ANGEL/IIS or SQL server systems, disks, etc. )
Constraints (what we can’t change) Ø Support coverage is not 24 x 7: Central IT (IST) provides extended support for critical systems but not 24 x 7 support. Ø Cannot survive lengthy power outages. Ø Cannot survive some network outages. l Network support is also not 24 x 7.
Backup Processes Ø System data backup l Database (dump of db file), Transaction logs (cut once per day) and Upload files backed up nightly by campus backup service. Ø Course archives l l Long term: Archive courses at end of term. Shorter term: Remove from system after 4 terms. (Note: to offer a course again, copy course rather than reuse same instance).
Recovery Process Ø Recover data to dev system and copy lost data to production. l This can be very complex if the missing data is a quiz that was run, a bulletin board, etc. ! Ø Currently no policies on what to recover, or promise of time to recovery. Requests considered on individual basis.
Protecting against failures Ø Current strategy: Buy robust equipment, configure to minimize points of failure. Production Systems • Dual RAID disks ANGEL/IIS (Dell server) • Dual power supply • 7 x 24 4 hour hardware support (from vendor) • Housed in accesscontrolled machine room SQL Server (Dell server) • Uninterrupted Power Supply Development System ANGEL/IIS and SQL server
Vulnerabilities in Current Strategy Ø The ANGEL/IIS or SQL Server hardware, e. g. , system motherboard failure l Don’t have ready back-up machine. • Could temporarily use development system. l Ø Likely a minimum half day down-time. Machine room “fire” l l l All hardware lost. Up to one day of lost data (if 24 hours from last backup). Days of down time!
Configurations under Investigation Looking for faster recovery time, less potential data loss, through increased redundancy. Ø Config 1: Identical production and development systems, different locations. Ø Config 2: Identical production and dev systems, shared data (data filer), Load Balancer (Cisco), different locations.
Config 1 Ø Identical production and development systems, different locations. Gains: ANGEL/IIS (Dell server) • In system failure: • If possible, move disks to duplicate system – 4 working hours. SQL Server (Dell server) • Or, recover data to duplicate systems – perhaps 8 working hours. Issues: • People intervention still required. Cost: • Two new systems.
Config 2 Ø Identical prod and dev systems, shared data, load balancer, different locations. Load Balancer ANGEL/IIS (Dell server) Gains: • Failure of one ANGEL/IIS system instantaneous fall over to remaining. • Failure of SQL Server - reconfigure dev system to point to data filer. Issues: Data Filer • Single point of failure unless filer clustered. • Greater complexity may cause downtime. SQL Server (Dell server) Cost: • 3 new systems, plus filer (~$30 USD)
Providence College (Adam) Ø Like Waterloo, ANGEL has been our LMS since Fall, 2001. Ø Support coverage is not 24 x 7. Ø Cannot survive lengthy power outages or network outages.
PC Backup and Recovery Ø System data backup l l l Ø Course archives l l Ø Back up database and logs to files once per day. Use Tivoli to back up both DB and file system nightly. Creates “backup of a backup. ” Short term: Archive courses 90 days after term end. Long term: Store archives to DVD. Recovery l Like Waterloo, recover Production database in Development environment.
PC’s Redundancy Ø Today: Robust Production Server Production System Development System ANGEL IIS/SQL (HP DL 380) ANGEL IIS/SQL (Desktop) • Multiple RAID disks (System, DB, Data) • Dual Power Supplies and NIC’s • Access-controlled machine room • UPS
PC’s Future Architecture Ø This Summer: Production System New Server and SAN IBM Storage Area Network Development System ANGEL IIS/SQL (New HP) ANGEL IIS/SQL (Old HP) • Purchase new server and install O/S and SQL Server on local RAID. • Store database and web files on SAN disk. • In the event of Production hardware failure, connect Production disk to Development server with little downtime.
Kelley Direct On-Line Programs, Indiana University (Ruomiao) Ø Road to ANGEL l l l Piloted ANGEL as LMS in Fall 2003 Spring 2004: all courses delivered via ANGEL Critical learning platform that connects KD to the students
Kelley Direct On-Line Programs, Indiana University
Kelley Direct On-Line Programs, Indiana University Ø Current Data Protection Measures l Backup System Backups • • Full Backups once a week starting Friday night Differential Backups every night around 11 PM Database Backups • Full ANGEL SQL database backup every night at 10 PM. The database backup output files are then backed up by system tape backups for that night. • Transaction log backups every six hours. The backup tapes are then taken to an offsite location.
Kelley Direct On-Line Programs, Indiana University Ø Current System Protection Measures l Disk • l l l Configured with RAID 5 with a spare disk Dual power connections UPS System connection (30 min. ) Spare Chassis • Test server has identical hardware and server as a spare chassis
Kelley Direct On-Line Programs, Indiana University Ø Current Recovery Practices File or Database Restore l • Restore from disk, tape backups, or individual developer’s machines. System Component Failure l • • Replace the faulty component(s) from the spare chassis (test server) or move entire disk array to from production to test server Total System Failure or disk array failure Rebuilt entire system, possibly to alternate hardware. All the ANGEL components will either need to be installed from scratch, or restored from backup tapes. Some system components have to be reconfigured manually.
Kelley Direct On-Line Programs, Indiana University Ø Challenges for KD ANGEL Environment Security l • Scalability l • • l Limited capability to scale performance based on volume Availability l l ANGEL web server resides on the same physical machine that hosts the ANGEL databases No redundancy built in. Single server design. Any component failure means downtime Shrinking Maintenance Window (or do we still have one? ) (continue on next slide)
Kelley Direct On-Line Programs, Indiana University Ø Challenges for KD ANGEL Environment Storage Capacity l • Limited expansion capability Recoverability Single copy of production data on disk. Tape restoration is time consuming and means data loss Availability l • No redundancy built in. Single server design. Any component failure means downtime Growth l • Significant enrollment growth is expected for the programs in the next three years Development Environment l • Developers are coding on own machines. Configurations differ from production environment. Less efficient.
Kelley Direct On-Line Programs, Indiana University Ø Some Questions l l l How can backend infrastructure better support the vision of the on-line programs? How to plan system capacity when progarm changes (such as enrollment growth)? How to better protect student data? What the available options for long-term data retention? How to better meet the requirements for less service interruption? What should we do to ensure a faster ANGEL systems recovery?
Kelley Direct On-Line Programs, Indiana University
Penn State Environment (Terry, Crystal) Ø Support coverage is 24 x 7 Ø Backup Power (generator) Ø Redundant network connectivity Ø Failover capability Ø Mirrored storage Ø Daily Backups/Off-site storage Ø Daily Maintenance (5 -7 am) Ø Archive (courses, inactive groups)
Constraints Ø Backup l l SQL: 3 hours File: 3 -4 days Ø Restoration l l SQL: 1. 5 hours File: 2 min. - ? ?