2460303a36997175b637505022885a5c.ppt
- Количество слайдов: 112
FT NT: A Tutorial on Microsoft Cluster Server™ © 1996, 1997 Microsoft Corp. (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray} @ microsoft. com http: //research. microsoft. com/barc 1
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 2
DEPENDABILITY: The 3 ITIES u RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF) u AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR Integrity Security Reliability System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). Availability u. Holistic vs. Reductionist view © 1996, 1997 Microsoft Corp. 3
Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp. , March 1986. (trans: Eiichi Watanabe). Vendor 4 2% Tele Comm lines Application Software 12 % 2 5% Vendor (hardware and software) Application software Communications lines Operations Environment 1 1. 2 % Environment 9. 3% Operations 5 9 1. 5 2 2 Months Years 10 Weeks 1, 383 institutions reported (6/84 - 7/85) 7, 517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas © 1996, 1997 Microsoft Corp. 4
Case Studies - Tandem Trends MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software © 1996, 1997 Microsoft Corp. 5
Summary of FT Studies u Current Situation: ~4 -year MTTF => Fault Tolerance Works. u Hardware is GREAT (maintenance and MTTF). u Software masks most hardware faults. u Many hidden software outages in operations: Ø New Software. Ø Utilities. u Must make all software ONLINE. u Software seems to define a 30 -year MTTF ceiling. u. Reasonable Goal: © 1996, 1997 Microsoft Corp. 100 -year MTTF. class 4 today => class 6 tomorrow. 6
Fault Tolerance vs Disaster Tolerance u Fault-Tolerance: mask local faults Ø RAID disks Ø Uninterruptible Power Supplies Ø Cluster Failover u Disaster Tolerance: masks site failures Ø Protects against fire, flood, sabotage, . . Ø Redundant system and service at remote site. © 1996, 1997 Microsoft Corp. 7
The Microsoft “Vision”: Plug & Play Dependability u u © 1996, 1997 Microsoft Corp. Transactions for reliability Clusters: for availability Security All built into the OS Integrity Security Integrity / Reliability Availability 8
Cluster Goals u Manageability Ø Manage nodes as a single system Ø Perform server maintenance without affecting users Ø Mask faults, so repair is non-disruptive u Availability Ø Restart failed applications & servers • un-availability ~ MTTR / MTBF , so quick repair. Ø Detect/warn administrators of failures u Scalability Ø Add nodes for incremental • processing • storage • bandwidth © 1996, 1997 Microsoft Corp. 9
Fault Model u Failures are independent So, single fault tolerance is a big win u u u Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: Ø Heisenbugs u Operations tasks: major source of outage Ø Utility operations Ø Software upgrades © 1996, 1997 Microsoft Corp. 10
Cluster: Servers Combined to Improve Availability & Scalability u Cluster: A group of independent systems working u Interconnect: Communications link used for intra- together as a single system. Clients see scalable & FT services (single system image). u Node: A server in a cluster. May be an SMP server. cluster status info such as “heartbeats”. Can be Ethernet. Client PCs Printers Server A © 1996, 1997 Microsoft Corp. Server B Disk array A Interconnect Disk array B 11
Microsoft Cluster Server™ u 2 -node availability Summer 97 (20, 000 Beta Testers now) Ø Commoditize fault-tolerance (high availability) Ø Commodity hardware (no special hardware) Ø Easy to set up and manage Ø Lots of applications work out of the box. u 16 -node scalability later (next year? ) © 1996, 1997 Microsoft Corp. 12
Failover Example Brows er Server 1 Server 2 Web site Databas e © 1996, 1997 Microsoft Corp. Web site Databas e Web site files Database files 13
MS Press Failover Demo u u Client/Server Software failure Admin shutdown Server failure © 1996, 1997 Microsoft Corp. Resource States - Pending - Partial - Failed ! - Offline 14
Demo Configuration Server “Alice” Server “Betty” SMP Pentium® Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server SMP Pentium® Processors Windows NT Server with Wolfpack Microsoft Internet Information Serv Microsoft SQL Server Interconnec t Local Disks SCSI Disk Cabinet Shared Disks Local Disks standard Ethernet Windows NT Server Cluster Administrator Client Windows NT Workstation Cluster Admin SQL Enterprise Mgr Windows NT Workstatio Internet Explorer MS Press OLTP app © 1996, 1997 Microsoft Corp.
Demo Administration Server “Alice” Server “Betty” Runs SQL Trace Runs Globe Run SQL Trace Local Disks SCSI Disk Cabinet Shared Disks Cluster Admin Console ·Windows GUI ·Shows cluster resource status ·Replicates status to all servers ·Define apps & related © 1996, 1997 Microsoft Corp. Windows NT Server Cluster SQL Enterprise Mgr ·Windows GUI ·Shows server status ·Manages many servers Client
Generic Stateless Application Rotating Globe u u u Mplay 32 is generic app. Registered with MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if Ø 4 failures (= process exits) Ø in 3 minutes Ø settable default © 1996, 1997 Microsoft Corp. 17
Demo Moving or Failing Over An Application X X AVI Applicatio Local n SCSI Disk Cabinet Disks © 1996, 1997 Microsoft Corp. Shared Disks AVI Applicatio Local n Disks Windows NT Server Cluster Alice Fails or Operator Requests move
Generic Stateful Application Note. Pad u u u Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move) © 1996, 1997 Microsoft Corp. 19
Demo Step 1: Alice Delivering Service SQL Activity SQL ODBC No SQL Activity Local Disks IIS SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP © 1996, 1997 Microsoft Corp. HTTP Local Disks
2: Request Move to Betty No SQL Activity SQL ODBC SQL Activity Local Disks IIS © 1996, 1997 Microsoft Corp. SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP IP HTTP Local Disks
3: Betty Delivering Service No SQL Activity SQL ODBC . Local Disks IIS © 1996, 1997 Microsoft Corp. SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP Local Disks
4: Power Fail Betty, Alice Takeover SQL ODBC SQL Activity ODBC No SQL Activity Local Disks IIS IP © 1996, 1997 Microsoft Corp. SCSI Disk Cabinet Shared Disks Windows NT Server Cluster IIS IP Local Disks
5: Alice Delivering Service SQL Activity No SQL Activity Local Disks ODBC SQL IIS Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster IP © 1996, 1997 Microsoft Corp. HTTP
6: Reboot Betty, now can takeover SQL Activity SQL ODBC No SQL Activity Local Disks IIS SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP © 1996, 1997 Microsoft Corp. HTTP Local Disks
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 26
Cluster and NT Abstractions Cluster Group Resource Cluster Abstractions NT Abstractions Domain © 1996, 1997 Microsoft Corp. Node Service 27
Basic NT Abstractions Domain u Ø Ø e. g. , file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed Node: a single (tightly-coupled) NT system Ø Ø Ø u Service: program or device managed by a node Ø u Node hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services Domain: a collection of nodes Ø cooperation for authentication, administration, naming © 1996, 1997 Microsoft Corp. 28
Cluster Abstractions Cluster u Ø Ø e. g. , file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed Resource Group: a collection of related resources Ø Ø u Resource: program or device managed by a cluster Ø u Resource Group hosts resources; belongs to a cluster unit of co-location; involved in naming resources Cluster: a collection of nodes, resources, and groups Ø © 1996, 1997 Microsoft Corp. cooperation for authentication, administration, naming 29
Resources Cluster Group Resources have. . . u Type: what it does (file, DB, print, web…) u An operational state (online/offline/failed) u Current and possible nodes u Containing Resource Group u Dependencies on other resources u Restart parameters (in case of resource failure) © 1996, 1997 Microsoft Corp. 30
Resource Types u Built-in types Generic Application Ø Generic Service Ø Internet Information Server (IIS) Virtual Root Ø Network Name Ø TCP/IP Address Ø Physical Disk Ø FT Disk (Software RAID) Ø Print Spooler Ø File Share Ø © 1996, 1997 Microsoft Corp. u Added by others Microsoft SQL Server, Ø Message Queues, Ø Exchange Mail Server, Ø Oracle, Ø SAP R/3 Ø Your application? (use developer kit wizard). Ø 31
© 1996, 1997 Microsoft Corp. Physical Disk 32
© 1996, 1997 Microsoft Corp. TCP/IP Address 33
© 1996, 1997 Microsoft Corp. Network Name 34
© 1996, 1997 Microsoft Corp. File Share 35
© 1996, 1997 Microsoft Corp. IIS (WWW/FTP) Server 36
© 1996, 1997 Microsoft Corp. Print Spooler 37
Resource States u Resources states: I’m Ø Offline: exists, not offering service Ø Online: offering service Ø Failed: not able to offer service u Online! Online Pending Resource failure may cause: Ø local restart Online Go Off-line! Failed Go Online! I’m here! Offline Pending I’m Off-line! Ø other resources to go offline Ø resource group to move Ø (all subject to group and resource parameters) u Resource failure detected by: Ø Polling failure Ø Node failure © 1996, 1997 Microsoft Corp. 38
Resource Dependencies u u Similar to NT Service Dependencies Orderly startup & shutdown Ø Ø u A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on Interdependent resources Ø Ø Form dependency trees move among nodes together failover together as per resource group © 1996, 1997 Microsoft Corp. IIS Virtual Root File Share Network Name IP Address Resource DLL 39
© 1996, 1997 Microsoft Corp. Dependencies Tab 40
u NT Registry Stores all configuration information Ø Software Ø Hardware u u u Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry: SoftwareMicrosoftMSSQLServer Default. Login = “GUEST” Default. Domain = “REDMOND” © 1996, 1997 Microsoft Corp. 41
Cluster Registry u u Separate from local NT Registry Replicated at each node Ø u Algorithms explained later Maintains configuration information: Ø Cluster members Ø Cluster resources Ø Resource and group parameters (e. g. restart) u u Stable storage Refreshed from “master” copy when node joins cluster © 1996, 1997 Microsoft Corp. 42
Other Resource Properties u u u Name Restart policy (restart N times, failover…) Startup parameters Private configuration info (resource type specific) Ø Per-node as well, if necessary Poll Intervals (Looks. Alive, Is. Alive, Timeout) These properties are all kept in Cluster Registry © 1996, 1997 Microsoft Corp. 43
© 1996, 1997 Microsoft Corp. General Resource Tab 44
© 1996, 1997 Microsoft Corp. Advanced Resource Tab 45
Resource Groups Cluster u u Group Resource Every resource belongs to a resource group. Resource groups move (failover) as a unit Dependencies NEVER cross groups. (Dependency trees contained within groups. ) Group may contain forest of dependency trees © 1996, 1997 Microsoft Corp. Payroll Group Web Server IP Address Drive E: SQL Server Drive F: 46
Moving a Resource Group © 1996, 1997 Microsoft Corp. 47
Group Properties u Current. State: Online, Partially Online, Offline u Members: resources that belong to group Ø members determine which nodes can host group. u Preferred Owners: ordered list of host nodes u Failover. Threshold: How many faults cause failover u Failover. Period: Time window for failover threshold u Failback. Windows. Start: When can failback happen? u Failback. Window. End: When can failback happen? u Everything (except Current. State) is stored in registry © 1996, 1997 Microsoft Corp. 48
Failover and Failback u Failover parameters Ø Ø u Failback to preferred node Ø u timeout on Looks. Alive, Is. Alive # local restarts in failure window after this, offline. (during failback window) Do resource failures affect group? © 1996, 1997 Microsoft Corp. Node \Betty Node \Alice Failover Cluster Failback Service IPaddr name 49
Cluster Concepts Cluster © 1996, 1997 Microsoft Corp. Group Resource 50
Cluster Properties u Defined Members: nodes that can join the cluster u Active Members: nodes currently joined to cluster u Resource Groups: groups in a cluster u Quorum Resource: Ø Stores copy of cluster registry. Ø Used to form quorum. u Network: Which network used for communication u All properties kept in Cluster Registry © 1996, 1997 Microsoft Corp. 51
Cluster API Functions (operations on nodes & groups) u u u Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects Ø Ø Ø u Nodes Groups Resources and Resource Types Cluster Event Notifications Ø Ø Ø © 1996, 1997 Microsoft Corp. Node state and property changes Group state and property changes Resource state and property changes 52
© 1996, 1997 Microsoft Corp. Cluster Management 53
Demo u u u Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources © 1996, 1997 Microsoft Corp. 54
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 55
Architecture Top tier provides cluster abstractions Failover Manager u Middle tier provides distributed operations Resource Monitor Cluster Registry Global Update Quorum Membership u Bottom tier is NT and drivers u © 1996, 1997 Microsoft Corp. Windows NT Server Cluster Disk Driver Cluster Net Drivers 56
Membership and Regroup u Membership: Ø Used for orderly addition and removal from { active nodes } u Regroup: Ø Used for failure detection (via heartbeat messages) Ø Forceful eviction from { active nodes } © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 57
Membership u u Defined cluster = all nodes Active cluster: Ø Ø Ø Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress) © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 58
Quorum Resource u u Usually (but not necessarily) a SCSI disk Requirements: Ø Arbitrates for a resource by supporting the challenge/defense protocol Ø Capable of storing cluster registry and logs u Configuration Change Logs Ø Tracks changes to configuration database when any defined member missing (not active) Ø Prevents configuration partitions in time © 1996, 1997 Microsoft Corp. 59
Challenge/Defense Protocol u SCSI-2 has reserve/release verbs Ø Semaphore on disk controller u u u © 1996, 1997 Microsoft Corp. Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership: Ø Challenger clears semaphore (SCSI bus reset) Ø Waits 10 seconds • 3 seconds for renewal + 2 seconds bus settle time • x 2 to give owner two chances to renew Ø If still clear, then former owner loses lease Ø Challenger issues reserve to acquire semaphore 60
Challenge/Defense Protocol: Successful Defense Defender Node Reserve 0 © 1996, 1997 Microsoft Corp. 1 Reserve 2 3 4 Reserve 5 6 7 Bus Reset 8 9 10 11 Reserve 12 13 14 15 16 Reservation detected Challenger Node 61
Challenge/Defense Protocol: Successful Challenge Defender Node Reserve 0 1 2 3 4 5 6 7 Bus Reset Challenger Node © 1996, 1997 Microsoft Corp. 8 9 10 11 12 13 14 15 16 Reserve No reservation detected 62
Regroup u u Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages Ø Ø u Uses a 5 -round protocol to agree. Ø Ø u suspicion that sender is dead failure detection in bounded time Checks communication among nodes. Suspected missing node may survive. Upper levels (global update, etc. ) informed of regroup event. © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 63
Membership State Machine Initialize Sleeping Start Cluster Search Fails Member Search Found Online Member © 1996, 1997 Microsoft Corp. Minority or no Quorum Joining Search or Reserve Fails Acquire (reserve) Quorum Disk Regroup Non-Minority and Quorum Join Succeeds Quorum Disk Search Lost Heartbeat Online Forming Synchronize Succeeds 64
Joining a Cluster u u When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which Ø looks in local (stale) registry for members Ø Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found. ) u Sponsor (any active member) Ø Sponsor authenticates applicant Ø Broadcasts applicant to cluster members Ø Sponsor sends updated registry to applicant Ø Applicant becomes a cluster member © 1996, 1997 Microsoft Corp. 65
Forming a Cluster (when Joining fails) u u u Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource Ø e. g. if we were down when it was in use u u u Form new one-node cluster Bring other cluster resources online Let others join your cluster © 1996, 1997 Microsoft Corp. 66
Leaving A Cluster (Gracefully) u Pause: Ø Move all groups off this member. Ø Change to paused state (remains a cluster member) u Offline: Ø Move all groups off this member. Ø Sends Cluster. Exit message all cluster members • Prevents regroup • Prevents stalls during departure transitions Ø Close Cluster connections (now not an active cluster member) Ø Cluster service stops on node u Evict: remove node from defined member list © 1996, 1997 Microsoft Corp. 67
Leaving a Cluster (Node Failure) u u Node (or communication) failure triggers Regroup If after regroup: Minority group OR no quorum device: • group does NOT survive Ø Non-minority group AND quorum device: • group DOES survive Ø u Non-Minority rule: Number of new members >= 1/2 old active cluster Ø Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Ø u Quorum guarantees correctness Ø Prevents “split-brain” • e. g. with newly forming cluster containing a single node © 1996, 1997 Microsoft Corp. 68
Global Update u u u Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership Ø u all are up all can communicate Ø R. Carr, Tandem Systems Review. V 1. 2 1985, sketches regroup and global update protocol. © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 69
Global Update Algorithm Cluster has locker node that regulates updates. Ø Ø u 0! Failure of all updated nodes: Ø Ø u in seniority order (e. g. locker first) this includes the updating node 10 Ø X= u L Send Update to locker node Update other (active) nodes Update never happened Updated nodes will roll back on recovery k u Oldest active node in cluster ac u S Survival of any updated nodes: Ø Ø New locker is oldest and so has update if any do. New locker restarts update © 1996, 1997 Microsoft Corp. 70
Cluster Registry u u Separate from local NT Registry Maintains cluster configuration Ø u u members, resources, restart parameters, etc. Stable storage Replicated at each member Ø Ø Global Update protocol NT Registry keeps local copy © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 71
Cluster Registry Bootstrapping u Membership uses Cluster Registry for list of nodes Ø …Circular dependency u Solution: Ø Membership uses stale local cluster registry Ø Refresh after joining or forming cluster Ø Master is either • quorum device, or • active members © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 72
Resource Monitor u Polls resources: Ø Is. Alive and Looks. Alive u Detects failures Ø polling failure Ø failure event from resource u Higher levels tell it Ø Online, Offline Ø Restart © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 73
Failover Manager u Assigns groups to nodes based on Ø Failover parameters Ø Possible nodes for each resource in group Ø Preferred nodes for resource group © 1996, 1997 Microsoft Corp. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers 74
© 1996, 1997 Microsoft Corp. Failover (Resource Goes Offline) Resource Manager Detects resource error. Notify Failover Manager checks: Failover Window and Failover Threshold Attempt to restart resource. Wait for Failback Window No Has the Resource Retry limit been exceeded? Are Failover conditions within Constraints? No Yes Leave Group in partially Online state. Yes Switch resource (and Dependants) Offline. Can another owner be found? (Arbitration) Yes No Notify Failover Manager on the new system to bring resource Online. 75
Pushing a Group (Resource Failure) Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over. Resource Manager takes each depending resource Offline. Leave Group in partially Online state. © 1996, 1997 Microsoft Corp. No Any resource has “Affect the Group” True Failover Manager performs Arbitration to locate a new owner for the group. Yes Failover Manager on the new owner node brings the resources Online. 76
© 1996, 1997 Microsoft Corp. Pulling a Group (Node Failure) Cluster Service notifies Failover Manager of node failure. Failover Manager determines which groups were owned by the failed node. Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over. Failover Manager performs Arbitration to locate a new owner for the groups. Failover Manager on the new owner(s) bring the resources Online in dependency order. 77
Failback to Preferred Owner Node u u u Group may have a Preferred Owner comes back online Will only occur during the Failback Window (time slot, e. g. at night) Preferred owner comes back Online. Is the time within the Failback Window? © 1996, 1997 Microsoft Corp. Resource Manager takes each resource on the current owner Offline. Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner. Failover Manager performs Arbitration to locate the Preferred Owner of the group. Failover Manager on the Preferred Owner brings the resources Online. 78
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 79
Process Structure u Cluster Service Failover Manager Ø Cluster Registry Ø Global Update Ø Quorum Ø Membership A Node Ø u Resource Monitor Private calls Resource Monitor Cluster Resource Monitor Service Ø Resource DLLs Ø u Resources Services Ø Applications Ø © 1996, 1997 Microsoft Corp. Resource Monitor DLL Private calls Resource 80
Resource Control u Commands Ø Ø Ø u A Node Create. Resource() Online. Resource() Offline. Resource() Terminate. Resource() Close. Resource() Shutdown. Process() And resource events © 1996, 1997 Microsoft Corp. Resource Monitor Private calls Cluster Service Resource Monitor DLL Private calls Resource 81
Resource DLLs I’m Online! u Calls to Resource DLL Ø Ø Ø Online Pending Open: get handle Online: start offering service Offline: stop offering service Go Off-line! Failed I’m here! Go Online! Offline Pending I’m Off-line! • as a standby or • pair-is offline Ø Ø Looks. Alive: Quick check Is. Alive: Thorough check Terminate: Forceful Offline Close: release handle © 1996, 1997 Microsoft Corp. Resource Monitor DLL Std calls Private calls Resource 82
Cluster Communications u u u © 1996, 1997 Microsoft Corp. Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e. g. Ethernet) interconnects Management apps DCOM Cluster Service DCOM / RPC: admin UDP: Heartbeat Cluster Service DCOM / RPC Resource Monitors 83
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 84
Application Support u u Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API © 1996, 1997 Microsoft Corp. 85
Virtual Servers u Problem: Ø u A Virtual Server simulates an NT Node Ø Ø Ø u Client and Server Applications do not want node name to change when server app moves to another node. Resource Group (name, disks, databases, …) Net. Name and IP address (node: \a keeps name and IP address as is moves) Virtual Registry (registry “moves” (is replicated)) Virtual Service Control Virtual RPC service Challenges: Ø Ø Limit app to virtual server’s devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients) © 1996, 1997 Microsoft Corp. Virtual Server \a: 1. 2. 3. 4 86
Virtual Servers (before failover) u u Nodes \Y and \Z support virtual servers \A and \B Things that need to fail over transparently Ø Client connection SAP \Y \Z SAP SQL S: \A T: \B Ø Server dependencies Ø Service names Ø Binding to local “SAP on A” “SAP on B” resources Ø Binding to local servers © 1996, 1997 Microsoft Corp. 87
Virtual Servers (just after failover) u u \Y resources and groups (i. e. Virtual Server \A) moved to \Z A resources bind to each other and to local resources (e. g. , local file system) Ø Ø u u E. g. time must remain monotonic after failover © 1996, 1997 Microsoft Corp. \Z SAP SQL S: T: \A Registry Physical resource Security domain Time Transactions used to make DB state consistent. To “work”, local resources on \Y and \Z have to be similar Ø \Y “SAP on A” \B “SAP on B” 88
Address Failover and Client Reconnection u Name and Address rebind to new node Ø u \Y Failure not transparent Ø Must log on again Ø Client context lost (encourages connectionless) Ø Applications could maintain context Ø SAP SQL S: Clients reconnect SAP SQL Details later © 1996, 1997 Microsoft Corp. \Z T: \A “SAP on A” \B “SAP on B” 89
Mapping Local References to Group-Relative References u Send client requests to correct server \Y \ASAP refers to \. SQL Ø \BSAP refers to \. SQL Ø u Must remap references: \ASAP to \. SQL$A Ø \BSAP to \. SQL$B Ø u u \Z SAP SQL S: T: \A \B Also handles namespace collision Done via modifying server apps, or “SAP on A” Ø DLLs to transparently rename Ø © 1996, 1997 Microsoft Corp. “SAP on B” 90
Naming and Binding and Failover u u u Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Ø Applications register names to advertise services Ø Example: \AliceSQL (i. e.
Client to Cluster Communications IP address mobility based on MAC rebinding u u u IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr. Client Alice <-> 200. 110. 12. 4 Virtual Alice <-> 200. 110. 12. 5 Betty <-> 200. 110. 12. 6 Virtual Betty <-> 200. 110. 12. 7 Alice <-> 200. 110. 120. 4 Virtual Alice <-> 200. 110. 120. 5 © 1996, 1997 Microsoft Corp. u u Cluster Clients Ø Must use IP (TCP, UDP, NBT, . . . ) Ø Must Reconnect or Retry after failure Cluster Servers Ø All cluster nodes must be on same LAN segment WAN Router: Betty <-> 200. 110. 120. 6 Virtual Betty <-> 200. 110. 120. 7 200. 110. 120. 4 ->Alice. MAC 200. 110. 120. 5 ->Alice. MAC 200. 110. 120. 6 ->Betty. MAC 200. 110. 120. 7 ->Betty. MAC Local Network 92
Time u Time must increase monotonically Otherwise applications get confused Ø e. g. make/nmake/build Ø u Time is maintained within failover resolution Ø u u Not hard, since failover on order of seconds Time is a resource, so one node owns time resource Other nodes periodically correct drift from owner’s time © 1996, 1997 Microsoft Corp. 93
Application Local NT Registry Checkpointing u u Resources can request that local NT registry subtrees be replicated Changes written out to quorum device Ø Uses registry change notification interface u Changes read and applied on fail-over \A on \X registry © 1996, 1997 Microsoft Corp. \A on \B Eac h up date er er Aft lov Fai registry Quorum Device 94
© 1996, 1997 Microsoft Corp. Registry Replication 95
Application Support u u Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API © 1996, 1997 Microsoft Corp. 96
Generic Resource DLLs u Generic Application DLL Ø Simplest: just starts, stops application, and makes sure process is alive u Generic Service DLL Ø Translates DLL calls into equivalent NT Server calls • • • © 1996, 1997 Microsoft Corp. Online => Service Start Offline => Service Stop Looks/Is. Alive => Service Status Resource Monitor DLL Private Std calls Resource 97
© 1996, 1997 Microsoft Corp. Generic Application 98
© 1996, 1997 Microsoft Corp. Generic Service 99
Application Support u u © 1996, 1997 Microsoft Corp. Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API 100
Resource DLL VC++ Wizard u u u Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary Ø E. g. additional checks for Looks/Is. Alive © 1996, 1997 Microsoft Corp. 101
Creating a New Workspace © 1996, 1997 Microsoft Corp. 102
Specifying Resource Type Name © 1996, 1997 Microsoft Corp. 103
Specifying Resource Parameters © 1996, 1997 Microsoft Corp. 104
Automatic Code Generation © 1996, 1997 Microsoft Corp. 105
© 1996, 1997 Microsoft Corp. Customizing The Code 106
Application Support u u © 1996, 1997 Microsoft Corp. Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API 107
Cluster API u Allows resources to: Ø Examine dependencies Ø Manage per-resource data Ø Change parameters (e. g. failover) Ø Listen for cluster events Ø etc. u u u © 1996, 1997 Microsoft Corp. Specs & API became public Sept 1996 On all MSDN Level 3 On web site: Ø http: //www. microsoft. com/clustering. htm 108
Cluster API Documentation © 1996, 1997 Microsoft Corp. 109
Outline u u u © 1996, 1997 Microsoft Corp. Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A 110
Research Topics? u u u u u Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10, 000 nodes? Better algorithms Shared memory or shared disk among nodes Ø a truly bad idea? © 1996, 1997 Microsoft Corp. 111
References Microsoft NT site: http: //www. microsoft. com/ntserver/ BARC site (e. g. these slides): http: //research. microsoft. com/~joebar/wolfpack Inside Windows NT, H. Custer, Microsoft Pr, ISBN: 155615481 Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V 1. 2 1985, sketches regroup and global update protocol. VAXclusters: a Closely Coupled Distributed System, Kronenberg, N. , Levey, H. , Strecker, W. , ACM TOCS, V 4. 2 1986. A (the) shared disk cluster. In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Transaction Processing Concepts and Techniques, Gray, J. , Reuter A. , Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques. © 1996, 1997 Microsoft Corp. 112