b26c84aae943b1557501e69794fca81c.ppt
- Количество слайдов: 30
Fault-Tolerance Issues for Communicating Mobile Agents Keith Marzullo University of California, San Diego Department of Computer Science and Engineering … and the TACOMA group 6 October 1999
Fault-Tolerance Fault-tolerance can mean different things: • Ensuring that a failure will not be visible to the application masking • Detecting when a failure has occurred. detection and recovery • Ensuring that a failure will not cause an inconsistent application state to arise. atomic transactions Fault-Tolerant Considerations for Communicating Mobile Agents -2 -
Roadmap • • • Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery. Fault-Tolerant Considerations for Communicating Mobile Agents -3 -
Masking • Uses sufficient replication and voting so that (independent) failures of components does not result in an incorrect state. • It can be supplied as a wrapper that hides the replication of the service from the clients. • Different approaches appropriate for different failure model, performance requirements, and underlying communication systems. Fault-Tolerant Considerations for Communicating Mobile Agents -4 -
State Machine Approach replicated service Fault-Tolerant Considerations for Communicating Mobile Agents -5 -
Primary-Backup Approach replicated service Fault-Tolerant Considerations for Communicating Mobile Agents -6 -
Detection and Recovery Detection require less replication than masking 1 vs. f+1 for detecting vs. masking f failstop crashes f+1 vs. 2 f+1 for detecting vs. masking f arbitrary failures Recovery can be rollback, roll forward, or more specific approach. Fault-Tolerant Considerations for Communicating Mobile Agents -7 -
Roadmap 1. Review some ideas from fault-tolerance. 2. Present some issues associated with masking and active replication in mobile agent computations. 3. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. 4. Discuss some issues associated with transactional support. 5. Mention a programming issue with detection and recovery. Fault-Tolerant Considerations for Communicating Mobile Agents -8 -
Replicated Agents with Voting landing pad S Fault-Tolerant Considerations for Communicating Mobile Agents D -9 -
Replicated Agents with Voting (2) stage 1 stage 2 stage 3 D S electorate for stage 1 stage 2 stage 3 Fault-Tolerant Considerations for Communicating Mobile Agents electorate for D - 10 -
Replicated Agents with Voting (3) S Fault-Tolerant Considerations for Communicating Mobile Agents D - 11 -
Replicated Agents with Voting (4) [[(s: S) > (q 1: P 1)]S; (q 1: P 1) > (r 1: P 2)]q 1 [(s: S) > (q 1: P 1)]S [[(s: S) > (q 1: P 1)]S; (q 2: P 1) > (r 1: P 2)]q 2 [(s: S) > (q 2: P 1)]S [[(s: S) > (q 1: P 1)]S; (q 3: P 1) > (r 1: P 2)]q 3 [(s: S) > (q 3: P 1)]S Fault-Tolerant Considerations for Communicating Mobile Agents - 12 -
Replicated Agents with Voting (5) Implements an architecture that can tolerate maliciously faulty landing pads. Rather complex and expensive. Perhaps best solved by landing pad. Fault-Tolerant Considerations for Communicating Mobile Agents - 13 -
Roadmap • • • Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery. Fault-Tolerant Considerations for Communicating Mobile Agents - 14 -
Primary-Backup by Application Places can crash, causing local agents to become lost. Agent code can be faulty, causing an agent to repeatedly fail. Communications can break, causing an agent’s plan to be unattainable. Fault-Tolerant Considerations for Communicating Mobile Agents - 15 -
Norwegian Army Protocol uses the places an agent has visited as a set of of potential places for recovery code to execute. The linear structure of a trajectory defines a monitoring strategy. current agent version 4 rear guards version 3 (youngest) Fault-Tolerant Considerations for Communicating Mobile Agents version 2 version 1 (oldest) - 16 -
Application Interaction An agent executes a fault-tolerant action at a place Action completes with a move or exit • Regular actions have an attribute failure • Failure actions have attributes failed. Code and failed. At If a regular action r fails then there is exactly one completed failure action f such that: f. code = r. failure f. failed. Code = r. code f. failed. At = r. place f. bc = r. bc Fault-Tolerant Considerations for Communicating Mobile Agents - 17 -
Fail-Stop Reliable Broadcast Fault-Tolerant Considerations for Communicating Mobile Agents - 18 -
Failure-Free Execution 6 ack(bc) update(bc’) move(bc) update(bc’) update(bc) 4 4 5 update(bc) update(bc’) update(bc) 3 update(bc) 2 ack(bc) update(bc’) move-complete(bc) Fault-Tolerant Considerations for Communicating Mobile Agents 1 ack(bc) - 19 -
Failure Execution update(bc) move(bc) 5 update(bc) 4 4 3 2 1 update(bc) move-complete Fault-Tolerant Considerations for Communicating Mobile Agents - 20 -
NAP Details. . . spawn and checkpoint operations also terminate fault-tolerant action Additional complexity arising from a mobile computation visiting same place multiple times. Can carry support for NAP along with mobile agent. scalability wrt administrative domains Fault-Tolerant Considerations for Communicating Mobile Agents - 21 -
Roadmap • • • Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery. Fault-Tolerant Considerations for Communicating Mobile Agents - 22 -
Transactions Atomicity based on atomic commit protocol and stable storage associated with each landing pad. Appears to be simple. Additional power comes from code mobility. Fault-Tolerant Considerations for Communicating Mobile Agents - 23 -
Transactions and Code Mobility store 1 store 2 store 3 lock $200 lock $100 buy X lock $160 lock $460 $300 $100 account Fault-Tolerant Considerations for Communicating Mobile Agents - 24 -
Transactions and Code Mobility (2) store 1 store 2 store 3 lock $200 lock $100 buy X lock $160 $100 lock $200 account Fault-Tolerant Considerations for Communicating Mobile Agents - 25 -
Roadmap • • • Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery. Fault-Tolerant Considerations for Communicating Mobile Agents - 26 -
Programming for Fault-Tolerance The kinds of problems we have been considering (so far) for NAP have to do with software installation and system maintenance. • Synchronized installation of new version of package. • Software license checking and upgrade. • Specialized tool installation for distributed monitoring and testing. All are built around some variation of agreement or reliable broadcast. Fault-Tolerant Considerations for Communicating Mobile Agents - 27 -
Programming for Fault-Tolerance (2) A plus seems to be the separation of mobility from function. Trajectory, synchronization, security and authentication are handled by mobility. But, writing fault-tolerant actions to implement the particular version agreement/reliable broadcast is awkward. … this seems to be a good place to use a higher-level programming language. e. g. , Sage Fault-Tolerant Considerations for Communicating Mobile Agents - 28 -
Observations • It’s hard to do fault-tolerance without knowing the failure model! • Detection and recovery is more appropriate for mobile agent computations than masking. • Need work by the fault-tolerance community into detection and recovery for arbitrary failures. • System management and maintenance seems to be a very rich field for problems involving fault-tolerant mobile agent computations. Fault-Tolerant Considerations for Communicating Mobile Agents - 29 -
Bibliography 1. 2. 3. 4. F. B. Schneider. Towards fault-tolerant and secure agentry. In 11 th International Workshop, WDAG '97, Saarbrucken, Germany, 24 -26 Sept. 1997), pp. 1 -14. Dag Johansen et. al. NAP: practical fault-tolerance for itinerant computations. In Proceedings. 19 th IEEE International Conference on Distributed Computing Systems, Austin, TX, USA, 31 May-4 June 1999), pp. 180 -189. M. Strasser and K. Rothermel. Reliability concepts for mobile agents. International Journal of Cooperative Information Systems, Dec. 1998, 7(4): 355 -382. A. Ricciardi. The Sage Project: Software Engineering for Distributed Applications. The University of Texas Department of Electrical and Computer Engineering TR 1996 -007, available at http: //www. belllabs. com/user/aleta/TR-PDS-1996 -007. ps. gz. Fault-Tolerant Considerations for Communicating Mobile Agents - 30 -
b26c84aae943b1557501e69794fca81c.ppt