
a41bf4ad2bc05fb8cbc5e05992f41b1a.ppt
- Количество слайдов: 23
Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of Technology, Haifa Artyom Sharov, Technion, Haifa Condor week – April 2006
Condor Pool without High Availability Startd and Schedd Central Manager Startd and Schedd Negotiator Startd and Schedd Artyom Sharov, Technion, Haifa Collector Startd and Schedd Condor week – April 2006
Why Highly Available CM? n Central Manager is a single-point-of-failure n n Condor tools do not work n n No additional matches are possible Unfair resource sharing and user priorities Our goal - continuous pool functioning in case of failure Artyom Sharov, Technion, Haifa Condor week – April 2006
Highly Available Condor Pool Startd and Schedd Highly Available Central Manager Startd and Schedd Artyom Sharov, Technion, Haifa Startd and Schedd 4 Startd and Schedd Condor week – April 2006
Solution Requirements n Automatic failure detection n Transparent failover n “Split brain” reconciliation n Persistency of CM state n No changes to CM code Artyom Sharov, Technion, Haifa Condor week – April 2006
Condor Pool with HA Replicator HAD HAD Collector Negotiator Collector Highly Available Central Manager Artyom Sharov, Technion, Haifa 6 Condor week – April 2006
HA – Election + Main Backup 1 Backup 2 Backup 3 #1 Election message I win Raise Negotiator I loose #2 Active Artyom Sharov, Technion, Haifa I am alive Condor week – April 2006
HA – Crash Active Backup 1 Backup 2 #3 Election messages I win I loose Raise Negotiator #4 Active Artyom Sharov, Technion, Haifa I am alive Condor week – April 2006
Replication – Main + Joining Active Backup Joining #1 State update #2 Solicit version reply Downloading request #3 State update Artyom Sharov, Technion, Haifa Condor week – April 2006
Replication – Crash Active Backup 1 Backup 2 #4 State update #5 Active Artyom Sharov, Technion, Haifa State update Condor week – April 2006
Configuration n Stabilization time n n n Depends on number of CMs and network performance HAD_CONNECT_TIMEOUT – upper bound on the time to establish TCP connection Example: HAD_CONNECT_TIMEOUT = 2 and 2 CMs - new Negotiator is guaranteed to be up and running after 48 seconds n Replication frequency n REPLICATION_INTERVAL Artyom Sharov, Technion, Haifa Condor week – April 2006
Testing n Automatic distributed testing framework: simulation of node crashes, network disconnections, network partition and merges n Extensive testing: n distributed testing on 5 machines in the Technion n interactive distributed testing in Wisconsin pool n automatic testing with NMI framework Artyom Sharov, Technion, Haifa Condor week – April 2006
HA in Production n Already deployed and fully functioning for more than a year in n Technion n GLOW, UW n California Department of Water Resources, Delta Modeling Section, Sacramento, CA n Hartford Life n Cycle Computing n Additional commercial users Artyom Sharov, Technion, Haifa Condor week – April 2006
Usability and Administration n HAD Monitoring System n Configuration/administration utilities n Detailed manual section n Full support by Technion team Artyom Sharov, Technion, Haifa Condor week – April 2006
Future Work n HA in WAN n HAIFA – High Availability Is For Anyone n n More consistency schemes and HA semantics n Dynamic registration of services requiring HA n n HA for any Condor service (e. g. : HA for schedd) Dynamic addition/removal of replicas More details in "Materializing Highly Available Grids" - hot topic paper, to appear in HPDC 2006. Artyom Sharov, Technion, Haifa Condor week – April 2006
Collaboration with Condor Team n n n n Ongoing collaboration for 3 years Compliance with Condor coding standards Peer-reviewed code Integration with NMI framework Automation of testing Open-minded attitude of Condor team to numerous requests and questions Unique experience of working with large peermanaged group of talented programmers Artyom Sharov, Technion, Haifa Condor week – April 2006
Collaboration with Condor Team This work was a collaborative effort of: n Distributed Systems Laboratory in Technion n n Prof. Assaf Schuster, Gabi Kliot, Mark Zilberstein, Artyom Sharov Condor team n Prof. Miron Livny, Nick, Todd, Derek, Greg, Anatoly, Peter, Becky, Bill, Tim Artyom Sharov, Technion, Haifa Condor week – April 2006
You Should Definitely Try It n n n Part of the official 6. 7. 18 development release Will soon appear in stable 6. 8 release More information: n n http: //dsl. cs. technion. ac. il/projects/gozal/project_p ages/ha/ha. html http: //dsl. cs. technion. ac. il/projects/gozal/project_p ages/replication. html more details + configuration in my tutorial Contact: n n {gabik, marks, sharov}@cs. technion. ac. il condor-users@cs. wisc. edu Artyom Sharov, Technion, Haifa Condor week – April 2006
In case of time Artyom Sharov, Technion, Haifa Condor week – April 2006
Replication – “Split Brain” Active 1 Merge of networks Active 2 I am alive, Active 2 HAD Replication Decision making: my ID > ‘Active 2’ ID, I am a leader Artyom Sharov, Technion, Haifa I am alive, Active 1 HAD Replication Decision making: my ID < ‘Active 1’ ID, give up Condor week – April 2006
Replication – “Split Brain” merging versions from two pools Active Merge of networks HAD Backup HAD You’re leader ‘Active 2’ last version before merge State update Replication Artyom Sharov, Technion, Haifa Replication Condor week – April 2006
HAD State Diagram Artyom Sharov, Technion, Haifa Condor week – April 2006
RD State Diagram Artyom Sharov, Technion, Haifa Condor week – April 2006
a41bf4ad2bc05fb8cbc5e05992f41b1a.ppt