An unattended fault-tolerant approach for the execution of

Скачать презентацию An unattended fault-tolerant approach for the execution of

f95fe14cad3df1b2a59fe5018ebea8c7.ppt

Количество слайдов: 24

An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain 1

Outline • Problem • Solution • • • Architecture Implementation Examples 2

Application porting for distributed platforms 3

Problem • Application should be highly portable • Grid: • • • Schedulers: WMS, Grid. Way, Pilot Jobs. . . Libraries: DRMAA, SAGA, . . . Cluster • Schedulers: SGE, PBS/Torque, . . . • Libraries: DRMAA, MPI 4

A new standard? (author: xkcd. com) 5

Solution: distributed. Toolbox 6

High level description • Distributed tasks are defined with a reduced set of parameters and exported as XML files • • Executable, arguments, input/output/error files XML files are parsed and tasks executed on the distributed infrastructures • Depending on the infrastructure, this can be done on very different ways 7

High level description (2) • The basic idea is NOT to define a new standard, libraries, API. . . • BUT • Create a simple specification that anyone can implement according to their specific needs • Extremely simple or rather complex! 8

Application developer’s point of view • Java and python APIs are included to create distributed task definition files (XMLs), and to load information from XMLs • If needed, others can be seamlessly implemented 9

Distributed. Toolbox • Set of tools to execute distributed tasks • Implementations for Cluster & Grid • Can be modified or adapted to new platforms on a very simple way 10

Proposed solution for clusters 11

Proposed solution for the Grid 12

Execution workflow • Local application creates task XMLs • Task. Loader reads these files and stores them on a database • Grid. Controller reads this database and executes the tasks employing Grid. Way • • A task is considered finished when the desired output files exist and are not null Local application loads results and finishes its execution 13

Robustness • Certification problems. • • If the user is not able to properly identify himself by employing a valid Grid Certificate, Grid. Way will detect it and abort the task submission, notifying the problem. Communication failures. • If any kind of problem on the transmission of the input data or task executable occurs, it is detected by Grid. Way on the remote site and the task is cancelled. • If any kind of problem on the transmission of the output data occurs and this data is not returned to the local host, the task is considered to have failed. 14

Robustness (2) • Local resource failures. • If the specified input files are not present on the system, the job is considered as finished. • If communication with Grid. Way is broken the task submission is stopped. When communication is restored, the status of the tasks being run is checked. • If Grid. Controller fails, no information is lost due the employment of databases for persistence. When it is restarted, previous state is recovered and the status of the tasks that were running is checked. • If the database fails, the execution of Grid. Controller is considered to be unsafe and automatically stops. 15

Robustness (3) • Remote resource failures. • If the remote task does not start, Grid. Way detects it. • If the remote task remains in a queue for more than a given threshold, it is resubmitted. • If there is any problem with the Grid certificates on the remote site, it is detected by Grid. Way. • Some failures in remote sites lead to an state where the master node thinks that the task is running even if it was finished on the worker node. To detect this, tasks with an extremely long execution time are considered to have failed. • In order to avoid performance slowdowns, a small replication factor for every group of tasks has been included. 16

Use Cases 17

1 Prot. Test 3 & 2 j. Model. Test 2 • Java applications, designed to run on local workstations • Wrappers of a serial application, Phy. ML, that takes 99% of the computational effort • Large cases take days to weeks • Porting to HPC & Grid necessary to improve throughput [1] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. Prot. Test 3: fast selection of best-fit models of protein evolution. Bioinformatics, 2011 [2] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. j. Model. Test 2: more models, new heuristics and 18 parallel computing. Nature Methods, 9(8): 772– 772, July 2012.

Architecture of the solution 19

Results: reliability tests • Tests for certificate management: • Submit jobs with no certificate • Submit jobs with a certificate of a different VO • Submit jobs finishing after the certificate • Manually destroying the certificate 20

Results: reliability tests (2) • Tests on local resource: • Kill Grid. Way • . . . or any number of Grid. Way tasks • Kill Grid. Controller • Kill database • Shutdown machine, both controlled and “hard reset” 21

Results: reliability tests (3) • Tests on remote sites: • Jobs not creating the desired output data • Many tasks submitted to fusion and Biomed VOs to test the proposal on production environments 22

Results • Tasks executed: • • • Cluster: about 10. 000 Grid: more than 100. 000 Not a single one was lost or miss-worked 23

Thanks for your attention Questions? 24