Скачать презентацию Condor DAGMan Introduction Update Peter Couvares Computer Скачать презентацию Condor DAGMan Introduction Update Peter Couvares Computer

010fdb66f07b65dd11ff8831e52ba081.ppt

  • Количество слайдов: 23

Condor DAGMan: Introduction & Update Peter Couvares Computer Sciences Department University of Wisconsin-Madison pfc@cs. Condor DAGMan: Introduction & Update Peter Couvares Computer Sciences Department University of Wisconsin-Madison [email protected] wisc. edu http: //www. cs. wisc. edu/condor

DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e. g. , “Don’t run job “B” until job “A” has completed successfully. ”) http: //www. cs. wisc. edu/condor 2

Why is This Important? › Most real science involves complex sequences of tasks – Why is This Important? › Most real science involves complex sequences of tasks – on many resources at many sites. h. E. g. , move data, compute, check, move back, etc. › … and many types of jobs working together h. Condor, Grid (Condor-G), MPI, shell scripts, etc. › Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial. http: //www. cs. wisc. edu/condor 3

What is a DAG? › A DAG is the data structure used by DAGMan What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D http: //www. cs. wisc. edu/condor 4

Defining a DAG › A DAG is defined by a. dag file, listing each Defining a DAG › A DAG is defined by a. dag file, listing each of its nodes and their dependencies: # diamond. dag Job A a. sub Job B b. sub Job C c. sub Job D d. sub Parent A Child B C Parent B C Child D Job A Job B Job C Job D › each node will run the Condor or Grid job specified by its accompanying Condor submit file http: //www. cs. wisc. edu/condor 5

Submitting a DAG › To start your DAG, just run condor_submit_dag with your. dag Submitting a DAG › To start your DAG, just run condor_submit_dag with your. dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: % condor_submit_dag diamond. dag › condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc. http: //www. cs. wisc. edu/condor 6

Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. A Condor A Job Queue B C . dag File DAGMan D http: //www. cs. wisc. edu/condor 7

Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor B Job C Queue B C DAGMan D http: //www. cs. wisc. edu/condor 8

Running a DAG (cont’d) › In case of a job failure, DAGMan continues until Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. A Condor Job Queue B X Rescue File DAGMan D http: //www. cs. wisc. edu/condor 9

Recovering a DAG › Once the failed job is ready to be re-run, the Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job C Queue B C Rescue File DAGMan D http: //www. cs. wisc. edu/condor 10

Finishing a DAG › Once the DAG is complete, the DAGMan job itself is Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor Job Queue B C DAGMan D http: //www. cs. wisc. edu/condor 11

Additional DAGMan Features › Provides other knobs handy for job management… hnodes can have Additional DAGMan Features › Provides other knobs handy for job management… hnodes can have PRE & POST scripts hjob submission can be “throttled” h. NEW: failed nodes can be automatically re-tried a configurable number of times http: //www. cs. wisc. edu/condor 12

PRE & POST Scripts › Executes locally on the submit host before › or PRE & POST Scripts › Executes locally on the submit host before › or after job submission… Example: # diamond. dag PRE A prepare-A. sh Job A a. sub Job B b. sub Job C c. sub Job D d. sub POST D double-check. sh Parent A Child B C Parent B C Child D PRE Job A Job B Job C Job D POST › PRE/POST scripts are part of node http: //www. cs. wisc. edu/condor 13

DAG “Throttling” › You can tell DAGMan to limit the maximum number of jobs DAG “Throttling” › You can tell DAGMan to limit the maximum number of jobs it submits at any one time hcondor_submit_dag -maxjobs N huseful for managing resource limitations (e. g. , licenses) › You can also can limit the number of simultaneous PRE or POST scripts. h. Added after Vladimir Litvin’s 7000 -node DAG started 7000 PRE scripts on his machine! http: //www. cs. wisc. edu/condor 14

Node RETRY › Tells DAGMan to re-run a node › multiple times if necessary… Node RETRY › Tells DAGMan to re-run a node › multiple times if necessary… Example: Job A # diamond. dag Job A a. sub Job B b. sub RETRY B 5 Job C c. sub RETRY C 5 Job D d. sub Parent A Child B C Parent B C Child D Job B Job C Job D http: //www. cs. wisc. edu/condor 15

DAGMan Progress › Testing… lots of testing. h 10, 000+ node DAGs run smoothly DAGMan Progress › Testing… lots of testing. h 10, 000+ node DAGs run smoothly h. Developed automated DAG testing tools to generate random DAGs and test for correct execution (Ning Lin & Will Mc. Donald) h. Lots of bugs fixed http: //www. cs. wisc. edu/condor 16

DAGMan Progress (cont’d) › New features h. Improved logging (timestamps, etc. ) h. More DAGMan Progress (cont’d) › New features h. Improved logging (timestamps, etc. ) h. More efficient recovery h. Node RETRY capability h. DAG info in condor_q (with –dag flag) h. Robust in more failure cases h. Recursive DAGs for conditional execution › DAGMan for Windows (Ray Pingree) http: //www. cs. wisc. edu/condor 17

DAGMan Success › DAGMan is becoming part of the common framework for running on DAGMan Success › DAGMan is becoming part of the common framework for running on the grid. h. Particle Physics Data Grid (PPDG) h. Grid Physics Network (Gri. Phy. N) h. Many Super Computing 2001 demos hmore… http: //www. cs. wisc. edu/condor 18

DAGMan in the Gri. Phy. N Application Architecture DAG Planner DAG Executor DAGMAN, Kangaroo DAGMan in the Gri. Phy. N Application Architecture DAG Planner DAG Executor DAGMAN, Kangaroo Catalog Services MCAT; Gri. Phy. N catalogs Info Services MDS Policy/Security Monitoring MDS Repl. Mgmt. GDMP GSI, CAS Reliable Transfer Service Globus Compute Resource Storage Resource GRAM Grid. FTP; GRAM; SRM diagram by Ian Foster (Argonne) http: //www. cs. wisc. edu/condor 19

DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab) 20 DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab) 20

What’s Next? › More flexible control of node execution h. Currently implicit: “all my What’s Next? › More flexible control of node execution h. Currently implicit: “all my parents returned 0”. h. Why not, “all parents returned 0 AND ran for more than two hours” or “parent A returned 0 and parent B returned 42”? › 1 st step: represent DAG nodes internally as Class. Ads h. Allows DAGMan to decide when to run nodes based on arbitrary requirements http: //www. cs. wisc. edu/condor 21

What’s Next? (cont’d) › Extend DAGMan to utilize Da. P Scheduler (Da. P? ) What’s Next? (cont’d) › Extend DAGMan to utilize Da. P Scheduler (Da. P? ) to intelligently schedule data transfers along with Condor and Condor-G jobs. Condor DAGMan Condor-G Da. P Scheduler http: //www. cs. wisc. edu/condor 22

Thank You! › Interested in seeing more? h. Come to the DAGMan Bo. F Thank You! › Interested in seeing more? h. Come to the DAGMan Bo. F • Wednesday 9 am - noon • Room 3393, Computer Sciences (1210 W. Dayton St. ) h. Email us: • [email protected] wisc. edu h. Try it! • http: //www. cs. wisc. edu/condor 23