Скачать презентацию NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Using the Скачать презентацию NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Using the

c7d2246a65fa140abedb1919c67742cb.ppt

  • Количество слайдов: 24

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Using the Batch System at NERSC Mark Durst NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training, Argonne, IL 28 April 1999 Using the Batch System

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Outline • • • Quick example How batch NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Outline • • • Quick example How batch processing works Batch and pipe queues How to submit jobs Monitoring jobs Reminders and Pointers Using the Batch System 2

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER #!/bin/csh # # file: simple 1 # #QSUB NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER #!/bin/csh # # file: simple 1 # #QSUB -q serial #QSUB -J y # keep job log set myname=`whoami` set now=`date` set mylocn=`pwd` echo echo echo "" "Hello $myname, this is your shell script $0, " "running at $now. " "" "Your current directory is $mylocn, which should" "be the same as $HOME. " "" "I'm going to sleep now. " "" sleep 90 exit Using the Batch System 3

% cqsub simple 1 Task id t 51847 inserted into database nqedb. % cqstatl % cqsub simple 1 Task id t 51847 inserted into database nqedb. % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 scheduler. main mjdurst NQE Database NNew % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 scheduler. main mjdurst NQE Database NPend % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 lws. mcurie mjdurst NQE Database NSche % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 (49939. mcurie) simple 1 lws. mcurie mjdurst nqs@mcurie NSubm

% qstat 49939 ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER % qstat 49939 ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST -------- ----------- ------ --49939. mcurie simple 1 mjdurst serial_short@mcurie 3753 25 364 1800 R 03 % qstat 49939 nqs-100 qstat: CAUTION Request <49939>: not found. % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 (49939. mcurie) simple 1 monitor. main mjdurst NQE Database NComp % ls -l total 12 -rwxrw-r--r--rw-r--r-- 1 1 mjdurst mpccc 365 0 1285 2638 Jan Jan 15 15 10: 47 10: 50 simple 1* simple 1. e 51847 simple 1. l 51847 simple 1. o 51847

% cat simple 1. l 51847 01/15 10: 48: 13 Submitting to queue <serial> % cat simple 1. l 51847 01/15 10: 48: 13 Submitting to queue by 01/15 10: 48: 13 Command line options: <-e /u 1/mjdurst/tests/bat. simple/simple 1. e 51847 -J y -j /u 1/mjdurst/tests/bat. simple/simple 1. l 51847 -l. M 28 mw -l. T 1800 -mu mjdurst@mcurie -o /u 1/mjdurst/tests/bat. simple/simple 1. o 51847 -r simple 1 -x -q serial>. 01/15 10: 48: 13 Script file options: <-q serial -J y # keep job log>. 01/15 10: 48: 15 Arrived in from . 01/15 10: 48: 15 Request-id is <49939. mcurie>, Request name=. 01/15 10: 48: 15 NQE Task ID is . 01/15 10: 48: 15 Origin uid=<12113>, Target username=. 01/15 10: 48: 15 Account/Project name=, Account/Project ID=<105>. 01/15 10: 48: 15 Submission security level=<0>, compartments=<0>. 01/15 10: 48: 17 Account/Project name=, Account/Project ID=<105>. 01/15 10: 48: 17 Arrived in from . 01/15 10: 48: 20 Submission security level=<0>, compartments=<0>. 01/15 10: 48: 20 Execution security level=<0>, compartments=<0>. 01/15 10: 48: 23 Started, pid=<36967>, jid=<3753>, shell=, umask=<18>. 01/15 10: 48: 23 Running in queue . 01/15 10: 50: 02 Finished. 01/15 10: 50: 02 Returning stderr output file. 01/15 10: 50: 03 Returning stdout output file.

% cat simple 1. o 51847 mcurie. nersc. gov, a Cray T 3 E-900 % cat simple 1. o 51847 mcurie. nersc. gov, a Cray T 3 E-900 running UNICOS/mk 2. 0. 3. 32 ---------------Contact Information---------------NERSC Web http: //www. nersc. gov/ ESnet Web http: //www. es. net/ ESCHER Web http: //www. nersc. gov/hardware/servers/vis-server. html CFS CONVERSION CFS to HPSS conversion was successfully completed on January 7, 1999. Users can access all of their CFS files on the new HPSS system, "archive". The cfs command on the NERSC Crays now points to the new HPSS interface, hsi. For more info on using hsi reference this URL: http: //www. nersc. gov/hardware/storage/hsi. ch 1. html. If your HPSS password fails or you don't have an HPSS account, contact the Account Support group at 1 -800 -66 NERSC, option 2, or (510) 486 -8612 ---------------------------------------Your current working directory is /u/mpccc/mjdurst. Hello mjdurst, this is your shell script /usr/spool/nqe/spool/scripts/++BBI+++++0+++, running at Fri Jan 15 10: 48: 31 PST 1999. Your current directory is /u 1/mjdurst, which should be the same as /u/mpccc/mjdurst. I'm going to sleep now. logout

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Why Batch Processing? • Batch queues are necessary: NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Why Batch Processing? • Batch queues are necessary: – On systems with many jobs – When scheduling is difficult – To assure greater throughput • Interactive jobs are limited – J 90: 10 hrs. – T 3 E: < 64 PEs, < 30 minutes parallel (1 hr serial) • Some machines/processors batch-only – J 90: all batch machines – T 3 E: many APP PEs (at night, almost all) Using the Batch System 8

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER The Batch Process • User creates shell script NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER The Batch Process • User creates shell script myscript • Submits to NQE with cqsub myscript – Returns NQE task id (e. g. , t 4913) • NQE forwards to NQS – J 90: selects a machine (J 90 wait time here) • NQS runs the job – – Assign NQS job id (e. g. , 6859. mcurie) Select a batch queue Place the job there (T 3 E wait time here) Run it when appropriate • NQS/NQE returns job logs at completion Using the Batch System 9

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pipe Queues • Groups of batch queues – NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pipe Queues • Groups of batch queues – Direct to a pipe with #QSUB -q serial – Default is production • To see them: qstat -p • T 3 E: – serial, debug, production, long • J 90: – production – batchk (for evening, weekend killeen queues) – batch{b, f, s, c, j} (not recommended) Using the Batch System 10

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Preparing for Batch Submission • Write your shell NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Preparing for Batch Submission • Write your shell script – C shell or Bourne/Korn shell – Starts in user’s home directory • Debug interactively (if possible) • Decide on needed resources – J 90: CPU time, memory – T 3 E: amount of parallel, serial time; number of PEs • Select other #QSUB options • Check for appropriate queue and submit Using the Batch System 11

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Essential options to cqsub (#QSUB directives) • J NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Essential options to cqsub (#QSUB directives) • J 90: – – -l. M -l. T

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Other cqsub options • -J y : save NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Other cqsub options • -J y : save job log (recommended) • -j : save it in file • -mb: send mail when job starts (-me: ends) • -a

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Job Submission • cqsub <file> • Can give NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Job Submission • cqsub • Can give options at submission time – Override file options – Less dependable • If no file name, expects commands from terminal – Useful behavior in automated script generation & submission • Response: Task id t 16839 inserted into database nqedb. – Task id useful for tracking with cqstatl. • Don’t break (Ctrl-C) out of cqsub! – Instead, allow to finish, then use cqdel Using the Batch System 14

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs • cqstatl <taskid> – cqstatl -a NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs • cqstatl – cqstatl -a | grep (if no ) • ST column (“status”) indicates progress – – – NNew, NPend, NSche: still in NQE NSubm: submitted to NQS NComp: done NTerm: killed NFail: job failed (user or system error) • IDENTIFIER column holds NQS job id (once submitted) • cqstatl -f : details for your job Using the Batch System 15

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs (cont’d) • T 3 E: qstat NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs (cont’d) • T 3 E: qstat once your job reaches NQS – cqstatl -d nqs = qstat – qstat -au (if no ) • J 90: qstat -h – Find hostname from NQS id (from cqstatl) – e. g. , 2861. seymour • ST column (“status”) now indicates – RNN : Running (with NN processes) – Qxy: waiting in the queue (xy encodes reason) • man qstat to decode Using the Batch System 16

% cqstatl -a --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER % cqstatl -a --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 48217 (46356. mcurie) PCM lws. mcurie alewife nqs@mcurie NSubm t 48713 (46848. mcurie) third lws. mcurie u 6670 nqs@mcurie NSubm t 49200 (47518. mcurie) int 566 A lws. mcurie u 61176 nqs@mcurie NSubm t 49245 (47368. mcurie) xqcd_ho lws. mcurie snm nqs@mcurie NSubm t 50349 (48480. mcurie) int 650 lws. mcurie u 61176 nqs@mcurie NSubm t 50881 (49338. mcurie) lte 34 -0 lws. mcurie lungfish nqs@mcurie NSubm t 51870 t 51871 t 51872 t 51873 t 51875 t 51877 t 51878 t 51881 t 51884 t 51885 t 51886 t 51887 case 17 c scheduler. main case 1 c 9 scheduler. main case 16 c scheduler. main (49967. mcurie) q_lsms lws. mcurie case 11 c scheduler. main (49970. mcurie) G 08 lws. mcurie (49971. mcurie) q. Hsig. 3 lws. mcurie (49975. mcurie) Jobge_b lws. mcurie (49979. mcurie) job 16. a lws. mcurie (49980. mcurie) run_dyn lws. mcurie (49981. mcurie) jupiter lws. mcurie (49983. mcurie) Job. CZ. b lws. mcurie (output greatly abridged) salmon NQE Database marlin nqs@mcurie salmon NQE Database u 66870 nqs@mcurie bass nqs@mcurie carp nqs@mcurie adt nqs@mcurie flounder nqs@mcurie grouper nqs@mcurie tarpon nqs@mcurie NTerm NFail NPend NSubm NSubm NComp

% qstat -a ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER % qstat -a ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER LOCATION/QUEUE -------- ----------49979. mcurie job 16. ag adt pe 32@mcurie 49936. mcurie akr 520 u 6677 pe 32@mcurie 49964. mcurie case 14 c 9 salmon pe 32@mcurie 49967. mcurie q_lsms marlin pe 32@mcurie 49983. mcurie Job. CZ. bb tarpon pe 32@mcurie 49984. mcurie bitgc 11 u 62098 pe 32@mcurie 49985. mcurie bitgc 11 u 62098 pe 32@mcurie 49362. mcurie Job_a 2 carp pe 128@mcurie 49335. mcurie script. 2 sturgeon pe 256@mcurie 49033. mcurie uo 2_3 h 2 o dorado gc 128@mcurie 49255. mcurie run 010_A bluegill long 128@mcurie 49276. mcurie sg 3 D 10 aku long 128@mcurie 49277. mcurie sg 3 D 10 aku long 128@mcurie 49867. mcurie run_t 4 flounder long 128@mcurie no pipe queue entries (output greatly abridged) JID PRTY REQMEM REQTIM ST ------ --4164 25 255 1520 R 03 3732 25 323 1800 R 03 3944 25 255 1795 R 03 999 28672 1800 Cge 317 28672 1800 Qge 244 28672 1800 Qge 242 28672 1800 Qge 5308 25 323 1800 R 03 999 28672 1800 Qqs --- 28672 7200 Hop 4617 25 255 1800 R 03 999 4096 1800 Qce 999 4096 1800 Qqu 70 28672 1800 Cgg

% qstat -f pe 32 ------------------NQS 3. 3. 0. 9 BATCH QUEUE: pe 32@mcurie % qstat -f pe 32 ------------------NQS 3. 3. 0. 9 BATCH QUEUE: pe 32@mcurie ------------------ Status: ENABLED/RUNNING Priority: Total: 15 17 Running: Holding: Queue: regular 5 0 Queued: Arriving: 12 0 Waiting: Exiting: 0 0 13 User: 2 Group: 20 Miser Queue: unspecified Scheduling Window: 0: 0. 0 Memory Size Quick File Space MPP Processor Elements LIMIT unlimited 0 b 416 PER-PROCESS type (cont’d) a b c d Tape Drives ALLOCATED 143360 kw 60 PER-REQUEST unspecified (0) (0)

type e Tape Drives type f Tape Drives type g Tape Drives type h type e Tape Drives type f Tape Drives type g Tape Drives type h Tape Drives Core File Size Data Size Permanent File Space Memory Size Nice Increment Quick File Space Stack Size CPU Time Limit Temporary File Space Working Set Limit MPP Processor Elements MPP Time Limit Shared Memory Segments MPP Memory Size unspecified unspecified 20 gb 28 mw 5 unspecified 3600 sec unspecified (0) (0) (256 mw) 25 gb 29 mw (0 b) (256 mw) 0 b 7200 sec unspecified (0 b) (256 mw) 32 15000 sec unspecified (0 mw) unspecified (0) unlimited 15000 sec unspecified (256 mw) Route: Pipe Only System Time: 3563114615067464. 00 secs Users: (qstat -f output, cont’d from previous slide) Unrestricted User Time: 281421545294442428. 00 secs

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting • No task id returned – Typically NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting • No task id returned – Typically means NQE down – message like “Can’t connect” • Job doesn’t make it to NQS: try cqstatl – NFail usually indicates submission error – Nabort could be a system problem – No listing if many days old (NQE database is purged frequently) • Stuck in NPend status – J 90: Many jobs ahead of you? – T 3 E: over pipe queue limit? Using the Batch System 21

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting (cont’d) • Stuck in NSubm : use NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting (cont’d) • Stuck in NSubm : use qstat – Q: normal on T 3 E, rare on J 90 – T 3 E: • Hop can be allocation problem • C (“checkpointed”) may be daily shuffling • May need both pslist and qstat -m to sort it all out • Job crashes – Read job log, stdout, stderr –. . . limit exceeded: ran out of time (or memory, or…) • Job vanishes – Did machine(s) crash? If not, collect info and contact Consultants Using the Batch System 22

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers • Batch job is like a login NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers • Batch job is like a login session – Starts in your home directory – Uses your startup files – But doesn’t inherit environment (unless you use -x) • Environment variable ENVIRONMENT – Not set in interactive work, set to BATCH in batch jobs – Can exclude parts of startup files • /usr/tmp faster than home directory – $TMPDIR vanishes (avoids littering) – Just one quota for $TMPDIR , rest of /usr/tmp/ – Can’t monitor batch J 90 temp file systems Using the Batch System 23

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers (cont’d) • Don’t submit blindly – Debug NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers (cont’d) • Don’t submit blindly – Debug executables, scripts first – Don’t trust inherited shell scripts – Spend time with man pages • J 90: large memory jobs should/must multitask • T 3 E: reduce serial time in parallel jobs – “Stage” HPSS retrievals (dmget) – Submit follow-on serial jobs within your job Using the Batch System 24