c7d2246a65fa140abedb1919c67742cb.ppt
- Количество слайдов: 24
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Using the Batch System at NERSC Mark Durst NERSC/USG ERSUG Training, Argonne, IL 28 April 1999 Using the Batch System
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Outline • • • Quick example How batch processing works Batch and pipe queues How to submit jobs Monitoring jobs Reminders and Pointers Using the Batch System 2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER #!/bin/csh # # file: simple 1 # #QSUB -q serial #QSUB -J y # keep job log set myname=`whoami` set now=`date` set mylocn=`pwd` echo echo echo "" "Hello $myname, this is your shell script $0, " "running at $now. " "" "Your current directory is $mylocn, which should" "be the same as $HOME. " "" "I'm going to sleep now. " "" sleep 90 exit Using the Batch System 3
% cqsub simple 1 Task id t 51847 inserted into database nqedb. % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 scheduler. main mjdurst NQE Database NNew % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 scheduler. main mjdurst NQE Database NPend % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 simple 1 lws. mcurie mjdurst NQE Database NSche % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 (49939. mcurie) simple 1 lws. mcurie mjdurst nqs@mcurie NSubm
% qstat 49939 ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST -------- ----------- ------ --49939. mcurie simple 1 mjdurst serial_short@mcurie 3753 25 364 1800 R 03 % qstat 49939 nqs-100 qstat: CAUTION Request <49939>: not found. % cqstatl t 51847 --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 51847 (49939. mcurie) simple 1 monitor. main mjdurst NQE Database NComp % ls -l total 12 -rwxrw-r--r--rw-r--r-- 1 1 mjdurst mpccc 365 0 1285 2638 Jan Jan 15 15 10: 47 10: 50 simple 1* simple 1. e 51847 simple 1. l 51847 simple 1. o 51847
% cat simple 1. l 51847 01/15 10: 48: 13 Submitting to queue
% cat simple 1. o 51847 mcurie. nersc. gov, a Cray T 3 E-900 running UNICOS/mk 2. 0. 3. 32 ---------------Contact Information---------------NERSC Web http: //www. nersc. gov/ ESnet Web http: //www. es. net/ ESCHER Web http: //www. nersc. gov/hardware/servers/vis-server. html
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Why Batch Processing? • Batch queues are necessary: – On systems with many jobs – When scheduling is difficult – To assure greater throughput • Interactive jobs are limited – J 90: 10 hrs. – T 3 E: < 64 PEs, < 30 minutes parallel (1 hr serial) • Some machines/processors batch-only – J 90: all batch machines – T 3 E: many APP PEs (at night, almost all) Using the Batch System 8
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER The Batch Process • User creates shell script myscript • Submits to NQE with cqsub myscript – Returns NQE task id (e. g. , t 4913) • NQE forwards to NQS – J 90: selects a machine (J 90 wait time here) • NQS runs the job – – Assign NQS job id (e. g. , 6859. mcurie) Select a batch queue Place the job there (T 3 E wait time here) Run it when appropriate • NQS/NQE returns job logs at completion Using the Batch System 9
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pipe Queues • Groups of batch queues – Direct to a pipe with #QSUB -q serial – Default is production • To see them: qstat -p • T 3 E: – serial, debug, production, long • J 90: – production – batchk (for evening, weekend killeen queues) – batch{b, f, s, c, j} (not recommended) Using the Batch System 10
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Preparing for Batch Submission • Write your shell script – C shell or Bourne/Korn shell – Starts in user’s home directory • Debug interactively (if possible) • Decide on needed resources – J 90: CPU time, memory – T 3 E: amount of parallel, serial time; number of PEs • Select other #QSUB options • Check for appropriate queue and submit Using the Batch System 11
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Essential options to cqsub (#QSUB directives) • J 90: – – -l. M
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Other cqsub options • -J y : save job log (recommended) • -j
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Job Submission • cqsub
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs • cqstatl
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Monitoring Jobs (cont’d) • T 3 E: qstat
% cqstatl -a --------------NQE 3. 3. 0. 9 Database Task Summary --------------IDENTIFIER NAME SYSTEM-OWNER LOCATION ST ------------ ------------- ---t 48217 (46356. mcurie) PCM lws. mcurie alewife nqs@mcurie NSubm t 48713 (46848. mcurie) third lws. mcurie u 6670 nqs@mcurie NSubm t 49200 (47518. mcurie) int 566 A lws. mcurie u 61176 nqs@mcurie NSubm t 49245 (47368. mcurie) xqcd_ho lws. mcurie snm nqs@mcurie NSubm t 50349 (48480. mcurie) int 650 lws. mcurie u 61176 nqs@mcurie NSubm t 50881 (49338. mcurie) lte 34 -0 lws. mcurie lungfish nqs@mcurie NSubm
% qstat -a ----------------NQS 3. 3. 0. 9 BATCH REQUEST SUMMARY ----------------IDENTIFIER NAME USER LOCATION/QUEUE -------- ----------49979. mcurie job 16. ag adt pe 32@mcurie 49936. mcurie akr 520 u 6677 pe 32@mcurie 49964. mcurie case 14 c 9 salmon pe 32@mcurie 49967. mcurie q_lsms marlin pe 32@mcurie 49983. mcurie Job. CZ. bb tarpon pe 32@mcurie 49984. mcurie bitgc 11 u 62098 pe 32@mcurie 49985. mcurie bitgc 11 u 62098 pe 32@mcurie 49362. mcurie Job_a 2 carp pe 128@mcurie 49335. mcurie script. 2 sturgeon pe 256@mcurie 49033. mcurie uo 2_3 h 2 o dorado gc 128@mcurie 49255. mcurie run 010_A bluegill long 128@mcurie 49276. mcurie sg 3 D 10 aku long 128@mcurie 49277. mcurie sg 3 D 10 aku long 128@mcurie 49867. mcurie run_t 4 flounder long 128@mcurie no pipe queue entries (output greatly abridged) JID PRTY REQMEM REQTIM ST ------ --4164 25 255 1520 R 03 3732 25 323 1800 R 03 3944 25 255 1795 R 03 999 28672 1800 Cge 317 28672 1800 Qge 244 28672 1800 Qge 242 28672 1800 Qge 5308 25 323 1800 R 03 999 28672 1800 Qqs --- 28672 7200 Hop 4617 25 255 1800 R 03 999 4096 1800 Qce 999 4096 1800 Qqu 70 28672 1800 Cgg
% qstat -f pe 32 ------------------NQS 3. 3. 0. 9 BATCH QUEUE: pe 32@mcurie ------------------ Status: ENABLED/RUNNING Priority:
type e Tape Drives type f Tape Drives type g Tape Drives type h Tape Drives Core File Size Data Size Permanent File Space Memory Size Nice Increment Quick File Space Stack Size CPU Time Limit Temporary File Space Working Set Limit MPP Processor Elements MPP Time Limit Shared Memory Segments MPP Memory Size unspecified unspecified 20 gb 28 mw 5 unspecified 3600 sec unspecified (0) (0) (256 mw) 25 gb 29 mw (0 b) (256 mw) 0 b 7200 sec unspecified (0 b) (256 mw) 32 15000 sec unspecified (0 mw) unspecified (0) unlimited 15000 sec unspecified (256 mw)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting • No task id returned – Typically means NQE down – message like “Can’t connect” • Job doesn’t make it to NQS: try cqstatl
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Troubleshooting (cont’d) • Stuck in NSubm : use qstat – Q: normal on T 3 E, rare on J 90 – T 3 E: • Hop can be allocation problem • C (“checkpointed”) may be daily shuffling • May need both pslist and qstat -m to sort it all out • Job crashes – Read job log, stdout, stderr –. . . limit exceeded: ran out of time (or memory, or…) • Job vanishes – Did machine(s) crash? If not, collect info and contact Consultants Using the Batch System 22
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers • Batch job is like a login session – Starts in your home directory – Uses your startup files – But doesn’t inherit environment (unless you use -x) • Environment variable ENVIRONMENT – Not set in interactive work, set to BATCH in batch jobs – Can exclude parts of startup files • /usr/tmp faster than home directory – $TMPDIR vanishes (avoids littering) – Just one quota for $TMPDIR , rest of /usr/tmp/ – Can’t monitor batch J 90 temp file systems Using the Batch System 23
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Pointers (cont’d) • Don’t submit blindly – Debug executables, scripts first – Don’t trust inherited shell scripts – Spend time with man pages • J 90: large memory jobs should/must multitask • T 3 E: reduce serial time in parallel jobs – “Stage” HPSS retrievals (dmget) – Submit follow-on serial jobs within your job Using the Batch System 24