'UNSW - Science

UNSW - Science - HPC

Monitoring Your Job

Job Monitoring

If you want to see the status of the jobs that you have submitted then you can use the qstat command. For example

qstat -u $USER

will show you the status of all of your jobs. If you type

qstat

then you will see a list of every job running on the cluster. Once you have a Job ID you can then use the command

qstat -f JOBID

to get details on how your job is going including elapsed time, CPU time and max memory usage. To find out more about available options type man qstat to see the full list of what you can see with the qstat command.

If your job doesn't want to run then you can see what is stopping it from running using the checkjob command. Type

checkjob JOBID

to see a summary of the status of a particular job. If there is something that is stopping your job from running there will be a message at the bottom telling you what is going on.

If you want even more detail you can use the tracejob command by running

tracejob JOBID

which will then give you information on the history of the job from the scheduler's point of view.

When your job is running STDOUT (what you would normally see on the screen if your job was run from the command line) and STDERR (any error messages that are generated) are copied to your cluster home drive when the job completes. See here for more information.

The qpeek command allows you to see those files whilst the job is running. Simply type

qpeek -f JOBID

for STDOUT and

qpeek -e JOBID

to see the STDERR.

To see a complete list of jobs running and queued you can use the

showq

command. Typing it will show you the the jobs that are active (i.e. currently running on a compute node), idle (i.e. waiting for the required resources to become available on a node that can be used by the job) and blocked (i.e. currently being blocked from running due to the number of cores or amount of memory available to the individual or research group already being used). For more information have a look at the job scheduling and queues page.

The

pestat

command allows you to list all the nodes of the cluster along with what jobs are running on those nodes, node memory usage, node load and if a node is able to accept more jobs.

Monitoring Jobs Manually (Useful for Array Jobs)

There are times when a different approach is desired or required. For example:

  • Some job management commands won't work properly if you are running an array job.
  • If can be useful to monitor exactly what resources your job is using at a specific time.
  • You can see exactly what you job is doing.

The answer to these situations is to log into the compute node running your job and look at things there.

The steps in the process are:

  1. List your current jobs using the qstat -u $USER command.
  2. Show what node(s) the running job is using via the showres -n command. The node(s) will have a name that looks like kcXXbYY where XX and YY are numbers.
  3. Log on to the compute node using ssh.
  4. Use the tail command to look at the output or other Linux commands.

Here is an example of looking at the most recent output of an array job.

[z1234567@katana ~]$ qstat -u $USER -t | head -20
 
katana.local: 
                                                                         Req'd  Req'd   Elap
Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
3721743[34].kata     z1234567    maths12   hard_maths          14896     2      8    --  12:00 R 07:54
3721743[35].kata     z1234567    maths12   hard_maths          13496     2      8    --  12:00 R 06:58
3721743[36].kata     z1234567    maths12   hard_maths          21601     2      8    --  12:00 R 06:08
3721743[37].kata     z1234567    maths12   hard_maths          31726     2      8    --  12:00 R 06:09
3721743[38].kata     z1234567    maths12   hard_maths          14921     2      8    --  12:00 R 06:09
3721743[41].kata     z1234567    maths12   hard_maths          40478     2      8    --  12:00 R 05:16
3721743[42].kata     z1234567    maths12   hard_maths          40733     2      8    --  12:00 R 05:15
3721743[43].kata     z1234567    maths12   hard_maths          41087     2      8    --  12:00 R 04:56
3721743[44].kata     z1234567    maths12   hard_maths          45222     2      8    --  12:00 R 04:21
3721743[45].kata     z1234567    maths12   hard_maths          7433     2      8    --  12:00 R 04:18
3721743[46].kata     z1234567    maths12   hard_maths          2965     2      8    --  12:00 R 03:50
3721743[47].kata     z1234567    maths12   hard_maths          2967     2      8    --  12:00 R 03:48
3721743[48].kata     z1234567    maths12   hard_maths          9425     2      8    --  12:00 R 03:23
3721743[49].kata     z1234567    maths12   hard_maths          9755     2      8    --  12:00 R 03:20
3721743[50].kata     z1234567    maths12   hard_maths          388     2      8    --  12:00 R 03:17
[z1234567@katana ~]$ showres -n | grep '3721743\[36\]'
             kc05b15        Job        3721743[36]    Running    4    -6:07:32    12:00:00  Mon Jan  10 04:29:10
[z1234567@katana ~]$ ssh kc05b15
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Jan  10 10:36:31 2016 from katana.local Rocks Compute Node Rocks 6.1 (Emerald Boa) Profile built 16:35 01-Sep-2015
 
Kickstarted 16:40 01-Sep-2015
[z1234567@kc05b15 ~]$ tail /var/spool/torque/spool/3721743-36.katana.science.unsw.edu.au.OU 
    230194 (99.88%) aligned 0 times
    284 (0.12%) aligned exactly 1 time
    2 (0.00%) aligned >1 times
0.13% overall alignment rate
Converting SAM to BAM \n
[samopen] SAM header is present: 21 sequences.
Sorting BAM file \n
[bam_sort_core] merging from 40 files...
Removing intermediate SAM and BAM files \n Generating simple stats from sorted BAM \n
[z1234567@kc05b15 ~]$ 

See Exactly What is Going On

Once you have logged on you can also use command

top

or even

htop

to see what your job is currently doing. In the example below z1234567 has 16 python based jobs running which are all running at full capacity and aren't spending time waiting for other things to happen.

[root@kc05b10 ~]# top
top - 12:25:40 up 694 days, 21:59,  1 user,  load average: 16.00, 16.00, 16.00
Tasks: 441 total,  17 running, 424 sleeping,   0 stopped,   0 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132250816k total, 11809388k used, 120441428k free,   438656k buffers
Swap:  3145720k total,   697084k used,  2448636k free,  8215400k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                   
 1334 z1234567  20   0  193m  31m 1048 R 100.0  0.0   2715:00 python                                                                                                                                  
 2887 z1234567  20   0  193m  35m 1096 R 100.0  0.0   9719:26 python                                                                                                                                  
 3350 z1234567  20   0  193m  31m 1048 R 100.0  0.0 457:05.60 python                                                                                                                                  
 8433 z1234567  20   0  193m  63m 1076 R 100.0  0.0   2410:54 python                                                                                                                                  
 9268 z1234567  20   0  193m  35m 1100 R 100.0  0.0   9399:17 python                                                                                                                                  
32445 z1234567  20   0  193m  35m 1088 R 100.0  0.0   8326:01 python                                                                                                                                  
 2053 z1234567  20   0  193m  31m 1048 R 99.6  0.0 510:09.12 python                                                                                                                                   
 2111 z1234567  20   0  193m  65m 1136 R 99.6  0.1 507:40.68 python                                                                                                                                   
 5843 z1234567  20   0  193m  31m 1048 R 99.6  0.0   2526:49 python                                                                                                                                   
 9223 z1234567  20   0  193m  35m 1084 R 99.6  0.0  11649:07 python                                                                                                                                   
21254 z1234567  20   0  193m  35m 1096 R 99.6  0.0   6438:50 python                                                                                                                                   
25398 z1234567  20   0  193m  31m 1048 R 99.6  0.0   3872:47 python                                                                                                                                   
33723 z1234567  20   0  193m  31m 1048 R 99.6  0.0   1307:27 python                                                                                                                                   
35989 z1234567  20   0  193m  35m 1088 R 99.6  0.0  10460:07 python                                                                                                                                   
43998 z1234567  20   0  193m  35m 1096 R 99.6  0.0  10087:35 python                                                                                                                                   
44187 z1234567  20   0  193m  31m 1048 R 99.6  0.0 808:02.18 python                                                                                                                                   
    1 root      20   0 22228 1456 1216 S  0.0  0.0   0:03.29 init                                                                                                                                      
    2 root      20   0     0    0    0 S  0.0  0.0   0:02.33 kthreadd                                                                                                                                  
    3 root      RT   0     0    0    0 S  0.0  0.0   0:18.74 migration/0                                                                                                                               
    4 root      20   0     0    0    0 S  0.0  0.0  46:21.22 ksoftirqd/0