'UNSW - Science

UNSW - Science - HPC

Frequently Asked Questions about Job Status and Reporting

Each of the Faculty of Science clusters has its' own status page which tells you the current load on each of the cluster nodes as well as the state of the cluster as a whole.

Apart from the information that you can have emailed to you upon job completion, the best way of getting basic information is to log on to the individual compute nodes and the running commands to look at exactly what is happening on the node.

For detailed information on using job profiling to determine the best way to run your job there is a starting page in the HPC Basics area

What does it mean when a job that's running displays a CPU usage percentage that exceeds the number of cores requested?

This can happen if your job starts more threads than the number of cores you requested in the job script. A common way to do that is to request one core and then start an implicitly multithreaded MATLAB job. This is less of a problem on Katana because we are able to constrain each job to a particular set of cores. It is also possible to produce an unusually high system load if your job performs a significant amount of i/o. If your job does perform a significant amount of I/O you should consider making use of local scratch.

Why is it taking longer to run my job on a cluster than on my desktop?

Given the rate at which CPUs advance it doesn't take long for the CPUs in a cluster to be outperformed by more recent desktop chips when processing serial jobs. This is balanced by the number of cores installed in a cluster as well as additional factors such as memory, networking, etc. which make clusters much better for large numbers of serial jobs, parallel processing of jobs and access to more resources than are typically available on the desktop.

How can I see exactly what resources (I/O, CPU, memory and scratch) my job is currently using?

If you run

qstat -nru $USER

then you can see a list of your running jobs and where they are running. You can then use ssh to log on to the individual nodes and run top or dtop to see the load on the node including memory usage for each of the processes on the node. For more detailed information on the resources that your job is using, visit the page on job profiling.

What do the different graphs on the cluster status pages mean?

That is an excelent question. We have written a page on understanding the cluster status to help you understand what everything on the status page means.

How can I check the status of a cluster from my phone or computer at home?

For security reasons the clusters in the Faculty of Science only allow access from on campus computers. In order to get around this limitation and work on jobs on the clusters or check the status of your jobs you will first need to connect to the UNSW VPN via the deails available on the UNSW IT web site. Once you are connected to the VPN come back to this site and click on the Status link associated with the cluster you are interested in. All of the clusters have a version of the status page that has been optimised for viewing on a mobile phone and you can use that view by selecting the correct tab.

What reports on cluster usage are available?

The Faculty of Science HPC team is currently working on improving the usage reports that you can obtain from the clusters. In the interim there are a number of different resources that you can piece together.

On Katana combining the command diagnose -f with the information on the status page will be a good starting point.

What can I look at if I really want to understand which is happening with the scheduling of my jobs?

There are some interesting commands on the page http://www.hpc.science.unsw.edu.au/about/job-management-and-status which will give you a list of who is running what and where.

Then a list of who owns what node is available at http://www.hpc.science.unsw.edu.au/cluster/katana-node-list.

The graphical view of the status of Katana (only available on campus or via the VPN) is at http://katana.science.unsw.edu.au/ganglia. These graphs are really, really useful if your jobs are the only one on the node.

To see who is in what group for the purposes of access you can look at the file /usr/local/etc/group on Katana.

Finally for more detail you can use the command “diagnose -f” which gives you the calculated weightings and “diagnose -p" or "showq -i” which will give you the calculated priorities for every queued job and how it was calculated.

How can I kill one of my running jobs?

The command qdel allows you to kill a running job on Katana. Use the command qstat to discover the Job Id of the job that you wish to delete and then use the command 'qdel JOB_ID" where JOB_ID is the number that you obtained from the qstat command.

How can I find out information about my recent jobs?