'UNSW - Science

UNSW - Science - HPC

Some Job Tricks and Tips

There are a number of things that you can do to make your job run sooner and this page lists a collection of them. It is also important to understand the limitations that the queueing system places on jobs to ensure that your job doesn't sit in the queue and never start.

General Tips

Here are a few general tips to get you started.

Keep your jobs under 12 hours if possible

If you request more than 12 hours of WALLTIME then you can only use the nodes bought by your school or research group, or the Faculty of Science. Keeping your job's run time request under 12 hours means that it can run on any node in the cluster.

Two 10 hour jobs will probably finish sooner that one 20 hour job

In fact, if there is spare capacity on Katana, which there is most of the time, six 10 hours jobs will finish before a single 20 hour job will.

Requesting more resources for your job decreases the places that the job can run

The most obvious example is going over the 12 hour limit which limits the number of compute nodes that your job can run on but it is worth . For example specifying the CPU in your job script restricts you to the nodes with that CPU. A job that requests 20Gb will run on a 128Gb node with a 100Gb job already running but a 30Gb job will not be able to.

Running your jobs interactively makes it hard to manage multiple concurrent jobs

If you are currently only running jobs interactively then you should move to batch jobs which allow you to submit more jobs which then start, run and finish automatically.

Array jobs are an easy way of submitting a large number

 

Specific Job Tips

The resources that you request can cause problems if they don't match what you have access to. There is a list of Katana nodes that you should look at to see what nodes you have access to. 

If you are not part of certain groups then a job requiring more than 128Gb of memory will only run if it has a WALLTIME of 12 hours

The only nodes with more than 128Gb of memory were purchased by the School of Mathematics and Statistics, the UNSW Business School and the Climate Change Research Centre. This means that if you request a run time of 24 hours and 150Gb of memory then your job will just sit in the queue. For information on the nodes that you have access to have a look at the node ownership page. Even if your job has a run time of less than 12 hours it may take a while to start due to the limited number of nodes.

Do not submit a job that will run for over 200 hours

If you request a WALLTIME of greater than 200 hours then your job WILL NOT run unless you are a member of the Astrobiology group.

Do not specify the CPU type until you have looked at the node list and considered the reduction in available cores

Specifying the CPU can help your jobs run more consistently but a long running job (over 12 hours) that specifies a CPU that you don't have access to will never start.

If you want to specify the CPU then you should look at the Katana node list to see what nodes you have access to. 

Requested Core Limits

There is a limit to the number of cores that a research group can use. If you request all of those cores for the one job it will not start until no-one else in your group is running any jobs. In fact it may never start as smaller jobs from other users start running as they fit in and your job does not.