'UNSW - Science

UNSW - Science - HPC

Resource Requirements

Whether submitting an interactive job or a batch job, the first step is to declare the job's resource requirements. These resources typically include: a number of CPU cores; an amount of memory; and a length of time. Based on this information, the job scheduler will determine when and where the job will run. During the time allocated to the job, the physical resources such as CPU cores and memory will be dedicated to the job for its sole use. If use of memory or time exceeds the amount requested then the job will be killed by the resource manager. Alternatively, if the requested resources far exceed the actual required resources then the job will run, but it will be unnecessarily delayed while waiting for all of the resources to become available. Consequently, it is very important to specify resource requirements as accurately as possible.

Summary

The following flags can be used to specify the resource requirements for your job. The nodes flag has a number of different options that can be specified. Further information is given below.

Flag Meaning
nodes Number of compute nodes that you want the job to run on. Default is 1 node. Other options appear below.
vmem The total amount of memory you require. Do NOT use when requesting more than 1 node. Default is 1Gb.
pvmem The amount of memory per core for a multinode job.
walltime The expected run time for the job in the form HH:MM:SS. Default is 1:00:00 (1 hour).

 

Node options

Flag Meaning
ppn Number of CPU cores per compute node requested. Default is 1 core per node.
E7540 Run only on nodes with a Intel E7540 CPU.
X5660 Run only on nodes with a Intel X5660 CPU.
X5675 Run only on nodes with a Intel X5675 CPU.
E52670 Run only on nodes with a Intel E5-2670 CPU.
E52650V2 Run only on nodes with a Intel E5-2650v2 CPU.
X56 Run only on nodes with a Intel X56xx CPU. i.e. X5660 or X5675.
E5 Run only on nodes with a Intel E5-xxxx CPU. i.e. E5-2670 or E5-2650v2

 

Compute Nodes, CPU Cores and Processor Selection

The number of CPU cores required by a job is specified with nodes=X:ppn=Y where X represents the number of compute nodes and Y represents the number of CPU cores per node. For example, a single-threaded job requires nodes=1:ppn=1, a multi-threaded job may require nodes=1:ppn=12, and a distributed memory job may require nodes=4:ppn=12.

You can also specify the required type of processor by adding a colon followed by a processor code from the table above. For example, adding :X56 to a node statement will restrict the job to nodes with an Intel X5660 or X5675 CPU:

#PBS -l nodes=2:X56:ppn=12
#PBS -l vmem=96gb
#PBS -l walltime=12:00:00

Note: If you want to specify the CPU then you should look at the Katana node list to see what nodes you have access to. If you request more than 12 hours of WALLTIME then you can only use the nodes bought by your school or research group, or the Faculty of Science. A long running job that specifies a CPU that you don't have access to will never start.

Memory

The amount of memory required by a job is specified with vmem=X where X can be expressed in terms of megabytes or gigabytes. For example, a job with relatively modest memory requirements might use vmem=4gb.

If you are running a job over more than one node then use pvmem=X to request an amount of memory per core.

Note: If you are not part of the School of Mathematics and Statistics, UNSW Business School or the Climate Change Research Centre then a job requiring more than 128Gb of memory will only run if it has a WALLTIME of 12 hours or less based on node ownership.

Run-Time

The expected run-time of a job is specified with walltime=HH:MM:SS where H, M and S are hours, minutes and seconds. This is the time required by the job as measured by a wall clock. In other words, it is not time spent executing instructions in the CPU, and it is independent of the number of CPUs requested. Note that long jobs are more constrained than short jobs. See the Job Scheduling section for further details. You will also find that the run time will vary depending on which compute node it is run on. If you read the Katana node list then you can get information including benchmarks on the CPU in the nodes that you are using.

Note: Unless you have a specific reason to choose a different number the walltime that you request should be one of 12:00:00, 48:00:00, 100:00:00 or 200:00:00. These numbers represent the longest time that jobs can run in each of the queues.

Note: If possible break up your jobs so that they need less than 12 hours of run time. That way they will likely start sooner as they can run on any node.

Note: If you request a WALLTIME of greater than 200 hours then your job WILL NOT run unless you are a member of the Astrobiology group.

Conclusion

All of these resource requirements are passed to the qsub command with the -l (lowercase L) flag. You may use the -l flag multiple times, once for each type of resource requirement, or you can follow a single -l with a comma-separated list of all your resource requirements. If resource requirements for CPU cores, memory or time are not explicitly provided then default values will be used: nodes=1:ppn=1,vmem=1gb,walltime=1:00:00.

Finally, please note that a job requesting more resources than are physically available will not necessarily be rejected. For example, a job that requests 64 CPU cores in a single compute node (nodes=1:ppn=64) will not run because there is no node with that many CPU cores, but the job will remain in the queue until such a node magically appears. The pestat command provides a useful summary of the physical resources available in the cluster and the checkjob command can be helpful when determining why a job is stuck in the queue.

Additional Notes

If you are running a MATLAB job then you should look at the MATLAB page for MATLAB specific resource requirement information.