'UNSW - Science

UNSW - Science - HPC

Simple Batch Jobs

A batch job is a script that runs autonomously on a compute node. The script must contain the necessary sequence of commands to complete a task independently of any input from the user. Typically, these scripts are shell scripts such as BASH. For example, the following script simply executes a pre-compiled program in the user's home directory:

#!/bin/bash
 
cd $HOME
 
./myprogram

This job can be submitted to the cluster with the qsub command. Assuming the filename of the script is myjob.pbs then the following command will submit the job with the default resource requirements (1 CPU core for 1 hour and 1Gb of memory):

z1234567@katana:~$ qsub myjob.pbs
1237.katana.science.unsw.edu.au

As with interactive jobs, the -l (lowercase L) flag can be used to specify resource requirements for the job:

z1234567@katana:~$ qsub -l nodes=1:ppn=1,vmem=4gb,walltime=12:00:00 myjob.pbs
1238.katana.science.unsw.edu.au

Job Scripts

Job scripts offer a much more convenient method for invoking any of the options that can be passed to qsub on the command-line. In a shell script, a line starting with # is a comment and will be ignored by the shell interpreter. However, in a job script, a line starting with #PBS can be used to pass options to the qsub command.

Here is an overview of the different parts of a job script which we will examine further below.

For example, the previous job script could be rewritten as:

#!/bin/bash
 
#PBS -l nodes=1:ppn=1
#PBS -l vmem=4gb
#PBS -l walltime=12:00:00
 
cd $HOME
 
./myprogram

Then the script can be submitted with much less typing on the command-line:

z1234567@katana:~$ qsub myjob.pbs
1239.katana.science.unsw.edu.au

Unlike submission of an interactive job, which results in a login session ready to accept commands, the submission of a batch job appears to simply return the ID of the new job. However, this is confirmation that the job was submitted successfully. The job is now in the hands of the job scheduler and the resource manager. Commands for checking the status of the job can be found in the Job Monitoring section.

Notifications

If you wish to be notified by email when the job finishes then use the -M flag to specify the email address and the -m flag to declare which events cause a notification.

#PBS -M fred.bloggs@unsw.edu.au
#PBS -m ae

This example will send an email if the job aborts (-m a) due to an error or ends (-m e) naturally. If required, users can also be notified when the job begins (-m b). The email sent when the job ends includes a summary of all the resources used while the job was running. This information is very useful for refining the resource requirements for future jobs.

Job Output

The standard output and error streams of a batch job are redirected by the resource manager to files on the compute node where the job is running. Only when the job finishes are the output and error files transferred to the head node. By default these files will be called JOB_NAME.oJOB_ID and JOB_NAME.eJOB_ID, and they will appear in the directory that was the current working directory when the job was submitted.

You can also specify the name of the output files by using the -o and -eflags. For example the following code combines the output and error information in the file /home/z1234567/results/Output_Report once the job completes

#PBS -j oe
#PBS -o /home/z1234567/results/Output_Report

and the following commands will save standard output and standard error to 2 separate files.

#PBS -o /home/z1234567/results/Error_Report
#PBS -o /home/z1234567/results/Output_Report

If required, the output and error streams can be redirected to a single file instead of two separate files. The qsub option -j oe will combine both streams into the standard output file.

Even though the output and error files are not made available until the job finishes, it is possible to monitor the output and error streams with the qpeek command while the job is still running. For example, the following command will provide a live view of the standard output from job 1234:

z1234567@katana:~$ qpeek -f 1234

Job Directories

When a job starts, its current working directory is defined by the variable $PBS_O_INITDIR. By default the resource manager will assign the user's home directory to $PBS_O_INITDIR. So unless all your scripts and executables are stored in your home directory (not recommended!) it is very important that each job sets its current working directory appropriately. This can be achieved by changing directory at the beginning of the job script:

#!/bin/bash
 
#PBS -l nodes=1:ppn=1,vmem=1gb
#PBS -l walltime=1:00:00
#PBS -j oe
 
cd $HOME/projects/hardsums
 
./myprogram

However, if that job script was reused elsewhere then it must be updated because the working directory is hard-wired into the script. An alternative approach is to use another variable provided by the resource manager: $PBS_O_WORKDIR. By default $PBS_O_WORKDIR will be assigned the current working directory of the qsub command that launched the job. In most cases the directory from where you submit the job is exactly where you would like the job to start running. Consequently, the following script provides a more convenient and reusable method of giving the job an appropriate working directory:

#!/bin/bash
 
#PBS -l nodes=1:ppn=1,vmem=1gb
#PBS -l walltime=1:00:00
#PBS -j oe
 
cd $PBS_O_WORKDIR
 
./myprogram

The directory referenced by $PBS_O_WORKDIR is also the default location for any standard output and error files produced by the job.

In some circumstances it might be useful to specify values for $PBS_O_INITDIR and $PBS_O_WORKDIR rather than accept the default values. This can be achieved with the -d and -w flags respectively:

#!/bin/bash
 
#PBS -l nodes=1:ppn=1,vmem=1gb
#PBS -l walltime=1:00:00
#PBS -j oe
#PBS -d /home/z1234567/projects/hardsums/data
#PBS -w /home/z1234567/projects/hardsums/output
 
./myprogram

Home directories ($HOME) and global scratch directories (/srv/scratch/$USER) reside on remote filesystems provided by the storage nodes. All data read from or written to those directories are transferred over the network and consequently incur a performance penalty. Applications that are sensitive to file IO performance should instead use a local scratch directory for intermediate files. More information on local scratch is available in the Advanced Batch Jobs section.

In general, please note that directory names containing spaces can cause problems for the resource manager. For example, if the name of the working directory contains spaces then the resource manager will be unable to deliver job output files to that directory. Therefore it is recommended that directory names and file names should not contain any spaces.