Examples of Common Mistakes

Here are some common mistakes:

Not having a look at the HPC web site.

This site has an extensive collection of information 

Not reading the module help.

When new software is installed on Katana information about the software and how it was installed is added to the module file. Before using software you should use the module help command.

Running your job on the head node.

If you run your job on the head node then it can cause problems for everyone on the cluster. Rather than using the head node to run jobs you should look at the alternatives and then run your job on a compute node.

Running interactive jobs rather than batch.

If you run your compute jobs interactively (i.e. starting it manually using the command qsub -I) then you should try and run your job as a batch job instead. By running your compute job in batch you can submit more jobs at the one time without the manual handling that interactive jobs require.

Not using array jobs.

If you have a number of jobs that are the same except for different input data then more efficient to submit one array job rather than multiple independant batch jobs.

Keeping data in H-drive and expecting it to be there when you run a compute job.

Whilst you can access the files in your UNSW H-Drive from the Katana Head Node and the KDM server the files are not available on the compute nodes. The storage section of the web site has more information about copying files from the H-Drive to the local file system.

Not understanding / specifying resource requirements.

Did you know that the more resources that you request the longer it can take to find a place for it. This is especially the case for jobs with a very long WALLTIME. In order to have your jobs run as soon (and efficiently) as possible you should look at the resource requirements page and use the information when your job completes to refine the resources that you request in the next job.

Not making use of the power of parallel processing.

If your batch job has multiple parts that run after each other but are actually independant then you would be better off splitting your batch job up and submitting all of the parts at the same time. This will mean that calculations will occur in parallel and will likely finish sooner. In fact if any of the parts requests under 12 hours of WALLTIME then they can run on any node in Katana.

Saving data to /tmp rather than $TMPDIR

By default some software will try and save temporary or working data to the directory /tmp as this is where system wide temporary files are usually kept. Whilst /tmp exists on the compute nodes you should use $TMPDIR instead as /tmp is limited in size. This is a common problem with SAS.

Filling up you cluster home drive whilst running a compute job

When you log on to Katana you get a message telling you how much free space you have on your cluster home drive. If your job creates files in your cluster home drive then you should make sure that your compute job doesn't fill up your cluster home drive. Read the storage pages for information on where data should be kept.

Requesting resources that you don't have access to

If you request resources that are not available to your research group then your job will not run. For example if you are not part of the School of Mathematics and Statistics, UNSW Business School or the Climate change Research Centre then a job requiring more than 128Gb of memory will only run if it has a WALLTIME of 12 hours or less based on node ownership. Similarly if you request a job which will take longer than 200 hours (for everyone other than Astrobiology) or more CPU cores than the WALLTIME will allow you to use then your job will never run. The page describing how to figure out when your job will start which may help you figure out if your job will never run or if it is just delayed.

Having files or directories with spaces in the names

If you have files or directories with spaces in the name then some programs will interpret the name as 2 or more names rather than a single name with space(s) in it. For this reason best practice under Linux is to not have spaces (or stange characters) in the names.