'UNSW - Science

UNSW - Science - HPC

Creating a Job Script

Typically, resource managers require the user to prepare a job script that defines how the job should be run (e.g. location of executable; list of arguments) and declares the job's requirements (e.g. machine architecture; memory usage). Condor is no exception, and this section provides a few simple examples of how to create a job script for use with Condor.

There are some parameters that will be found in almost any Condor job script. These common parameters include:

  • universe: To which universe does this job belong?
  • executable: The path to your executable (can be relative or absolute).
  • arguments: Any command-line arguments your executable might need.
  • requirements: A logical expression that describes a suitable machine for your executable.
  • rank: What attribute should be used to rank the list of machines that satisfy your requirements?
  • output: Where to send the job's stdout?
  • error: Where to send the job's stderr?
  • log: Where to send Condor messages relating to your job?
  • initialdir: Use of this parameter is slightly limited because the School of Mathematics Condor pool does not really incorporate a shared filesystem. It will not affect the directory in which a Vanilla universe job will run, that is always a temporary directory created by Condor. Any relative pathname (e.g. executable, output, error, log) is relative to initialdir. Any remote I/O in Standard universe jobs is also relative to initialdir. By default, initialdir is the directory from which you submitted the job.
  • queue: This is more of a command than a parameter. It submits the job with the parameter values that have been defined prior to appearance of this queue command. There can be several queue commands in a single job script (Condor will launch a new job for each instance of queue), and you can change parameter values between queue commands.

See the section on condor_submit in the Condor Manual for a full list of parameters.

Vanilla Universe

universe = vanilla
 
executable = /usr/bin/povray
arguments  = -D +W1280 +H1024 +Iscene.pov +Oscene.png
 
requirements = memory > 400
rank         = mips
 
output = myjob.out
error  = myjob.err
log    = myjob.log
 
should_transfer_files   = yes
when_to_transfer_output = on_exit
transfer_executable     = false
 
transfer_input_files  = scene.pov
transfer_output_files = scene.png
 
queue

In this example, the combination of requirements and rank cause Condor to find all currently idle machines with more than 400MB memory, and then from that list, select the fastest machine according to its MIPS rating.

Since this is a Vanilla universe job we must arrange for the transfer of all files required by the job. In fact, if we did not specify how to transfer the files then this job would never get scheduled to run because Condor will assume your job relies on a shared filesystem and it will wait for one to appear (that could be a very long wait!).

The combination of should_transfer_files=yes and when_to_transfer_output=on_exit will enable the Condor file transfer mechanism. By default, the job's executable will be transferred to the compute host, but in the example above that facility is unnecessary so it has been explicitly disabled. However, it is the user's responsibility to transfer any required input files to the compute host, and any output files back to the submit host. This can be achieved by providing a comma-separated list of files as values for the transfer_input_files and transfer_output_files parameters.

Standard Universe

universe = standard
 
executable = big_simulation
arguments  = -m 1024 -n 1024
 
requirements = memory > 400
rank         = mips
 
output = myjob.out
error  = myjob.err
log    = myjob.log
 
queue

Job scripts for jobs in the Standard universe can be much simpler than those for the Vanilla universe because there is no need to worry about file transfers (all I/O is redirected to the submit host).

Parametric Jobs

universe = vanilla
 
executable = /usr/bin/povray
arguments  = -D +W1280 +H1024 +Iscene.$(Process).pov +Oscene.$(Process).png
 
requirements = memory > 400
rank         = mips
 
output = myjob.$(Process).out
error  = myjob.$(Process).err
log    = myjob.log
 
should_transfer_files   = yes
when_to_transfer_output = on_exit
transfer_executable     = false
 
transfer_input_files  = scene.$(Process).pov
transfer_output_files = scene.$(Process).png
 
queue 50

Condor provides a useful facility to support the submission of parametric jobs. Typically, a parametric job is composed of several sub-jobs where all sub-jobs invoke the same executable but with slightly different parameters, and all these sub-jobs can be run in parallel. In the example above we have 50 input files (scene.1.pov, scene.2.pov, …) that describe the geometry of a 3D scene, possibly the first 50 frames of a special effects sequence in a movie. Each of these 3D scenes needs to be rendered into a high quality 2D image (scene.1.png, scene.2.png, …) using the povray application. Essentially we are repeating the same task 50 times, each time using a different input file and producing a different output file.

In Condor terminology, a parametric job is a cluster and each associated sub-job is simply called a job. A job is uniquely identified by the combination of its job id and its parent cluster id (<cluster id>.<job id>) as seen in the output of condor_q. Supplying a numerical argument to the queue parameter in the job script instructs Condor to launch the specified number of job instances within the cluster. Each job instance can be referenced in the job script using the $(Process) macro, which will be replaced with the particular job id (0, 1, 2, …).