'UNSW - Science

UNSW - Science - HPC

Choosing a Universe

In Condor terms, a universe is a means of categorising a job. When a user wishes to submit a job they must declare which universe the job belongs to. Each universe has an associated set of requirements and services. It is the user's responsibility to ensure that the job satisfies the requirements of its universe. In return, Condor will ensure that the job benefits from the services offered by that particular universe.

The universes available in Condor include: Vanilla, Standard, Java, PVM, MPI, Globus and Scheduler. This document will discuss two of the most commonly used universes: Vanilla and Standard. These universes can be distinguished by their checkpointing and I/O facilities.

Vanilla Universe

The Vanilla universe is the most basic of all universes. Consequently, it will accept any kind of job, but it offers these jobs only a very simple service.

In the Vanilla universe, checkpointing is the responsibility of the job itself. When Condor determines that it is necessary to migrate the job, then the job is simply killed and re-run on the next available machine. If no precautions are made by the user then that effectively means that any work done prior to migration will be lost.

Also, the Vanilla universe provides no support for remote I/O. If the job requires access to certain input files, or if an output file is generated by the job, then those files must be transferred to and from the compute host by the job itself. Fortunately, there are mechanisms available for manually arranging the transfer of such files, see the Vanilla Universe section of Create a Job Script.

Standard Universe

The Standard universe improves upon the Vanilla universe by offering automatic checkpointing and remote I/O facilties. However, in order for Condor to offer these facilities, the job's executable must satisfy a set of constraints and it must be relinked against the Condor checkpointing and I/O libraries.

The constraints imposed upon Standard universe jobs are as follows:

  1. Multi-process jobs are not allowed. This includes system calls such as fork(), exec() and system().
  2. Interprocess communication is not allowed. This includes pipes, semaphores and shared memory.
  3. Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
  4. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.
  5. Alarms, timers and sleeping are not allowed. This includes system calls such as alarm(), getitimer() and sleep().
  6. Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.
  7. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().
  8. File locks are allowed, but not retained between checkpoints.
  9. All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.
  10. On Linux, your job must be statically linked.

All that is required to relink your code against the Condor checkpointing and I/O libraries is to prefix your usual link line with the condor_compile command.

An example using the GNU C compiler is shown below.

Both the GNU and PGI compilers are supported by Condor, however, with the PGI compilers it is necessary to add -lg2c
to the end of the condor_compile line. The Intel compilers are not supported by the condor_compile command.

[matht001]$ condor_compile gcc foo.o bar.o -o big_simulation -lm

In the Standard universe checkpointing is handled automatically. When Condor determines that it is necessary to migrate the job, then Condor will save a checkpoint of your job to the checkpoint server, kill your job, then restart your job using the previously saved checkpoint. Unlike the Vanilla universe, when the job is launched on a new host it will restart from the point it had reached immediately before migration. As well as saving checkpoints as part of the job migration process, Condor also periodically (every 2 hours) saves checkpoints, just in case a machine is switched off unexpectedly and there was no time for job migration.

Also, the Standard universe provides support for remote I/O. When a job in the Standard universe performs any I/O (e.g. reads/writes a file) it is redirected back to the machine from which the job was submitted. Consequently, there is no need to transfer any input or output files to and from the machine on which the job is running. From an I/O perspective, it is as if the job was actually running on the submit host.