|Batch Services at Nevis||
This is a description of the batch job submission services available
on the Linux
cluster at Nevis Labs. The topics discussed are:
This web page, like the batch system itself, is a work in progress. It was last modified on 06-Jun-2006.
As of 16-Aug-2007, the following paragraph is obsolete. The machine hermes is being repaired. The replacement condor batch manager is riverside.nevis.columbia.edu. However, as documented below, you don't need to login to riverside to use condor.
The system responsible for administering batches services is hermes.nevis.columbia.edu. Users typically cannot log in to this machine; you submit and monitor jobs from your local box on the Linux cluster. As far as job submission and execution are concerned, the existence of hermes may be completely transparent to you.
In addition to any RAID drives attached to your workgroup's servers, there are additional "common" RAID drives that are intended to be shared among the users of Nevis batch system. They are initially to be used by the ATLAS and D0 groups, as noted below, but may be made available to other groups as the need arises. These disks are available via automount on the Linux cluster; each has a capacity of about 1.5TB. The names of these drives are:
For example, the permissions on the drives have been set so that you can do the following from any machine on the Linux cluster (if you're a member of the ATLAS group):
Important! If you're skimming this page, stop and read the following paragraph!
The files on these /data partitions, like those on the /data partitions of any other systems on the Nevis cluster, are not backed up. They are stored on RAID5 arrays, which are a reliable form of storage; there is monitoring software that warns if any individual drives have failed. However, RAID arrays have been known to fail (and we've had at least one such failure at Nevis). If you have any critical data stored on these drives, make sure you backup the files yourself.
One more time: the disks on these partitions are not backed up!
|Submitting batch jobs|
The batch job submission system we're using at Nevis is Condor, developed at the University of Wisconsin. You can learn more about Condor from the User's Manual.
To use Condor at Nevis, the simplest way is to use the setup command:
Condor tips and tricks
Read the README file; type make to compile the programs; type sh submit to submit a few test jobs.
You may notice that the sh_loop script will not execute; it will sit in the "Idle" state indefinitely. It won't execute unless you submit it in the vanilla universe as described below.
Obviously, some machines are more powerful than others. Before you arbitrarily decide that only the most powerful machines are good enough for your jobs, consider:
The best way to tell Condor that you'd prefer your job to execute on the faster machines is to use the Rank attribute in your submit file:
With all that said, if you want to restrict your job to the faster machines, you can try a statement like the following in your submit file:
This would restrict your job to the fastest processors on the cluster.
If you wish to enable your job to execute on both the 32-bit and 64-bit machines at Nevis, include the following statement in your submit file:
This can be combined with other job requirements; e.g.:
If I/O efficiency is important, then consider manually transferring your files to a particular server, and requiring your job to execute on that machine in your submit file; for example:
If you don't mind if the files are transferred, but would prefer it if they were not, instead of the above command you can use the Rank attribute (in this example, we also include a preference for a faster machine):
The 2000 is a scaling factor. The value of Machine == "sasha.nevis.columbia.edu" will be 1.0 or 0.0, so we must scale it in order for the term to have roughly the same weight as the number of Mips.
Many of the above tips, and others, have been combined into a set of example scripts. They are in ~seligman/condor/; start with the README file, which will point you to the other relevant files in the directory.
|Availability of batch services|
Use of Condor is not available to all systems at Nevis. If you would like access to the batch services (or feel that your system was omitted in error), please contact both Gustaaf Brooijmans and Bill Seligman.
|to the Nevis Home Page.|
|to the previous Page.|
|Send comments and suggestions to the webmaster|