Condor Tutorial
Clarification: If you are not in one of the Nevis particle-physics
groups, the following instructions won’t work as-is. However, if
you’re trying to come up with an initial set of a condor command file
(a .cmd
file), a shell script (a .sh
file)1, and a program (here a
.py
file), then going over these instructions will still be useful
on other condor-based batch farms.
Login to your file server
For the purposes of illustration, I’m going to assume that your file server is olga (a machine name that does not exist for the Nevis particle-physics systems), and that your Nevis account name is $USER (a variable that’s automatically assigned in most UNIX-related operating systems).
> ssh $USER@olga.nevis.columbia.edu
Create a directory on your file server
Disks are divided into
partitions, and
directories are created within partitions. All the Nevis systems have
a /data
partition. Let’s go there, create our test directory, and go
into the new directory:
> cd /nevis/olga/data
> mkdir $USER
> cd $USER
Copy the files used in this tutorial
> cp ~seligman/root-class/condor-example.* $PWD
For a list of files you’ve copied:
> ls -lh
Look at the files
As noted before, typically condor jobs require at least three files: the condor command file, a shell script executed by the command file, and a program executed by the shell script. Take a look at these files and read the comments:
> less condor-example.cmd
> less condor-example.sh
> less condor-example.py
The command file (the .cmd
file) submits the shell script to be
executed on some machine in the condor pool. The shell script (the
.sh
file) sets up the environment for the program to execute. The
program (here a .py
file), when it executes, writes an output
file. That output file is copied by condor to the directory from which
you originally executed condor_submit
.
If you’re not at Nevis: if you click on the following links, you should be able to see these files in your web browser. It’s worth reviewing them for both the commands and the comments:
However you’re looking at the files, note that each “instance” of this
job has its own Process
parameter. This is one way to differentiate
between multiple instances of the same program running on different
job queues; i.e., process 0 gives a different result than process 1,
and process 1 is different from process 2, etc.
The Process number
Realistically, at most you’re just going to skim the contents of these example files. But there’s one aspect I ask you to pay special attention to: process.
This is what makes each run of your program unique. If you didn’t make
use of $(Process)
in some way, as shown in the examples, then every
run of your program would produce identical results. This will wreak
havoc on your analysis.
How will $(Process)
affect your program? That’s up to you. Among the
most common is to have a program read a selected number of events from
a file; e.g., if the process is 0 read the first 10,000 events, if the
process is 1 read events 10,001 to 20,000, etc.2
If you’re running a simulation, your program will be generating lots of random numbers. Typically you’d have the process be part of the random-number seed.
Another common use of $(Process)
is to make it part of a file
name. In this way, your program will generate a unique output
file. After all the instances of your program have run, you’ll have to
combine the outputs in some way. There’s a ROOT feature designed for
this: TChain.
Make sure the files are executable
For a file to be executed as a program, it must be executable. I’ve already made sure that condor-example.sh and condor-example.py are executable programs via the following commands, but I suggest you type them in again to both be certain and to know how to do this when you start writing your own scripts.
> chmod +x condor-example.sh
> chmod +x condor-example.py
Let’s try it
Submit your condor command file to the condor cluster:3
> condor_submit condor-example.cmd
Quickly (before the program has a chance to finish), type
> condor_q
> condor_q -run
The first command shows all the jobs you submitted on this computer. The second command shows the jobs you submitted which are currently executing, and on which computer.
Within a minute or so, the job will complete and there’ll be no result
with your account ID from condor_q
. Take a look at the contents of your
directory:
> ls -lrth
The files are listed in ascending order by date. Note the new files at
the bottom of the list. Compare these files names to the ones given in
condor-example.sh. Can
you see how condor-example-test-0.root
got its name?
Multiple jobs
Edit the file condor-example.cmd and change the last line to read
queue 10
This means to submit 10 jobs.4 Save the file and execute the
condor_submit
command again. Note how the submitted jobs are “counted
off” by periods. Type condor_q
and
condor_q --run
to see which computers execute the jobs. When they’re all
done, look at the contents of your directory to see all the new files.
Run ROOT and look at the contents of condor-example-test-9.root
. Does it
contain the histogram you expect? Look at the mean and the histogram
limits.
Aborting a job
It happens all the time: You submit 10,000 jobs, and then realize that something is wrong. Fortunately, you can quickly abort a cluster of condor jobs.
Do condor_submit
again. The message that comes out looks something like
this:
> condor_submit condor-example.cmd
Submitting job(s)..........
Logging submit event(s)..........
10 job(s) submitted to cluster 14.
The identifier for this particular cluster is “14” (you’ll almost certainly see a different number). If you want to cancel all the jobs in that cluster at once, the command is:
> condor_rm 14
If you forget the cluster ID, you can always remind yourself with
condor_q
.
Clean up
Finished? Get rid of the files you no longer need:
> rm condor-example-test*
Or if you really want to wipe a directory that you’re never going to use again:
> cd /nevis/olga/data
> rm -rf $USER
A couple of tricks
Here are a couple of extra tricks you can do with python to improve this process a little bit.
Eliminating the “middle-man”, that is, the
.sh
file. You can do this by placing all the environment set-up commands in your.py
file.A different method of getting command-line arguments into your python program.
If you’ve deleted your temporary directory in /data
, create it again and
cd
to it. Copy over these example files:
> cp ~seligman/root-class/root-python-setup.* $PWD
Take a look at root-python-setup.cmd. It looks pretty much the same as that other condor command file, with one big difference: instead of executing a shell script that will execute another program, this command file will execute the python program directly.
Now look at root-python-setup.py and look at the comments. Two new things are happening in this program:
The python program is setting up its own environment. This requires the “stupid python trick” I mention in the comments (causing the program to run itself again). This altering of an external environment is something python can do but C++ cannot, at least not without even more trickery than you see here.
The python program is parsing its arguments; that is, it’s looking for options and arguments instead of just assuming that the first argument has a particular meaning. This can be done in C++ as well. When I’m writing code that requires only a couple of parameters, I like to use “getopt” or “argparse” methods because they help tell a user what a program is doing. Which is clearer to you?
> condor-example.py 5 myfile-5.root > root-python-setup.py --mean=5 --outputfile=myfile.root
- 1
For the rest of this discussion, I’m going to assume any wrapper script is written in a bash-style shell. There’s nothing wrong with having the condor executable written in a csh-style shell, or in any other scripting or programming language.
- 2
If you’re working with RDataFrame, the
Range
method described on the RDataFrame web page makes this approach easy.- 3
“After all those figures you showed us, don’t we have to know the name of the condor master or any of the nodes?” If everything works, no! Condor handles it all for you.
If you want to see how condor handles it, you can look at the following files on your server:
> less /etc/condor/condor_config > less /etc/condor/condor_config.local
- 4
If you’re going to frequently change the number of jobs you’re going to submit, there’s another way to approach this. Delete the
queue
line from the.cmd
file. Then include a-queue N
option incondor_submit
, where N is the number of jobs you want to submit. For example:condor_submit condor-example.cmd -queue 10