Condor Tutorial

Clarification: If you are not in one of the Nevis particle-physics groups, the following instructions won’t work as-is. However, if you’re trying to come up with an initial set of a condor command file (a .cmd file), a shell script (a .sh file)1, and a program (here a .py file), then going over these instructions will still be useful on other condor-based batch farms.

Login to your file server

For the purposes of illustration, I’m going to assume that your file server is olga (a machine name that does not exist for the Nevis particle-physics systems), and that your Nevis account name is $USER (a variable that’s automatically assigned in most UNIX-related operating systems).

> ssh $USER@olga.nevis.columbia.edu

Create a directory on your file server

Disks are divided into partitions, and directories are created within partitions. All the Nevis systems have a /data partition. Let’s go there, create our test directory, and go into the new directory:

> cd /nevis/olga/data
> mkdir $USER
> cd $USER

Copy the files used in this tutorial

> cp ~seligman/root-class/condor-example.* $PWD

For a list of files you’ve copied:

> ls -lh

Look at the files

As noted before, typically condor jobs require at least three files: the condor command file, a shell script executed by the command file, and a program executed by the shell script. Take a look at these files and read the comments:

> less condor-example.cmd
> less condor-example.sh
> less condor-example.py

The command file (the .cmd file) submits the shell script to be executed on some machine in the condor pool. The shell script (the .sh file) sets up the environment for the program to execute. The program (here a .py file), when it executes, writes an output file. That output file is copied by condor to the directory from which you originally executed condor_submit.

If you’re not at Nevis: if you click on the following links, you should be able to see these files in your web browser. It’s worth reviewing them for both the commands and the comments:

However you’re looking at the files, note that each “instance” of this job has its own Process parameter. This is one way to differentiate between multiple instances of the same program running on different job queues; i.e., process 0 gives a different result than process 1, and process 1 is different from process 2, etc.

The Process number

Realistically, at most you’re just going to skim the contents of these example files. But there’s one aspect I ask you to pay special attention to: process.

This is what makes each run of your program unique. If you didn’t make use of $(Process) in some way, as shown in the examples, then every run of your program would produce identical results. This will wreak havoc on your analysis.

How will $(Process) affect your program? That’s up to you. Among the most common is to have a program read a selected number of events from a file; e.g., if the process is 0 read the first 10,000 events, if the process is 1 read events 10,001 to 20,000, etc.2

If you’re running a simulation, your program will be generating lots of random numbers. Typically you’d have the process be part of the random-number seed.

Another common use of $(Process) is to make it part of a file name. In this way, your program will generate a unique output file. After all the instances of your program have run, you’ll have to combine the outputs in some way. There’s a ROOT feature designed for this: TChain.

Make sure the files are executable

For a file to be executed as a program, it must be executable. I’ve already made sure that condor-example.sh and condor-example.py are executable programs via the following commands, but I suggest you type them in again to both be certain and to know how to do this when you start writing your own scripts.

> chmod +x condor-example.sh
> chmod +x condor-example.py

Let’s try it

Submit your condor command file to the condor cluster:3

> condor_submit condor-example.cmd

Quickly (before the program has a chance to finish), type

> condor_q
> condor_q -run

The first command shows all the jobs you submitted on this computer. The second command shows the jobs you submitted which are currently executing, and on which computer.

Within a minute or so, the job will complete and there’ll be no result with your account ID from condor_q. Take a look at the contents of your directory:

> ls -lrth

The files are listed in ascending order by date. Note the new files at the bottom of the list. Compare these files names to the ones given in condor-example.sh. Can you see how condor-example-test-0.root got its name?

Multiple jobs

Edit the file condor-example.cmd and change the last line to read

queue 10

This means to submit 10 jobs.4 Save the file and execute the condor_submit command again. Note how the submitted jobs are “counted off” by periods. Type condor_q and condor_q --run to see which computers execute the jobs. When they’re all done, look at the contents of your directory to see all the new files.

Run ROOT and look at the contents of condor-example-test-9.root. Does it contain the histogram you expect? Look at the mean and the histogram limits.

Aborting a job

It happens all the time: You submit 10,000 jobs, and then realize that something is wrong. Fortunately, you can quickly abort a cluster of condor jobs.

Do condor_submit again. The message that comes out looks something like this:

> condor_submit condor-example.cmd
Submitting job(s)..........
Logging submit event(s)..........
10 job(s) submitted to cluster 14.

The identifier for this particular cluster is “14” (you’ll almost certainly see a different number). If you want to cancel all the jobs in that cluster at once, the command is:

> condor_rm 14

If you forget the cluster ID, you can always remind yourself with condor_q.

Clean up

Finished? Get rid of the files you no longer need:

> rm condor-example-test*

Or if you really want to wipe a directory that you’re never going to use again:

> cd /nevis/olga/data
> rm -rf $USER

A couple of tricks

Here are a couple of extra tricks you can do with python to improve this process a little bit.

  • Eliminating the “middle-man”, that is, the .sh file. You can do this by placing all the environment set-up commands in your .py file.

  • A different method of getting command-line arguments into your python program.

If you’ve deleted your temporary directory in /data, create it again and cd to it. Copy over these example files:

> cp ~seligman/root-class/root-python-setup.* $PWD

Take a look at root-python-setup.cmd. It looks pretty much the same as that other condor command file, with one big difference: instead of executing a shell script that will execute another program, this command file will execute the python program directly.

Now look at root-python-setup.py and look at the comments. Two new things are happening in this program:

  • The python program is setting up its own environment. This requires the “stupid python trick” I mention in the comments (causing the program to run itself again). This altering of an external environment is something python can do but C++ cannot, at least not without even more trickery than you see here.

  • The python program is parsing its arguments; that is, it’s looking for options and arguments instead of just assuming that the first argument has a particular meaning. This can be done in C++ as well. When I’m writing code that requires only a couple of parameters, I like to use “getopt” or “argparse” methods because they help tell a user what a program is doing. Which is clearer to you?

    > condor-example.py 5 myfile-5.root
    
    > root-python-setup.py --mean=5 --outputfile=myfile.root
    

1

For the rest of this discussion, I’m going to assume any wrapper script is written in a bash-style shell. There’s nothing wrong with having the condor executable written in a csh-style shell, or in any other scripting or programming language.

2

If you’re working with RDataFrame, the Range method described on the RDataFrame web page makes this approach easy.

3

“After all those figures you showed us, don’t we have to know the name of the condor master or any of the nodes?” If everything works, no! Condor handles it all for you.

If you want to see how condor handles it, you can look at the following files on your server:

> less /etc/condor/condor_config
> less /etc/condor/condor_config.local
4

If you’re going to frequently change the number of jobs you’re going to submit, there’s another way to approach this. Delete the queue line from the .cmd file. Then include a -queue N option in condor_submit, where N is the number of jobs you want to submit. For example:

condor_submit condor-example.cmd -queue 10