Resource Planning

If you’ve made it this far, then:

  1. Congratulations!

  2. You now know how to submit a condor job with very simple input/output requirements.

  3. Your head must be spinning with all these files: .cmd files, .sh files, .py files, etc.

There’s a reason for the nested scripts. It has to do with resource planning: Telling condor what inputs are needed to run your program, and how to handle any outputs.

Physicists hate resource planning.1 But if you’re going to use condor for any real task, you’ll have to do it.

Understanding the execution environment

Let’s take a look at some lines from condor-example.cmd, with a couple of modifications for the sake of discussion:

executable     = condor-example.sh
transfer_input_files = condor-example.py
input          = experiment.root
transfer_output_files = condor-example-$(Process).root 
output         = condor-example-test-$(Process).out
error          = condor-example-test-$(Process).err
log            = condor-example-test-$(Process).log

The purpose of these lines is to instruct condor which files have to be transferred to the batch node for your program to work, and which files should be copied back to you after your program is finished.2

This diagram illustrates what’s going on:

resource-planning

Figure 120: The relationship between the lines in a .cmd file and the condor execution environment.

Note that output refers to the direct text output of your job (such as print statements in python or cout statements in C++). The error file will contain any error messages. The log file will include some condor status messages, including the name of the batch node on which your job executed.3

You can learn more about these statements in the HTCondor manual.

Stay within the execution environment

You’re not in Kansas anymore

Or to be less flippant, you’re not in your home directory anymore. When your job is executing on a batch node, your program (and any shell script that runs it) doesn’t have access to the files and environments that you might have set up in your home directory.

Take another look at condor-example.sh. The first few non-comment lines set up the execution environment for the Nevis particle-physics systems.4

If you’re not at Nevis, of course, you’ll have learn what form of preamble or set-up is required for your institution’s batch farm to replicate your environment.

In general, it’s not a good idea if your shell script (the .sh file) references any directory outside the execution environment for that particular process. The reason is that you may not know which batch nodes will run your job, or if they have access to any external directories.

You can fiddle with directories in condor, but you’ll want to keep it within the condor environment; e.g.:

# Create a new sub-directory within the condor execution environment.
mkdir myDirectory

# Visit this directory.
cd myDirectory

# Do something in that directory.
echo "This is a temporary file" > file.txt

# Leave the directory
cd ..

# Run something that uses that directory.
./myprogram.py --inputfile=myDirectory/file.txt

Test to see if everything works

This may sound trivial, but you’d be surprised at how often I see people skip this step: Test your condor job submission with 2-3 jobs before submitting a huge quantity like 20,000.

The reason why I’m giving this advice is that I’ve seen what happens when someone doesn’t do it, and there’s a problem with their job:

  • When a condor job fails, it sends an error message.

  • 20,000 failed jobs means 20,000 error messages.

  • Sending out 20,000 error messages at once will clog our mail server, slowing it down. This does not make one popular.

  • The 20,000 error messages are sent to your email address at your home institution. Your home institution becomes suspicious, and shuts down your email address.

  • The nevis.columbia.edu mail server is identified as a potential source of spam, since it sent out 20,000 emails to a single address. Our mail server is added to spam block lists. Members of the Nevis faculty discover that their emails never arrive, and without any warnings. Again, this does not make one popular.

All of the above have happened, though fortunately not all at once. Still, to maintain your popularity, test your scripts first!

Find reasonable execution times for your jobs

The “sweet spot” for the length of a job is about 20-60 minutes.

  • Shorter than 5 minutes, and the overhead involved in condor setting up the execution environment and copying your files begins to take up a substantial percentage of the job’s execution time.

  • Longer than that, and your job may run up against condor’s time limit for a job. That limit is assigned by a systems administrator; for the Nevis particle-physics systems I’ve left this limit at its default of two hours.

    That’s not a “hard” limit; if no other jobs are waiting to be executed, condor will let a job run over two hours. However, as soon as any other job requests to be executed, condor will “suspend” a job that’s taking too long. A suspended job has to start again from the beginning.5

A practical way to discover how long your job takes to run is to do some test submissions with varying values of an appropriate parameter; e.g., the number of events read from an input file, or the number of events to generate in a simulation.

Too many inputs

As you work with more elaborate projects, you may discover that you need a longer and longer list of files to transfer with the transfer_input_files option in the .cmd file.

There are two ways to deal with this. One way is to let transfer_input_files copy an entire directory for you. For example:

transfer_input_files=myprogram.py,my_directory

If you have to transfer a lot of input files, put them in that directory. Then you reference that directory within your condor environment; e.g.:

./myprogram.py --inputfile=my_directory/experiment.root

It’s easy to get sloppy and forget about the contents of any such special directories. They tend to pick up “junk” that your condor job never uses, but transfers anyway because condor can’t tell the difference.

Therefore, I like to use the tar command to create an archive file that I will unpack in my condor job.6

Here’s an example from work that I’m doing right now (May-2022). I’ve got a program that I want to submit to condor. I developed the program and its associated files as I worked in a directory. That directory is filled with all kinds of temporary files, test scripts, and other junk:

> ls work-directory
10000events.mac       example.hepmc2    grams.gdml           LArHits.h              root
atestfil.fit          example.hepmc3    gramssky             mac                    scripts
bin                   example.treeroot  GramsSky             Makefile               test
btestfil.fit          GDMLSchema        gramssky.hepmc3      options.xml            test.mac
c1.gif                gdmlsearch        grams-z20.gdml       options.xml~           test.root
c1.pdf                gdmlsearch.cc~    hepmc3.mac           outputs-pretrajectory  treeViewer.C
CMakeCache.txt        gramsdetsim       hepmc3.mac~          output.txt             user.properties
CMakeFiles            GramsDetSim       hepmc-ntuple.root    parsed.gdml            view.mac
cmake_install.cmake   gramsdetsim.root  HepRApp.jar          radialdistance.root    xinclude.xml
crab-45.mac           gramsg4           heprep.mac           rd.pdf
crab.mac              GramsG4           HitRestructure.root  README.md
detector-outline.mac  gramsg4.root      LArHits.C            RestructuredEdx.root

I copy the work directory:

> cp -arv work-directory job-directory

Then I clean up job-directory to only contain the files I’ll need to run my condor job. Here it is after I’m done.

> ls job-directory
bin  GDMLSchema  gramsdetsim  gramsg4  grams.gdml  gramssky  mac  options.xml

For me, this is a “reference directory” for the jobs I’m going to submit. I can continue to fiddle with things in work-directory, but I’ll leave job-directory pristine until there’s a reason to change it.

This directory still contains a lot of files:7

# Count the number of files in job-directory 
# and all its sub-directories.
> find job-directory -type f | wc -l
60

So it’s easier for me to transfer all these files as a single archive. I’ll create an archive of that directory:

> tar -cvf jobDir.tar ./job-directory

It’s the single jobDir.tar file that I’ll transfer and unpack in my condor job. In my .cmd file, I’ll have:

transfer_input_files=jobDir.tar

In my .sh file, I’ll have:

tar -xf jobDir.tar

Then within my condor execution environment, I’ll have access to all the files in job-directory. For example:

./job-directory/gramsky --rngseed=${process}

That last option to gramssky supplies a unique random-number seed for my sky simulation… but that’s another story.

xkcd tar

Figure 121: https://xkcd.com/1168/ by Randall Munroe


1

In this discussion, when I say “physicists hate resource planning,” it’s not exclusive. Other fields of study also hate resource planning. However, since I mostly hang out with physicists and not physicians or morticians, I’ll only speak for the folks I know.

2

You might initially think that if your program produces any outputs, condor should simply transfer everything back to you. In reality, an execution environment often contains a lot of “junk”; one classic example are conda work files. It’s generally better to explictly tell condor which output files are relevant to you.

Why I didn’t do this in condor-example.cmd? When I first created that file, selective transfer of output files was not available in condor. And now… well, perhaps I’ll get around to it in time for the next edition of this tutorial.

3

The .log output from condor is a bit different than the .out and .err output. If you run your job more than once with identical values for output= and error= (in this example it would mean than you’re running the same with the same process number), the .out and .err files will be overwritten.

However, the .log is always appended to, not overwritten. If you see your .log file gradually become larger and larger as you re-submit your jobs, it’s because the file is basically maintaining a “history” of every time you’ve run the job.

4

Actually, on Nevis particle-physics systems, the directory /usr/nevis/adm is located outside the condor execution environment. This works on our batch farms because of the way directories are managed here.

5

At one time we had someone try to submit a week-long spline calculation on a Nevis batch farm. It never executed, because any time someone submitted a different job to the farm, the spine job was suspended and had to start from the beginning.

6

Why don’t I use zip? Mostly it’s personal preference: although tar is a bit harder to use, it’s more powerful than zip. Also, zip isn’t always part of a standard Linux installation, though I make sure it’s available on all the Nevis particle-physics systems.

7

Mystified by the find? By the wc? And what’s the deal with that vertical bar?

Hey, I told you that you could spend a lifetime learning about UNIX!