# Resource Planning

If you've made it this far, then:

   1. Congratulations!

   1. You now know how to submit a condor job with very simple
   input/output requirements.

   1. Your head must be spinning with all these files: `.cmd` files,
   `.sh` files, `.py` files, etc.

There's a reason for the nested scripts. It has to do with resource
planning: Telling condor what inputs are needed to run your program,
and how to handle any outputs.

Physicists hate resource planning.[^hate] But if you're going to use
condor for any real task, you'll have to do it.

## Understanding the execution environment

Let's take a look at some lines from
[condor-example.cmd](https://www.nevis.columbia.edu/~seligman/root-class/files/condor-example.cmd), with a couple of modifications for the sake of discussion:

```
executable     = condor-example.sh
transfer_input_files = condor-example.py
input          = experiment.root
transfer_output_files = condor-example-$(Process).root 
output         = condor-example-test-$(Process).out
error          = condor-example-test-$(Process).err
log            = condor-example-test-$(Process).log
```

The purpose of these lines is to instruct condor which files have to
be transferred to the batch node for your program to work, and which
files should be copied back to you after your program is finished.[^outputs]

This diagram illustrates what's going on:

:::{figure-md} resource-planning-fig
:align: center

<img src="images/resource-planning.png" alt="resource-planning" width="100%">

The relationship between the lines in a `.cmd` file and the condor execution environment. 
:::

Note that `output` refers to the direct text output of your job
(such as `print` statements in python or `cout` statements in
C++). The `error` file will contain any error messages. The `log` file
will include some condor status messages, including the name of the
batch node on which your job executed.[^log]

You can learn more about these statements in the [HTCondor
manual](https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html).

## Stay within the execution environment

You're not in Kansas anymore

Or to be less flippant, you're not in your home directory
anymore. When your job is executing on a batch node, your program (and
any shell script that runs it) doesn't have access to the files and
environments that you might have set up in your home directory.

Take another look at
[condor-example.sh](https://www.nevis.columbia.edu/~seligman/root-class/files/condor-example.sh). The
first few non-comment lines set up the execution environment for the
Nevis particle-physics systems.[^preamble]

If you're not at Nevis, of course, you'll have learn what form of
preamble or set-up is required for your institution's batch farm to
replicate your environment.

In general, it's not a good idea if your shell script (the `.sh` file)
references any directory outside the execution environment for that
particular process. The reason is that you may not know which batch
nodes will run your job, or if they have access to any external
directories.

You can fiddle with directories in condor, but you'll want to keep it
within the condor environment; e.g.:

```
# Create a new sub-directory within the condor execution environment.
mkdir myDirectory

# Visit this directory.
cd myDirectory

# Do something in that directory.
echo "This is a temporary file" > file.txt

# Leave the directory
cd ..

# Run something that uses that directory.
./myprogram.py --inputfile=myDirectory/file.txt
```

## Test to see if everything works

This may sound trivial, but you'd be surprised at how often I see
people skip this step: Test your condor job submission with 2-3 jobs
before submitting a huge quantity like 20,000.

The reason why I'm giving this advice is that I've seen what happens
when someone doesn't do it, and there's a problem with their job:

   - When a condor job fails, it sends an error message. 

   - 20,000 failed jobs means 20,000 error messages. 

   - Sending out 20,000 error messages at once will clog our mail
     server, slowing it down. This does not make one popular.

   - The 20,000 error messages are sent to your email address at your
     home institution. Your home institution becomes suspicious, and
     shuts down your email address.

   - The `nevis.columbia.edu` mail server is identified as a potential
     source of spam, since it sent out 20,000 emails to a single
     address. Our mail server is added to spam block lists. Members of
     the Nevis faculty discover that their emails never arrive, and without
     any warnings. Again, this does not make one popular. 

All of the above have happened, though fortunately not all at
once. Still, to maintain your popularity, test your scripts first!

## Find reasonable execution times for your jobs

The "sweet spot" for the length of a job is about 20-60 minutes.

   - Shorter than 5 minutes, and the overhead involved in condor
     setting up the execution environment and copying your files
     begins to take up a substantial percentage of the job's execution
     time.

   - Longer than that, and your job may run up against condor's time
     limit for a job. That limit is assigned by a systems
     administrator; for the Nevis particle-physics systems I've left
     this limit at its default of two hours.

     That's not a "hard" limit; if no other jobs are waiting to be
     executed, condor will let a job run over two hours. However, as
     soon as any other job requests to be executed, condor will
     "suspend" a job that's taking too long. A suspended job has to
     start again from the beginning.[^long]

A practical way to discover how long your job takes to run is to do
some test submissions with varying values of an appropriate parameter;
e.g., the number of events read from an input file, or the number of
events to generate in a simulation.

## Too many inputs

As you work with more elaborate projects, you may discover that you
need a longer and longer list of files to transfer with the
`transfer_input_files` option in the `.cmd` file.

There are two ways to deal with this. One way is to let
`transfer_input_files` copy an entire directory for you. For example:

    transfer_input_files=myprogram.py,my_directory

If you have to transfer a lot of input files, put them in that directory. 
Then you reference that directory within your condor environment;
e.g.:

    ./myprogram.py --inputfile=my_directory/experiment.root

It's easy to get sloppy and forget about the contents of any such
special directories. They tend to pick up "junk" that your condor job
never uses, but transfers anyway because condor can't tell the
difference.

Therefore, I like to use the
[tar](https://www.geeksforgeeks.org/tar-command-linux-examples/)
command to create an archive file that I will unpack in my condor job.[^zip]

Here's an example from work that I'm doing right now (May-2022). I've
got a program that I want to submit to condor. I developed the program
and its associated files as I worked in a directory. That directory is
filled with all kinds of temporary files, test scripts, and other
junk:

```
> ls work-directory
10000events.mac       example.hepmc2    grams.gdml           LArHits.h              root
atestfil.fit          example.hepmc3    gramssky             mac                    scripts
bin                   example.treeroot  GramsSky             Makefile               test
btestfil.fit          GDMLSchema        gramssky.hepmc3      options.xml            test.mac
c1.gif                gdmlsearch        grams-z20.gdml       options.xml~           test.root
c1.pdf                gdmlsearch.cc~    hepmc3.mac           outputs-pretrajectory  treeViewer.C
CMakeCache.txt        gramsdetsim       hepmc3.mac~          output.txt             user.properties
CMakeFiles            GramsDetSim       hepmc-ntuple.root    parsed.gdml            view.mac
cmake_install.cmake   gramsdetsim.root  HepRApp.jar          radialdistance.root    xinclude.xml
crab-45.mac           gramsg4           heprep.mac           rd.pdf
crab.mac              GramsG4           HitRestructure.root  README.md
detector-outline.mac  gramsg4.root      LArHits.C            RestructuredEdx.root
```

I copy the work directory:

    > cp -arv work-directory job-directory

Then I clean up job-directory to only contain the files I'll need to
run my condor job. Here it is after I'm done.

```
> ls job-directory
bin  GDMLSchema  gramsdetsim  gramsg4  grams.gdml  gramssky  mac  options.xml
```

For me, this is a "reference directory" for the jobs I'm going to
submit. I can continue to fiddle with things in work-directory, but
I'll leave job-directory pristine until there's a reason to change it.

This directory still contains a lot of files:[^huh]

```
# Count the number of files in job-directory 
# and all its sub-directories.
> find job-directory -type f | wc -l
60
```

So it's easier for me to transfer all these files as a single archive.
I'll create an archive of that directory:

    > tar -cvf jobDir.tar ./job-directory

It's the single `jobDir.tar` file that I'll transfer and unpack in my
condor job. In my `.cmd` file, I'll have:

    transfer_input_files=jobDir.tar

In my `.sh` file, I'll have:

    tar -xf jobDir.tar

Then within my condor execution environment, I'll have access to all
the files in job-directory. For example:

    ./job-directory/gramsky --rngseed=${process}

That last option to `gramssky` supplies a unique random-number seed
for my sky simulation... but that's another story.

:::{figure-md} tar-fig
:align: center

<img src="https://imgs.xkcd.com/comics/tar.png" alt="xkcd tar" width="80%">

<https://xkcd.com/1168/> by Randall Munroe
:::

[^hate]: In this discussion, when I say "physicists hate resource planning," it's
    not exclusive. Other fields of study also hate resource planning.
    However, since I mostly hang out with physicists and not
    physicians or morticians, I'll only speak for the folks I know.

[^outputs]: You might initially think that if your program produces
    any outputs, condor should simply transfer everything back to
    you. In reality, an execution environment often contains a lot of
    "junk"; one classic example are
    [conda](https://docs.conda.io/en/latest/miniconda.html) work
    files. It's generally better to explictly tell condor which output
    files are relevant to you.

    Why I didn't do this in condor-example.cmd? When I first created
    that file, selective transfer of output files was not available in
    condor. And now... well, perhaps I'll get around to it in time for
    the next edition of this tutorial.

[^log]: The `.log` output from condor is a bit different than the
    `.out` and `.err` output. If you run your job more than once with
    identical values for `output=` and `error=` (in this example it would
    mean than you're running the same with the same process number), the
    `.out` and `.err` files will be overwritten.

    However, the `.log` is always appended to, not overwritten. If you
    see your `.log` file gradually become larger and larger as you
    re-submit your jobs, it's because the file is basically maintaining
    a "history" of every time you've run the job.

[^preamble]: Actually, on Nevis particle-physics systems, the
    directory `/usr/nevis/adm` *is* located outside the condor
    execution environment. This works on our batch farms because of
    the way directories are {ref}`managed <shared filesystems>` here.

[^long]: At one time we had someone try to submit a week-long spline
    calculation on a Nevis batch farm. It never executed, because any
    time someone submitted a different job to the farm, the spine job
    was suspended and had to start from the beginning.

[^zip]: Why don't I use
    [zip](https://linuxize.com/post/how-to-zip-files-and-directories-in-linux/)?
    Mostly it's personal preference: although `tar` is a bit harder to
    use, it's more powerful than `zip`. Also, `zip` isn't always part
    of a standard Linux installation, though I make sure it's available on all the
    Nevis particle-physics systems. 

[^huh]: Mystified by the
    [find](https://www.geeksforgeeks.org/find-command-in-linux-with-examples/)?
    By the
    [wc](https://www.geeksforgeeks.org/wc-command-linux-examples/)?
    And what's the deal with that [vertical
    bar](https://www.geeksforgeeks.org/piping-in-unix-or-linux/)? 

    Hey, I told you that you could spend a lifetime learning about UNIX!