# Shared filesystems
We're almost finished with the subject of batch systems. Hang in
there. There's a really sick xkcd cartoon at the end.
After lecturing you on keeping your execution environment clean and
independent of outside directories, I have to confess: I wasn't
telling you the whole truth.
The reality is that in many batch-node installations, there is an
external filesystem of some sort that all the nodes share. Here are a
couple of reasons why that might be needed:
- Software libraries, such as the those maintained in [environment
modules](https://modules.readthedocs.io/en/latest/) and in
[conda](https://docs.conda.io/projects/conda/en/latest/index.html).
Keeping
such libraries in a shared filesystem may be the only practical
way to assure that all the batch nodes have access to the same
software.[^adm]
- Large data files.
Condor is pretty good about transferring files,
but there can be problems when those files get bigger than 1GB or
so. It may be easier to read those files via a shared filesystem,
even though there'll be a speed and/or bandwidth penalty for
many programs reading a large file over the network at the same time.
There's no single standard for shared filesystems and condor. The two
accelerator labs with which I'm the most familar, CERN and Fermilab, each have
their own method. You've already guessed that I'm going to describe
what I've set up on the Nevis particle-physics batch farms, because
it's the one you're mostly likely to use if you've read my condor
pages so far.
From this point forward, everything I describe is in the context of
the Nevis particle-physics [Linux
cluster](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/LinuxCluster). If
you're not in one of the Nevis particle-physics group, or you're
outside Nevis, you'll have to ask how they handle their shared
filesystem (if any).
## Nevis shared filesystem
The cluster shares its files via
[NFS](https://www.atera.com/blog/what-is-nfs-understanding-the-network-file-system/),
a standard protocol for systems to view other directories.
At this point, you may want to read the Nevis wiki page on
[automount](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/Automount). The
basic summary is if a system (let's say `olga`) has a partition
(perhaps `share` as an example), then to view that partition use the
path `/nevis/olga/share`.[^milne]
Anyone can set up
[permissions](https://www.guru99.com/file-permissions.html) to
restrict others from viewing their directories. For the most part, you
can view others' directories without having accounts on the individual
machines. That's why you can view the directory
`~seligman/root-class/`, which expands to
`/nevis/tanya/home/seligman/root-class/`, even though you
can't login to my desktop computer `tanya`.
## Nevis filesystems and the batch farms
At this point, you may be thinking, "Hey, I don't have to bother with
all the {ref}`resource planning` stuff.[^stuff] Bill just said everything is
shared, right? So I'll just copy-and-paste the same lines I use to run
my program into a `.sh` file and submit that. Easy-peasy!"
Sorry, but that won't work. The reasons why:
- It may be troublesome to figure out how to use `input=`,
`transfer-input-files=`, and `transfer-output-files=`. But
condor's file-transfer mechanism is much more robust than
NFS. I've seen systems running hundreds of
[shadow](https://htcondor.readthedocs.io/en/latest/users-manual/managing-a-job.html)
processes without slowing down the system from which the jobs' files came.
- The NFS file-sharing scheme has been deliberately set up in such
a way that you *can't* refer to your home directory within a
condor job.
It's reasonable to ask "Why not?" Consider what might happen if the
batch nodes could access your home directory, and all the batch nodes
on a cluster wanted to access that directory at once:
:::{figure-md} dont-do-fig
:align: center
This is what we *don't* permit the NFS-based shared filesystem to do.
:::
NFS is a robust protocol, but handling hundreds of access requests to
the same location on a single partition is a bit much for it. If it's
just reading, the server may slow down so much that it becomes
unusuable, which irritates any users who are logged into that server
to do their work. If those hundreds of jobs are *writing* to that
directory at the same time, the server will crash.
The servers with users' home directories are called [login
servers](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/DiskSharing),
because those are the servers that users primarily login to. If a
login server slows down or crashes, users can't login. Since our mail
server requires a user's home directory be available to process email,
if a login server slows down or crashes, our email slows down or
crashes.[^email]
Each Nevis particle-physics group resolves this issue by having
dedicated file servers that are distinct from their login
servers. Remember the diagram I showed you at the beginning of the
first class?
:::{figure-md} LinuxCluster-fig
:align: center
On the first day of class, I predicted that you'd forget this diagram.
Was my prediction correct?
:::
The file servers are the smaller boxes to the right in the above
figure. Each one of those file servers has at least two partitions:
- `/share`
This partition is meant for read-only access by the
batch farms. `/share` is intended for software libraries or similar
resources that a job may need in order to execute.[^backup] The size of the
`/share` partition is typically on the order of 150GB, and it's
shared among the users in that research group.
- `/data`
This partition is meant for big data files (either
inputs or outputs), and any other recreatable files associated
with your jobs.[^recreate] Typically `/data` is about a dozen or more
terabytes, though that varies widely between file servers.
This is how it looks:
:::{figure-md} do-do-fig
:align: center
This is what we permit the NFS-based shared filesystem to do.
:::
It's still possible to crash a file server in this way. But if you do,
it only affects your research group, not all the Nevis
particle-physics groups, faculty, or email. Your *group* may irritated
with you, but that's a different story.[^havent]
## Planning a batch job
Here's the general work flow:
- Develop a program or procedure in your home directory. Get the
program to work on a small scale.
- Once you're confident you have a working program, copy the
relevant files to a `/share` partition. Typically you can do this
(using `olga` as an example file server):
> mkdir -p /nevis/olga/share/$USER
> cp -arv /nevis/olga/share/$USER
> cd /nevis/olga/share/$USER
Then clean things up in `` in preparation
for the batch farm.
- In your `.cmd` file, refer to any distinct input files by their
absolute path to the `/share` directory in which you keep
them. For example:
executable=/nevis/olga/share/$USER/myjob.sh
transfer_input_files=/nevis/olga/share/$USER/myprog.py,/nevis/olga/share/$USER/jobDir.tar
- Create a directory on `/data` for your output files; e.g.:
> mkdir -p /nevis/olga/data/$USER
- Set `initialdir` in the `.cmd` file to that directory, so that
output files will go there; e.g.:
initialdir=/nevis/olga/data/$USER
- Do a small test run of your job to see that it all works; e.g. (in
`.cmd`):
queue 4
Then:
> condor_submit myjob.cmd
- Examine the files in /nevis/olga/data/$USER to make sure that
everything works the way you want it to. Then go to town!
queue 1000
## Bringing the job to the data
There's one last wrinkle on the whole job-design process.
Suppose your analysis requires reading large disk files. They're too
big to let condor do the transfer, and you also don't want to hog the
network bandwidth by reading them via NFS.
One answer is to make sure that a job that requires access to a big
file runs on the machine with the partition that holds that file.
This requires:
- A list of which large file is located on which batch node. That
list, in turn, must come from some earlier procedure that
created/downloaded the file onto selected nodes.
- A wrapper script around the `.cmd` file to insert or modify a
`requirements=` line, to force the job to run on a particular
node.
:::{figure-md} bigfile-fig
:align: center
A sketch of how one might "bring the job to the data." In this
example, our program needs to access `bigfile4.root`.
:::
To some degree this brings us all the way back to {numref}`Figure %s
`: a bunch of programs with the same input file all
being forced to run on a single computer. The main difference is that
the process of downloading the files and submitting the jobs can be
automated. The details of how that's automated depend on each research
group and the tools they use.
## Disk space
Physicists like to live in an idealized world where there's infinite
disk space.[^infinite] The reality is that there's only so much disk
space available to you, or that is shared among your research group.
Batch jobs tend to consume large amounts of disk space. In particular,
if you submit N jobs, you're going to get at least 3*N files written
to `initialdir`; each job will write a `.out`, a `.err`, and a `.log`
file. These files, and other miscellaneous outputs associated with
different projects, can accumulate and be forgotten. They passively
take up space and are never looked at again.
At some point, please consider deleting these "scrap" files once you
no longer need them. Only the most intelligent and clever physicists
remember to do this... and we have high hopes for you!
:::{figure-md} ideal-experiment-fig
:align: center
by Randall Munroe. This is another example of
an idealized environment that, for practical reasons, physicists can't
use.
:::
[^adm]: An example of this at Nevis, to which I aluded in
{ref}`resource planning`, is the `/usr/nevis/adm` directory. On
all the systems in the Nevis particle-physics [Linux
cluster](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/LinuxCluster),
including the batch nodes, this is an external directory mounted
from `/nevis/library/usr/nevis/adm`.
[^milne]: This may explain for you why your Nevis home directory is
`/nevis/milne/files/`. It means your home directory is
named ``, on partition `files`, on machine `milne`, in
the Nevis cluster.
[^stuff]: I thought about using the more polite word "nonsense"
here. But we're all adults, and I know you can handle the
word "stuff", which (despite being slang) is more direct
and to-the-point.
[^email]: *Your* email may not handled at Nevis, but the *faculty's*
email is. If you were to submit a job that would crash the faculty's
email, you would not be popular.
[^backup]: Like the `/home` partitions (and similar critical
directories, such as the student accounts on `milne`), the `/share`
partitions are backed up nightly. Note that `/data` directories
**are not backed up**; we have neither the bandwidth nor the spare
disk storage for overnight multi-terabyte backups.
Keep that in mind as you're deciding what to keep in `/share` and
what to keep in `/data`.
[^recreate]: Here, "recreatable" means that if something were to
happen to the `/data` partition, you could easily recreate the
file by running a program again. An example of a file that is
*not* easy to recreate is a research paper that you write; keep
that and its associated plots in your home directory!
[^havent]: If your group hassles you about crashing a file server, ask
them if any of them have ever crashed a server? They'll look
embarassed, mumble an apology, and politely offer to work with you so
that it doesn't happen again.
[^infinite]: In this discussion, when I say "physicists like to live
in an idealized world"... Oh, never mind. By this point I'm sure
you've got the joke.