Shared filesystems

We’re almost finished with the subject of batch systems. Hang in there. There’s a really sick xkcd cartoon at the end.

After lecturing you on keeping your execution environment clean and independent of outside directories, I have to confess: I wasn’t telling you the whole truth.

The reality is that in many batch-node installations, there is an external filesystem of some sort that all the nodes share. Here are a couple of reasons why that might be needed:

  • Software libraries, such as the those maintained in environment modules and in conda.

    Keeping such libraries in a shared filesystem may be the only practical way to assure that all the batch nodes have access to the same software.1

  • Large data files.

    Condor is pretty good about transferring files, but there can be problems when those files get bigger than 1GB or so. It may be easier to read those files via a shared filesystem, even though there’ll be a speed and/or bandwidth penalty for many programs reading a large file over the network at the same time.

There’s no single standard for shared filesystems and condor. The two accelerator labs with which I’m the most familar, CERN and Fermilab, each have their own method. You’ve already guessed that I’m going to describe what I’ve set up on the Nevis particle-physics batch farms, because it’s the one you’re mostly likely to use if you’ve read my condor pages so far.

From this point forward, everything I describe is in the context of the Nevis particle-physics Linux cluster. If you’re not in one of the Nevis particle-physics group, or you’re outside Nevis, you’ll have to ask how they handle their shared filesystem (if any).

Nevis shared filesystem

The cluster shares its files via NFS, a standard protocol for systems to view other directories.

At this point, you may want to read the Nevis wiki page on automount. The basic summary is if a system (let’s say olga) has a partition (perhaps share as an example), then to view that partition use the path /nevis/olga/share.2

Anyone can set up permissions to restrict others from viewing their directories. For the most part, you can view others’ directories without having accounts on the individual machines. That’s why you can view the directory ~seligman/root-class/, which expands to /nevis/tanya/home/seligman/root-class/, even though you can’t login to my desktop computer tanya.

Nevis filesystems and the batch farms

At this point, you may be thinking, “Hey, I don’t have to bother with all the Resource Planning stuff.3 Bill just said everything is shared, right? So I’ll just copy-and-paste the same lines I use to run my program into a .sh file and submit that. Easy-peasy!”

Sorry, but that won’t work. The reasons why:

  • It may be troublesome to figure out how to use input=, transfer-input-files=, and transfer-output-files=. But condor’s file-transfer mechanism is much more robust than NFS. I’ve seen systems running hundreds of shadow processes without slowing down the system from which the jobs’ files came.

  • The NFS file-sharing scheme has been deliberately set up in such a way that you can’t refer to your home directory within a condor job.

It’s reasonable to ask “Why not?” Consider what might happen if the batch nodes could access your home directory, and all the batch nodes on a cluster wanted to access that directory at once:

don't do this!

Figure 83: This is what we don’t permit the NFS-based shared filesystem to do.

NFS is a robust protocol, but handling hundreds of access requests to the same location on a single partition is a bit much for it. If it’s just reading, the server may slow down so much that it becomes unusuable, which irritates any users who are logged into that server to do their work. If those hundreds of jobs are writing to that directory at the same time, the server will crash.

The servers with users’ home directories are called login servers, because those are the servers that users primarily login to. If a login server slows down or crashes, users can’t login. Since our mail server requires a user’s home directory be available to process email, if a login server slows down or crashes, our email slows down or crashes.4

Each Nevis particle-physics group resolves this issue by having dedicated file servers that are distinct from their login servers. Remember the diagram I showed you at the beginning of the first class?

Nevis Linux Cluster

Figure 84: On the first day of class, I predicted that you’d forget this diagram. Was my prediction correct?

The file servers are the smaller boxes to the right in the above figure. Each one of those file servers has at least two partitions:

  • /share

    This partition is meant for read-only access by the batch farms. /share is intended for software libraries or similar resources that a job may need in order to execute.5 The size of the /share partition is typically on the order of 150GB, and it’s shared among the users in that research group.

  • /data

    This partition is meant for big data files (either inputs or outputs), and any other recreatable files associated with your jobs.6 Typically /data is about a dozen or more terabytes, though that varies widely between file servers.

This is how it looks:

do do this!

Figure 85: This is what we permit the NFS-based shared filesystem to do.

It’s still possible to crash a file server in this way. But if you do, it only affects your research group, not all the Nevis particle-physics groups, faculty, or email. Your group may irritated with you, but that’s a different story.7

Planning a batch job

Here’s the general work flow:

  • Develop a program or procedure in your home directory. Get the program to work on a small scale.

  • Once you’re confident you have a working program, copy the relevant files to a /share partition. Typically you can do this (using olga as an example file server):

    > mkdir -p /nevis/olga/share/$USER
    > cp -arv <my-program-directory> /nevis/olga/share/$USER
    > cd /nevis/olga/share/$USER
    

    Then clean things up in <my-program-directory> in preparation for the batch farm.

  • In your .cmd file, refer to any distinct input files by their absolute path to the /share directory in which you keep them. For example:

    executable=/nevis/olga/share/$USER/myjob.sh
    transfer_input_files=/nevis/olga/share/$USER/myprog.py,/nevis/olga/share/$USER/jobDir.tar
    
  • Create a directory on /data for your output files; e.g.:

    > mkdir -p /nevis/olga/data/$USER
    
  • Set initialdir in the .cmd file to that directory, so that output files will go there; e.g.:

    initialdir=/nevis/olga/data/$USER
    
  • Do a small test run of your job to see that it all works; e.g. (in .cmd):

    queue 4
    

    Then:

    > condor_submit myjob.cmd
    
  • Examine the files in /nevis/olga/data/$USER to make sure that everything works the way you want it to. Then go to town!

    queue 1000
    

Bringing the job to the data

There’s one last wrinkle on the whole job-design process.

Suppose your analysis requires reading large disk files. They’re too big to let condor do the transfer, and you also don’t want to hog the network bandwidth by reading them via NFS.

One answer is to make sure that a job that requires access to a big file runs on the machine with the partition that holds that file. This requires:

  • A list of which large file is located on which batch node. That list, in turn, must come from some earlier procedure that created/downloaded the file onto selected nodes.

  • A wrapper script around the .cmd file to insert or modify a requirements= line, to force the job to run on a particular node.

bring job to data

Figure 86: A sketch of how one might “bring the job to the data.” In this example, our program needs to access bigfile4.root.

To some degree this brings us all the way back to Figure 73: a bunch of programs with the same input file all being forced to run on a single computer. The main difference is that the process of downloading the files and submitting the jobs can be automated. The details of how that’s automated depend on each research group and the tools they use.

Disk space

Physicists like to live in an idealized world where there’s infinite disk space.8 The reality is that there’s only so much disk space available to you, or that is shared among your research group.

Batch jobs tend to consume large amounts of disk space. In particular, if you submit N jobs, you’re going to get at least 3*N files written to initialdir; each job will write a .out, a .err, and a .log file. These files, and other miscellaneous outputs associated with different projects, can accumulate and be forgotten. They passively take up space and are never looked at again.

At some point, please consider deleting these “scrap” files once you no longer need them. Only the most intelligent and clever physicists remember to do this… and we have high hopes for you!

xkcd experiment

Figure 87: https://xkcd.com/669/ by Randall Munroe. This is another example of an idealized environment that, for practical reasons, physicists can’t use.


1

An example of this at Nevis, to which I aluded in Resource Planning, is the /usr/nevis/adm directory. On all the systems in the Nevis particle-physics Linux cluster, including the batch nodes, this is an external directory mounted from /nevis/library/usr/nevis/adm.

2

This may explain for you why your Nevis home directory is /nevis/milne/files/<account>. It means your home directory is named <account>, on partition files, on machine milne, in the Nevis cluster.

3

I thought about using the more polite word “nonsense” here. But we’re all adults, and I know you can handle the word “stuff”, which (despite being slang) is more direct and to-the-point.

4

Your email may not handled at Nevis, but the faculty’s email is. If you were to submit a job that would crash the faculty’s email, you would not be popular.

5

Like the /home partitions (and similar critical directories, such as the student accounts on milne), the /share partitions are backed up nightly. Note that /data directories are not backed up; we have neither the bandwidth nor the spare disk storage for overnight multi-terabyte backups.

Keep that in mind as you’re deciding what to keep in /share and what to keep in /data.

6

Here, “recreatable” means that if something were to happen to the /data partition, you could easily recreate the file by running a program again. An example of a file that is not easy to recreate is a research paper that you write; keep that and its associated plots in your home directory!

7

If your group hassles you about crashing a file server, ask them if any of them have ever crashed a server? They’ll look embarassed, mumble an apology, and politely offer to work with you so that it doesn’t happen again.

8

In this discussion, when I say “physicists like to live in an idealized world”… Oh, never mind. By this point I’m sure you’ve got the joke.