# Shared filesystems We're almost finished with the subject of batch systems. Hang in there. There's a really sick xkcd cartoon at the end. After lecturing you on keeping your execution environment clean and independent of outside directories, I have to confess: I wasn't telling you the whole truth. The reality is that in many batch-node installations, there is an external filesystem of some sort that all the nodes share. Here are a couple of reasons why that might be needed: - Software libraries, such as the those maintained in [environment modules](https://modules.readthedocs.io/en/latest/) and in [conda](https://docs.conda.io/projects/conda/en/latest/index.html). Keeping such libraries in a shared filesystem may be the only practical way to assure that all the batch nodes have access to the same software.[^adm] - Large data files. Condor is pretty good about transferring files, but there can be problems when those files get bigger than 1GB or so. It may be easier to read those files via a shared filesystem, even though there'll be a speed and/or bandwidth penalty for many programs reading a large file over the network at the same time. There's no single standard for shared filesystems and condor. The two accelerator labs with which I'm the most familar, CERN and Fermilab, each have their own method. You've already guessed that I'm going to describe what I've set up on the Nevis particle-physics batch farms, because it's the one you're mostly likely to use if you've read my condor pages so far. From this point forward, everything I describe is in the context of the Nevis particle-physics [Linux cluster](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/LinuxCluster). If you're not in one of the Nevis particle-physics group, or you're outside Nevis, you'll have to ask how they handle their shared filesystem (if any). ## Nevis shared filesystem The cluster shares its files via [NFS](https://www.atera.com/blog/what-is-nfs-understanding-the-network-file-system/), a standard protocol for systems to view other directories. At this point, you may want to read the Nevis wiki page on [automount](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/Automount). The basic summary is if a system (let's say `olga`) has a partition (perhaps `share` as an example), then to view that partition use the path `/nevis/olga/share`.[^milne] Anyone can set up [permissions](https://www.guru99.com/file-permissions.html) to restrict others from viewing their directories. For the most part, you can view others' directories without having accounts on the individual machines. That's why you can view the directory `~seligman/root-class/`, which expands to `/nevis/tanya/home/seligman/root-class/`, even though you can't login to my desktop computer `tanya`. ## Nevis filesystems and the batch farms At this point, you may be thinking, "Hey, I don't have to bother with all the {ref}`resource planning` stuff.[^stuff] Bill just said everything is shared, right? So I'll just copy-and-paste the same lines I use to run my program into a `.sh` file and submit that. Easy-peasy!" Sorry, but that won't work. The reasons why: - It may be troublesome to figure out how to use `input=`, `transfer-input-files=`, and `transfer-output-files=`. But condor's file-transfer mechanism is much more robust than NFS. I've seen systems running hundreds of [shadow](https://htcondor.readthedocs.io/en/latest/users-manual/managing-a-job.html) processes without slowing down the system from which the jobs' files came. - The NFS file-sharing scheme has been deliberately set up in such a way that you *can't* refer to your home directory within a condor job. It's reasonable to ask "Why not?" Consider what might happen if the batch nodes could access your home directory, and all the batch nodes on a cluster wanted to access that directory at once: :::{figure-md} dont-do-fig :align: center don't do this! This is what we *don't* permit the NFS-based shared filesystem to do. ::: NFS is a robust protocol, but handling hundreds of access requests to the same location on a single partition is a bit much for it. If it's just reading, the server may slow down so much that it becomes unusuable, which irritates any users who are logged into that server to do their work. If those hundreds of jobs are *writing* to that directory at the same time, the server will crash. The servers with users' home directories are called [login servers](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/DiskSharing), because those are the servers that users primarily login to. If a login server slows down or crashes, users can't login. Since our mail server requires a user's home directory be available to process email, if a login server slows down or crashes, our email slows down or crashes.[^email] Each Nevis particle-physics group resolves this issue by having dedicated file servers that are distinct from their login servers. Remember the diagram I showed you at the beginning of the first class? :::{figure-md} LinuxCluster-fig :align: center Nevis Linux Cluster On the first day of class, I predicted that you'd forget this diagram. Was my prediction correct? ::: The file servers are the smaller boxes to the right in the above figure. Each one of those file servers has at least two partitions: - `/share` This partition is meant for read-only access by the batch farms. `/share` is intended for software libraries or similar resources that a job may need in order to execute.[^backup] The size of the `/share` partition is typically on the order of 150GB, and it's shared among the users in that research group. - `/data` This partition is meant for big data files (either inputs or outputs), and any other recreatable files associated with your jobs.[^recreate] Typically `/data` is about a dozen or more terabytes, though that varies widely between file servers. This is how it looks: :::{figure-md} do-do-fig :align: center do do this! This is what we permit the NFS-based shared filesystem to do. ::: It's still possible to crash a file server in this way. But if you do, it only affects your research group, not all the Nevis particle-physics groups, faculty, or email. Your *group* may irritated with you, but that's a different story.[^havent] ## Planning a batch job Here's the general work flow: - Develop a program or procedure in your home directory. Get the program to work on a small scale. - Once you're confident you have a working program, copy the relevant files to a `/share` partition. Typically you can do this (using `olga` as an example file server): > mkdir -p /nevis/olga/share/$USER > cp -arv /nevis/olga/share/$USER > cd /nevis/olga/share/$USER Then clean things up in `` in preparation for the batch farm. - In your `.cmd` file, refer to any distinct input files by their absolute path to the `/share` directory in which you keep them. For example: executable=/nevis/olga/share/$USER/myjob.sh transfer_input_files=/nevis/olga/share/$USER/myprog.py,/nevis/olga/share/$USER/jobDir.tar - Create a directory on `/data` for your output files; e.g.: > mkdir -p /nevis/olga/data/$USER - Set `initialdir` in the `.cmd` file to that directory, so that output files will go there; e.g.: initialdir=/nevis/olga/data/$USER - Do a small test run of your job to see that it all works; e.g. (in `.cmd`): queue 4 Then: > condor_submit myjob.cmd - Examine the files in /nevis/olga/data/$USER to make sure that everything works the way you want it to. Then go to town! queue 1000 ## Bringing the job to the data There's one last wrinkle on the whole job-design process. Suppose your analysis requires reading large disk files. They're too big to let condor do the transfer, and you also don't want to hog the network bandwidth by reading them via NFS. One answer is to make sure that a job that requires access to a big file runs on the machine with the partition that holds that file. This requires: - A list of which large file is located on which batch node. That list, in turn, must come from some earlier procedure that created/downloaded the file onto selected nodes. - A wrapper script around the `.cmd` file to insert or modify a `requirements=` line, to force the job to run on a particular node. :::{figure-md} bigfile-fig :align: center bring job to data A sketch of how one might "bring the job to the data." In this example, our program needs to access `bigfile4.root`. ::: To some degree this brings us all the way back to {numref}`Figure %s `: a bunch of programs with the same input file all being forced to run on a single computer. The main difference is that the process of downloading the files and submitting the jobs can be automated. The details of how that's automated depend on each research group and the tools they use. ## Disk space Physicists like to live in an idealized world where there's infinite disk space.[^infinite] The reality is that there's only so much disk space available to you, or that is shared among your research group. Batch jobs tend to consume large amounts of disk space. In particular, if you submit N jobs, you're going to get at least 3*N files written to `initialdir`; each job will write a `.out`, a `.err`, and a `.log` file. These files, and other miscellaneous outputs associated with different projects, can accumulate and be forgotten. They passively take up space and are never looked at again. At some point, please consider deleting these "scrap" files once you no longer need them. Only the most intelligent and clever physicists remember to do this... and we have high hopes for you! :::{figure-md} ideal-experiment-fig :align: center xkcd experiment by Randall Munroe. This is another example of an idealized environment that, for practical reasons, physicists can't use. ::: [^adm]: An example of this at Nevis, to which I aluded in {ref}`resource planning`, is the `/usr/nevis/adm` directory. On all the systems in the Nevis particle-physics [Linux cluster](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/LinuxCluster), including the batch nodes, this is an external directory mounted from `/nevis/library/usr/nevis/adm`. [^milne]: This may explain for you why your Nevis home directory is `/nevis/milne/files/`. It means your home directory is named ``, on partition `files`, on machine `milne`, in the Nevis cluster. [^stuff]: I thought about using the more polite word "nonsense" here. But we're all adults, and I know you can handle the word "stuff", which (despite being slang) is more direct and to-the-point. [^email]: *Your* email may not handled at Nevis, but the *faculty's* email is. If you were to submit a job that would crash the faculty's email, you would not be popular. [^backup]: Like the `/home` partitions (and similar critical directories, such as the student accounts on `milne`), the `/share` partitions are backed up nightly. Note that `/data` directories **are not backed up**; we have neither the bandwidth nor the spare disk storage for overnight multi-terabyte backups. Keep that in mind as you're deciding what to keep in `/share` and what to keep in `/data`. [^recreate]: Here, "recreatable" means that if something were to happen to the `/data` partition, you could easily recreate the file by running a program again. An example of a file that is *not* easy to recreate is a research paper that you write; keep that and its associated plots in your home directory! [^havent]: If your group hassles you about crashing a file server, ask them if any of them have ever crashed a server? They'll look embarassed, mumble an apology, and politely offer to work with you so that it doesn't happen again. [^infinite]: In this discussion, when I say "physicists like to live in an idealized world"... Oh, never mind. By this point I'm sure you've got the joke.