/usr/local to /usr/nevis

The basics

The applications directory is where I put the physics software that we use, such as ROOT and Geant4. This directory is currently shared to all the boxes on the Linux cluster with the name /usr/local. I'd like to change that name to /usr/nevis. It's not easy to do this quickly, so I propose to do it gradually. Here's the procedure:

  1. Define a variable ${NevisAppBase} that points to whatever application directory is seen by a given box. I've already done this; you can type echo $NevisAppBase on any cluster box to confirm it.

  2. Replace the hard-coded name /usr/local with the variable $NevisAppBase in the user scripts.

  3. Change the directory links and mount points on each system individually.

I will take care of steps (1) and (3) above. As for step (2):

The details

History

Back in the 1990's, I set up the physics analysis software on Nevis' single central UNIX machine, nevis1. I put the software on nevis1:/usr/local, which seemed the logical location for it.

As we shifted from nevis1 to the Linux cluster, I kept the physics software in /usr/local, in the hope that users would be able to keep the same scripts on both systems.

It became rapidly clear that I couldn't put a copy of the physics software on each of the Linux systems, so I worked out a scheme using automount for the /usr/local directory. (According to the formal UNIX file system hierarchy, this was a mistake, since /usr/local is supposed to be for files unique to that specific system, but this seemed to be a minor issue at the time.)

At first I automounted the /usr/local directory from hypatia, with franklin as the "replica". However, I finally ran out of disk space on those systems, and moved the physics software to karthur (aliased to library.nevis.columbia.edu), with kolya as the replica (aliased to assistant.nevis.columbia.edu).

The problem

All of the above worked, until the first time that karthur went down: Every system on cluster slowed to a crawl, including the mail server. At subsequent occasions when karthur has gone down or has been overloaded, the same thing happens.

Why does this happen? Two reasons:

  1. Automount is not quite doing what I want it to do. The first time it tries to mount library:/usr/local, it pauses for a while; then it mounts assistant:/usr/local instead. That's fine. Then when it tries to mount /usr/local from some other process:

    Multiply the above delay by every process running on the machine, and you can see why performance suffers.

    (Actually, according to the documentation none of the above scenarios are what's supposed to occur. The automounter is supposed to try both servers simultaneously, then go with the one that responds first. I don't know why this does not appear to happen.)

  2. The directory /usr/local is in the default $PATH and $LD_LIBRARY_PATH, which means that every single process that invokes a sub-shell tries to access these directories. There are lots of such processes continuously created on a Linux system.

    Combine this with (1), and we get what we observe: When karthur goes down, everyone suffers.

Moving to the /usr/nevis will solve issue (2): The only shells that will try to mount /usr/nevis will be the login shells. That means it will take you a while to login, which is a pain, but no other shells will be affected. The systems won't slow to a crawl anymore.

As for (1): Periodically I review the automount options, trying to see if anything can be set differently. At present the various NFS timeouts are set to be quite short. If I discover the reason for the delays, I'll make the appropriate changes to the automount configuration files.