/usr/local to /usr/nevis
The basics
The applications
directory is where I put the physics
software that we use, such as ROOT and Geant4. This directory is
currently shared to all the boxes on the Linux cluster with the
name /usr/local. I'd like to change that name to
/usr/nevis. It's not easy to do this quickly, so I propose to do it
gradually. Here's the procedure:
- Define a variable ${NevisAppBase} that points to whatever
application directory is seen by a given box. I've already done this;
you can type echo $NevisAppBase on any cluster box to confirm
it.
- Replace the hard-coded name /usr/local with the
variable $NevisAppBase in the user scripts.
- Change the directory links and mount points on each system individually.
I will take care of steps (1) and (3) above. As for step (2):
- On Wed 16-May-2007 at about 6PM, I'll run a script that will go
through every home directory on the cluster. It will look for the
following files in your home directory:
- .profile
- .cshrc
- .myprofile
- .mycshrc
The script will create a backup of the file in
<filename>.bak (e.g., .profile will be saved
as .profile.bak) and replace all occurrences of
/usr/local with ${NevisAppBase}.
- Any other scripts you've written that reference /usr/local will
have to be your responsiblity.
Note that there's no reason to reference /usr/local
directly if what you're trying to do is compile or link with ROOT,
CERNLIB, or Geant4:
- For ROOT, use the root-config command; type
root-config --help for a list of options. A typical
invocation might be:
g++ myrootprogram.cxx `root-config --incdir --libs`
- For CERNLIB, use the cernlib command;
g77 myprogram.f `cernlib packlib kernlib`
- For Geant4, use the variables defined in the Geant4
installation; e.g., $G4INSTALL.
IMPORTANT: The variable $NevisAppBase is only defined for
login shells; it is not automatically defined for scripts
submitted via batch. To make sure this variable is defined in a
batch script, you
have these choices:
- Run the script using a login shell, which you can do even in
a batch environment by adding the -l option to the command
that invokes the shell; e.g., if your script begins
#!/bin/sh modify it to #!/bin/sh -l (see this FAQ
for more information).
- Define $NevisAppBase yourself. For sh-style
shells:
[ -d /usr/nevis/adm ] && export NevisAppBase=/usr/nevis || export NevisAppBase=/usr/local
For csh-style shells:
[ -d /usr/nevis/adm ] && setenv NevisAppBase /usr/nevis || setenv NevisAppBase /usr/local
The details
History
Back in the 1990's, I set up the physics analysis software on Nevis'
single central UNIX machine, nevis1.
I put the software on nevis1:/usr/local, which seemed the
logical location for it.
As we shifted from nevis1 to the Linux cluster, I kept the
physics software in /usr/local, in the hope that users would
be able to keep the same scripts on both systems.
It became rapidly clear that I couldn't put a copy of the physics
software on each of the Linux systems, so I worked out a scheme using
automount
for the /usr/local directory. (According to the formal UNIX file system
hierarchy, this was a mistake, since /usr/local is
supposed to be for files unique to that specific system, but this
seemed to be a minor issue at the time.)
At first I automounted the /usr/local directory from
hypatia, with franklin as the "replica". However, I
finally ran out of disk space on those systems, and moved the physics
software to karthur (aliased to
library.nevis.columbia.edu), with kolya as the replica
(aliased to assistant.nevis.columbia.edu).
The problem
All of the above worked, until the first time that karthur
went down: Every system on cluster slowed to a crawl, including the
mail server. At subsequent occasions when karthur has gone
down or has been overloaded, the same thing happens.
Why does this happen? Two reasons:
- Automount is not quite doing what I want it to do. The first
time it tries to mount library:/usr/local, it pauses for a
while; then it mounts assistant:/usr/local instead. That's
fine. Then when it tries to mount /usr/local from some other
process:
- What I want it to do is say, "Hey, I've already mounted
/usr/local, so there's no need to mount it again. Just
return a link to the directory I've already mounted."
- What it actually does is try to mount
library:/usr/local again, pauses for a while, then settles
for the replica.
Multiply the above delay by every process running on the machine, and
you can see why performance suffers.
(Actually, according to the documentation none of the above scenarios
are what's supposed to occur. The automounter is supposed to try both
servers simultaneously, then go with the one that responds first. I
don't know why this does not appear to happen.)
- The directory /usr/local is in the default
$PATH and $LD_LIBRARY_PATH, which means that every
single process that invokes a sub-shell tries to access these
directories. There are lots of such processes continuously created on
a Linux system.
Combine this with (1), and we get what we observe: When
karthur goes down, everyone suffers.
Moving to the /usr/nevis will solve issue (2): The only
shells that will try to mount /usr/nevis will be the login
shells. That means it will take you a while to login, which is a
pain, but no other shells will be affected. The systems won't slow to
a crawl anymore.
As for (1): Periodically I review the automount options, trying to see
if anything can be set differently. At present the various NFS
timeouts are set to be quite short. If I discover the reason for the
delays, I'll make the appropriate changes to the automount
configuration files.