|
Nevis UPS Management
|
This page describes how Uninterruptble Power Supplies ("UPS") are monitored
at Nevis. There's also a web page on which you can see the current UPS status.
|
Power outages are a fact of life at Nevis. It's not unusual to have
two or three multi-hour outages per year.
To protect the systems from surges or equipment damage due to sudden
power loss, all of the computer servers and workstations in the Room
119 computer enclosure at Nevis are protected by uninterruptble
power supplies ("UPS"); other devices (e.g.: the firewall in the
Nevis network room; the server in the Nevis Annex) are
connected to UPSes as well. Note that none of the processing nodes on the batch farm
are connected to a UPS; they are not considered
"critical" systems.
As you can see from the UPS
status page, the UPSes can supply power to the various systems
with times ranging from about 10 to 60 minutes. Since this time is
shorter than a typical multi-hour power outage at Nevis, there is a
system in place to shutdown the systems when the UPS batteries get low
on power, and to automatically turn on the systems again when power is
restored. The idea is that (hopefully) the Nevis systems will respond
properly and automatically in the event of a power outage, even during
times when a system administrator is not immediately available.
The software programs used to monitor the UPSes and control the
attached systems are the Network UPS Tools or
"NUT". The details of the NUT configuration, in /etc/ups,
are not accessible to most users. Here's the general policy applied
to configuring NUT on the various systems:
- A UPS goes "critical" if both of the following are true:
- There is no AC power being supplied to the UPS.
- The UPS battery goes "low"; that is, the UPS determines that it
has 3-5 minutes of power left under its current load.
- If a system is directly attached to a UPS, the system uses NUT to
monitor when the UPS goes critical. If the attached UPS goes
critical, NUT sends a shutdown command to the system.
- The network switches, including the firewall in the network room,
are also attached to UPSes. If a system's network is connected to a
switch whose UPS goes critical, NUT will shut down the system.
The idea is if a system loses its network connectivity, odds are that
its NIS
and automount
services will get into a bizarre state that would delay or prevent the
completion of a shutdown. It's best to issue the shutdown command
before that occurs.
- Some UPSes supply power to more than one system; as of May-07, an
example of this is that both lincoln and sullivan
are plugged into the same UPS. In such a situation, one system is the
"UPS master" and the other is the "UPS slave"; NUT on the master
usually communicates directly with the UPS, while the slave gets the
UPS status by communicating with the master. If the UPS goes
critical, the slave will shutdown immediately; the master will wait a
minute or so to give a chance for the slave to receive the critical
signal.
- Some UPSes communicate their status via serial cables, which can
only be connected to a single system; that's the reason for the
"master-slave" situation described in the previous point.
- The rest of the UPSes have SNMP management cards attached, which
communicate their status via the ethernet. This has two advantages:
- The SNMP management cards are all on the private
network, while many of the systems that monitor them are on the
public network. All public<->private network traffic goes through the
firewall. That means if the firewall goes down, the systems would
lose connection to important UPSes. So if the firewall battery goes
critical, all the Nevis cluster systems shut down (with the exceptions
noted below).
- As of May-2007, the UPS attached to the firewall only supplies
power for about 13 minutes. Given the previous point, this means that
ten minutes into a power outage, the systems will start shutting
themselves down; the three-minute buffer is to give time for the
systems to shut down cleanly.
- The BIOS on all the systems has been set to automatically start
the system back up on AC power restore. If that was not set, then the
system would remain off even after Nevis power came back on and the
UPS began supplying power to the system again.
- Some systems have an older BIOS that cannot be set to
automatically start on AC power restore;
polaris.nevis.columbia.edu is an example. The BIOS on those
systems was fixed at the factory to go to the "last state": if the
system was powered down normally, then when AC power is restored the
system will remain down. On such systems, NUT has been configured to
not issue a system shutdown when the attached UPS goes
critical. So these "old-BIOS" systems run until their battery runs
out of power, then crash; they come up immediately when the UPS starts
providing power again.
This is a risk; the point of a UPS
and NUT is to help machines shut down and start up cleanly. However,
it turns out the delays caused by waiting for a systems administrator
to give such systems personal attention outweigh the risk.
- Some UPSes (e.g., the one that supplies power to the mail server)
do not turn on their power immediately after Nevis power is restored;
they are set to delay a few minutes. The reason is that those systems
will come up more smoothly if other Nevis systems are already on; this is
the case for the mail server, which mounts a lot of directories from
other systems.
- Once a week, hypatia.nevis.columbia.edu sends a command
to each UPS to test its status. Once a month, hypatia sends
a command to calibrate each UPS' battery under its current load.
These tests are run in the early-morning hours, between 2AM
and 5AM.
Aside from keeping the UPS
status page accurate, these tests help assure us that the UPS
batteries are functioning properly. Typically, a UPS battery has to
be replaced about once every five years; these tests let us know when
it's time for a replacement.
to the Nevis Linux Cluster Page.
to the Nevis Computing Page.
to the Nevis Home Page.
Send any comments or questions to the
webmaster.