Nevis Redhat Upgrades

This is a discussion of the issues we've identified that are associated with upgrading to the latest version of Redhat Linux on the Nevis Linux cluster. It is a work in progress. Please send me your comments and questions. I'll try to keep this document up-to-date with what we've decided.

In brief: What must be done

Short term (before the end of 2003)

All of the systems on the Linux cluster will either be:

I cannot guarantee that all of your software will run properly on Redhat 9. If you need to preserve the current version of Redhat Linux on a particular machine, then it will have to be blocked at the firewall; this means that although the machine can make out-going connections (e.g., you can SSH from from it) you won't be able to make in-coming connections (e.g., you won't be able to SSH to it from outside Nevis).

Long term (before the end of April, 2004)

Of the options listed below, the one that we've selected is to purchase Redhat Enterprise Linux for each machine on the Nevis Linux cluster. The cost (with educational discounts, described below) should be about $1000.

At this point, you can stop reading, unless you want to review the details behind the Redhat upgrades.

Background

UNIX is the primary operating system for software development in the high-energy physics community. Linux has become the primary "flavor" of UNIX that we use, primarily because its software has been free, it does not require proprietary hardware, and its kernel has been relatively quickly modified to accomodate new hardware. At Nevis we primarily use the Redhat distribution of Linux, mostly because the national labs (BNL, FNAL, CERN) use it; presumably they chose Redhat because it's the best-known distribution.

Versions

Redhat has released several versions of their Linux distribution over the past few years. The versions that are relevant to the cluster are 6.2, 7.2, 7.3, 8.0, and 9. Each of these versions (and the corresponding software patches) could be obtained for free from Redhat, or from mirrors at many sites, including the national labs (such as the BNL mirror or the FNAL mirror).

Redhat's support model (at least for its Linux distributions that could be obtained for free) was to maintain each release via software patches for a time, then discontinue the maintenance for an old version as a new version was released.

Realistically, a company cannot make money by distributing free software. Redhat is changing its distribution model. At the end of December 2003, Redhat Linux versions 7.2, 7.3, and 8.0 will no longer be maintained (support for 6.2 ended several months ago). At the end of April 2004, Redhat Linux 9 will no longer be maintained. The "free ride" is over.

Redhat will continue to distribute a free version of Linux called Fedora. This distribution will focus on testing new technologies, and will definitely be experimental (this used to be called "RawHide" Linux). There is no plan to issue regular security patches. As far as I know, no physics lab used RawHide Linux for any kind of serious development, and I think the same will be true of Fedora as well.

The "mainstream" version of Linux will be Redhat Enterprise Linux. The presentation on "RHEL" that was given at HEPiX 2003 is available here. Some additional useful comments on end-of-life issues are here.

Support

Support is important for two reasons:

  1. Bug fixes. Actually, the above-listed versions of Redhat Linux have been fairly stable in their final state. If it wasn't for the next reason, we could just live with whatever version existed at its end-of-life.

  2. Security. This is a serious issue, and drives the remainder of the discussion on this web page. New security flaws are being discovered all the time in existing programs, and it's vital that they be repaired quickly.

The usual cycle for a security flaw is:

Of course, system crackers are also searching existing programs for security flaws. Linux is free because of all of its source code is available for inspection by everyone, both white hats and black hats alike.

A firewall will not necessarily protect against such security exploits. For example, if there is a security flaw in sendmail, a firewall will not protect our mail server against the exploit. We can control access to our mail server; for example, we can block access from a specific system if we see that it's trying to crack the server. But we'd only see such activity after the possibly-successful attempt had been made. Although a firewall is useful and necessary, it does not eliminate the need to keep our software updated against security flaws.

Backports

In order to understand what our options are, and how some of the national labs are supporting their users, it's helpful to know what "backporting" means. I'm going to explain the concept by example. For this example, I'm going to use ssh; I hope it's clear that the concept of backporting patches applies to any programs in Linux, and that's it's not enough for us to just fix ssh periodically.

Assume a security flaw is discovered in SSH. The version of SSH used in Redhat Linux is that supplied by the OpenSSH group. When they discover or are informed of a security flaw, the developers of OpenSSH prepare a patch (or, in some cases, a new version) and distribute it.

However, Redhat Linux does not always use the very latest version of every software package. This is sensible, because new software versions often include features (perhaps unrelated to security) that have not yet been fully tested. It's also a good idea to test new software features in the context of all the other programs in a Linux distribution to see if a change in one program affects another. Such tests take time; as illustrated above, there may not be much time before the discovery of a potential exploit and and its actual use

In the case of OpenSSH, the latest version is 3.7.1 (as of Sep-2003). The version of OpenSSH that comes with Redhat is 3.5. So what Redhat would do is to analyze the source code of the OpenSSH-3.7.1 patch, and re-write the fix so that it will work for OpenSSH-3.5. This is known as "backporting" the patch.

What others are doing

If the national labs were all doing the same thing, there'd be no decision to make: Nevis would do what they're doing. Unfortunately, the different labs have chosen somewhat different approaches to this issue.

In the quotes below, "RHEN" and "RHEL" stand for "RedHat Enterprise Linux".

Fermilab

I sent an inquiry to Troy Dawson at FNAL. He replied on 06-Nov-2003:

The timing of this e-mail is rather funny. Just this afternoon we finally got permission to send out our solution to the Fermi Linux users. I actually thought you were responding to that e-mail/web page.

This page isn't yet linked from our main web page, but it will be soon.

http://www-oss.fnal.gov/projects/fermilinux/common/Fermi_Linux_Support_Structure.html

So far two questions that have been asked and answered are

Is the licensing understood now, will we be able to use it for free (as in beer)?

Yes.

Isn't RHEL3 really three products (WS, ES and AS), do you use WS - workstation - as the base?

Yes. There is no difference bit wise between AS and ES. WS has a few less RPMS. Notably bind,dhcp-server.

RedHat just releases the SRPMS and does not distinguish if they are AS, WS, ES. Since we and compiling them into our own release we can do anything we feel like. So we expect Fermi Linux LTS 3.0 to have all of the RPMS from all 3 versions.

Troy

p.s. A new question we haven't answered and can't officially say is

Q. Are you going to work with CERN and the grid projects in a unifying linux distrubution?

A. the grid project people don't seem to be working with anyone, wich is a shame. But we are going to be releasing a release called FREE Linux (Fermi Recompiled Enterprise Enviroment) that will be very generic. We will be offering it to the High energy labs such as CERN and slac, although pretty much anyone can use it/distribute it.

We still can't officially say that though, although we have already built it.

In a message on an ESnet discussion board dated 14-Nov-2003, Mark Kaletka wrote:

Fermilab has been working this issue for some months. It was discussed in detail at the most recent HEPiX meeting (http://www.triumf.ca/hepix2003/). SLCCC has also been having some discussions.

At Fermilab we currently use a "localized" Fermi Linux which is based on the RedHat distribution. One of our boundary conditions is that the collider Run II experiments are in production and desire/require support for Fermi Linux 7.3 for one to three years. "Support" means security errata. These of course will not be available from RedHat after the end of the CY but we expect to continue to provide them through a combination of of community support (the "Fedora Legacy" project) and local effort.

The rest of our strategy is to continue our Fermi Linux effort, based on builds of the source RPM's for RedHat Enterprise, which are available as open source. Applications where we need "real" commercial support (i.e. Oracle servers) will use RedHat Enterprise.

We don't expect to provide support for the Fedora releases, since we don't expect a great deal of stability there.

Our approach is consistent with the discussions with other HEP labs at HEPiX.

In short: Fermilab will continue to maintain and make availabe its own distributions of Linux based on Redhat, and backport security patches when necessary. We should treat FREE Linux as just a rumor right now.

Brookhaven

General information on UNIX at BNL can be found here.

I asked BNL IT support, and received the following response from Richard Casella on 06-Nov-2003:

I manage Unix admins for the Brookhaven Computing Facility.

We run a little under 300 Red Hat systems, there are probably in the neighborhood of 200 at BNL.

This subject is the talk of most of the labs. It was discussed at HEPIX 2 weeks ago, and SLCCC last week. There are lots of proposals and I don't think anybody is sure where it is going to fall out at this point.

We are having a lab-wide Unix admin meeting next Thursday to discuss it further.

If I had to guess, I would guess that some combination of systems using RHEN and some not will be where we end up, but what those numbers will be, I have no Idea.

Seems to me there is a lot of polarization going on.

Chi forwarded a note to me that was posted on phenix-comp-l mailing list by Charlie Maguire:

On our Departmental farm we are planning to follow the FermiLab migration path, unless there are more attractive new options from the RedHat company.

For the larger VAMPIRE farm (local publicity reference at http://sitemason.vanderbilt.edu/newspub/bjfTyg?id=7886) it's a more complicated issue because of the mix of groups involved. However, we certainly can't afford a per node license arrangement.

Because of its defined, near End-of-Lifetime date, RH8 has never been a consideration locally.

It's important to note that BNL's concern with Linux security issues is different from that of Nevis, because BNL has a different security model (or at least US-ATLAS does). To access US-ATLAS systems, one must first login to a "gateway" machine, and then login to one of the internal workstations at BNL; you cannot login to a workstation directly from the outside.

It would add a large measure of inconvenience, but we could adopt a similar security model for the Nevis Linux cluster. This would eliminate much of the security concerns about the need to upgrade to the latest Redhat Linux distribution.

CERN

The CERN Linux pages can be viewed here.

According to this news item, CERN plans to stay with its custom version of Redhat Linux 7.3.1. It will backport all the security patches through to the end of 2004. During 2004, a new distribution will be proposed, and CERN systems will start migrating.

Columbia

I sent e-mail to ACiS and asked if they had any licensing plans for Redhat Enterprise Linux. Walter Bourne replied, stating that Columbia has no plans to subscribe to Redhat, though he'll look at this issue again in mid-December.

What are our options?

These are the options I've explored. If you have any other suggestions, I'm anxious to listen.

Purchase RHEL for the Nevis Linux cluster

If we can afford it, this is the solution I recommend.

There are about 35 boxes in the current cluster. Of these, five boxes (hypatia [NIS server], franklin [mail server], annex [general annex server], hammurabi [FTP server], and the nevis1 replacement) would require Redhat Enterprise ES. For the remaining boxes, Redhat Enterprise WS would be sufficient.

The base prices for these packages is $179 for each WS license, and $349 for each ES license. That adds up to $7115. I recommend that we not bother with the extended support options; in the five years I've maintained Linux systems at Nevis, I've only used their support services once, and their advice was completely useless.

Fortunately, we may not have to pay that much. Redhat is now offering academic discounts, and I think that Nevis qualifies. The individual academic prices are $25 for WS and $50 for AS, with no telephone support (Nevis does not require enough licenses to make a site subscription worthwhile). This would be ideal for us; the total cost would drop to $1000.

Of course, as we purchased new boxes or as new versions of RHEL became available, we'd have to purchase additional licenses. Redhat has promised to support each version of RHEL for five years, but over the course of time we'd have to renew the licenses on each box.

The advantage of this approach is that a large company will be in charge of maintaining the security of our Linux distribution. Redhat has a proven track record in this area; they've maintained the security patches on our Linux distributions in the past, and there's no reason to doubt them for the future.

The disadvantage is two-fold:

  1. The expense. In the near future, we want to replace nevis1 and franklin with faster servers, and possibly get a new tape jukebox. Adding an additional expense on top of this may be more than we can afford.

  2. None of the other labs are taking exactly this solution. This is basically the relationship that the Nevis Linux cluster has with the other labs right now: I manage things the best I can to make everyone happy. However, it's possible that a program or script developed at Nevis won't work elsewhere, and vice versa.

Laptops

Laptops are a special case. Installing Redhat 9 on Hal Evans' laptop was a nightmare; configuring the wireless card was particularly difficult. Redhat 8.0 did not pose these problems. It's become clear that Redhat is aiming towards the workstation and cluster market, and not configuring their distributions for laptops.

For now, I'm willing to support Redhat 8.0 for laptops. Laptops should not be offering services on the network anyway, so the security risks are minimized. The existing physics applications I maintain on the applications server (CERNLIB, ROOT, Geant4) will function with GCC 3.2, which is the default in Redhat 8.0.

Over time, there's going to be a divergence between software that will function on RHEL systems but not on laptops running RH8.0. We'll have to deal with that situation when it arrives.

Use the FNAL distribution

In this model, we use the Fermilab Redhat distribution (whether it's "FREE" or some other version) for our cluster at Nevis. (We can't use the CERN 7.3.1 distribution, since our mail server badly needs functionality not available in Redhat 7.x.)

This has the advantage of being free. I'm uncertain as to the reputation of FNAL's Computing Division. However, I'm assured that they are very responsive when it comes to security issues; they are often quicker than Redhat in offering security patches to their distribution.

Up until now, I've been able to maintain a cluster that (judging by the lack of complaints) works equally well for software development and data analysis for collaborators from BNL, CERN, FNAL, and other institutions. I'm concerned that if we use the FNAL Linux distribution, the systems here may become less useful for users from other institutions.

Use a gateway to protect our systems

If we had to keep a substantial number of our systems at earlier versions of Redhat for development purposes, we could think about restricting access to the cluster through a gateway. This would be similar to the BNL approach to security, as noted above.

This has the advantage of being secure, and being low-cost in actual dollars (since we'd only get RHEL for the gateway). However, it would inconvenience all the users, and (in the short term) be a drain on manpower as we re-designed the network services for this new configuration.

Maintain our own distribution

This is the least costly in terms of actual dollars, but the most costly in terms of sysadmin hours.

The idea is that we'd do our own maintenance on the current Linux distributions we have installed. First, we'd have to "inventory" the exact network services we want to maintain (SSH, mail, etc.) and firewall everything else. Then we'd have to maintain the patches and software versions for those network services.

I think we should only adopt this approach if we have no other choice -- or we hire another systems administrator.

Move to a different distribution

There are other Linux distributions than Redhat: SUSE, Mandrake, Debian; there are even other flavors of UNIX that are free such as FreeBSD. If we're having problems with Redhat as a company, perhaps we should switch to a different source for our OS software.

The main disadvantage here is that none of the other labs are following this path. We use Linux here at Nevis for software development. We must make sure that the code and scripts that we create here work at the other physics labs.

Exception: One or more of the other distributions I mentioned above may be better for laptops than Redhat. I don't have the time to provide support for more than one flavor of Linux distribution, but I encourage Linux-on-laptop users to explore these other possibilities.

Fragment the cluster

It may be that, with the different national labs and institutions moving in different directions, it's not appropriate to try to manage a single central cluster at Nevis. The HiRes group already operates a separate cluster of Linux systems; they're responsible for the maintenance and security of those systems. Perhaps the different research groups at Nevis should follow the same model: the D0 group manages the D0 boxes and configure them with FNAL Linux, the ATLAS group manages the ATLAS boxes and configure them with CERN Linux, etc. There were would remain a few systems (e.g., mail server, web server) that would be maintained by the Nevis systems administrators.