User Authentication and Disk Monitoring Discussions | Beowulf List

Authentication and disk help on the way

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on user authentication within clusters and on some postings to the smartmontools mailing list discussing the monitoring of disks.

Authentication Within Clusters

A very good cluster topic for discussion is how people authenticate within a cluster. Authentication is the process of determining who you are and what you can do on a system. In layman's terms, authentication allows you to log into a node and run jobs. On January 30, 2004, Brent Clements asked the Beowulf mailing list how people did authentication on their clusters.

One should expect a number of responses to this question. The first response was from Daniel Widyono who responded that they had one form of authentication to log into the head node and then used their own system for authentication inside the cluster. They copy /etc/passwd to all of the nodes via a cron script and have written wrappers for useradd and userdel to copy /etc/passwd and /etc/shadow to the nodes when a user is added or removed. They use /etc/password for account information and then they update an authentication token on each node once it becomes assigned to a user (through a scheduling system). Then ssh checks the authentication token using a PAM module before execution begins. They also use Bproc to determine ownership on the head node.

Robert Brown (RGB to his friends) then pointed out what many experienced cluster people know - NIS is a high overhead protocol that impacts the performance of clusters. There have been past discussions about NIS usage in clusters and if you search the web for "NIS" and "cluster" you should be able to find the discussion (try filtering the search with "beowulf" to refine the search). RGB pointed out that you will get NIS traffic any time a file stat is performed. Imagine this across many nodes and you will see how NIS can become a drain on network performance. RGB also discussed security aspects of NIS. There have been many well known problems with NIS including the fact that NIS sends information in the clear (i.e. not encrypted). RGB then pointed out that many people use rsync to copy /etc/passwd and /etc/shadow to the nodes (in much the fashion that Daniel mentioned). However, RGB did point out that you have to watch for password changes and copy the appropriate files to the nodes (you could write a wrapper for passwd to perform parts of this operation).

A user with the email name of "Jag" replied that at his university they configured PAM on the head node to authenticate off the main kerberos server but they remap the home directories and other things for the cluster. They also use NIS within the cluster but only for name service information. To access the compute nodes they use host-based authentication using ssh. Jag also suggested that for people using NIS that NSCD (Name Services Caching Daemon), which is part of glibc, could be used. NSCD doesn't stop the NIS traffic but limits it because it stores authentication information for subsequent requests. Leif Nixon posted that he was suspicious of NSCD because he has seen it hang on stale information with no good reason.

Mark Hahn chipped in that he uses the ubiquitous rsync-ing of the password/shadow files and uses ssh to get to the nodes inside the cluster. Mark also had some good comments about why he doesn't like centralized authentication for a campus because it creates a central point of failure (despite fail-over servers, etc.), a network hotspot, and because it can increase the work load on the poor person who has to administer the central authentication system.

Joe Landman posted that he was very leery of NIS because he has had customers crash it when serving login information just by running a simple script across the cluster. Joe said that he prefers to push name service lookups through DNS, particularly dnsmasq. Joe added that configuring a full blown named/bind system for a cluster is a significant overkill in many cases. For authentication, Joe had been hoping that LDAP would solve his problems but he hasn't been able to repeatedly make a working LDAP server with databases. He said that he's beginning to think about a simple database with PAM modules on the front end (such as pam-mysql).

Brent Clements responded that they had been using LDAP and found it to work very well especially with Red Hat. They like it because they can integrate it with a web based account management system for various groups within the campus. Joe responded that he thought the client side of LDAP was very easy to configure and run, but it was the server side that he had trouble with. He used Red Hat's LDAP rpm's and tried various things but could never get it to work the way he wanted.

The final poster was Steve Timm and he had some good information about NIS. Steve has used NIS on their cluster, but found problems with it. In particular when a job, such as a cron job that runs a script, starts on all the nodes at once, then the NIS server is hammered by all the nodes (aka' "NIS storm"). In an effort to prevent NIS storms, they tried allowing each node in the cluster to be a NIS slave, but found that the transmission protocol is not perfect and there were always a few slaves that were down a map or two. Steve said they ended up pushing the password and shadow files out to the compute nodes from the head node using rsync.

It seems for the time being that many people prefer using rsync to copy the password and shadow files to the compute nodes. While not the most ideal of methods, it is very simple and effective and has a very low impact on the network (unlike NIS). Perhaps some ingenious person will come up with a better way some day (hint, hint).

SMART: Usage Within Big Clusters

In the past I have mentioned the SMART (Self-Monitoring Analysis and Reporting Technology) system included in virtually all modern hard drives. SMART capable hard drives have added intelligence in the firmware to monitor the drive and to attempt to detect hard drive failures. Also, SMART Capable drives can perform various types of self-tests which are very useful for diagnostics as well as monitor the temperature of the hard drive (note: not all hard drives report the same information). There is a nice package for Linux, called smartmontools, that allows you to access the SMART information and to run self-tests on SMART capable drives to help detect drives that are failing.

On February 14 of 2004, Konstantin Kudin asked if anyone was using SMART monitoring of IDE drives in big clusters. He was curious how often SMART was able to give some kind of warning of a failing drive within 24 hours of failure. Steve Timm responded that they had been using SMART monitoring tools on their cluster and SMART was able to predict failure about 50% of the time. Steven seemed very happy with this number.

Joe Mack posted a question about how one can get information out of smartd (the daemon in smartmontools). Steve Timm replied that they were using an older version that didn't have smartd and just used a cron script to run a short test every night and capture the output to a file. Steve also said that they were probably going to switch over to using smartd and an agent that is already grep-ing through /var/log/messages to capture the SMART information.

Felix Rauch posted that he was using smartmontools as well and had a few troubles grep-ing though the system logs, particularly when the logs rotate. He now uses a simple setuid-root program to monitor temperatures on the drives. Daniel Fernandez also mentioned that it's possible to have smartd write to a file other than the system logs and check it regularly for temperature. He also mentioned that you can have smartd run a script if a problem develops.

Sidebar One: Links Mentioned in Column

Beowulf Archives

Smartmontools Archive

Smartmontools

NIS on Linux

LDAP on Linux HOWTO

LDAP Implementation HOWTO

Rsync

Kerberos

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at ClusterMonkey.net (don't stick your arms through the bars though).