Grids, Ganglia Metrics, IO, and UML | Beowulf List

The Beowulf mailing list provides detailed discussions about issues
concerning Linux HPC clusters. In this column we turn our attention
to the bioclusters mailing list where we
report on using semi-public PC's for grid type applications and how
we can handle large numbers of files. I also turn to the
ganglia-developers mailings list to report on how one can add a
"disk alive" metric to ganglia. You can consult the
Beowulf archives
or the
Biocluster archives

Bioclusters: Using Semi-Public PCs

There was an interesting discussion on the I mailing list
about using semi-public PC's for heavy computational jobs. On Feb. 15,
2004, Arnon Klein asked about running his jobs on semi-public machines
that are running various flavors of Windows. Arnon is asking this question
because he is doing his graduate research and needs computational power.
He's already exhausted the machines easily available to him, so he was
looking for suggestions about what to do next.

The first response came from Chris Dwan. Chris responded that he's in
a similar boat but has managed to put together some systems from
various campuses into something like a grid. He also provided a very
useful ranking of systems in terms of access difficulty. For example,
systems that he maintains were easiest to get into followed by systems
running Linux or OS X (which Chris also runs). The lowest two ranked
systems were Windows machines that either could be rebooted at night
or could not be rebooted at all. Chris went on to talk about some
schedulers that can steal cycles from idle workstations (e.g.
SGE,
torque,
LSF).
Although he said that integrating disparate schedulers
can be very difficult. He did mention
Condor from the University of
Wisconsin as a possible solution. He also mentioned the grid software
from
United Devices, which runs
on Windows machines but will use compute cycles from other machines.

Farud Ghazali also mentioned that's he's also looking for a solution to
this type of problem. He pointed that there were many practical
difficulties including authentication across disparate resources. Chris
Dwan jumped in to explain how he has hacked up something to do
authentication for him.

Ron Chen joined the conversation to mention that
SGE (Sun Grid Engine)
version 6.0 will integrate with JXTA which then offers Jgrid that offers
P2P (Peer-to-Peer) workload management in a fashion similar to SETI@home.
However he did say that SGE 6.0 won't be out until May of 2004 (and it
may slip slightly from then). Until then, Ron recommended using
boinc
This package starts jobs and transmits data using port 80, which makes
it easier to get in and out of a firewall than other approaches. It also
has versions for Windows, Linux, Solaris and OS X. John van Workum
also mentioned GreenTea (www.greenteatech.com) that offers a Java
P2P client that gives grid capabilities for running jobs. Bruce
Moxon also mentioned that the
Cornell Theory Center, which is about
the only place doing clusters with Windows, has some tools that
might help with Windows machines.

While this is discussion was short it did offer some ideas that could
help people in similar situations. There are many people and groups
thinking about the same things that Arnon mentioned in his first posting.

Ganglia-Developers: Disk Alive Metric

I'm sure many readers are aware of
ganglia. It is a
scalable distributed monitoring system for high performance computing
systems such as clusters and grids. It is open source and in use on over
500 clusters throughout the world. On December 22, 2003, on the Ganglia
Developers mailing list Federico Sacerdoti asked about a metric that
ganglia could watch that would report if a disk was alive or not.
It seems that Federico was talking to a Purdue (my alma mater) sys
admin about a cluster that is put together from old PCs. The disks in
the machines keep failing but ganglia fails to report the disks
as down since the ganglia daemon will still report a heartbeat even
the node is basically down. Federico posted a possible solution that
he worked out with the administrator but had not tried it.

Brooks Davis replied that he didn't think it would work, at least in
FreeBSD, because of the way Unix and Unix-like systems work. He did
offer another solution that read random blocks from a file system to
make sure the drive was still functioning.

Robert Walsh responded that he has been trying to get information from
the
SMART
(Self-Monitoring Analysis and Reporting Technology System)
data in most hard drives into ganglia. Brooks Davis mentioned that
he thought integrating
smartmontools
with ganglia might offer a solution. smartmontools is a package
that allows you to control and monitor the SMART data contained in
virtually all modern hard drives.

The discussion spilled over into January of 2004, where Sander van Vliet
announced that he had a preliminary working version of a gmetric code
that would test if the drives were alive. The code walks the
/proc/mounts file looking for drives that are mounted and then attempts
to write 4 bytes to the end of the current used file system to determine
if the disk is alive. If there were no errors along the way, then the
disk is alive. Sander then posted that he had a version of his code
working that used the SMART data but the job as to be run as root.
This problem was sorted out fairly quickly though. During all of the
conversation, there was an effort to make the code work under Linux
and the various BSD flavors, especially FreeBSD. At this point the
thread died out, but it appears as though the code was working correctly
for Linux and FreeBSD.

In some cases, the bioinformatics world has a need for handling large
numbers of files. This need can be a problem when you are trying to
address over 10,000 files in one directory! The people with large mp3
collections can sympathize. The 
bioclusters
mailing list had a very
interesting brief discussion about how to handle this. On Jan. 28, 2004,
Dan Bolser posted a question looking for new information for an old
problem - working with directories with over 10,000 files. Dan had some
tools to get around the problem of handling this number of files in
bash scripts, but felt that the filesystem was sluggish in working
with the files. He said that the file systems used a linear, unindexed
search of directories to find files. He said that he accidentally
created a directory with more than 300,000 files which he referred to
as a "... death trap for the system." He posted some quick thoughts
about using a hash table to access the files with each node in the hash
table being a directory. You would then follow the directory structure
to find the file.

Elijah Wright posted that ReiserFS was designed to cope with exactly
this problem (accessing files in directories with a large number of
files). Joe Landman said that he liked
XFS because it used B*-trees
which could easily handle this situation. He said in theory that XFS
can handle more than 10**7 files per directory. He thought
JFS could
handle on the order of 10**4 files per directory. Joe felt that none
of the other file systems could handle this problem. Arnon Klein offered
the possibility of using MySQL in a file system manner. In particular,
he mentioned
LinFS which is a file
system of sorts that uses MySQL as a backend.

Dan, the original poster, mentioned that he would try to persuade the
administrators to try
ReiserFS or XFS. Joe Landman
offered the opinion
that if they administrators would not switch, then using the hash table
idea that Dan originally mentioned should work well. Joe also mentioned
that he has been badly burned by ReiserFS in the past. Elijah Wright
and Joe Landman also mentioned that XFS and ReiserFS are not really
"new" file systems in that they have been around for several years.
Joe Landman also posted some information about 
ext3. He said that
under heavy journal pressure (performing lots of I/O to files) ext3
had problems. He said that the journal can become a liability because
he felt it wasn't optimized yet. Joe said that he has several customers
that are regularly seeing problems when using ext3 and software RAID.

To end the discussion Tim Cutts posted a nice short Perl script for
hashing filenames. It has a hash depth of two directories and Tim
said it was good for up to about 64 million files.

The discussion was interesting in that it shows how one can use file
systems to improve performance of applications and if that doesn't work
or is not possible, how one can use simple user-space scripts to get
around problems. While writing scripts to handle problems may not
be the most ideal solution to many people, it does allow you to solve
your problems.

Beowulf: Hypothetical Situation

Brent Clements posted an interesting conundrum to the Beowulf mailing
list. He has had requests from researchers to uses a queuing/scheduling
system to submit kernel builds and reboots. Preferably, a normal user
could compile a customized kernel and boot a cluster node with it.
When the job finished or if it failed to boot, reboot to the baseline
kernel. 

There were a variety of solutions proposed. Many thought
UML (User Mode
Linux -- Linux running Linux) might do the trick, but they were not
sure how to incorporate into a batch system. Others thought diskless
nodes and PXE DHCP booting was the way to go. After considering all
the input, Brent proposed a series of "stock" kernels known to work
with their cluster. The researchers could then modify the source and
submit their job using a perl script they had developed. The script
allows the users to reboot the the allocated nodes using the new kernel
via DHCP and TFTP. If the nodes don't respond within 15 minutes, then
the nodes are rebooted with a stock kernel.

Sidebar One: Links Mentioned in Column

Bioclusters

Ganglia

Smartmontools

This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He has been to
38 countries and hopes to see all 192 some day.