mailing list
about using semi-public PC's for heavy computational jobs. On Feb. 15,
2004, Arnon Klein asked about running his jobs on semi-public machines
that are running various flavors of Windows. Arnon is asking this question
because he is doing his graduate research and needs computational power.
He's already exhausted the machines easily available to him, so he was
looking for suggestions about what to do next.
The first response came from Chris Dwan. Chris responded that he's in
a similar boat but has managed to put together some systems from
various campuses into something like a grid. He also provided a very
useful ranking of systems in terms of access difficulty. For example,
systems that he maintains were easiest to get into followed by systems
running Linux or OS X (which Chris also runs). The lowest two ranked
systems were Windows machines that either could be rebooted at night
or could not be rebooted at all. Chris went on to talk about some
schedulers that can steal cycles from idle workstations (e.g.
SGE,
torque,
LSF).
Although he said that integrating disparate schedulers
can be very difficult. He did mention
Condor from the University of
Wisconsin as a possible solution. He also mentioned the grid software
from
United Devices, which runs
on Windows machines but will use compute cycles from other machines.
Farud Ghazali also mentioned that's he's also looking for a solution to
this type of problem. He pointed that there were many practical
difficulties including authentication across disparate resources. Chris
Dwan jumped in to explain how he has hacked up something to do
authentication for him.
Ron Chen joined the conversation to mention that
SGE (Sun Grid Engine)
version 6.0 will integrate with JXTA which then offers Jgrid that offers
P2P (Peer-to-Peer) workload management in a fashion similar to SETI@home.
However he did say that SGE 6.0 won't be out until May of 2004 (and it
may slip slightly from then). Until then, Ron recommended using
boinc
This package starts jobs and transmits data using port 80, which makes
it easier to get in and out of a firewall than other approaches. It also
has versions for Windows, Linux, Solaris and OS X. John van Workum
also mentioned GreenTea (www.greenteatech.com) that offers a Java
P2P client that gives grid capabilities for running jobs. Bruce
Moxon also mentioned that the
Cornell Theory Center, which is about
the only place doing clusters with Windows, has some tools that
might help with Windows machines.
While this is discussion was short it did offer some ideas that could
help people in similar situations. There are many people and groups
thinking about the same things that Arnon mentioned in his first posting.
Ganglia-Developers: Disk Alive Metric
I'm sure many readers are aware of
ganglia. It is a
scalable distributed monitoring system for high performance computing
systems such as clusters and grids. It is open source and in use on over
500 clusters throughout the world. On December 22, 2003, on the Ganglia
Developers mailing list Federico Sacerdoti asked about a metric that
ganglia could watch that would report if a disk was alive or not.
It seems that Federico was talking to a Purdue (my alma mater) sys
admin about a cluster that is put together from old PCs. The disks in
the machines keep failing but ganglia fails to report the disks
as down since the ganglia daemon will still report a heartbeat even
the node is basically down. Federico posted a possible solution that
he worked out with the administrator but had not tried it.
Brooks Davis replied that he didn't think it would work, at least in
FreeBSD, because of the way Unix and Unix-like systems work. He did
offer another solution that read random blocks from a file system to
make sure the drive was still functioning.
Robert Walsh responded that he has been trying to get information from
the
SMART
(Self-Monitoring Analysis and Reporting Technology System)
data in most hard drives into ganglia. Brooks Davis mentioned that
he thought integrating
smartmontools
with ganglia might offer a solution. smartmontools is a package
that allows you to control and monitor the SMART data contained in
virtually all modern hard drives.
The discussion spilled over into January of 2004, where Sander van Vliet
announced that he had a preliminary working version of a gmetric code
that would test if the drives were alive. The code walks the
/proc/mounts file looking for drives that are mounted and then attempts
to write 4 bytes to the end of the current used file system to determine
if the disk is alive. If there were no errors along the way, then the
disk is alive. Sander then posted that he had a version of his code
working that used the SMART data but the job as to be run as root.
This problem was sorted out fairly quickly though. During all of the
conversation, there was an effort to make the code work under Linux
and the various BSD flavors, especially FreeBSD. At this point the
thread died out, but it appears as though the code was working correctly
for Linux and FreeBSD.