How do you keep clusters running....
Doug J Nordwall
nordwall at pnl.gov
Wed Apr 3 17:46:31 EST 2002
On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:
What are folks doing about keeping hardware running on large clusters?
Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
Sure seems like every week or two, I notice dead fans (each RS-1200
has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
You running lm_sensors on your nodes? That's a handy tool for paying
attention to things like that. We use ours in combination with ganglia
and pump it to a web page and to big brother to see when a cpu might be
getting hot, or a fan might be too slow. We actually saved a dozen
machines that way...we have 32 4 processor racksaver boxes in a rack,
and they rack was not designed to handle racksaver's fan system. That is
to say, there was a solid sidewall on the rack, and it kept in heat. I
set up lm_sensors on all the nodes (homogenous, so configured on one and
pushed it out to all), then pumped the data into ganglia
(ganglia.sourceforge.net) and then to a web page. I noticed that the
temp on a dozen of the machines was extremely high. So, I took off the
side panel of the rack. The temp dropped by 15 C on all the nodes, and
everything was within normal parameters again.
My last fan failure was a CPU fan that toasted the CPU and motherboard.
Ya, we would have seen this on ours earlier...excellent tool
How are folks with significantly more nodes than mine dealing with constant
maintenance on their nodes? Do you have whole spare nodes sitting around-
ready to be installed if something fails, or do you have a pile of
No, we don't actually, but we've talked about it
Did you get the vendor (if you purchased prebuilt systems)
to supply a stockpile of warranty parts?
we use racksaver as well, so our experience is similar. Probably should
talk to our people about getting some spare nodes
One of the problems I'm facing is that every time something croaks,
Racksaver is very good about replacing it under warranty, but getting
the new parts delivered usually takes several days.
Ya...this is another area where just monitoring the data can be
helpful...if a fan is failing, you can see it coming (temperature slowly
rises) and you can order it before hand and schedule downtime.
Cristopher J. Rhea Mayo Foundation
Research Computing Facility Pavilion 2-25
crhea at Mayo.EDU Rochester, MN 55905
Fax: (507) 266-4486 (507) 284-0587
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Douglas J Nordwall http://rex.nmhu.edu/~musashi
System Administrator Pacific Northwest National Labs
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf