How do you keep clusters running....

Roger L. Smith roger at ERC.MsState.Edu
Wed Apr 3 16:23:44 EST 2002

I don't know how to say this without sounding condescending, but we
resolved this problem by purchasing high quality machines.  We currently
use IBM x330s (although I also had good luck with our SGI 1100's before
SGI discontinued them).  We have enough nodes on hand, that IBM has
stocked a couple of spare motherboards, power supplies, etc., but we don't
need them that often.  I've never had a fan failure.

In general, hardware problems are a very minor part of the care and
feeding of our cluster.

On Wed, 3 Apr 2002, Cris Rhea wrote:

> What are folks doing about keeping hardware running on large clusters?
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
> One of the problems I'm facing is that every time something croaks,
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
> For some things like fans, they sent extras for me to keep on-hand.
> For my last fan/CPU/motherboard failure, the node pair will be
> down ~5 days waiting for parts.
> Comments? Thoughts? Ideas?
> Thanks-
> --- Cris
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at
> To change your subscription (digest mode or unsubscribe) visit

| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list