In need of Beowulf data

Jim Lux James.P.Lux at jpl.nasa.gov
Tue Jul 22 13:49:48 EDT 2003


At 08:30 PM 7/21/2003 +0200, Farrel Lifson wrote:
>Hi there,
>
>As part of my M.Sc I hope to carry out a case study using Markov Reward
>Models of a large distributed system. Being a Linux fan, a Beowulf
>cluster was the obvious choice.
>
>Performance data seems to be quite readily available, however finding
>reliability data seems to be more of a challenge. Specifically I am
>looking for real word failure and repair rates for the various
>components of a Beowulf node (HDD, power supply, CPU, RAM) and the
>larger cluster (software failure, network, etc).
>
>While some components have a mean time to failure rating, this is
>sometimes underestimated by the manufacturer and I am interested in
>getting an as accurate as possible model of a real world Beowulf
>cluster.

I don't know that the manufacturer failure rate data is actually 
underestimated (they tend to pay pretty close attention to this, it being a 
legally enforceable specification), but, more probably, the data is 
being  misinterpreted by the casual consumer of it.  Take, for example, an 
MTBF rating for a disk drive. A typical rating might be 50,000 
hrs.  However, what temperature is that rating at (20C)? What temperature 
are you really running the drive at (40C?), What's the life derating for 
the 20C temperature rise? What sort of operation rate is presumed in that 
failure rate (constant seeks, or some smaller duty cycle)?  What counts as 
a failure?  How many power on/power off cycles are assumed?

Most of the major manufacturers have very detailed writeups on the 
reliability of their components (i.e. go to Seagate's site, and there's 
many pages describing how they do life tests, what the results are, how to 
apply them, etc.)

For "no-name" power supplies, though, you might have a bit more of a challenge.


>If anyone has any data they would be willing to share, or if you know of
>any papers or reports which list such data I would greatly appreciate
>any links or pointers to them.
>
>Thanks in advance,
>Farrel Lifson
>--
>Data Network Architecture Research Lab    mailto:flifson at cs.uct.ac.za
>Dept. of Computer Science                 http://people.cs.uct.ac.za/~flifson
>University of Cape Town                   +27-21-650-3127

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list