In need of Beowulf data
James.P.Lux at jpl.nasa.gov
Tue Jul 22 13:49:48 EDT 2003
At 08:30 PM 7/21/2003 +0200, Farrel Lifson wrote:
>As part of my M.Sc I hope to carry out a case study using Markov Reward
>Models of a large distributed system. Being a Linux fan, a Beowulf
>cluster was the obvious choice.
>Performance data seems to be quite readily available, however finding
>reliability data seems to be more of a challenge. Specifically I am
>looking for real word failure and repair rates for the various
>components of a Beowulf node (HDD, power supply, CPU, RAM) and the
>larger cluster (software failure, network, etc).
>While some components have a mean time to failure rating, this is
>sometimes underestimated by the manufacturer and I am interested in
>getting an as accurate as possible model of a real world Beowulf
I don't know that the manufacturer failure rate data is actually
underestimated (they tend to pay pretty close attention to this, it being a
legally enforceable specification), but, more probably, the data is
being misinterpreted by the casual consumer of it. Take, for example, an
MTBF rating for a disk drive. A typical rating might be 50,000
hrs. However, what temperature is that rating at (20C)? What temperature
are you really running the drive at (40C?), What's the life derating for
the 20C temperature rise? What sort of operation rate is presumed in that
failure rate (constant seeks, or some smaller duty cycle)? What counts as
a failure? How many power on/power off cycles are assumed?
Most of the major manufacturers have very detailed writeups on the
reliability of their components (i.e. go to Seagate's site, and there's
many pages describing how they do life tests, what the results are, how to
apply them, etc.)
For "no-name" power supplies, though, you might have a bit more of a challenge.
>If anyone has any data they would be willing to share, or if you know of
>any papers or reports which list such data I would greatly appreciate
>any links or pointers to them.
>Thanks in advance,
>Data Network Architecture Research Lab mailto:flifson at cs.uct.ac.za
>Dept. of Computer Science http://people.cs.uct.ac.za/~flifson
>University of Cape Town +27-21-650-3127
James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf