[Beowulf] Re: HPC fault tolerance using virtualization)
Dave Love
d.love at liverpool.ac.uk
Mon Jun 29 08:33:37 EDT 2009
Greg Lindahl <lindahl at pbm.com> writes:
>> What I typically see from smartd is alerts when one or more sectors has
>> already gone bad, although that tends not to be something that will
>> clobber the running job. How should it be configured to do better
>> (without noise)?
>
> That isn't noise, that's signal.
Of course I didn't mean that bad block alerts were noise. However,
there is what I and a hardware expert think is noise from the default
smartd configuration. I'm interested in how best to configure it for
useful warnings. I did have a look OTW, of course.
> You're just lucky that your running
> job doesn't need the data off the bad sector.
Not if the problem is, say, on /usr, which the job normally isn't going
to need before it finishes.
> You can try waiting
> until the job finishes before taking the node out of service; from the
> sounds of it, you will usually win. But if you don't have
> application-level end-to-end checksums of your data, how do you know
> if you won or not?
I know where the job is doing i/o, and I'm not going to kill multi-day,
multi-node jobs -- especially not automatically -- because there's a bad
sector somewhere irrelevant. Also we have better things to worry about
here, at least, than application checksums, much as they might feature
in an ideal world.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list