[Beowulf] Re: RAM ECC errors
mathog at caltech.edu
Tue Feb 23 12:05:30 EST 2010
Carsten Aulbert wrote
> > Are you saying that now that you are monitoring you are seeing kernel
> > panics which did not appear before?
> No, but there seem to be a switch in the kernel module that allows to
> a kernel panic upon discovering uncorrectable errors.
By "switch" do you mean:
A. There is an option that may be set when that module is loaded which
will then cause it to panic on an uncorrectable error, where normally it
B. There has been a change in the module code between kernel versions
that causes it to panic now on events where it formerly did not panic.
> > You can get some information through netconsole, but you know that
> Yup already running, question is if a kernel panic would also be fully
> via netconsole - we are glad that we rarely have those ;)
I have seen one kernel panic since turning on netconsole, and it did log
across the network and showed up in /var/log/messages as it was supposed
to, with the same information presented as in the tests. Limited data,
but it would seem the answer is "at least sometimes".
> Yes, but the memory of any process might get corrupted, thus this is
> learn which user is currently running jobs. Which in turn enables us
> these users that this particular machine running these jobs had a
> the user might need to re-run her jobs to prevent "false" data
If the node blows up presumably the output of all the jobs currently
running there will clearly indicate that there was a failure - so you
should not have to notify those users since they will see the problem in
their results. (Unless MPI, or PVM, or whatever is being used to spread
jobs around, ignores fatal errors, which should never be the case.) For
jobs which completed earlier on the same node, this would have been
before an uncorrectable error took place, so the results should be OK.
Or am I missing something?
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf