intermittent crashing of programs

Donald Becker becker at
Thu Feb 21 11:43:01 EST 2002

On Thu, 21 Feb 2002, Kris Thielemans wrote:

> (2nd resubmit after subscribing with a different email address...)

OK, I just deleted them from the moderation-hold queue.
I usually approve held posts in a few hours during the week.  The volume
of attempted spam has become very high in the past few months, so I'm
unlikely to loosen the requirement that non-member messages be held for

> we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
> 7.1. On one of the PCs, our programs occasionally crash with a segmentation
> fault. This also happens with an ordinary serial program with all its IO to
> local disks. (It does use NIS to get user info though, so I cannot easily
> test it without network). The crash NEVER occurs on any of the other
> systems.

This is pretty clearly a hardware problem.  Luckily you have other
similar system to compare against.

> Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
> RAM chips

Hmmm, there is a similar problem reported in the eepro100 list on a Dell
4400 server.  There the problem occurs when a PCI device is accessed
(and of course the driver is blamed).  I'm guessing that problem
is a datapath parity error, which is slightly different than a PCI
parity error.

You might want to read that thread which starts 16 Feb 2002.

The important detail to remember is that NMI is once again being used to
report system data errors, there are additional error sources beyond
memory parity errors.

> So, we ran memtest86-2.5 for 4 days continuously. No error was reported.

I would swap RAM between two systems and see if the problem follows.  If
the problem just goes away, you should still relegate the suspect RAM to
a machine that doesn't need to be reliable.

Donald Becker				becker at
Scyld Computing Corporation
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list