[Beowulf] Odd SuperMicro power off issues
smulcahy at aplpi.com
Mon Dec 8 07:59:00 EST 2008
Chris Samuel wrote:
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
> it as powered down too.
We had a similar exerience with one of our compute nodes - intermittent
power-offs when running our model and absolutely nothing in the logs. I
modified Ganglia to track voltage and temp in an effort to see if
anything unusual happened to those before-hand but there was no
I can memtest86+ a number of times on the problem node and neither it
nor mcelog showed any problems.
Subsequent to that, I found aBIOS upgrade for those systems which
included an Opteron microcode update to fix an AMD processor erratum
(sp?) - I can dig out the details if the specific problem is of interest.
Around the same time, we finally started to see memory errors, so we
also replaced the bad mmory in the system.
Unfortunately I can't tell you which was responsible for fixing the
problem. My understanding is that Fluent is quite memory and I/O
intensive - do you run other equally intensive models without seeing the
Anyways, in summary - if you're totally stumped - try swapping out the
memory and/or rolling to the latest firmware and see if that improves
Stephen Mulcahy Applepie Solutions Ltd. http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf