[Beowulf] Odd SuperMicro power off issues
gerry.creager at tamu.edu
Mon Dec 8 08:51:32 EST 2008
Bogdan Costescu wrote:
> On Mon, 8 Dec 2008, Chris Samuel wrote:
>> Very occasionally we find one of our Barcelona nodes with
>> a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
>> it as powered down too.
>> No kernel panic, no crash, nothing in the system logs.
> So IPMI still works ? Then this is _not_ like yanking the power cable,
> in which case IPMI would not work anymore.
> I've seen this exact behaviour (computer is off, IPMI works and reports
> that the computer is off) being triggered by computational loads on
> SuperMicro H8QC8. I've had several nodes and I was able to swap power
> supplies - the problem moved with the power supplies, so exchanging the
> "faulty" ones made this behaviour disappear. There is no Fluent running
> here, but other codes like Gromacs that are known to load the system
> quite well. The power supplies are supposed to deliver a max. of 1KW for
> a system with 4 Opteron 875, 8GB RAM and 2 internal disks. The "turning
> off" behaviour was also quite random, sometimes appearing within an
> hour, sometimes taking hours-days; it has started to appear about 5-6
> months after the nodes were purchased. I still have one node where this
> occurs so rarely (about once a month) that it's not accepted as an
> excuse for exchange ;-(
Continuing on the thread of power-related issues, this is beginning to
sound like a thermal-related mechanical problem. In the power industry
it is common to assume that there is a finite life for circuit breakers
based on the number of times they cycle (are tripped and reset). I'm
extrapolating here, as I've not had time to track down my power supply
guru and ask him... however, some time back there was a company that
introduced the "polyfuse" which is a thermal-trip breaker that
auto-resets after it trips, upon cooling down. I used a number of these
years ago while at NASA, and saw some evidence of a phenomenon similar
to the breaker limited life scenario described above.
I'm wondering if there might be a single voltage that's over-taxed and
that opening a breaker in that supply might cause the halt-to-quiescent
while leaving IPMI alive...
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf