A Tyan S2466 gotcha

Robert G. Brown rgb at phy.duke.edu
Wed Nov 5 13:00:13 EST 2003


On Wed, 5 Nov 2003, David Mathog wrote:

> Anyway, the take home lesson seems to be that one should
> scan the /proc/cpuinfo on all nodes following a reboot to
> verify that all came up at the rated speed.

xmlsysd:

Content-Length: 728

<?xml version="1.0"?>
<xmlsysd init="1">
  <proc>
    <cpuinfo tv_sec="1068054883" tv_usec="135507">
      <processor id="0">
        <vendor_id>AuthenticAMD</vendor_id>
        <family>6</family>
        <model_num>6</model_num>
        <model_name>AMD Athlon(tm) MP 1900+</model_name>
        <clock units="MHz">1600.096</clock>
        <cachesize units="KB">256</cachesize>
      </processor>
      <processor id="1">
        <vendor_id>AuthenticAMD</vendor_id>
        <family>6</family>
        <model_num>6</model_num>
        <model_name>AMD Athlon(tm) Processor</model_name>
        <clock units="MHz">1600.096</clock>
        <cachesize units="KB">256</cachesize>
      </processor>
    </cpuinfo>
  </proc>
</xmlsysd>

plus wulfstat:

r00 |AMD Athlon(tm) MP 1900+ |1600| 256|12:51:31 pm| 21d:04h:56m:09s| 98
r01 |AMD Athlon(tm) MP 1900+ |1600| 256|12:51:30 pm| 15d:23h:17m:58s| 94
r02 |AMD Athlon(tm) MP 1900+ |1600| 256|12:51:31 pm| 15d:23h:17m:50s| 93
r03 |AMD Athlon(tm) MP 1900+ |1600| 256|12:51:31 pm| 15d:23h:17m:26s| 93
r04 |AMD Athlon(tm) MP 1900+ |1600| 256|12:51:31 pm| 21d:04h:55m:39s| 98
...

make it easy to scan a cluster for this particular problem -- all the
rnodes are 2466's:-)

Did the clock drop on just ONE CPU or on both?  xmlsysd provides both as
you can see, but up to know I only have displayed the clock of the first
one in wulfstat as it never occurred to me that they might be different.

> Is there some way to configure these nodes so that
> they cannot drop into the lower speed? 

What BIOS revision are you running?  Most of the problems we've had with
2466's are related to running an older BIOS.  It should be at least 4.03
I think to run fairly stably.

Although if this is a thermal throttling to avoid processor burnout,
what it may be telling you is that this particular node has a bad CPU
cooler or a ribbon cable somewhere that is partially obstructing
airflow.  The Tyan/Athlon combination >>really<< hates heat and responds
to an excess with temper tantrums and worse.

We've found that just having CPU-coolers that "work" but rattle a bit
while working is enough to induce node failure under load.  You might
not WANT to override the BIOS action here, but rather tweak the node to
run cooler.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list