custom hardware (was: Xbox clusters?)

J Harrop jharrop at shaw.ca
Thu Nov 29 14:20:58 EST 2001


We have had similar problems over the years, some of which we tracked down 
to poor grounding conditions in the building wiring.  I know one location 
where the weather (in particular rain) can affect the behavior of some of 
the system.  I expect the grounding problem would create problems with 
similar symptoms on the newer power supplies - but I cant give a detailed 
explanation such as the excellent one posted.  I seem to recall that we 
also had this problem with the older power supplies.  Solution was the same 
- unplug, wait, reboot.

My favorite hardware problem was when I was working down in Honduras.  One 
of the laptops became more and more flaky and finally quit booting at 
all.  When I swapped out the CD-ROM module to try and boot from a floppy I 
found a stray ant sitting on the inside edge of the connector!  On further 
inspection the inside of the laptop turned out to be packed with them.  I 
wanted to duct-tape the machine closed and mail the box back to Dell with a 
"bug report" taped on it ;-)

John Harrop

At 10:02 AM 29/11/2001 -0500, you wrote:
>On Thu, Nov 29, 2001 at 09:15:15AM +0100, Daniel Pfenniger wrote:
> >
> > David Vos wrote:
> > >
> > ....
> > > There is one computer in our cluster that would make me think twice 
> before
> > > doing a custom build.  I prefer to call it the node from heck.  It only
> > > has one problem: it won't boot.  If you press the power button, the
> > > powerlight flashes while the cpu and case fans turn a quarter turn, then
> > > nothing.  You have to wait a minute before you even get that reaction
> > > again.  (Sounds like a short somewhere).  The problem only surfaces 
> if the
> > > computer has been off for a little while, and nearly every time at that.
> >
> > I have seen similar strange behavior of some boxes in a set of 66's, 
> and the
> > way to restart is also rather odd.
> > Basically, and this has been repeatedly observed on several boxes of 
> the same
> > composition (dual Pentium III with ASUS P2BD motherboard) aligned on a 
> metallic
> > shelf, the ATX box would stop after months of activity, and the 
> simplest found
> > way to restart it is to unplug everything (power and ethernet), touch 
> it for
> > a few seconds with hands, replug and voila.  No need to open the box!
> > My guess is that some condensator needs to be unloaded, but exactly why
> > one needs to unplug every cable appears curious.
>
>One thing to understand is that, unless there is a physical
>switch on the power supply itself, ATX systems are never
>*really* turned off as long as they are plugged in -- they
>only go to a "standby" state, wherein +5V power is still
>being applied to a single pin (the purple wire). When you
>press the power button on the front of the chassis, it
>merely shorts a header that ultimately causes the
>motherboard to short the green wire in the ATX cable to
>ground -- this is a signal to the power supply to leave
>standby and start generating power for all the other
>outputs.
>
>Another thing to observe is that generally, ATX power
>supplies are switching supplies, which means that (to
>simplify things somewhat) they generate the correct voltage
>by charging and discharging a capacitor at a high rate. The
>switching controller constantly monitors the voltage on the
>capacitor and connects or disconnects the capacitor to the
>incoming supply, depending on whether the charge is above or
>below the desired level (the detailed truth behind this is
>fairly complex and typically involves multiple stages and
>inductors as well as capacitors, but this model is probably
>good enough for this discussion...). Thus, even when an ATX
>system is "off", the power supply is chugging along, keeping
>a capacitor charged to provide +5V at a low current. BTW, if
>you have the resources to do this, put a current sensor on
>the incoming AC line for a running system and feed the
>output to an oscilloscope.  You should see a series of
>alternating positive and negative spikes -- those are the
>capacitors charging at the peaks and troughs of the AC
>voltage.
>
>Now, if the ATX board were simply to run the green-wire
>contact straight through to the power on/off header, you
>wouldn't need much oomph at all on the +5V standby line, and
>older ATX power supplies in fact didn't. However, newer
>boards have things like Wake-on-LAN, Wake-on-Modem, and
>other various and sundry goodies that have to run off the
>+5V standby.  It has gotten to the point that, in order to
>do all the processing that is required to leave standby, the
>standby current draw is greater than what some older
>supplies can provide. So in the case of a power supply that
>either by design or fault cannot provide sufficient current
>under standby, what (I think) happens is that while the
>motherboard is waiting for the main supply voltages to come
>up to full power, the standby processing bleeds off the
>capacitor to the point that the standby voltage sags below
>the minimum required for operation. At that point, the
>standby processing halts, the motherboard stops holding the
>green wire to ground, and the power supply stops trying to
>power up. It then returns to standby mode, re-charges the
>standby capacitor, and the cycle begins again.
>
>If you have a system that is behaving like this, try putting
>a voltmeter on the standby pin of the ATX header (you can
>usually jab a probe down into the back of the connector).
>You should see it at +5V when the system is "off". Then
>press the system's "on" button and watch the voltage. You'll
>most likely see it sag down to a couple of volts or so.  If
>this doesn't happen, you've probably got some other problem,
>perhaps a POST failure of some sort. Also, this may not be
>the end of the diagnosis -- it is possible that the failure
>to provide enough current on standby may not be the fault of
>the power supply itself. It could be a faulty componant
>(e.g. the SCSI drive we heard about) sucking down too much
>current on power-up, or an overburdened AC supply circuit
>that sags just a bit when your system starts up -- in the
>latter case I imagine that you could wind up with a
>seemingly jinxed spot in the equipment rack. :-)
>
>BTW, if the power supply has too little oomph on standby by
>*design*, the system will probably *never* power up.  If the
>supply's design meets the new spec only marginally, or if it
>is malfunctioning, say, because of a damaged or weakened
>capacitor, then it might behave differently when cold than
>it does when it is fully warmed up. In this event,
>unplugging the supply for a while and reconnecting it can
>create a short window in which the supply can get the system
>over the hump to leave standby. I in fact have a supply at
>home that has this problem, and I just sort of live with it
>because it's not my main system. Someday perhaps I'll
>replace the supply.
>
>As to why you have to disconnect the Ethernet as well, I
>really don't have a clue.
>
>HTH,
>--Bob Drzyzgula
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list