[Beowulf] UPS & power supply instability

Robert G. Brown rgb at phy.duke.edu
Wed Sep 28 09:08:01 EDT 2005


David Kewley writes:

> Hi all,
> 
> Our compute nodes, the main power load, are Dell PowerEdge 1850s with a 
> single power supply per node.  This power supply is 
> power-factor-corrected, so the Liebert PDUs see a power factor of 0.99 
> or 1.00.
> 
> I've balanced the loads on the three phases about as well as possible.  
> We still have neutral current, about 1/3 to 1/2 the magnitude of any of 
> the per-phase currents.

I know this song all too well -- not for lieberts/dells per se but power
woes in general.  From what I learned messing with this before:

If the power supplies on the load side are really power factor
corrected, you shouldn't have a neutral current when the load is
balanced.  Certainly not one 1/2 the magnitude of the per phase current.
That is a fairly clear signature of switching power supplies on 3 phase
power, where the fact that the supplies draw current pretty much only in
the middle third of each half cycle prevents the three phase
cancellation.

Also, you shouldn't be sharing a neutral line anywhere between the
transformer/wall and the load -- each outgoing phase should have its own
neutral back to the UPS.  That is there shouldn't BE a "neutral current"
for you to measure in a shared neutral line of three phase wye, although
if you are running off a UPS I suppose they could float the neutral line
(something that strikes me as cosmically boneheaded to do and if done, a
very likely source of your woes and then some).

Assume nothing.  That is, don't assume that your wiring was done sanely
or safely until/unless YOU'VE traced it back on through, and beware
ground loops and worse.  Our wiring was done by area contractors who
supposedly knew what they were doing, and obviously passed inspection.
Our wiring was FUBAR anyway -- the contractors were clueless and didn't
even follow the architect's spec; the inspector inspected it as if it
was household wiring and not server room wiring.  It is entirely
possible for your wiring to have been done by well-meaning electricians
who haven't the faintest idea how to correctly and safely wire a server
room environment with its (usually) highly nonlinear loads.

Don't assume that Dell's power supplies are really PFC just because they
say that they are, also.  Believe it when you put a dual input scope on
it and measure the current and voltage simultaneously as a function of
time on a triggered scope and see two perfectly sinusoidal waves, in
phase.

> The problem is this: We can fire up our cluster to about 40% of maximum 
> load and everything is fine.  But if we go over some threshold right 
> around 40% of max, the output currents from the PDUs go unstable.  It's 
> a fairly sharp edge: Approximately speaking, if I stay below the 
> threshold, the current variation is <1%.  But if go to the top end of 
> the stable range, then add another ~2% load, the output currents vary 
> over something like 30%.  The instability gets worse with increasing 
> load above the threshold.  Reducing the load below the threshold 
> restores stability (with perhaps a slight bit of hystereticity).
> 
> This instability only happens when the UPS is online.  If we put the UPS 
> in bypass, we can go up to around 70% of max load with no instability 
> (all computers on but idling in the OS; we haven't tested all nodes at 
> 100% CPU yet).
> 
> We suspect the problem is due to some interaction between the computer 
> power supplies and the output stage of the UPS.  Perhaps the UPS isn't 
> regulating correctly with this load.  Or perhaps it's regulating *too 
> well*, and the rock-solid voltages allow the oscillations to grow 
> instead of damp.  I don't know.

Ummm, yes, something like this is possible, especially if the UPS is
also being fed by a switching power supply in its own right.  You could
end up with some odd ripple on the line from the 180 Hz harmonics.  It
somewhat sounds like its primary capacitors are being driven to where
they are undercharged (and can no longer effectively filter the ripple
which then is bleeding through).  Additionally, every transformer in the
supply system is an inductor chained to capacitance and if your load has
harmonics, it can drive resonance-like behaviors.  A secondary problem
is that with three-phase wye transformers in particular, switching power
supply loads with odd harmonics (e.g. 180 Hz) can drive loop/eddy
currents within the transformer itself, causing it to overheat (wasting
power and costing you money) which will shorten its lifetime.  The
mid-phase overloads also brown out the computer power supplies during
the draw part of the cycle.

By far the best (and in fact nearly the only:-) decent explanation of
harmonics and harmonic mitigation is to be found here:

  http://www.mirusinternational.com/pages/faq.html

I would recommend reading ALL of this -- in fact, print it out and just
keep it handy to use in testing and discussions with Liebert and/or
Dell.  In particular see #7 "Why do 3rd harmonic currents overload
neutral conductors".  I would have THOUGHT that Liebert would be all
over this stuff as well, but from the sound of it that might not be the
case.  I would expect Dell to know none of it, and to not really know
what a power factor correcting power supply is or what it does and why
you need it.

I don't know what you can do to positively diagnose the situation, but I
expect that it will involve a dual trace oscilloscope rigged so it can
function as a line voltmeter and ammeter at the same time, in a test
circuit you'll probably have to hand wire so you can insert the one
(ammeter) and run the other (voltmeter) across all three wire pair
combos, and a handful of nodes to load the test circuit with.  If you
aren't comfortable with wiring, and don't know why you NEVER put an
ammeter across two voltage lines and ALWAYS but a voltmeter across the
voltage lines and which wire is hot and which is not and which is
neutral, DON'T TRY THIS YOURSELF.  Dying is such a drag.  Be sure to rig
a scope to measure current (safely, without starting fires or injuring
living things including yourself), which isn't horribly easy but can be
done.  

Look at the shape of the neutral line current, compared to the line
voltage, when a single Dell is on the system and compared to the
pictures in Mirus FAQ #7.  This should give you a quick-and-dirty
picture of whether the Dell power supplies are really PFC or if they're
just ordinary switching power supplies that are supposedly more
efficient or something so Dell claims that they are "PFC".  Dell may in
ignorance interpret "PFC" as having current and voltage "in phase" for
the primary draws but ignore the presence of third harmonics.  Honestly,
from your reported neutral current I expect that this is the case (and
I'm assuming since you REPORT the neutral current that you do indeed
know how to measure it -- but looking at it is better).  Look for
voltage distortion on the supply lines as well.

If you discover that -- surprise -- the "PFC" supplies aren't, you can
either:

a) Bug Dell for "real" PFC supplies, directing them to the Mirus FAQ in
case they are clueless about what that means and telling them that when
you slap the aforementioned scope on them under load you'd better see
nearly perfect 60 Hz sinusoids, in phase, in both power and current with
"no" odd-phase harmonics -- once you can find an engineer somewhere you
understands a word of this; 

b) Live with it (this is what we did, reasonably successfully).  Rewire
the shared neutral so each phase has its own neutral back to a solid
ground (e.g.  building steel, depending on how your setup is wired).
Try to ensure that the runs from the primary circuit panels are as short
as possible and use as heavy gauge wire as possible/practical (minimally
12/2, but 10/2 would be even better although it is a PITA to work with
in conduits) to keep the overcurrents in the middle third half-phase
from browning out the supply.  Also watch the circuit breakers -- when
we shared a neutral we would pop the breaker whenever load went above
about 60% of theoretical line capacity because of breaker overheating
caused by the extra non-cancelled current.  Sounding a lot like your
current problem, that is; 

c) Give Mirus a call and get a harmonic correcting primary transformer
for the space.  Then forget about the problem and use whatever kind of
power supplies you like (but still avoid sharing a neutral and all
that).  Or get Liebert to work on this for/with you.

If the dells ARE OK and HAVE PFC transformers when you test them
independently on otherwise quiet lines, then I suspect that you have a
bigger problem.  At least you'll know it isn't in the dells, which
limits the number of people you have to yammer at.  Consequently it must
be in the Lieberts, the UPS, or in the wiring itself.  I'd suspect
something wired egregiously incorrectly -- a floating neutral on the
UPS, for example -- that causes the neutral line to to accumulate a
voltage bias relative to true ground and undercharge power supply
capacitors, create a significant ground loop risk, and all sorts of
other things.  Or something else, maybe something worse.  Maybe
something dangerous.  Take it pretty seriously -- people have been known
to melt down racks of equipment (as in "melt the metal and burn the
epoxy and insulation", not as in "cause equipment to momentarily smoke
and break") from incorrect multiphase wiring.  People have also been
known to have been killed by faulty wiring.

USUALLY you can detect egregious problems with an ordinary voltmeter or
scope or maybe even a kill-a-watt -- if there is a significant voltage
between the neutral line and the (unloaded) ground wire on any circuit
(where I'm not certain what "significant" is -- greater than the 1-3
volts that might represent the resistive voltage on the driven neutral
line from load to wall at any rate) this is a problem.  If for any
reason the neutral is far away from the local ground spatially (long
runs of wire in between them increases the voltage disparity) ground
loops can be quite dangerous and can cause system malfunction.

Also see Mirus FAQ #9.  I'm GUESSING that similar things to the pictures
on this page can happen to the UPS under harmonic loads -- decrease in
ride-through capacity as the caps are incorrectly charged.  An
interesting possibility is that the UPS switching power supplies are NOT
harmonic corrected and share a neutral back to the transformers, so the
fact that the dells are PFC is completely erased by having the UPS
inline.  This is an appealing possibility, really -- you have spent much
money to ensure that you don't have a harmonic distortion problem, but
in fact moved the harmonic distortion problem one step upstream and if
anything exacerbated it (since the UPS has its own inefficiencies and
ADDS those to the inefficiencies in the node power supplies, so it draws
EVEN MORE peak current in the middle third half-cycles than the
aggregate nodes would have done:-).

SO, you might want to put your dual scope on the UPS supply lines
themselves under various loads, looking for ripple and harmonics on both
sides.

Some of this stuff you can check for on your own, but really you may
need to find a COMPETENT electrician -- one that specializes in server
room wiring and is e.g. union trained -- to help you out.  My
brother-in-law is a journeyman electrician in the Detroit area, and I
know what he went through in his journeyman training -- serious physics,
actually.  I also know what the local electricians who wired our server
room had as training -- think "How to Wire Your Own Home" from Home
Depot (well, maybe a BIT more than this, but you get the idea...:-).
There exist competent people but you'll have to look for them and
probably pay for their knowledge.

> Liebert has been on this case for something like 4 weeks now.  So far 
> they have no solution.  Mind you, the "blame" may be shared by the 
> Liebert UPS and the Dell power supplies, but I'm relying on Liebert to 
> figure out why things go unstable *when their UPS is online, supplying 
> a load that should be quite normal*, and so far they have no solution 
> for me.  We can't just wait on Liebert; this problem is hamstringing 
> our use of our new 1024-node cluster.  So now I turn to this list.
> 
> Can anyone here offer ideas, or better yet, experience?

I've done my best above.  I'm sorry you're having this problem, but you
are certainly not the first person to get bitten by it and probably
won't be the last, even though it SOUNDS like you did everything right
(from your end) during the server room design phase.

Good luck.

   rgb

> 
> David
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20050928/5fa38860/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list