Thermal Problems

Robert G. Brown rgb at phy.duke.edu
Thu Jul 24 10:09:15 EDT 2003


On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> Here are a few pictures of the culprite.  Any suggestions on how to fix it
> other than buying a whole new case would be appreciated
> http://neptune.navships.com/images/oscarnode-front.jpg
> http://neptune.navships.com/images/oscarnode-side.jpg
> http://neptune.navships.com/images/oscarnode-back.jpg

The case design doesn't look totally insane, although that depends a bit
on the actual capacity of some of the fans.  You've got a fairly large,
clear aperture at the front, three fans pulling from it and blowing cool
air over the memory and all three heatsinks, and a rotary/turbine fan in
the rear corner to exhaust the heated air.  The ribbon cables are off to
the side where they don't appear to obstruct the airflow.  The hard disk
presumably has its own fan and pulls front to back over on the other
side more or less independent of the case flow.

At a guess, you're problem really is just the CPU coolers, which may not
be optimal for 1U cases.  A few minutes with google turns up a lot of
alternatives, e.g.:

  http://www.buyextras.com/cojaiuracpuc.html

which is engineered to pull air in through the copper (very good heat
conductor) fins and exhaust it to the SIDE and not out the TOP.  Another
couple of things you can try are to contact AMD and find out what CPU
cooler(s) THEY recommend for 1U systems or join one of the AMD hardware
user support lists (I'll let you do the googling on this one, but they
are out there) and see if somebody will give you a glowing testimonial
on some particular brands for quality, reliability, effectiveness.

The high end coolers aren't horribly cheap -- the one above is $20
(although the site also had a couple of coolers for $16 that might also
be adequate).  However, retrofitting fans is a lot cheaper than
replacing 64 1U cases with 2U cases AND likely having to replace the CPU
coolers anyway, as a cheap cooler is a cheap cooler and likely to fail.

If you bought the cluster from a vendor selling "1U dual Athlon nodes"
and they picked the hardware, they should replace all of the cheap fans
with good fans at their cost, and they should do it right away as you're
losing money by the bucketfull every time a node goes down and you have
to mess with it.  Downtime and your time are EXPENSIVE -- hardware is
cheap.  If they refuse to, please post their name on the list so the
rest of us can avoid them plague-like (a thing I'm tempted to do anyway
if their advice on "fixing" your cooling is to install your 1U node on a
2U spacing).

If you picked the hardware and they just assembled it, well, tough luck,
but they should still help out some -- perhaps take back the cheap fans
and replace them with good fans at cost.  However, even if they decide
to do nothing at all for you and you're stuck doing it all yourself,
you're better off spending $40 x 64 = $2560 and a couple of days of your
time and ending up with a functional cluster than living with days/weeks
of downtime fruitlessly cycling cheap replacement fans doomed to die in
their turn.  Also, eventually your CPUs will start to die and not just
crash your systems, and that gets very expensive very quickly quite
aside from the cost of downtime and labor.

There are no free lunches, and it may be that going with expensive (but
effective!) CPU cooler fans isn't enough to stabilize your systems.  For
example, if the rear exhaust fan doesn't have adequate capacity or the
cooler fans can't be installed in such a way as to establish a clean
airflow of cool air from the front, the CPU cooler fans will just end up
blowing heated air around in a turbulent loop inside the case and even
though the fans may not fail (as they won't be obstructed) the CPUs may
run hotter than you'd like.  You'll have no way of knowing without
trying.

If your vendor doesn't handle this for you I'd recommend that you
immediately spring for a "sample" of the high end fans -- perhaps eight
of them, perhaps sixteen -- and use them to repair your downed systems.
Run the nodes in their usual environment with the new fans and sample
CPU core temperatures.  I'd predict that the CPUs will run cooler than
they do now in any event, but it is good to be sure.  When you're
confident that they will a) keep the CPUs cool and b) run reliably,
given that they have unobstructed airflow you can either buy them as you
need them and just repair nodes as the cheap fans die with the new ones
or, if your cluster really needs to be up and stay up, spring for the
complete set.

BTW, you should check to make sure that the fan at the link above is
actually correct for your CPUs -- it seems like it would be, but caveat
emptor.

Good luck,

    rgb

> 
> You can also see how many I'm down... it should read 65 nodes (64 + 1 head
> node)
> http://neptune.navships.com/ganglia
> 
> Mitchel Kagawa
> Systems Administrator
> 
> ----- Original Message -----
> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: "Mitchel Kagawa" <mitchel at navships.com>
> Cc: <beowulf at beowulf.org>
> Sent: Wednesday, July 23, 2003 10:14 AM
> Subject: Re: Thermal Problems
> 
> 
> > On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
> >
> > > I run a small 64 node cluster each with dual AMD MP2200's in a 1U
> chassis.
> > > I am having problems with some of the nodes overheating and shutting
> down.
> > > We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm
> but
> > > I notice that a lot (25%) of the fans tend to freeze up or blow the
> bearings
> > > and spin at only 1000 RPM, which causes the cpu to overheat.  After
> careful
> > > inspection I noticed that the heatsink and fan sit very close to the lid
> of
> > > the case.  I was wondering how much clearance is needed between the lid
> and
> > > the fan that blown down onto the short copper heatsink?  When I put the
> lid
> > > on the case it is almost as if the fan is working in a vaccum because it
> > > actually speeds up an aditional 600-700 rpm to over 6000 rpm... like
> there
> > > is no air resistance.  Could this be why the fans are crapping out?  I
> was
> > > thinking that a 60x60x10mm cpu fan that has air intakes on the side of
> the
> > > fan might work better but I have not seen any... have you?
> > >
> > > Also the vendor suggested that we sepetate the 1U cases because he
> belives
> > > that there is heat transfer between the nodeswhen they are stacked right
> on
> > > top of eachother.  I thought that if one node is running at 50c and
> another
> > > node is running at 50c it wont generate a combined heatload of more than
> 50c
> > > right.
> >
> > AMD's really hate to run hot, and duals in 1U require some fairly
> > careful engineering to run cool enough, stably.  Who is your vendor?
> > Did they do the node design or did you?  If they did, you should be able
> > to ask them to just plain fix it -- replace the fans or if necessary
> > reengineer the whole case -- to make the problem go away.
> >
> > Issues like fan clearance and stacking and overall airflow through the
> > case are indeed important.  Sometimes things like using round instead of
> > ribbon cables (which can turn sideways and interrupt airflow) makes a
> > big difference.  Keeping the room's ambient air "cold" (as opposed to
> > "comfortable") helps.  There is likely some heat transfer vertically
> > between the 1U cases, but if you go to the length of separating them you
> > might as well have used 2U cases in the first place.
> >
> > From your description, it does sound like you have some bad fans.
> > Whether they are bad (as in a bad design, poor vendor), or bad (as in
> > installed "incorrectly" in a case/mobo with inadequate clearance causing
> > them to fail), or bad (as in you just happened to get some fans from a
> > bad production batch but replacements would probably work fine) it is
> > very hard to say, and I don't envy you the debugging process of finding
> > out which.  We've been the route of replacing all of the fans once
> > ourselves so it can certainly happen...
> >
> >    rgb
> >
> > >
> > >
> > > Mitchel Kagawa
> > > Systems Admin.
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> >
> > Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list