Q: Building a small machine room? Materials/costs/etc.

Robert G. Brown rgb at phy.duke.edu
Wed Sep 17 16:00:09 EDT 2003


On Wed, 17 Sep 2003, Michael Stein wrote:

> >   a) Thermal kill switch.  You may well want the room to be equipped
> > with one.  This is a thermostatted switch that kills all room power if
> > the ambient temperature exceeds some threshold, e.g. 90F.  The idea is
> > that if AC fails and you don't get down there to shut down nodes fast
> > enough, the kill switch kills power all at once rather than permit the
> > nodes to overheat to where they physically break (as they will if they
> > get hot enough).
> > 
> > Remember, at 10 KW/rack, four racks is 40 KW all being released in your
> > itty bitty 15x25' room.  The room will go from ambient cool to meltdown
> > in a matter of minutes if AC fails, and (Murphy's law being what it is)
> > it WILL FAIL sooner or later.
> 
> At that size/power a lot less than minutes.
> 
> 1 1U machine might output about 30 CFM of 99 F air.  4 racks full of
> them (42 * 4) would be about 5000 CFM of 99 F air.
> 
> You have 15 x 25 x 8? so about 3000 cubic feet of 75 F air to start.
> At 5000 CFM through the machines the whole room will be 99 F in about
> 36 seconds.   After that it gets interesting (the machines are now taking
> in 99 F air)...
> 
> And this assumes uniform/perfect air flow (no hot/cold spots).   

Ahh, you theorists.  My claim of "minutes" was experimental. Alas.
(Just kidding, I'm really a theorist myself;-)

The point being that minutes or seconds, a thermal kill switch is very
much a good idea.  So are things like netbotz, other machine readable
thermal monitors, or the use of lmsensors and so forth of you've got
'em.  An automated kill SCRIPT that triggers before the kill SWITCH
permits clean shutdown instead of instant loss of power (good even
though with e.g. ext3 the latter is no longer quite so worrisome).

Anyway, in our direct experience it isn't QUITE the "less than a minute"
a pessimistic estimate yields, probably because there is a certain
amount of thermal ballast in the room itself and the cases and the air
(I think you're overestimating the perfection of the circulation and
mixing process, for example), the walls and floor and ceiling do let
some heat out, especially if they start good and cold (concrete has a
fairly high specific heat), and if it is "just" the chiller that goes
out but the blower keeps running there is a bit more "stored cold" (I
know, stored absence of heat:-) in the AC coils and ductwork.  We've
found semi-empirically (the hard way) that it takes between five and
fifteen minutes for the room space to really get over 90 on a full AC
kill, admittedly starting an easy 10F colder than 75 -- barely enough
time to run a distributed shutdown that heads off some of the
meltdown/thermal kill process, or run down and open the door to the hall
and throw a big fan in if one happens to be paying close attention.

In fact, I'd also generally recommend holding the room ambient
temperature well under 70, (as low as 60F or lower if you can stand it)
partly to increase this window of where you can intervene less
destructively than with a thermal kill, partly because computer
components tend to give up roughly a year of projected lifetime for
every 10F increase in ambient operating temperature.

So a cluster with ambient air at 75F will have a LOT more hardware
problems and a shorter lifetime than one at 65F, and one with ambient
85F air will just plain break all the time, including bitlevel errors
that don't actually permanently break the hardware but cause a node to
freeze up.  Our cluster operated for a fairly extended period (several
weeks) with ambient air in the high 70's - low 80's when they first
turned off the chiller last winter before we convinced them that several
hundred thousand dollars worth of equipment was at risk and that they'd
PROMISED to keep the room cold all year long back in our original design
meeting for the space.  We got to see a whole lot of these problems
firsthand -- I still have broken nodes lying around that are good only
for spare parts, and everybody experienced systems crashes and lost
work.

We keep the incoming air pretty cold, just about as cold as we possibly
can.  After mixing, the air ends up cold but not unbearably cold
ambient, a bit colder in front of racks (where the air vents are
directed).  In front it is actively uncomfortable due to a cold draft
(low grade wind) you can catch cold of it. Behind a rack the air is a
balmy 75F or so after mixing (not right in front of the vents but a few
inches away). Warm but not hot.  

This is probably what extends our meltdown time out to minutes, and is
also why we keep a jacket down there for general use -- in the
summertime, shorts, sandles, and a hawaiian shirt outside don't
translate well into cluster room garb;-)

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list