[Beowulf] Re: HVAC and room cooling...

Eckhoff.Peter at epamail.epa.gov Eckhoff.Peter at epamail.epa.gov
Tue Feb 3 09:26:37 EST 2004

Hello Jim

The main goal for us is to stay up and running as long as we can.

 (Please read the last paragraph before responding to this one:)
Most of our temperature problems have been caused by AC maintenance
temperature spikes.  Having "security" open the doors slows the room
process.  The Sensaphone call to us helps us to know that there is a
and we can phone in to be briefed.  "Do we have to come in or has the
room already
begun to cool?"

The last of the Solutions is for just the type of incident that you
describe.  These are
very rare but like you say, they need to be planned for.  Our ideal goal
would be one
that signals a problem to the cluster.  The cluster takes the signal and
gracefully shuts
down the programs and then shuts down the nodes.  We did not find such a
solution on
the commercial market for our "came with the room" UPS.

Instead we found a sensor/software combination where the sensor ties
into the
serial port of one of the nodes.  So far we **have** been able to
gracefully shut down the
programs that are running.  We have **not** found a way to automatically
turn off the
various cluster nodes.  That's where we need some help/suggestions.

Peter Eckhoff
Environmental Scientist
U.S. Environmental Protection Agency
4930 Page Road, D243-01
Research Triangle Park, NC 27709

Tel: (919) 541-5385
Fax: (919) 541-0044
E-mail: eckhoff.peter at epa.gov
Website:  www.epa.gov/scram001

                      Jim Lux                                                                                                          
                      <James.P.Lux at jpl.        To:       Peter Eckhoff/RTP/USEPA/US at EPA, beowulf at scyld.com                             
                      nasa.gov>                cc:                                                                                     
                                               Subject:  Re: [Beowulf] Re: HVAC and room cooling...                                    
                      02/02/04 07:56 PM                                                                                                

At 04:27 PM 2/2/2004 -0500, Eckhoff.Peter at epamail.epa.gov wrote:

>Problem 2:  What do you do when the AC stops?  Maintenance and the
>occasional AC system oops can be devastating to a cluster in a small
>Solution 2a:  We are tied directly into a security system.  When a
>sensor in the room reaches a temperature level, "Security" responds
>dependent upon the
>level detected.
>Solution 2b:  We installed a backup automated telephone dialer.  Not
>that we don't trust "Security", but we wanted a backup to let us know
what was
>going on.
>    When the temperature reaches a certain level, the phone dials us
>    automated message:
>    " This is the Sensaphone 1108.  The time is 1:36 AM and ...
>    [ ed.  your CPUs are about to fry... Have a nice night!!!"  ;-)  ]

YOu need to seriously consider a "failsafe" totally automated shutdown
in chop the power when temperature gets to, say, 40C, in the room)...
Security might be busy (maybe there was a big problem with the chiller
plant catching fire or the boiler exploding.. if they're directing fire
engine traffic, the last thing they're going to be thinking about is
over to your machine room and shutting down your hardware.

The autodialer is nice, but, what if you're out of town when the balloon

goes up?

A simple temperature sensor with a contact closure wired into the "shunt

trip" on your power distribution will work quite nicely as a "kill it
before it melts". Sure, the file system will be corrupted, and so forth,

but, at least, you'll have functioning hardware to rebuild it on.

Automated monitoring and tcp sockets are nice for management in the day
day situation, ideal for answering questions like: Should we get another

fan? or Maybe Rack #3 needs to be moved closer to the vent. But, what if

there's a DDoS attack on someone near you, and netops decides to shut
the router. What if all those Windows desktops run amok, sending mass
emails to each other or trying to remotely manage each other's IIS,
bringing the network to a grinding halt.

The upshot is: Do not trust computers to save your computers in the
ultimate extreme.  Have a totally separate, bulletproof system.  It's
cheap, it's reliable, all that stuff.

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list