Cluster Environments

Article Index

There was some more discussions about the thermal kill switch for the room pertaining to the amount of time from room cooling failure to system thermal damage. It was discussed that in some cases, it could only be a matter of tens of seconds before systems start having all kinds of thermal problems. Jim Lux added a very important point that when the power is killed, the power to the lights and a few receptacles needs to be left on. He also asked the proverbial question, "you also don't want the thermal kill to shut down the power to the blowers, now, do you?" Jim went on to say that there is reason for shutting down the blowers to prevent the HVAC (heating, ventilation, and air conditioning) system from spreading the fire around the building. He went on to suggest that one might want to consider a configuration with staged responses to over heating. For example, a moderate over heat shuts would shut down the computer equipment, a bigger over heat like a fire shuts down the blowers, and then you would have the big red emergency button next to the door that shuts down all power. There was some humorous discussion about the location of the big red button close to the light switch and the door opening switch and what has and could happen. Robert Brown added the one could also use scripts and lm_sensors as part of the first stage to shut down nodes before they figuratively melt down and before the emergency facility crews can identify and fix the cooling problem.

{mosgoogle right}

Andrew Latham also added some general guidelines including contacting a halon installation company (halon is a gas used to suppress a fire so that a sprinkler system is not needed - water and electronics don't mix well). There was some discussion about whether halon could be used in new installations and Joe Jaeggli pointed out that halon has been banned because it is a CFC (Chlorofluorocarbon) and can damage the ozone layer. Joel gave several replacements for halon. Finally Luc Vereecken gave everyone a lesson in how halon works by describing the chemistry of the combustion (fire) and how halon disrupts the combustion. Luc also pointed out that he uses his cluster for doing research in combustion chemistry!

This was a very good discussion about many of the things that go into making a good machine room for clusters. If you planning a new machine room or want to upgrade or retrofit an old one, you would be wise to review the posting to the Beowulf mailing list and perhaps ask further questions on the mailing list.

Beowulf: Environment Monitoring

To go along with the discussion of designing a machine room for clusters was the discussion of environmental monitoring of clusters. On the 30th of September, Mitchel Kagawa started this discussion by asking about environmental monitoring appliances like NetBotz/RackBots that email or call you in the event of a problem in the machine room (Mitchel's machine room hit 145 degrees because the cooling shut down, but amazingly 20 of his 64 nodes were still running!). Robert Brown (who is that masked man?) responded that the NetBotz boxes would work fine, but were a bit expensive in his opinion. He suggested using a temperature probe on the serial port of a select number of nodes and then a series of scripts to perform whatever action you desire based on the readings. Bob also went on to describe a do-it-yourself (diy) setup using a PC-TV card and an X10 camera to monitor the room remotely (finally a use for those stupid pop-up adds!)

Several people suggested using lm_sensors and scripts to monitor and shut down nodes appropriately. This allows you to address each node in addition to an overall room monitoring system. Robert Brown and others suggested using lm_sensors with a polling cron script to watch the systems and take appropriate action if and when needed (please see the previous discussion). If you get one or two emails from a script based on lm_sensors you might not have a problem, but if start to get a number of them, this might indicate a room problem. There was some discussion about how lm_sensors presents the monitoring information and that is presented to the users. Robert Brown, Don Becker, Rocky McGaugh and others joined in the discussion which spilled over to the lm_sensors mailing list.

Bill Broadley presented an alternative idea to using lm_sensors. If a system using lm_sensors goes down, you can no longer receive any information from the sensors. Bill mentioned an inexpensive stand alone temperature monitoring probe that be used to monitor temperature even if a node is shut down. The monitoring data even includes a time stamp and the device can build a temperature histogram for you. In his cluster they put one behind the machine (what he called the rack temperature), one on top of the rack (what he calls the room temperature), and one in the air conditioner output, and puts them all on the same wire connector. He has found them useful to help convince facility people that the room was getting hot more often than they thought.

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He occasionally finds time to perform experiments on clusters in his basement.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.