Cluster Environments

A summary of past postings from the Beowulf mailing list up to October 10, 2003

In the fall of 2003, there was something of a general theme on the Beowulf mailing list. The theme revolved around the environment in which our clusters live. That is, the machine room. This topic involves the design of machine rooms and how to save our dear clusters from imminent disaster when the cooling fails. Join us as we take a look at killing power (quickly), building machine rooms, and environment monitoring.

Beowulf: Kill the Power Faster than poweroff?

On the 11th of September 2003 David Mathog asked about ways to shut down a system and kill the power faster than using the poweroff command. He was interested in ways to shut down systems in emergency over heating conditions. He has some Athlon systems that he wanted to shutdown in the event of a cooling fan failure. The ensuing discussion was very interesting because not only was a fast shut down of the system discussed but also some old Unix habits.

Initially, David mentioned he wanted something like running a sync command and then powering off the system. The sync command would flush the file system buffers and get a consistent file system state, hopefully completely flushing the journal for a journaled file system. The first suggestion from Ariel Sabiguero was to do either use the halt -p -f command or poweroff -f. He said that in his tests it only took 3 seconds to shuts down his system instead of 20 seconds. David responded that this approach did indeed work quickly, but was not a clean shutdown, forcing the file system to be repaired via fsck upon reboot including fixing inodes. He didn't necessarily mind this since, in his opinion, a fsck is better than fried hardware. Bernd Schubert added that since it was a 2.4.21 kernel or later, that a series of changes to /proc/sysrq-trigger would force a shutdown in less than a second on his machine.

At his point the discussion brought in the question of how to sync the file system prior to shutdown. Alan Grossfield mentioned the ever popular system administrator approach of running the sync command 3 times before shutting down. Donald Becker and others said that this sysadmin habit was before the advent of good journaling file systems. Now, just one sync should be sufficient to ensure a consistent file system before shutting down. Who says you can't teach an old sysadmin new tricks?

{mosgoogle right}

The final piece of the discussion was how Linux unmounted file systems during the shutdown. The esteemed Robert Brown (Bob or rgb to people on the list) started off the discussion by mentioning that applications that have an open file(s) would have to be killed quickly to avoid a race and to satisfy David's initial request for a very fast shutdown. Greg Lindahl provided some great insight into how Linux shuts down. He pointed out that Linux nicely kills the processes during shutdown. He also mentioned that if you want to do it faster, using the kill -9 command will greatly speed things along. Robert Brown also added that during a fast shut down you might get some of the infamous .nfs20800200 leftover files if the system had an active nfs mount.

The moral of these discussions is that if you have to do a very fast shutdown, you should first make sure you are using a journaling file system on all disks on the system in question, and the follow one of the suggested methods to shut down the system. However, you could end up having to fsck the file systems. The final moral is that you don't need to run sync three times before having to shutdown a system.

Beowulf Q: Building a small machine room? Materials/costs/etc.

There was a very interesting discussion about designing machines rooms for clusters that was initiated by Brian Dobbins on the 16th of September 2003. He wanted to solicit the advice of people who had experience designing small machine rooms for their clusters. Of course the first reply was from Robert Brown, who has lately taken machine room requirements, especially electrical, to heart. He responded with many good comments about power, airflow, structural integrity (primarily weight), sound and light, networking, security, comfort an convenience. Michael Stein and Bob Brown added many more details to the electrical requirements for supplying power to the room including estimating the power required, what kind of room power supplies to use, and where to put the power distribution panels. Bob went on to add additional items such as a thermal kill switch for the machine room in the even of a complete cooling failure. He pointed out that in the event of a room cooling failure, the temperature can go from a reasonable temperature to system thermal failure in just a few minutes (it's better to spend some time fixing file systems than to have to purchase all new equipment). He also extended his comments about a raised floor for the machine room. Bob also made a very good point that it is highly recommended to get facilities people involved very early in the design process not only for the design of the room but also operational issues such as not shutting down the chillers in the winter just because it's cold outside!

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.