Thoughts on Choosing a Cluster File System

Article Index

To use local file systems, you're going to have to do some extra manual work. If each parallel task of your program needs to read from a file, you need to copy the file out to every node before you begin (most cluster management systems have a single command that will do this). While this is easy to do, it takes time. If you must copy the file from a single node to each compute node, you have essentially the same bottleneck as NFS. However, if your program reads the file multiple times, or you will do multiple runs using the same file, this can be a huge win.

If your program needs to write data to a file instead of just read it, you have a bigger problem. If the program expects a single file system, each task is probably going to write to a different part of the file. Using local file systems, you may end up with lots of copies of the same file that are mostly empty space. Merging these back into a single coherent file when the program finishes typically requires some extra code, and may not be for the faint of heart. If you are programming your own application around a local file system model, you may be able to have all programs write data at the beginning of each file; then gathering data is simply a matter of copying the files to one place and concatenating the files together. Again, this takes some time, so your program better be doing enough I/O to amortize the cost of distributing and collecting files before and after the run.

Another option for using all those local disks is to use a parallel file system such as PVFS. PVFS will make use of multiple disks in your cluster, divide your data among them, but still present the user or program with the same set of files on every node. PVFS removes the need for distributing and collecting your data, but it is a little harder to set up and more susceptible to hardware failures than NFS (after all, it's using more hardware). More importantly, PVFS can perform badly if you are accessing small files, or just performing small reads and writes on big files. PVFS provides tremendous performance, but only on applications with tremendous I/O requirements. {mosgoogle right}

How To Advise Users

In the end, you'll end up with a combination of these approaches, perhaps all three. You'll find there are situations where each one works well. The trick is in helping your users recognize these situations, and take advantage of the proper option.

For programs doing only a little I/O, for instance reading a few parameters out of a configuration file, or writing a single line of output every few minutes to mark progress, NFS is the clear choice. You'll find many programs that fit this mold, and you should look no further. At the opposite end of the spectrum, if you have applications using large numbers of nodes doing large amounts of I/O, read/write transactions of a megabyte or more and total I/O of hundreds of megabytes or gigabytes, a parallel file system is absolutely essential. In most cases, NFS will not only get out-performed, it will flat out fail.

Local file systems tend to work best when programs need to make small to medium sized reads and writes repeatedly to the same file or set of files. An important extension of this concept is not a single application that does this but perhaps a large set of runs of an application that may share the same input data. The cost of the initial replication of data across nodes or collection at the end can be high; it's usually not worth it if the data will only be read once and discarded. This approach also works best with users willing to do a little work to squeeze out maximum performance. If it's a program that will run once and be discarded, it's probably not worth the effort to deal with multiple local file systems.

Onward

Once again, we've reached the end of a column about the time I feel like I've finished the introduction. Hopefully, you a general feel for what your file system options are (short of calling your local Storage Area Network vendor), and when the appropriate times to use them. Like many things, the key to using file systems effectively in clusters is to not become to addicted to any one approach, and to spend a little time experimenting with what works best for a particular problem. You'll inevitably find the need for multiple file system options in your cluster anyway; don't hesitate to try the same problem on each. You'll be surprised at the difference it can make.

Finally, an astute reader pointed out that we missed a resource manger in a previous column. SLURM is a production resource manager used and developed at Lawrence Livermore National Labs. It is now more widely available under the GNU public license. Like PBS and LSF, it allows for integration with MAUI and other schedulers. One of the strengths of SLURM is it's ability to tolerate node failures and continue functioning. SLURM is in use on cluster of 1,000 nodes already.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.