File Systems O'Plenty Part One: The Basics, Taxonomy and NFS

Storage: its where we put things

Clusters have become the dominant type of HPC systems but that doesn't mean they aren't perfect (sounds like a Dr. Phil show doesn't it?). While you get a huge bang for the buck from them, somehow you have to get the data to and from the processors. Moreover, some applications have fairly benign IO requirements and others need really large amounts of IO. Regardless of your IO requirements you will need some type of file system for your cluster.

I wrote a file system/storage survey article for clusters in the past, but as always things change rather rapidly in the HPC arena. Originally, I had wanted to update the original article, however, the updates became so large that it's really an entirely new article. So this article, I hope, is a bit more in depth and a bit more helpful than the past file system article.

[Editors note: This article represents the first in a series on cluster file systems (See Part Two: NAS, AoE, iSCSI, and more! and Part Three: Object Based Storage). As the technology continues to change, Jeff has done his best to "snap shot" the state of the art. If products are missing or specifications are out of date, then please contact Jeff or myself. Personally, I want to stress magnitude and difficulty of this undertaking and thank Jeff for amount of work he put into this series.]

Introduction

Storage has become a very important part of clusters and is likely to become even more important as problem sizes grow. For example, the size of CFD meshes grows by almost a factor of 2 every year. As the problem size grows, the file size grows as well. Very soon the CFD solutions will become large enough that transferring them to and from user's desktop or post-processing them on a desktop will consume too much time or may not fit in memory. The best idea then is to visualize, manipulate, and post-process the results where they were created - on the cluster. CFD is but one example of the ever increasing need for more storage and higher speed storage.

In a 2007 presentation at the ISC (International Super Computer) Show in Dresden, IDC pointed out that storage and data management is growing in importance. In a survey that IDC conducted, 92% of the people who responded said they have applications that are constrained by IO. Also, 48% of the people who responded said that they were constrained by total file size. This number grew to 60% in three years. So based on the IDC study, it looks like high speed data storage, particularly parallel file systems, for clusters will become increasingly important.

This article will only be discussing distributed and parallel file systems for clusters. I won't be discussing typical Linux file systems such as XFS, ext4, ext3, ext2, Reiserfs, Reiser4, JFS, BTRFS, etc. These file systems are what I think of as local file systems. That is, the physical storage for these systems can be Direct Attached Storage (DAS) or part of a Storage Area Network (SAN). I also won't be discussing file system tools such as LVM, EVMS, md, etc. A discussion of these tools is better left to an article on local file systems.

Moreover, where possible this article will not focus on the hardware part of cluster storage, but rather, it will focus on file systems or parallel storage solutions. The hardware will only be discussed when it is part of an overall solution and can't be separated from the file system. I may disappoint people by not discussing which hard drive manufacturer is better, which type of hard drive is better (hint: SATA drives have the same "failure rate" as Fibre Channel (FC) and SCSI drives), or which RAID card is better, or which RAID level is best for certain workloads, etc. as these questions aren't really the focus of this article.

Due to the time considerations, I can only touch on a few file systems. Please don't be alarmed if your favorite is not in this article. It's not intended to dissuade anyone from considering or using that file system. Rather, it was just a choice I made to cut the size of this article and to finish it in a reasonable amount of time. I have tried to cover the popular systems, but alas popularity depends on application area as well. Finally, the discussion of a file system is not to be considered an endorsement by me or ClusterMonkey. If you think your file systems deserves more attention, by all means, contact me (my contact link is at the end of the article) and tell me how you use it and why it works for you. And, ClusterMonkey is always looking for cutting edge writers.

In an effort to scratch the proverbial IO itch, this series of articles is designed as a high level survey that just touches upon file systems (and file system issues) used by clusters. My goal is to at least give you some ideas of available file system options and some links that allow you to investigate. Anything else starts to look like a dissertation and I already did one of those. Before I start, however, I want to discuss some of the enabling technologies for high performance parallel file systems.

Enabling Technologies for Cluster File Systems

I think there are a number of enabling technologies that help make parallel file systems a much more prevalent technology in today's clusters. These technologies are:

  • High-speed networks with a large bandwidth, such as InfiniBand
  • Multi-core processors
  • MPI2 (MPI-IO)
  • NFS

You may scoff at some of these, but let me try to explain why I think they are enabling technologies. If you disagree or if you can think of other technologies please let me know.

As I discussed in another article InfiniBand (IB) has very good performance and the price has been steadily dropping. DDR (Double Data Rate) InfiniBand is now pretty much the standard for all IB systems replacing Single Data Rate (SDR). DDR IB has a theoretical bandwidth of 20 Gbps (Giga-bits per second), a latency less than 3 microseconds (depending upon how you measure it), an N/2 of about 110 bytes and a very high message rate. The price has dropped to about $1,000-$1,400 per port (average costs including switching and cables). In addition, Myrinet 10GigE is also a contender in this space as well as 10GigE (if the price ever comes down). But at the same time, most applications don't need all of that bandwidth. Even with the added communication needs of multi-core processors many applications will only use a fraction of the available bandwidth in IB or 10GigE bandwidth. To make the best of all the interconnect capability, vendors and customers are using the left over bandwidth to feed the data appetite of cluster nodes. So they are running both the computational traffic and the IO traffic over the same network.

In my opinion, another enabling technology is the multi-core processor. People may disagree with me on this issue, but let me explain why I think multi-core processors can be, in a sense, an enabling technology. The plethora of cores could be very useful in terms of IO. There may be opportunities to use one or more cores per socket to do nothing but IO. With this concept you dedicate a core per node to IO processing. This processor is also programmable (it's a CPU), and very fast compared to other processors such as those on RAID controllers or other aspects of storage. So it gives you a great deal of flexibility. I think it might be possible to dedicate one of the CPUs solely to IO processing, freeing the other cores on the system for computing.

{mosgoogle right}

Imagine a dual-socket, quad-core node that has a total of 8 cores. Just a few years ago you had at most 2 cores per node. That means only 2 cores had to share a Network Interface Card (NIC) or share access to local storage. Now you have 8 cores vying for the same resources. When each core performs IO you will have 8 cores trying to push data through the same NIC. But what if you could write an application where 7 of the cores did computational processing and one of the cores did all of the IO for the node. When one of the 7 cores needed to perform IO, particularly writes, they just pin the memory with the data and pass the memory address to the eighth core. Then this eighth core performs the IO while the other seven cores continue processing. When finished with the IO, the eighth core could just release the memory back to the cores. The HPC community has always talked about overlapping communication and computation. Now you can begin to talk about overlapping computation and IO.

However, while I think there is something to the idea of overlapping computation and IO, I think the most important reason multi-core processors are an enabling technology is because of the IO demand they can create. As I previously mentioned, just a few years ago we had at most two cores per node. Now we can easily have eight and soon we will have sixteen (AMD's quad-core CPU with 4 sockets per board or dual-socket Intel Nehalem CPUs with 8 cores per socket in 2008). With this many processors you may have to use a faster network so communication does not become a huge bottleneck (e.g. InfiniBand). So now you've got IB to solve your computational communication problem with lots of cores. As I previously mentioned, you can take advantage of the large amount of bandwidth that is leftover for IO. So multi-core processors could easily drive IO demand in this fashion. It's something of a backwards argument since the multi-core chips are not really driving IO requirements. But since they are driving high speed interconnects on each node, they create the opportunity to put a high-speed storage system on the same network.

In addition, with multi-core processors becoming so prevalent (soon you won't even be able to find new single core processors) you will see people start to run jobs with more and more cores. Just a couple of years ago you would have two cores per node and run, perhaps on 8-16 nodes, for a total of 16-32 processors. Now with 8-16 nodes you can get 64-128 cores for about the same hardware price. People will want to run their jobs across more processors and they are going to want to run larger problems. Larger problems and problems across more cores can lead to more IO being required. So again, multi-core processors could help drive up IO requirements.

I think another important enabling technology is MPI-2 which now includes something called MPI-IO. This is an addition to the MPI standard that covers IO functions. Prior to MPI-2, writing IO in a high-speed and portable manner was very ad-hoc. You could use normal POSIX functions but this was difficult to do effectively if you had complicated data structures or unstructured data. In addition, people sometimes used proprietary toolkits to write data on a parallel file system, thus making portability an issue. With MPI-2, you now had a set of functions that are portable (as long as the MPI implementation was compliant with MPI-2 or at the very least MPI-IO), high performance, and easy to use for complicated data structures or unstructured data. Being part of the MPI standard did not hurt things since most parallel applications used MPI. Now you have a portable way to do IO for MPI codes.

The MPI-2 committee was also very kind to the storage vendors. They created a method where you could pass "hints" to the underlying file system or MPI layer to take advantage of any special features of the storage. If you built your application for one storage platform and used hints, you will have to change it for another storage platform. This change is usually just a few lines of code (less than 10 in general). So portability has not been adversely affected and you get to take advantage of the storage.

Finally, one enabling technology that I think people forget about, are the file systems themselves. Clusters need a file system that at the very least can be used by all nodes in the cluster. This capability allows parallel applications to access the same data set as all of the other nodes. But, just as important as having a common file system for all of the nodes is that the file system should be standard and if at all possible, part of the distribution itself.

If a file system is a standard and part of the OS, then it becomes possible for the hardware of various vendors to work together. This combination includes the hardware of the node and the network. So if I take hardware from vendor X and hardware from vendor Y and plug them into a network, then they should be able to communicate. This is pretty much a given in today's world. But, equally important is the ability for the hardware from vendor X and Y to share data (files). This situation means you can shop for the best price, or the best set of features, or the best support, or any other criteria, and the hardware/OS combination should be able to share data.

The only true distributed file system standard today is NFS. NFS was and is one of the enabling technologies for clusters because it allows different hardware/OS combinations to share data because it is a standard. In fact a huge number of vendors of hardware and software get together at Connectathon every year to test their interoperability, particularly for NFS. Without NFS we probably would have a difficult time with IO for clusters since we would have multiple implementations that would be vendor dependent.

However, you don't have to use NFS for data sharing in a cluster, however. As long as you use hardware and software in a cluster that all of the nodes have in common, you can use a proprietary file system. This situation does restrict you in some cases to a single vendor. But it also means that you can take advantage of a high performance file system in your cluster. What many of the vendors of proprietary cluster file systems have done is to also provide NFS gateways so you can at least get to the file system using a standard file system such as NFS.

However, if you read further in this article, you will see that there is about to be a new file system standard that is specifically designed for parallel systems (i.e. clusters).

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.