File Systems O'Plenty Part Three: Object Based Storage | FileSystems

lots of bits in lots of places

In the final part of our worlds biggest and best Parallel File Systems Review, we take a look at object based parallel file systems. Although this installment stands on its own, you may find it valuable to take a look at Part One: The Basics, Taxonomy and NFS and Part Two: NAS, AoE, iSCSI, and more! to round out your knowledge. Let's jump in! And, don't miss the biggest and best summary table I ever created on the last page.

[Editors note: This article represents the third in a series on cluster file systems. As the technology continues to change, Jeff has done his best to "snap shot" the state of the art. If products are missing or specifications are out of date, then please contact Jeff or myself. In addition,, because the article is in three parts, we will continue the figure numbering from part one. Personally, I want to stress magnitude and difficulty of this undertaking and thank Jeff for amount of work he put into this series.]

A Brief Introduction to Object Based Storage

While object storage sounds simple, and conceptually it is, there are many gotchas that need to be addressed to make it reliable and useful as a real file system. The key concept of an object based file system needs to be emphasized since it is a powerful methodology for parallel file systems. In a more traditional file system, such as a block based file system, the metadata manager is contacted and a set of inodes or blocks is allocated for a file (assuming a write operation). The client passes the data to the metadata manager which then sends the data to the file system and ultimately to the disk. If you notice, the metadata manager is a key part of the process. This design creates a bottleneck in performance. The metadata manager is responsible not only for the metadata itself but also where the data is located on the storage.

Object storage takes a different approach allowing the storage devices themselves to manage where the data is stored. In an object storage system, the metadata manager and the storage are usually separate devices. The storage devices are not just dumb hard drives, but have some processing capability. With an object storage system the metadata manager is contacted by a client about a file operation, such as a write. The metadata manager contacts the storage devices and determines where the data can be placed based on some rules in the file system (e.g. striping for performance, load or capacity balancing across storage devices, RAID layout for resiliency). Then the metadata manager passes back a map of which storage devices the client can use for the file operation as well as what capabilities it has. Then the metadata manager gets out of the way of the actual file operation and allows the client to directly contact the assigned storage devices. The metadata manager constantly monitors the operations so that it knows the basics of where the data is located but it stays out of the middle of the data operations. If there is a change in the file operation, such as another client wanting to read or write data to the file, then the metadata manager has to get involved to arbitrate the file operations. But in general the data operations happen without the metadata manager being in the middle.

A key benefit of object storage is that if the data on the storage devices needs to be moved, it can easily happen without the intervention of the metadata manager. The movement of the data can be for many reasons, most of them important for performance or resiliency of the file system. The storage devices just move the data and inform the metadata manager where it is now located.

Jan Jitze Krol of Panasas has a great analogy for object storage. Let's assume you are going to a mall to go shopping. You can drive your car into the parking garage and park on the third level, zone 5, seventh spot from the end and then go in and shop. When you finish you come out and you have to remember where you parked and then you go directly to your car. This is an analogy to the way a typical file system works.

For the object storage analogy, let's assume you arrive at the parking garage and the valet parks your car for you. You go in and shop and when you finish you just present the valet with your parking ticket and the valet determines where your car is located and gets it for you. The advantage of the valet is that they can move the car around as they need to (perhaps they want to work on part of the garage). They can even tear down and rebuild the garage and you won't even know it happened. Similar to the valet, object based file systems give the file system tremendous flexibility. It can optimize data layout (where your car is located), add more storage (add to the parking garage), check the data to make sure it's still correct (check your car to make sure it's not damaged and if it is, fix it for you or inform you), and a whole host of other tasks. Notice that these tasks can be done with minimal involvement of the metadata manager.

So you can see that object storage has a great deal of flexibility and possibilities for improving file system performance and scalability for clusters. Let's examine the three major object file systems for clusters, Lustre, Panasas, and PVFS.

Panasas

Panasas is a storage company focused on the HPC market. Their ActiveStor Parallel Storage Cluster is a high-speed, scalable, global, parallel storage system that uses an object based file system called PanFS. Panasas couples their file system with a commodity based storage system that uses blades (called Panasas Storage Blades). The basic storage unit consists of a 4U chassis, called a shelf, that can accommodate up to 11 blades. Each blade is a complete system with a small motherboard, CPU, memory,1 or 2 disks, and two internal GigE connections. There are two types of blades - a Director Blade (DB) that fills the role of a metadata manager and only contains a single disk, and a Storage Blade (SB) that stores data.

Each shelf can hold up to three Director Blades or 11 Storage Blades. The most common configuration is to have 1 Director Blade and 10 Storage Blades. Currently, Panasas has 500GB, 1TB, and 1.5TB storage blades giving a nominal storage of 5TB, 10TB, and 15TB per shelf. In addition, the Storage Blade can have either 512MB of memory or 2GB of memory, the extra memory is used for cache. In addition, each shelf has redundant power supplies as well as battery backup. Each blade is connected to an internal GigE switch with 16 ports. Eleven of the ports are internal to connect to the disk and four are used to connect the shelf to an outside network (the remaining GigE link can be used for diagnostics).

The object-based file system, called PanFS, is the unique features of Panasas' storage system. As data is written to PanFS, it is broken into 64KB chunks. File attributes, an object ID, and other information are added to each chunk to complete an object. The resulting object is then written to one of the storage blades in the file system. For files that are less than 64KB, the objects are written to PanFS using RAID-1 on two different storage blades. For files larger then 64KB, the objects are written using RAID-5 using independent storage blades. Using more than one blade and more than one hard drive gives PanFS parallelism resulting in speed. The objects are also spread across the storage blades to provide capacity load balancing so the amount of used space on each blade is about the same.

There is a subtle distinction in how Panasas does RAID and how other file systems, particularly non-object file systems, use RAID. For non-object based file systems, RAID is done on the block level. So the hardware is what is actually RAID-ed. In the case of PanFS, it is the file itself that is RAID-ed. This gives huge advantages to PanFS that I'll talk about in a bit.

Another aspect of PanFS that gives it performance is the Panasas developed data access protocol, DirectFlow. In virtually all other file systems, if a client requests data, either to read or write, the metadata manager is always involved in the transaction even if it is not needed. This single bottleneck can limit performance. In DirectFlow, when a client requests data it contacts a metadata manager first, that controls that part of the file system containing that data. The metadata manager passes back a map of where the data is located as well as a set of capabilities to the client. Then the client can use the map and capabilities to communicate directly with the storage blades. This communication happens in parallel so performance is greatly improved. The metadata manager is not involved in this part of the process. Consequently, the metadata manager is not a bottleneck in the data access process. These features give PanFS its performance. Panasas has measured the performance of a single shelf with a single Director Blade at 350MB/s for reading data.

In addition to DirectFlow, PanFS can also used with NFS and CIFS protocols. Each Director Blades acts an NFS and CIFS server. So if you have more than one DB, you can load balance NFS and CIFS across them. In effect for NFS and CIFS needs, it becomes a clustered NAS solution. This makes the file system very useful desktops or for systems that do not run Linux.

In any file system, one of the most important aspects to consider is its resiliency or the ability to tolerate failures and still function. In the case of PanFS, if a storage blade fails some set of files are damaged (i.e. some objects are missing). The remaining objects and their parity are stored on other blades. As soon as a blade or disk failure is detected, all of the Director Blades start the reconstruction process. Since the metadata knows what files were on the failed blade the Director Blades start rebuilding the missing objects for only the affected files and put them on remaining storage blades. One of the key features of PanFS, or any object based storage system, is that only the files affected by the blade or disk loss are reconstructed. Contrast this with the case of block-based storage.

In block-based storage the drives are attached to a RAID controller. When a drive fails, all of the blocks on the failed drive much be reconstructed, not just the data that was on the failed drive. This means that reconstruction takes longer which also means that there is a larger window of vulnerability where another drive could fail. In block based storage you also have only a single RAID controller doing the reconstruction. With Panasas, because of the object based nature of the file system, all of the Director Blades can participate in the reconstruction, reducing the amount of time to rebuild the data. In fact the rebuild time is linear with the number of Director Blades.

Moreover, during block-based reconstruction, if an unrecoverable read error (URE) is encountered, then the drive with the URE is considered failed. For a RAID-5 array, this means that the reconstruction process fails and you have to restore all of the data from a backup. For a RAID-6 array, this means that data on the second failed drive can be recovered and the reconstruction can continue, but you also don't have any more protection for a second failure. In the case of PanFS, if a URE is encountered then the file associated with the bad media sector is marked as bad, but the reconstruction process continues with the remaining files. This is true for any number of URE's that might be encountered during reconstruction. Then only the missing file(s) are copied from a backup, which is far less data than restoring all of the data from backup for an entire RAID array just for a failed drive.

The current generation of ActiveStor software, ActiveStor 3.x, which include PanFS, includes several new features that are designed to help prevent URE's and other data corruption. There is a daemon that runs in the background to check all of the media sectors on the drives as well as to check the parity of existing files. With Panasas' new Tiered Parity, PanFS now has the ability to repair files from parity information even if a media sector within the drive goes bad. Normally with block based schemes, if a bad sector is encountered, the file has to be restored from a backup. This daemon will also touch every sector on the disk, causing bad sectors to be remapped (Note: if a bad sector is marked as bad, then the Tiered Parity can restore the bad sector without having to restore the file from backup). RAID controllers can also "walk" the sectors forcing bad sectors to be remapped, but it cannot restore sectors that contain data. Those files have to be restored from backup.

Tiered Parity also includes the capability of checking the data that has been sent over the network (network parity). The client makes sure that the data delivered by the storage wasn't corrupted while on it's way to the client. This could become increasing important as the number of IO devices and the number of network connections increases.

ActiveStor also offers High Availability as an option, with fail-over between Director Blades. This allows Director Blades (DB) to take over the functions of another in the even of a failure of a DB. When the DB is replaced the functions that it was performing can be easily migrated to the new DB. There is also a feature called Blade Drain that allows a blade to be marked as down and then PanFS moves all of the existing objects from that blade to other blades. This feature allows the admin to mark blades that are suspect as down and move the data off of them before the blade goes bad.

The ActiveStor Storage Cluster is in use at many HPC sites. Some of the systems range from just a few Terabytes to almost a Petabyte. Since ActiveStor uses GigE connections, it is relatively easy to connect the storage to an existing cluster since there are likely to be GigE connections to the nodes. The DirectFlow protocol is implemented via a kernel module and Panasas supports a wide range of kernels and distributions.

Lustre

One of the few open-source parallel file systems is Lustre. Lustre is an object based file system that can potentially scale to ten of thousands of nodes and Petabytes of data. Lustre also stores data as objects, called containers, that are very similar to files, but are not part of a directory tree. The objects are written to an underlying file system. Currently that file system is a modified ext3 since Lustre really only runs on Linux. The modifications are to improve performance of ext3 and to improve the scalability of ext3. The Lustre team is very good about pushing their patches back into the mainstream ext3 source as well.

The advantage to an object based file system is that allocation management of data is distributed over many nodes, avoiding a central bottleneck. Only the file system needs to know is where the objects are located, not the applications. So the data can be put anywhere on the file system. This gives tremendous flexibility to the designers of the file system to enhance performance, robustness, resiliency, etc.

As with other file systems, Lustre has a metadata component, a data component, and a client part that accesses the data. However, Lustre allows these components to be put on various machines, or a single machine (usually only for home clusters or for testing). The metadata is handled by a server called a MetaData Server (MDS). In version 1.x of Lustre, there is only one MDS but it can also be built with a second MDS for fail-over in the even that the first dies (one in active mode and one in standby mode). In Lustre 2.x the goal is to have tens or hundreds of MDS machines.

The file system data itself is stored as objects on the Object Storage Servers (OSS) machines. The data can be spread across the OSS machines in a round-robin fashion (striped in a RAID-0 sense) to allow parallel data access across many nodes resulting in higher throughput. This distribution also helps to ensure that a failure of an OSS machine will not cause the lose of data. But, the data is not written in any kind of RAID format to help preserve or recover data. The client machines mount the Lustre file system in the same way other file systems are mounted. Lustre mount points can be put into /etc/fstab or in an automounter.

For clusters, the MDS machines and the OSS machines are put on different machines. As previously mentioned, the current version of Lustre limits you to two MDS machines, only one of which is active, but you can have as many OSS machines as you want, need, or can afford. There are some very large Lustre sites in use. For large sites, the MDS machines are usually very large, shall we say "beefy", machines to handle the metadata traffic. But with object based file systems, the idea is to reduce or minimize the amount of metadata involvement and traffic in a data operation (read or write). Consequently, these large Lustre configurations can easily handle thousands of clients, tens or even hundreds of OSS machines without overly taxing the MDS machine.

Lustre uses an open network protocol to allows the components of the file system to communicate. It uses an open protocol called Portals, originally developed at Sandia National Laboratories. This allows the networking portion of Lustre to be abstracted so new networks can be easily added. Currently it supports TCP networks (Fast Ethernet, Gigabit Ethernet, 10GigE), Quadrics Elan, Myrinet GM, Scali SDP, and InfiniBand. Lustre also uses Remote Direct Memory Access (RDMA) and OS-bypass capabilities to improve I/O performance.

Notice that Lustre is purely a file system. You have to select what storage hardware you want to use for the OSS and MDS systems. If you examine some of the Lustre systems that are functioning around the world you can gain some insight into how you can architect such a system. For example, it's advisable to use two machines configured in failover mode for OSS nodes so that if one OSS fails, the other(s) can continue to access the storage. It's also advisable to use some sort of RAID protection on the storage itself so that you can recover from a failed disk. If you are using large disks or a large number of disks in a single RAID group, then it's advisable to use RAID-6. But just be careful in that RAID-6 requires more disks for a given capacity and also is expensive in terms of parity computations (i.e. it taxes the RAID controller heavily). One other thing to notice is that it is a good idea to configure the MDS node using HA (High Availability) in the event that the MDS node goes down.

One of the previous complaints about Lustre has been that you have to patch the kernel for the clients to use Lustre. There were very few people who could take the patches and apply them to a kernel.org kernel and successfully build and run Lustre. People had to rely on ClusterFS for their kernels. This was a burden on the users and on ClusterFS. With the advent of Lustre 1.6.x, ClusterFS now has a patchless kernel for the client if you use a 2.6.15-16 kernel or greater. According to ClusterFS, there may be a performance reduction when using the patchless client, but for many people this is a useful thing since they can quickly rebuild a Lustre client kernel if need be.

In a way to finance development, Lustre follows a slightly unusual open-source mode. In this model, the newest version of Lustre is only available from Cluster File Systems, Inc., the company developing Lustre. The previous version of Lustre is available freely from www.lustre.org. The open-source community around Lustre is pretty good about helping each other. Even people from ClusterFS will chime in to help. You can even buy support from ClusterFS for the current version or older versions.

Recently, Sun purchased ClusterFS. As part of the announcement there was a commitment from Sun and ClusterFS to keep the licensing for Lustre and support people who use Lustre on non-Sun hardware. An earlier announcement from ClusterFS is that they will be developing Lustre to use ZFS as the underlying file system for Lustre. It is assumed that this version will be available only for Sun hardware.

About a year ago some performance testing was done with Lustre on a cluster at Lawrence Livermore National Laboratories (LLNL) with 1,100 nodes. Lustre was able to achieve a throughput for each OST of 270 MB/sec and an average client I/O throughput of 260 MB/sec. In total, for the 1,000 clients within the cluster, it achieved an I/O rate of 11.1 GB/sec.

PVFS

One of the first parallel file systems for Linux clusters was PVFS (Parallel Virtual File System). PVFS is an open-source project with a number of contributors. PVFS is not designed to be a persistent file system for something like user's home directories but rather as a high-speed scratch file system for the storage of data during the run of a HPC job.

The original PVFS system, now called PVFS1, was originally developed at Clemson University but with the advent of PVFS2 it is not in use as much. PVFS2 is a complete rewrite of PVFS1 focusing on improving the scalability of PVFS, the flexibility of PVFS, and the performance of PVFS. Currently PVFS2 natively supports TCP, InfiniBand, and Myrinet networks. In addition, PVFS2 accommodates heterogeneous clusters allowing x86, x86_64, PowerPC, and Itanium machines to all be part of the same PVFS file system. PVFS2 also adds some management interfaces and tools to allow easier administration.

PVFS2 divides the functions of the file system into two pieces, metadata servers and data servers. In general, PVFS2 has only one type of server - pvfs2-server. The actual function a particular server fulfills is defined in a configuration file. For example, it could be a metadata server or a data server. Any given pvfs2-server can fulfill either the metadata server or data server or both functions. PVFS2 also has the ability to accommodate multiple metadata servers (PVFS1 can only accommodate one). After starting the metadata and data servers, the clients then mount PVFS as if it were a normal file system.

The "V" in PVFS stands for Virtual. This means that it doesn't write directly to the storage devices, but instead the data resides in another file systems that does the actual IO operations to the storage devices. When data is written to PVFS, it is sent to the underlying file system in chunks. The underlying file system then writes the data to the storage. For example, you could use ext3, ResierFS, JFS, XFS, ext2, etc. as the underlying file system for PVFS.

The virtual part of the file system sounds more complicated than it actually is. If you happened to look at a PVFS directory on a particular file system, you will see lots of smaller files with strange names. Each file is part of a PVFS file. If you combine these files in an appropriate way, you will get the real file that is stored in PVFS. So you can think of PVFS has a file system on top of a file system. Consequently PVFS looks a lot like an object based file system and is, in fact, an object based file system.

This approach gives PVFS a great deal of flexibility. For example, you could have several storage nodes running ext3 and another set of nodes running XFS and PVFS will work perfectly fine with all of them. When data is written to PVFS it is broken into chunks of a certain size and then sent to the data servers in some fashion, usually in a round-robin fashion. The size of the chunks, which storage nodes are used, and how the chunks are written, are all configurable, allowing PVFS to be tuned for maximum performance for a given application or class of applications.

There are a number of features that PVFS2 uses for improved performance. For example, it can utilize multiple network interfaces on the client node. This design means that if a node has both a TCP NIC and a Myrinet NIC, you can use both network interfaces for PVFS data. You can also change the underlying file system for improved performance. The behavior of the local file system can have a big impact of the PVFS2 performance so there is room for performance improvement. You can also use extended attributes to set directory hints to improve performance.

Recall that PVFS is designed to be a high-speed scratch file system. Consequently it does not have much fault tolerance built into it. You can achieve some level of fault tolerance but you have to configure it. For example, PVFS2 does not naturally tolerate disk failures, but you can use disks that are mirrored or behind a RAID controller to give some fault tolerance. You can do the same for network interfaces by using more than one interface. You can even configure the servers to be fault tolerant by using HA between two servers. The PVFS website has an paper on how you can go about using HA.

PVFS is fault tolerant in one respect. If you lose a server the file system will not go down. For example let's assume you have several data servers and one of them goes down while it is being used by clients. If any client was actively writing to that server, then the data on the network or in the process of being written will be lost. But any client that was not using the server will then continue to run just fine. If a client was using the server but was not actively writing data to it, then it will continue to run but will not use the down server during the next IO phase. If the down server is returned to service without any changes to the file system underneath PVFS, then the data that was on the server will be available once again. But if the disks in the down server are wiped, for example, then the data is lost. This is a design decision that the PVFS developers have made. PVFS is designed to be a high-speed storage space, not a place for home directories. The intent is to use PVFS for storage of data during an application run and then move the data off of PVFS onto a permanent storage space that is backed by tape or some other type of archiving media.

The decision to focus on PVFS as a high-speed scratch system and not a file system with lots of redundancy has actually freed the developers so that they can focus on improving performance without being shackled by keeping compatibility with a file system interface. The developers have created several ways to access PVFS from the clients. First, there is a system interface that is used as a low-level interface for all of the other layers. The second interface is the management interface. It is intended for administrative tasks for PVFS such as fsck or for low-level file information. Then there is a Linux kernel driver that is really a kernel module that can be loaded into an unmodified Linux kernel so the Linux VFS (Virtual File System) can access PVFS2. This allows standard UNIX command such as ls and cp to work correctly. And finally there is a ROMIO PVFS2 interface that is part of the standard ROMIO distribution. ROMIO supports MPI-IO for various MPI libraries. Using this device ROMIO can better utilize PVFS2 for IO functions.

The performance of PVFS is heavily dependent upon the network performance. The typical Ethernet networks will work well, but will limit scalability and ultimate performance. High-speed networks such as Myrinet, Quadrics, and InfiniBand (IB) greatly improve scalability and performance. In tests, the IO performance of PVFS scales linearly with the number of data servers up to at least 128. At that point, the performance exceeds 1 Gigabyte/sec. PVFS2 has been tested with 350 data servers, each with a simple IDE and connected via Myrinet 2000. With 100 clients, PVFS2 was able to achieve an aggregate performance of about 4.5 Gigabyte/sec in writes and almost 5 Gigabyte/sec in read performance running an MPI-IO test code.

PVFS2 has also been ported to IBM's BlueGene/L. In a fairly recent test, PVFS2 was run on a BlueGene/L at Argonne Labs. In a test using MPI-IO-TEST on 1,024 CPUs, 32 IO nodes (part of BlueGene/L), and 16 P4 storage nodes, PVFS2 was able to achieve an aggregate of about 1.2 GB/s for reads and about 350 MB/s for writes.

So Long and Thanks for All of the File Systems

The file systems that I have covered in this article are just some of what's available. I've tried to cover the more popular file systems for clusters, but undoubtedly I have missed some. Moreover there may be file systems, such as OCFS2, that may not have been tested or even considered for clusters. I invite you investigate other file systems and after reading this series or articles, you will be able to look at other file systems with a more critical eye.

The big summary table is next!

Quick Summary

I wanted to include a table with a summary of features because the article is so long.

Distributed File Systems
File system	Networking	Features	Limitations	Example Vendors
NFS/NAS	TCP, UDP NFS/RDMA (InfiniBand) very soon	Easy to configure and manage Well understood (easy to debug) Client comes with every version of Linux Can be cost effective Provides enough IO for many applications May be enough capacity for your needs	Single connection to network GigE throughput is about 100 MB/s max. Limited aggregate performance Limited capacity scalability May not provide enough capacity Potential load imbalance (if using multiple NAS devices) "Islands" of storage are created if you multiple NAS devices	IBM Netapp EMC HP ONStor Scalable Informatics BlueArc SnapServer Dell
Clustered NAS	Currently TCP only (almost entirely GigE)	Usually a more scalable file system than other NAS models Only one file server is used for the data flow (forwarding model could potentially use all of the file servers) Uses NFS as protocol between client and file server (gateway) Many applications don't need large amounts of IO for good performance (can use low gateway/client ratio)	Can have scalability problems (block allocation and write traffic) Load balancing problems Need a high gateway/client ratio for good performance	Polyserve Isilon Panasas Netapp
AFS	Currently TCP only (primarily GigE)	Caching (Clients cache data. Plus servers can go down without loss of access to data) Security (Kerberos and ACL) Scalability (Additional servers just increase the size of the file system)	Limited single client performance (as fast as data access inside an individual node) Not in wide spread use Uses UDP	Open-source (link)
iSCSI	Currently TCP only (primarily GigE)	Allows for extremely flexible configurations Software (target and initiator) comes with Linux Centralized storage (easier administration and maintenance) You don't have to use just SCSI drives	Performance is not always as fast as it could be Requires careful planning (not a limitation, but just a requirement) Centralized storage (if centralized storage goes down, all clients go down)	Open-source (link)
HyperSCSI	Currently IP only (it uses it's own packets)	Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP) Allows for very flexible configurations	Hasn't been updated in a while Cannot route packets since they aren't UDP or TCP	Open-source (link)
AoE	Currently IP only (it uses it's own packets)	Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP) Drivers are part of the Linux kernel	Uses ATA protocol (really a requirement and not a limitation) Cannot route packets since they aren't UDP or TCP	Open-source (link)
dCache	Currently TCP	Can use hardware space on all available machines (even clients) Tertiary Storage Manager (HSM)	Performance (it's only as fast the local storage) Limited use (primarily high-energy physics labs)	Open-source (link)

Parallel File Systems
File system	Networking	Features	Limitations	Vendors
GPFS	Currently TCP only Native IB soon (4.x)	Probably the most mature of all parallel file systems Variable block sizes (up to 2MB) 32 sub-blocks (can help with small files) Multi-cluster IO pattern recognition Can be configured with fail-over NFS and CIFS gateways Open portability layer (makes kernel updates easier) File system only solution (you can select what ever hardware you want)	Pricing is by the server and client (i.e. you have to pay for every client and server) Block-based (has to use sophisticated lock manager) Can't change block size after deployment Current access is via TCP only (but IB is coming in version 4.x) File system only solution (it allows people to select unreliable hardware)	IBM Linux Networx
Rapidscale	Currently TCP only (primarily GigE)	Uses standard Linux tools (`md`, `lvm`) Distributed metadata Good load balancing NFS and CIFS gateways High availability	More difficult to expand capacity while load balancing Dependent on RAID groups of disks for resiliency and reconstruction Modified file system, modified iSCSI Currently network protocol is TCP (limits performance) Must use Rackable hardware	Rackable
IBRIX	Currently TCP only (primarily GigE)	Can split files and directories across several servers Can split a directory across segment servers (good for directories that have lots of IO and lots of file) Segment ownership can be migrated from one server to another Segments can be taken off-line for maintenance without bringing the entire file system down Can configure HA for segment fail over Snap shoot tool File replication tool File system only solution (you can select what ever hardware you want) Distributed metadata NFS and CIFS gateways	Administration load can be higher than other file systems (some of this is due to the flexibility of the product) Dependent on RAID groups of disks for resiliency and reconstruction Native access is currently only TCP (limits performance) File system only solution (it allows people to select unreliable hardware) Rumors of having to pay for each client as well as segment servers (data servers)	IBRIX
GlusterFS	TCP InfiniBand	Open-source Excellent performance Can use almost any hardware Plug-ins (translators) provide a huge amount of flexibility and tuning capability Very fast performance File system only solution (you can select what ever hardware you want) No metadata server Automated File Recovery (AFR) and auto-healing if a data server is lost NFS and CIFS gateways	Relatively new Dependent on RAID groups of disks for resiliency and reconstruction File system only solution (it allows people to select unreliable hardware) Extremely flexible (it takes some time to configure the file system the way you want it	Open-source (link)
EMC Highroad (MPFSi)	TCP Fibre Channel	Uses iSCSI as data protocol NFS and CIFS gateways Uses EMC storage so backups may be easier	Only EMC hardware can be used Dependent on RAID groups of disks for resiliency and reconstruction Single metadata server FC protocol requires an FC HBA in each node and FC network ($$) Most popular deployments use TCP (limits performance)	EMC
SGI CXFS	TCP (metadata) and FC (data)	Multiple metadata servers (although only 1 is active) Lots of redundancy in design (recovery from data server failure) Guaranteed IO rate NFS and CIFS gateways?	Doesn't scale well on clusters with many nodes FC protocol requires an FC HBA in each node and FC network ($$) Only one active metadata server NFS and CIFS gateways? Dependent on RAID groups of disks for resiliency and reconstruction Restricted to SGI only hardware Only one metadata server	SGI
Red Hat GFS	Fibre Channel (FC) TCP (iSCSI)	Open-source Global locking Can use almost any hardware for storage Quotas NFS and CIFS gateways	Limited expandability (but limit is large) Dependent on RAID groups of disks for resiliency and reconstruction	Open-source (link)

Object Based File Systems/Storage
File system	Networking	Features	Limitations	Vendors
Panasas	Currently TCP only (primarily GigE)	Object based file system Easy to setup, manage, expand Performance scales with shelves Distributed metadata Metadata fail-over Fast reconstruction in the event of a disk failure Disk sector scrubbers (looks for bad sectors) Can restore a sector if it is marked bad Network parity Blade drain NFS and CIFS gateways (scalable NAS) Coupled hardware/software solution (more like an appliance)	Have to use Panasas hardware Limited small file performance Kernel modules for kernel upgrades come from Panasas Single client performance is limited by network (TCP) Coupled hardware/software solution (limits hardware choice)	Panasas
Lustre	TCP Quadrics Elan Myrinet GM and MX InfiniBand (Mellanox, Voltaire, Infinicon, OFED) RapidArray (Cray XD1) Scali SDP LNET (Lustre Networking)	Open-source Object based file system Can use a wide range of networking protocols Can use native IB protocols for much higher performance Excellent performance with high-speed network NFS and CIFS gateways (scalable NAS) File system only solution (you can select what ever hardware you want)	Single Metadata server Dependent on RAID groups of disks for resiliency and reconstruction File system only solution (it allows people to select unreliable hardware)	Lustre
PVFS	TCP Myrinet (gm and mx) Native IB protocols Quadrics Elan	Object base file system Easy to setup Distributed metadata Open-source High-speed performance Can use multiple networks File system only solution (you can select what ever hardware you want)	Lacks some of the resiliency of other file systems (but wasn't designed for that same functionality) File system only solution (it allows people to select unreliable hardware)	PVFS

I want to thank Marc Ungangst, Brent Welch, and Garth Gibson at Panasas for their help in understanding the complex world of cluster file systems. While I haven't even come close to achieving the understanding that they have, I'm much better than I when I started. This article, as attempt to summarize the world of cluster file systems, is the result of many discussions between where they answered many, many questions from me. I want to thank them for their help and their patience.

I also hope this series of articles, despite their length, has given you some good general information about file systems and even storage hardware. And to borrow some parting comments, "Be well, Do Good Work, and Stay in Touch."

A much shorter version of this article was originally published in ClusterWorld Magazine. It has been greatly updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

© Copyright 2008, Jeffrey B. Layton. All rights reserved.
This article is copyrighted by Jeffrey B. Layton. Permission to use any part of the article or the entire article must be obtained in writing from Jeffrey B. Layton.