Article Index

lots of bits in lots of places

In the final part of our worlds biggest and best Parallel File Systems Review, we take a look at object based parallel file systems. Although this installment stands on its own, you may find it valuable to take a look at Part One: The Basics, Taxonomy and NFS and Part Two: NAS, AoE, iSCSI, and more! to round out your knowledge. Let's jump in! And, don't miss the biggest and best summary table I ever created on the last page.

[Editors note: This article represents the third in a series on cluster file systems. As the technology continues to change, Jeff has done his best to "snap shot" the state of the art. If products are missing or specifications are out of date, then please contact Jeff or myself. In addition,, because the article is in three parts, we will continue the figure numbering from part one. Personally, I want to stress magnitude and difficulty of this undertaking and thank Jeff for amount of work he put into this series.]

A Brief Introduction to Object Based Storage

While object storage sounds simple, and conceptually it is, there are many gotchas that need to be addressed to make it reliable and useful as a real file system. The key concept of an object based file system needs to be emphasized since it is a powerful methodology for parallel file systems. In a more traditional file system, such as a block based file system, the metadata manager is contacted and a set of inodes or blocks is allocated for a file (assuming a write operation). The client passes the data to the metadata manager which then sends the data to the file system and ultimately to the disk. If you notice, the metadata manager is a key part of the process. This design creates a bottleneck in performance. The metadata manager is responsible not only for the metadata itself but also where the data is located on the storage.

Object storage takes a different approach allowing the storage devices themselves to manage where the data is stored. In an object storage system, the metadata manager and the storage are usually separate devices. The storage devices are not just dumb hard drives, but have some processing capability. With an object storage system the metadata manager is contacted by a client about a file operation, such as a write. The metadata manager contacts the storage devices and determines where the data can be placed based on some rules in the file system (e.g. striping for performance, load or capacity balancing across storage devices, RAID layout for resiliency). Then the metadata manager passes back a map of which storage devices the client can use for the file operation as well as what capabilities it has. Then the metadata manager gets out of the way of the actual file operation and allows the client to directly contact the assigned storage devices. The metadata manager constantly monitors the operations so that it knows the basics of where the data is located but it stays out of the middle of the data operations. If there is a change in the file operation, such as another client wanting to read or write data to the file, then the metadata manager has to get involved to arbitrate the file operations. But in general the data operations happen without the metadata manager being in the middle.

A key benefit of object storage is that if the data on the storage devices needs to be moved, it can easily happen without the intervention of the metadata manager. The movement of the data can be for many reasons, most of them important for performance or resiliency of the file system. The storage devices just move the data and inform the metadata manager where it is now located.

Jan Jitze Krol of Panasas has a great analogy for object storage. Let's assume you are going to a mall to go shopping. You can drive your car into the parking garage and park on the third level, zone 5, seventh spot from the end and then go in and shop. When you finish you come out and you have to remember where you parked and then you go directly to your car. This is an analogy to the way a typical file system works.

For the object storage analogy, let's assume you arrive at the parking garage and the valet parks your car for you. You go in and shop and when you finish you just present the valet with your parking ticket and the valet determines where your car is located and gets it for you. The advantage of the valet is that they can move the car around as they need to (perhaps they want to work on part of the garage). They can even tear down and rebuild the garage and you won't even know it happened. Similar to the valet, object based file systems give the file system tremendous flexibility. It can optimize data layout (where your car is located), add more storage (add to the parking garage), check the data to make sure it's still correct (check your car to make sure it's not damaged and if it is, fix it for you or inform you), and a whole host of other tasks. Notice that these tasks can be done with minimal involvement of the metadata manager.

So you can see that object storage has a great deal of flexibility and possibilities for improving file system performance and scalability for clusters. Let's examine the three major object file systems for clusters, Lustre, Panasas, and PVFS.

Panasas

Panasas is a storage company focused on the HPC market. Their ActiveStor Parallel Storage Cluster is a high-speed, scalable, global, parallel storage system that uses an object based file system called PanFS. Panasas couples their file system with a commodity based storage system that uses blades (called Panasas Storage Blades). The basic storage unit consists of a 4U chassis, called a shelf, that can accommodate up to 11 blades. Each blade is a complete system with a small motherboard, CPU, memory,1 or 2 disks, and two internal GigE connections. There are two types of blades - a Director Blade (DB) that fills the role of a metadata manager and only contains a single disk, and a Storage Blade (SB) that stores data.

Each shelf can hold up to three Director Blades or 11 Storage Blades. The most common configuration is to have 1 Director Blade and 10 Storage Blades. Currently, Panasas has 500GB, 1TB, and 1.5TB storage blades giving a nominal storage of 5TB, 10TB, and 15TB per shelf. In addition, the Storage Blade can have either 512MB of memory or 2GB of memory, the extra memory is used for cache. In addition, each shelf has redundant power supplies as well as battery backup. Each blade is connected to an internal GigE switch with 16 ports. Eleven of the ports are internal to connect to the disk and four are used to connect the shelf to an outside network (the remaining GigE link can be used for diagnostics).

The object-based file system, called PanFS, is the unique features of Panasas' storage system. As data is written to PanFS, it is broken into 64KB chunks. File attributes, an object ID, and other information are added to each chunk to complete an object. The resulting object is then written to one of the storage blades in the file system. For files that are less than 64KB, the objects are written to PanFS using RAID-1 on two different storage blades. For files larger then 64KB, the objects are written using RAID-5 using independent storage blades. Using more than one blade and more than one hard drive gives PanFS parallelism resulting in speed. The objects are also spread across the storage blades to provide capacity load balancing so the amount of used space on each blade is about the same.

There is a subtle distinction in how Panasas does RAID and how other file systems, particularly non-object file systems, use RAID. For non-object based file systems, RAID is done on the block level. So the hardware is what is actually RAID-ed. In the case of PanFS, it is the file itself that is RAID-ed. This gives huge advantages to PanFS that I'll talk about in a bit.

Another aspect of PanFS that gives it performance is the Panasas developed data access protocol, DirectFlow. In virtually all other file systems, if a client requests data, either to read or write, the metadata manager is always involved in the transaction even if it is not needed. This single bottleneck can limit performance. In DirectFlow, when a client requests data it contacts a metadata manager first, that controls that part of the file system containing that data. The metadata manager passes back a map of where the data is located as well as a set of capabilities to the client. Then the client can use the map and capabilities to communicate directly with the storage blades. This communication happens in parallel so performance is greatly improved. The metadata manager is not involved in this part of the process. Consequently, the metadata manager is not a bottleneck in the data access process. These features give PanFS its performance. Panasas has measured the performance of a single shelf with a single Director Blade at 350MB/s for reading data.

In addition to DirectFlow, PanFS can also used with NFS and CIFS protocols. Each Director Blades acts an NFS and CIFS server. So if you have more than one DB, you can load balance NFS and CIFS across them. In effect for NFS and CIFS needs, it becomes a clustered NAS solution. This makes the file system very useful desktops or for systems that do not run Linux.

In any file system, one of the most important aspects to consider is its resiliency or the ability to tolerate failures and still function. In the case of PanFS, if a storage blade fails some set of files are damaged (i.e. some objects are missing). The remaining objects and their parity are stored on other blades. As soon as a blade or disk failure is detected, all of the Director Blades start the reconstruction process. Since the metadata knows what files were on the failed blade the Director Blades start rebuilding the missing objects for only the affected files and put them on remaining storage blades. One of the key features of PanFS, or any object based storage system, is that only the files affected by the blade or disk loss are reconstructed. Contrast this with the case of block-based storage.

In block-based storage the drives are attached to a RAID controller. When a drive fails, all of the blocks on the failed drive much be reconstructed, not just the data that was on the failed drive. This means that reconstruction takes longer which also means that there is a larger window of vulnerability where another drive could fail. In block based storage you also have only a single RAID controller doing the reconstruction. With Panasas, because of the object based nature of the file system, all of the Director Blades can participate in the reconstruction, reducing the amount of time to rebuild the data. In fact the rebuild time is linear with the number of Director Blades.

Moreover, during block-based reconstruction, if an unrecoverable read error (URE) is encountered, then the drive with the URE is considered failed. For a RAID-5 array, this means that the reconstruction process fails and you have to restore all of the data from a backup. For a RAID-6 array, this means that data on the second failed drive can be recovered and the reconstruction can continue, but you also don't have any more protection for a second failure. In the case of PanFS, if a URE is encountered then the file associated with the bad media sector is marked as bad, but the reconstruction process continues with the remaining files. This is true for any number of URE's that might be encountered during reconstruction. Then only the missing file(s) are copied from a backup, which is far less data than restoring all of the data from backup for an entire RAID array just for a failed drive.

The current generation of ActiveStor software, ActiveStor 3.x, which include PanFS, includes several new features that are designed to help prevent URE's and other data corruption. There is a daemon that runs in the background to check all of the media sectors on the drives as well as to check the parity of existing files. With Panasas' new Tiered Parity, PanFS now has the ability to repair files from parity information even if a media sector within the drive goes bad. Normally with block based schemes, if a bad sector is encountered, the file has to be restored from a backup. This daemon will also touch every sector on the disk, causing bad sectors to be remapped (Note: if a bad sector is marked as bad, then the Tiered Parity can restore the bad sector without having to restore the file from backup). RAID controllers can also "walk" the sectors forcing bad sectors to be remapped, but it cannot restore sectors that contain data. Those files have to be restored from backup.

Tiered Parity also includes the capability of checking the data that has been sent over the network (network parity). The client makes sure that the data delivered by the storage wasn't corrupted while on it's way to the client. This could become increasing important as the number of IO devices and the number of network connections increases.

ActiveStor also offers High Availability as an option, with fail-over between Director Blades. This allows Director Blades (DB) to take over the functions of another in the even of a failure of a DB. When the DB is replaced the functions that it was performing can be easily migrated to the new DB. There is also a feature called Blade Drain that allows a blade to be marked as down and then PanFS moves all of the existing objects from that blade to other blades. This feature allows the admin to mark blades that are suspect as down and move the data off of them before the blade goes bad.

The ActiveStor Storage Cluster is in use at many HPC sites. Some of the systems range from just a few Terabytes to almost a Petabyte. Since ActiveStor uses GigE connections, it is relatively easy to connect the storage to an existing cluster since there are likely to be GigE connections to the nodes. The DirectFlow protocol is implemented via a kernel module and Panasas supports a wide range of kernels and distributions.

Lustre

One of the few open-source parallel file systems is Lustre. Lustre is an object based file system that can potentially scale to ten of thousands of nodes and Petabytes of data. Lustre also stores data as objects, called containers, that are very similar to files, but are not part of a directory tree. The objects are written to an underlying file system. Currently that file system is a modified ext3 since Lustre really only runs on Linux. The modifications are to improve performance of ext3 and to improve the scalability of ext3. The Lustre team is very good about pushing their patches back into the mainstream ext3 source as well.

The advantage to an object based file system is that allocation management of data is distributed over many nodes, avoiding a central bottleneck. Only the file system needs to know is where the objects are located, not the applications. So the data can be put anywhere on the file system. This gives tremendous flexibility to the designers of the file system to enhance performance, robustness, resiliency, etc.

As with other file systems, Lustre has a metadata component, a data component, and a client part that accesses the data. However, Lustre allows these components to be put on various machines, or a single machine (usually only for home clusters or for testing). The metadata is handled by a server called a MetaData Server (MDS). In version 1.x of Lustre, there is only one MDS but it can also be built with a second MDS for fail-over in the even that the first dies (one in active mode and one in standby mode). In Lustre 2.x the goal is to have tens or hundreds of MDS machines.

The file system data itself is stored as objects on the Object Storage Servers (OSS) machines. The data can be spread across the OSS machines in a round-robin fashion (striped in a RAID-0 sense) to allow parallel data access across many nodes resulting in higher throughput. This distribution also helps to ensure that a failure of an OSS machine will not cause the lose of data. But, the data is not written in any kind of RAID format to help preserve or recover data. The client machines mount the Lustre file system in the same way other file systems are mounted. Lustre mount points can be put into /etc/fstab or in an automounter.

For clusters, the MDS machines and the OSS machines are put on different machines. As previously mentioned, the current version of Lustre limits you to two MDS machines, only one of which is active, but you can have as many OSS machines as you want, need, or can afford. There are some very large Lustre sites in use. For large sites, the MDS machines are usually very large, shall we say "beefy", machines to handle the metadata traffic. But with object based file systems, the idea is to reduce or minimize the amount of metadata involvement and traffic in a data operation (read or write). Consequently, these large Lustre configurations can easily handle thousands of clients, tens or even hundreds of OSS machines without overly taxing the MDS machine.

Lustre uses an open network protocol to allows the components of the file system to communicate. It uses an open protocol called Portals, originally developed at Sandia National Laboratories. This allows the networking portion of Lustre to be abstracted so new networks can be easily added. Currently it supports TCP networks (Fast Ethernet, Gigabit Ethernet, 10GigE), Quadrics Elan, Myrinet GM, Scali SDP, and InfiniBand. Lustre also uses Remote Direct Memory Access (RDMA) and OS-bypass capabilities to improve I/O performance.

Notice that Lustre is purely a file system. You have to select what storage hardware you want to use for the OSS and MDS systems. If you examine some of the Lustre systems that are functioning around the world you can gain some insight into how you can architect such a system. For example, it's advisable to use two machines configured in failover mode for OSS nodes so that if one OSS fails, the other(s) can continue to access the storage. It's also advisable to use some sort of RAID protection on the storage itself so that you can recover from a failed disk. If you are using large disks or a large number of disks in a single RAID group, then it's advisable to use RAID-6. But just be careful in that RAID-6 requires more disks for a given capacity and also is expensive in terms of parity computations (i.e. it taxes the RAID controller heavily). One other thing to notice is that it is a good idea to configure the MDS node using HA (High Availability) in the event that the MDS node goes down.

One of the previous complaints about Lustre has been that you have to patch the kernel for the clients to use Lustre. There were very few people who could take the patches and apply them to a kernel.org kernel and successfully build and run Lustre. People had to rely on ClusterFS for their kernels. This was a burden on the users and on ClusterFS. With the advent of Lustre 1.6.x, ClusterFS now has a patchless kernel for the client if you use a 2.6.15-16 kernel or greater. According to ClusterFS, there may be a performance reduction when using the patchless client, but for many people this is a useful thing since they can quickly rebuild a Lustre client kernel if need be.

In a way to finance development, Lustre follows a slightly unusual open-source mode. In this model, the newest version of Lustre is only available from Cluster File Systems, Inc., the company developing Lustre. The previous version of Lustre is available freely from www.lustre.org. The open-source community around Lustre is pretty good about helping each other. Even people from ClusterFS will chime in to help. You can even buy support from ClusterFS for the current version or older versions.

Recently, Sun purchased ClusterFS. As part of the announcement there was a commitment from Sun and ClusterFS to keep the licensing for Lustre and support people who use Lustre on non-Sun hardware. An earlier announcement from ClusterFS is that they will be developing Lustre to use ZFS as the underlying file system for Lustre. It is assumed that this version will be available only for Sun hardware.

About a year ago some performance testing was done with Lustre on a cluster at Lawrence Livermore National Laboratories (LLNL) with 1,100 nodes. Lustre was able to achieve a throughput for each OST of 270 MB/sec and an average client I/O throughput of 260 MB/sec. In total, for the 1,000 clients within the cluster, it achieved an I/O rate of 11.1 GB/sec.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.