|
Page 1 of 3
lots of bits in lots of places
In the final part of our worlds biggest and best Parallel File Systems Review, we take a look at object based parallel file systems. Although this installment stands on its own, you may find it valuable to take a look at
Part One: The Basics, Taxonomy and NFS and
Part Two: NAS, AoE, iSCSI, and more! to round out your knowledge. Let's jump in! And, don't miss the biggest and best summary table I ever created on the last page.
[Editors note: This article represents the third in a series on cluster file systems. As the technology continues to change, Jeff has done his best to "snap shot" the state of the art. If products are missing or specifications are out of
date, then please contact
Jeff or myself.
In addition,, because the article is in three parts, we will continue the figure numbering from part one.
Personally, I want to stress magnitude and difficulty of this undertaking and thank Jeff for amount of work he put into this series.]
A Brief Introduction to Object Based Storage
While object storage sounds simple, and conceptually it is, there are many gotchas that need
to be addressed to make it reliable and useful as a real file system. The key concept of an object based file system needs to be emphasized
since it is a powerful methodology for parallel file systems. In a more
traditional file system, such as a block based file system, the metadata
manager is contacted and a set of inodes or blocks is allocated for a file
(assuming a write operation). The client passes the data to the metadata
manager which then sends the data to the file system and ultimately to the
disk. If you notice, the metadata manager is a key part of the process.
This design creates a bottleneck in performance. The metadata manager is responsible
not only for the metadata itself but also where the data is located on the
storage.
Object storage takes a different approach allowing the storage devices
themselves to manage where the data is stored. In an object storage system,
the metadata manager and the storage are usually separate devices. The
storage devices are not just dumb hard drives, but have some processing
capability. With an object storage system the metadata manager is contacted
by a client about a file operation, such as a write. The metadata manager
contacts the storage devices and determines where the data can be placed
based on some rules in the file system (e.g. striping for performance,
load or capacity balancing across storage devices, RAID layout for
resiliency). Then the metadata manager passes back a map of which storage
devices the client can use for the file operation as well as what capabilities
it has. Then the metadata manager gets out of the way of the actual file
operation and allows the client to directly contact the assigned storage
devices. The metadata manager constantly monitors the operations so that
it knows the basics of where the data is located but it stays out of the
middle of the data operations. If there is a change in the file operation,
such as another client wanting to read or write data to the file, then the
metadata manager has to get involved to arbitrate the file operations. But
in general the data operations happen without the metadata manager being
in the middle.
A key benefit of object storage is that if the data on the storage devices
needs to be moved, it can easily happen without the intervention of the
metadata manager. The movement of the data can be for many reasons, most
of them important for performance or resiliency of the file system. The
storage devices just move the data and inform the metadata manager where
it is now located.
Jan Jitze Krol of Panasas has a great analogy for object storage. Let's
assume you are going to a mall to go shopping. You can drive your car into
the parking garage and park on the third level, zone 5, seventh spot from
the end and then go in and shop. When you finish you come out and you have
to remember where you parked and then you go directly to your car. This is
an analogy to the way a typical file system works.
For the object storage
analogy, let's assume you arrive at the parking garage and the valet parks
your car for you. You go in and shop and when you finish you just present
the valet with your parking ticket and the valet determines where your car
is located and gets it for you. The advantage of the valet is that they can
move the car around as they need to (perhaps they want to work on part of
the garage). They can even tear down and rebuild the garage and you won't
even know it happened. Similar to the valet, object based file systems give
the file system tremendous flexibility. It can optimize
data layout (where your car is located), add more storage (add to the parking
garage), check the data to make sure it's still correct (check your car to
make sure it's not damaged and if it is, fix it for you or inform you), and
a whole host of other tasks. Notice that these tasks can be done with minimal
involvement of the metadata manager.
So you can see that object storage has a great deal of flexibility and
possibilities for improving file system performance and scalability for
clusters. Let's examine the three major object file systems for clusters,
Lustre, Panasas, and PVFS.
Panasas
Panasas is a storage company
focused on the HPC market. Their ActiveStor Parallel Storage Cluster
is a high-speed, scalable, global, parallel storage system that uses an
object based file system called PanFS. Panasas couples their file system
with a commodity based storage system that uses blades (called Panasas
Storage Blades). The basic storage unit consists of a 4U chassis, called
a shelf, that can accommodate up to 11 blades. Each blade is a complete
system with a small motherboard, CPU, memory,1 or 2 disks, and
two internal GigE connections. There are two types of blades - a Director
Blade (DB) that fills the role of a metadata manager and only contains
a single disk, and a Storage Blade (SB) that stores data.
Each shelf can hold up to three Director Blades or 11 Storage Blades.
The most common configuration is to have 1 Director Blade
and 10 Storage Blades. Currently, Panasas has 500GB, 1TB, and 1.5TB
storage blades giving a nominal storage of 5TB, 10TB, and 15TB per
shelf. In addition, the Storage Blade can have either 512MB of
memory or 2GB of memory, the extra memory is used for cache.
In addition, each shelf has redundant power supplies as well as
battery backup. Each blade is connected to an internal GigE switch
with 16 ports. Eleven of the ports are internal to connect to the disk
and four are used to connect the shelf to an outside network (the
remaining GigE link can be used for diagnostics).
The object-based file system, called PanFS, is the unique
features of Panasas' storage system. As data is written to PanFS, it
is broken into 64KB chunks. File attributes, an object ID, and other
information are added to each chunk to complete an object. The resulting
object is then written to one of the storage blades in the file system.
For files that are less than 64KB, the objects are written to PanFS
using RAID-1 on two different storage blades. For files larger then
64KB, the objects are written using RAID-5 using independent storage
blades. Using more than one blade and more than one hard drive gives
PanFS parallelism resulting in speed. The objects are also spread
across the storage blades to provide capacity load balancing so the
amount of used space on each blade is about the same.
There is a subtle distinction in how Panasas does RAID and how
other file systems, particularly non-object file systems, use
RAID. For non-object based file systems, RAID is done on the
block level. So the hardware is what is actually RAID-ed. In the
case of PanFS, it is the file itself that is RAID-ed. This gives
huge advantages to PanFS that I'll talk about in a bit.
Another aspect of PanFS that gives it performance is the Panasas
developed data access protocol, DirectFlow. In virtually all other
file systems, if a client requests data, either to read or write,
the metadata manager is always involved in the transaction even if it
is not needed. This single bottleneck can limit performance. In
DirectFlow, when a client requests data it contacts a metadata manager
first, that controls that part of the file system containing that data.
The metadata manager passes back a map of where the data is located
as well as a set of capabilities to the client. Then the client can
use the map and capabilities to communicate directly with the
storage blades. This communication happens in parallel so performance
is greatly improved. The metadata manager is not involved in
this part of the process. Consequently, the metadata manager is
not a bottleneck in the data access process. These features give
PanFS its performance. Panasas has measured the performance of a
single shelf with a single Director Blade at 350MB/s for reading data.
In addition to DirectFlow, PanFS can also used with NFS and CIFS
protocols. Each Director Blades acts an NFS and CIFS server. So if you
have more than one DB, you can load balance NFS and CIFS across them.
In effect for NFS and CIFS needs, it becomes a clustered NAS solution.
This makes the file system very useful desktops or for systems that
do not run Linux.
In any file system, one of the most important aspects to consider
is its resiliency or the ability to tolerate failures and still
function. In the case of PanFS,
if a storage blade fails some set of files are damaged (i.e. some
objects are missing). The remaining objects and their parity are
stored on other blades. As soon as a blade or disk failure is
detected, all of the Director Blades start the reconstruction
process. Since the metadata knows what files were on the failed blade
the Director Blades start rebuilding the missing objects for only the
affected files and put them on remaining storage blades.
One of the key features of PanFS, or any object based storage
system, is that only the files affected by the blade or disk loss are
reconstructed. Contrast this with the case of block-based storage.
In block-based storage the drives are attached to a RAID controller.
When a drive fails, all of the blocks on the failed
drive much be reconstructed, not just the data that was on the
failed drive. This means that reconstruction takes longer which
also means that there is a larger window of vulnerability
where another drive could fail. In block based storage you also have
only a single RAID controller doing the reconstruction. With Panasas,
because of the object based nature of the file system, all of the
Director Blades can participate in the reconstruction, reducing the
amount of time to rebuild the data. In fact the rebuild time is linear
with the number of Director Blades.
Moreover, during block-based reconstruction, if an unrecoverable read
error (URE) is encountered, then the drive with the URE is considered
failed. For a RAID-5 array, this means that the reconstruction process
fails and you have to restore all of the data from a backup. For a RAID-6
array, this means that data on the second failed drive can be recovered
and the reconstruction can continue, but you also don't have any more
protection for a second failure. In the case of PanFS, if a URE is
encountered then the file associated with the bad media sector is marked
as bad, but the reconstruction process continues with the remaining
files. This is true for any number of URE's that might be encountered
during reconstruction. Then only the missing file(s) are copied from
a backup, which is far less data than restoring all of the data from
backup for an entire RAID array just for a failed drive.
The current generation of ActiveStor software, ActiveStor 3.x, which
include PanFS, includes several new features that are designed to
help prevent URE's and other data corruption. There is a daemon that
runs in the background to check all of the media sectors on the drives
as well as to check the parity of existing files. With Panasas' new
Tiered Parity,
PanFS now has the ability to repair files from parity information even
if a media sector within the drive goes bad. Normally with block based
schemes, if a bad sector is encountered, the file has to be restored
from a backup. This daemon will also touch every sector on the disk,
causing bad sectors to be remapped (Note: if a bad sector is marked
as bad, then the Tiered Parity can restore the bad sector without
having to restore the file from backup). RAID controllers can also
"walk" the sectors forcing bad sectors to be remapped, but it cannot
restore sectors that contain data. Those files have to be restored
from backup.
Tiered Parity also includes the capability of checking the data that
has been sent over the network (network parity). The client makes sure
that the data delivered by the storage wasn't corrupted while on it's
way to the client. This could become increasing important as the number
of IO devices and the number of network connections increases.
ActiveStor also offers High Availability as an option, with fail-over
between Director Blades. This allows Director Blades (DB) to take over
the functions of another in the even of a failure of a DB. When the
DB is replaced the functions that it was performing can be easily
migrated to the new DB. There is also a feature called Blade Drain that
allows a blade to be marked as down and then PanFS moves all of the
existing objects from that blade to other blades. This feature allows the admin
to mark blades that are suspect as down and move the data off of them
before the blade goes bad.
The ActiveStor Storage Cluster is in use at many HPC sites. Some of
the systems range from just a few Terabytes to almost a Petabyte.
Since ActiveStor uses GigE connections, it is relatively easy to connect
the storage to an existing cluster since there are likely to be GigE
connections to the nodes. The DirectFlow protocol is implemented via
a kernel module and Panasas supports a wide range of kernels and
distributions.
Lustre
One of the few open-source parallel file systems is
Lustre. Lustre is an object based
file system that can potentially scale to ten of thousands of nodes and
Petabytes of data. Lustre also stores data as objects, called containers, that are very similar to
files, but are not part of a directory tree. The objects are written
to an underlying file system. Currently that file system is a modified
ext3 since Lustre
really only runs on Linux. The modifications are to improve performance
of ext3 and to improve the scalability of ext3. The Lustre team is
very good about pushing their patches back into the mainstream ext3
source as well.
The advantage to an object based file system is that allocation management
of data is distributed over many nodes, avoiding a central bottleneck.
Only the file system needs to know is where the objects are located,
not the applications. So the data can be put anywhere on the file system.
This gives tremendous flexibility to the designers of the file system
to enhance performance, robustness, resiliency, etc.
As with other file systems, Lustre has a metadata component, a data
component, and a client part that accesses the data. However, Lustre
allows these components to be put on various machines, or a single
machine (usually only for home clusters or for testing). The metadata
is handled by a server called a MetaData Server (MDS). In version 1.x
of Lustre, there is only one MDS but it can also be built with a second
MDS for fail-over in the even that the first dies (one in active mode
and one in standby mode). In Lustre 2.x the goal is to have tens or
hundreds of MDS machines.
The file system data itself is stored as objects on the Object
Storage Servers (OSS) machines. The data can be spread across the OSS
machines in a round-robin fashion (striped in a RAID-0 sense) to allow
parallel data access across many nodes resulting in higher
throughput. This distribution also helps to ensure that a failure of
an OSS machine will not cause the lose of data. But, the data is not
written in any kind of RAID format to help preserve or recover data.
The client machines mount the Lustre file system in the same way
other file systems are mounted. Lustre mount points can be put into
/etc/fstab or in an automounter.
For clusters, the MDS machines and the OSS machines are put on different
machines. As previously mentioned, the current version of Lustre limits
you to two MDS machines, only one of which is active, but you can have
as many OSS machines as you want, need, or can afford. There are some
very large Lustre sites in use. For large sites, the MDS machines are
usually very large, shall we say "beefy", machines to handle the metadata
traffic. But with object based file systems, the idea is to reduce or
minimize the amount of metadata involvement and traffic in a data
operation (read or write). Consequently, these large Lustre
configurations can easily handle thousands of clients, tens or even
hundreds of OSS machines without overly taxing the MDS machine.
Lustre uses an open network protocol to allows the components of
the file system to communicate. It uses an open protocol called
Portals,
originally developed at Sandia National Laboratories. This allows the
networking portion of Lustre to be abstracted so new networks can be
easily added. Currently it supports TCP networks (Fast Ethernet,
Gigabit Ethernet, 10GigE), Quadrics Elan, Myrinet GM, Scali SDP, and
InfiniBand. Lustre also uses Remote Direct Memory Access (RDMA) and
OS-bypass capabilities to improve I/O performance.
Notice that Lustre is purely a file system. You have to select
what storage hardware you want to use for the OSS and MDS systems. If
you examine some of the Lustre systems that are functioning around
the world you can gain some insight into how you can architect such a
system. For example, it's advisable to use two machines configured in
failover mode for OSS nodes so that if one OSS fails, the other(s)
can continue to access the storage. It's also advisable to use some
sort of RAID protection on the storage itself so that you can recover
from a failed disk. If you are using large disks or a large number of
disks in a single RAID group, then it's advisable to use RAID-6. But
just be careful in that RAID-6 requires more disks for a given capacity
and also is expensive in terms of parity computations (i.e. it taxes the
RAID controller heavily). One other thing to notice is that it is a good
idea to configure the MDS node using HA (High Availability) in the
event that the MDS node goes down.
One of the previous complaints about Lustre has been that you have
to patch the kernel for the clients to use Lustre. There were very
few people who could take the patches and apply them to a kernel.org kernel
and successfully build and run Lustre. People had to rely on
ClusterFS
for their kernels. This was a burden on the users and on ClusterFS. With
the advent of Lustre 1.6.x, ClusterFS now has a patchless kernel for
the client if you use a
2.6.15-16 kernel
or greater. According to ClusterFS, there may be a
performance reduction when using the patchless client, but for many
people this is a useful thing since they can quickly rebuild a
Lustre client kernel if need be.
In a way to finance development, Lustre follows a slightly unusual open-source mode. In this model, the newest version of
Lustre is only available from
Cluster File Systems, Inc.,
the company developing Lustre. The previous version of Lustre is
available freely from
www.lustre.org. The open-source
community around Lustre is pretty good about helping each other.
Even people from ClusterFS will chime in to help. You can even buy
support from ClusterFS for the current version or older versions.
Recently, Sun purchased ClusterFS.
As part of the announcement there was a commitment from Sun and
ClusterFS to keep the licensing for Lustre and support people who use
Lustre on non-Sun hardware. An earlier announcement from ClusterFS
is that they will be developing Lustre to use
ZFS
as the underlying file system for Lustre. It is assumed that this
version will be available only for Sun hardware.
About a year ago some performance testing was done with Lustre on a
cluster at Lawrence Livermore National Laboratories (LLNL) with
1,100 nodes. Lustre was able to achieve a throughput for each OST of
270 MB/sec and an average client I/O throughput of 260 MB/sec. In
total, for the 1,000 clients within the cluster, it achieved an I/O
rate of 11.1 GB/sec.
|