|
Page 2 of 3
GPFS
IBM has had a global parallel file system for many years.
The
Global Parallel File System (GPFS) started with the IBM SP systems
running AIX, but has been ported to Linux for certain IBM eServer,
blade server, and xServer products.
GPFS is a high-speed, parallel, distributed file system that achieves
high-performance by striping data across multiple disks and multiple
nodes. To further improve performance, it uses client-side caching as
well as read-ahead and write-behind functions. It can use large block
sizes from 16KB to 1024 KB to help improve IO throughput depending
upon requirements.
To get HA capability, it can be configured using logging and replication.
It can be configured for fail over at a disk-level and at the server
level. This means that if you lose a node GPFS will not lose access
to data nor degrade performance unacceptably. As part of this it can
transparently fail-over lock servers and other core GPFS functions
so that the system stays up and performance is at an acceptable level.
Lustre
There is a relatively new open-source file system called
Lustre
that has been developed for Linux clusters.
It follows a model that the newest version is only available from
Cluster File Systems, Inc. while the previous
version is available freely from
www.lustre.org.
It can potentially scale to ten of
thousands of nodes and hundreds of Terabytes of storage. Lustre stores
data as objects, called containers, that are very similar to files,
but are not part of a directory tree. The advantage to an object
based file system is that allocation management of data is
distributed over many nodes, avoiding a central bottleneck. Only the
file system needs to know where the objects are located, not the
applications, so the data can be put anywhere on the network.
As with other file systems, Lustre has a metadata component, a data
component, and a client part that accesses the data. However, Lustre
allows these components to be put on various machines, or a single machine
(usually for home clusters or for testing). The metadata
can be distributed across machines called MetaData Servers
(MDS), to ensure that the failure of one machine will not cause the
file system to crash. The MetaData Servers support failover as well.
In Lustre 1.x, you can use up to two MDS machines while in Lustre
2.x, you can have tens or even hundreds of MDS machines.
The file system data itself, is stored as objects on the Object Storage
Servers (OSS) machines. The data can be spread across the OSS
machines in a round-robin fashion (striped in a RAID sense) to allow
parallel data access across many nodes resulting in higher
throughput. The data is also distributed to ensure that a failure of
an OSS machine will not cause the lose of data. The client machines
mount the Lustre file system in the same way other file systems are
mounted. They communicate with the OSS machines for direct data
access.
The MDS machines, OSS machines, and clients can all reside on one
machine (most likely for testing or learning) or can be split among
multiple machines. For example,
you could make all of the nodes within a cluster OSS machines and
clients so that they can see the entire file system. You could choose
one or more of the nodes in the cluster to be MDS machines for fail
over redundancy.
Lustre uses open network protocols to allows the components to communicate.
It uses an open protocol called
Portals,
originally developed at
Sandia National Laboratories. This allows the networking portion of
Lustre to be abstracted so new networks can easily be added.
Currently it supports TCP networks (Fast Ethernet, Gigabit Ethernet),
Quadrics Elan, Myrinet GM, Scali SDP, and Infiniband. Lustre also
uses Remote Direct Memory Access (RDMA) and OS-bypass capabilities to
improve I/O performance.
Recently some performance testing was done with Lustre on a cluster at
Lawrence Livermore National Laboratories (LLNL) with 1,100 nodes.
Lustre was able to achieve a throughput for each OST of 270 MB/sec
and an average client I/O throughput of 260 MB/sec. In total, for the
1,000 clients within the cluster, it achieved an I/O rate of 11.1
GB/sec.
PVFS
One of the first parallel file systems for Linux clusters was
PVFS (Parallel Virtual File System).
The original PVFS system, now called
PVFS1, was originally developed at Clemson University. It is not
designed to be a file system for users home directories but rather
a high-speed scratch file system for the storage of
input and output data during the run of a job. It is an open-source
project that has involved several developers. There is a new version
of PVFS called PVFS2 which is a complete rewrite the code to make
PVFS more expandable for new systems and to add new features.
PVFS2 is a complete rewrite of PVFS1 focusing on improving the scalability of
PVFS, the flexibility of PVFS, and the performance of PVFS. PVFS1 has
been modified for various networking protocols beside TCP, but the
modifications were very invasive and difficult to support. PVFS2 has
abstracted the networking layer allowing new networking protocols to
be used very quickly. For example, Infiniband support was added in
about a week with only about 3,000 lines of C code. Currently PVFS2
supports TCP, Infiniband, and Myrinet. In addition, PVFS2 accomodates
heterogeneous clusters allowing x86, PowerPC, Itanium machines to all
be part of the same PVFS file system. PVFS2 also adds some management
interfaces and tools to allow easier administration.
Both PVFS1 and 2 divide the functions of the file system into two pieces,
a metadata server and data servers, and allow clients to mount PVFS as
a normal file system. The metadata server holds information about the
data in the file system, and the IO devices (IOD's) actually hold the
data. In PVFS1 the metadata is put onto one machine. In PVFS2, the
metadata can be distributed. The data is then spread across the IO
devices (IOD's) which can be the nodes within a cluster. PVFS
stores the data as files on top of an existing file systems such as
ext3 or xfs. So you can mix the file systems under a PVFS system
without worry. When data is written to PVFS it is broken into chunks
of a certain size and then written to the IOD's in some fashion,
usually in a round-robin fashion. The size of the chunks, what IOD
nodes are used for storage, and how the chunks are written, are all
configurable, allowing PVFS to be tuned for maximum performance for a
given application or class of applications.
One of the features of PVFS is that one of the IOD's can go down and PVFS will
still be available. PVFS will recognize that one of the IOD's is down
and not write to it. Instead it will go on to the next IOD in the
list. Consequently, applications that are writing will not hang or
crash (as long as they are not writing to the IOD node when it
crashes). However, any application that was reading data that was on
the IOD will not be able to fully access all of the data since that
IOD is down. If the IOD is brought back into production without the
storage space being altered, the data will once again be available
for use. On the other hand, if the storage space on the returned IOD
has been altered, such as re-installing Linux, then the file(s) that
used that IOD are corrupt and cannot be recovered. This is a design
decision that the PVFS developers have made. PVFS is designed to be a
temporary high-speed storage space, not a place for home directories.
The intent is to use PVFS for storage of data during an application
run and then move the data off of PVFS onto a permanent storage space
that is backed by tape or some other type of archiving media.
There are several ways to use PVFS. The best way to take advantage of the
parallel capabilities involves using an MPI-IO implementation such as
ROMIO or ChaMPIon that uses PVFS directly (either PVFS1 or PVFS2).
You can also use the normal Linux kernel interface but you won't get
the parallel speed of PVFS. This interface allows PVFS to be mounted
as a normal file system on the client nodes and for users to interact
with PVFS using typical commands such as ls or mv or
rm. Currently PVFS1 and PVFS2 cannot run binaries (i.e. you
can't run an application stored in PVFS). In the case of PVFS1, there is
also a third interface,
the PVFS library. This library has semantics very similar to the standard
C library file functions. It allows the parallel performance
capabilities of PVFS to be used, but requires some changes to your
application.
PVFS is in use at various sites around the world. PVFS1 is in production
use at several locations including the University of Utah and the Ohio State
Supercomputer Center. PVFS2 in production use as well. It is being used
on Orion Multi-System desktop clusters.
The performance of PVFS is heavily dependent upon the network
performance. The typical ethernet networks will work well, but will
limit scalability and ultimate performance. High-speed networks such
as Myrinet, Quadrics, and Infiniband (IB) greatly improve scalability
and performance. In tests, the IO performance of PVFS scales linearly
with the number of IOD's up to at least 128 IODs. At that point, the
performance exceeds 1 Gigabyte/sec. PVFS2 has been tested with 350
IOD's, each with a simple IDE and connected via Myrinet 2000. With
100 clients, PVFS2 was able to achieve an aggregate performance of
about 4.5 Gigabyte/sec in writes and almost 5 Gigabyte/sec in read
performance running an MPI-IO test code.
GFS
A few years ago some researchers
from the University of Minnesota formed a company called Sistina to
develop open-source file system tools for Linux. They developed lvm
(Logical Volume Manager) for Linux and GFS (Global File System). The goal of
GFS is to provide a true global name space for clustered machines
using various networking options (such as Fibre Channel or GigE) that
is resilient to server or network loss.
After a period of time, Sistina took GFS and made it
closed-source. Recently, Redhat has purchased Sistina and
re-open-sourced
GFS.
Redhat is offering GFS with
commercial support as part of their Red Hat Cluster Suite
for $499 for up to 8 nodes (it requires a subscription to
Red Hat Enterprise Linux (RHEL) on these servers).
GFS is a global distributed file
system that can run on Linux systems that are connected to a Storage
Area Network (SAN). These networks can be constructed with Fibre
Channel or iSCSI networks. It has distributed locking as well as
distributed meta data so if a server in the file system goes down no
data or access to data is lost. It has a dynamic path capability so
in the event that a switch or HBA (Host Bus Adapter) goes down, it
can still get to the other servers in the SAN. It has quota
capability as well. GFS is flexible enough that various
configurations for clusters are possible. This includes both
networking and storage options as well. Redhat is claiming that GFS
scales to hundreds of machines and tens of Terabytes.
|