|
Page 1 of 3
Plenty of Options for Plenty of Files
Welcome to Cluster Money (aka' "The Monkey") and the Cluster
File Systems Column. This column will be taking a look at these new
file systems, how
they work, how they fit into the HPC world, and how you can deploy
them for maximum effect. It will also explore some of the details
of file systems to help you understand such things as the difference
between metadata and inodes. It will also discuss some of the
underlying hardware that is important to file systems such as
networking and the data storage devices itself.
This first column will present a brief overview of many of the file
systems that are available today for clusters as well as some storage
options for the file systems. This includes true parallel
file systems, new network file systems, and even a storage option
for high performance storage for clusters. The list isn't extensive
but is intended to wet your appetite for more information about the
explosion of storage that is happening.
Storage has become a very important part of clusters. Initially
people used use NFS to distributed data to all of the nodes of the
cluster. However, as clusters grew to hundreds and thousands of
nodes, and the demand for increased data I/O rates grew, people
realized that NFS was not going to cut it. So they used somewhat
kludgy systems for a while or turned to other ideas such as PVFS
(Parallel Virtual File System). Recently companies have started to
realize that the market for HPC file systems is larger than they
thought and largely untapped. Coupling this with very high-speed,
low-cost networks such as Infiniband, results in the right time for
an explosion of HPC file systems. This article will present a brief
overview of some of the file systems that are available today for
clusters. The list isn't extensive but is intended to wet your
appetite for more information about the explosion of storage that
is happening.
IBRIX
IBRIX
is a relatively new company offering a distributed file system that
presents itself as a global name space to all the clients. IBRIX'
Fusion product is a software only product takes whatever
data space you designate on what ever machines you choose and creates
a global, parallel, distributed file system on them. This file
system, or "name space," can be mounted by clients who can share
the same data with all of the other clients. In essence, each client
sees that exact same data, hence the phrase, "single" or "global" name
space. The key to Fusion is that the common bottlenecks in
parallel global file systems have been removed. Consequently, the
file systems scales almost linearly with the number of data servers
(also called IO servers). This architecture allows the file system to
grow to tens of Petabytes (a Petabyte is about 1,000 Terabytes or
about 1,000,000 Gigabytes). It can also achieve IO (Input/Output)
speeds of Tens of Gigabytes per second for large or small files.
IBRIX has automatic fail-over as well as metadata journaling to speed
recover in case of a crash. Perhaps more importantly IBRIX has
developed a distributed metadata capability so losing several nodes
will not result in losing access to any data. This unique feature
also allows parts of the name space to be taken off line for
maintenance, upgrades, or even backups, while the rest of the name
space stays on-line. You can also add storage space while Fusion is
running and it will automatically incorporate it. It can also export
the file system using NFS or CIFS (for the Windows users that haven't
gotten a clue yet.
It's easy to see that Fusion could be deployed in an HPC cluster by using
all of the latent space available on the compute nodes. Since most
nodes come with at least something like a 40 Gig or 80 Gig had drive
and the OS only takes about 2-4 GB (Gigabytes) of space, you have some extra
space to do something with. Fusion allows you to combine that extra
space and create a global name space for all of the nodes within the
cluster. Alternatively you could choose a few nodes and load them
with storage space, create a global name space using the data
servers, and mount it on the client nodes at speeds much faster than
traditional NFS. These clients nodes don't need a local disk so you
can run them diskless.
IBRIX Fusion currently has some limitations. It isn't 64-bit (yet), and
requires IP for it's networking. Also, it doesn't support SUSE on the
client nodes. However, IBRIX is aware of these issues and is working
to provide all of these features.
Also, IBRIX Fusion comes bundled with a number of systems. For example,
Dell is shipping Fusion with some of its cluster products. Also,
recently, Rackable Systems has announced an OEM agreement with IBRIX.
In addition, Scali has announced a reseller arrangement with IBRIX.
Polyserve
Polyserve Inc.
has a unique product in the storage world. The Polyserve Matrix
Server takes up to 16 SAN (Storage Area Network) attached servers and
creates a high-performance, low-cost NAS (Network Attached Storage)
system.
Polyserve takes low-cost PC Servers running Linux that are attached via a Fibre
Channel (FC) network to a SAN and installs their proprietary file
system. This file system is a true symmetric file system with high
availability services and cluster and storage management
capabilities, providing a global name space. Polyserve states that
there is not central lock of metadata servers so there is no single
point of failure. It provides a global name space.
The servers that are part of the Matrix Server network can then export
the file system via NFS to compute nodes within a cluster. Since
there are up to 16 servers in the Matrix Server, each server could
NFS support for a portion of the cluster. Also, since the file system
is global, if one server goes down, another server can provide NFS
services to the nodes the original server was servicing.
Panasas
Panasas is one of the storage
vendors contending for a part of the HPC
market. Their ActiveScale Storage Cluster is a high-speed, scalable,
global, storage system that uses an object based file system called
ActiveScale. Panasas couples their
file system with a proprietary, but commodity based, storage system
termed Panasas Storage
Blades. The basic storage unit consists of a 4U chassis and a
number of blades that fit into the chassis with direct attached
storage (hard drives). In each chassis is also a director blade that
is in essence a part of the file system.
This file system is one of the unique features of Panasas' storage system.
ActiveScale turns files into objects and then dynamically distributes
the data activity across Panasas Storage Blades. The role of the
director blade is to virtualize the data objects (the files) and put
them onto the storage blades. This is a unique concept where the
storage finds the data rather than the usual approach of the data
looking for the storage.
Each chassis that is part of the Panasas Storage Cluster is called a
shelf. Each shelf can hold up to two director blades and 10 storage
blades, creating up to 5 TB (Terabytes) of space across the 10
storage blades (500 Gigabytes of data per blade). Each shelf also has
a built-in Gigabit (GigE) switch for traffic within the chassis and
for traffic within other shelves or outside the storage system.
Panasas claims that their Storage Cluster system can achieve
a data throughput of up to 10 Gigabytes per second.
The ActiveScale Storage Cluster is very useful for providing high-speed
storage within a cluster. A typical approach for HPC clusters would
have a dedicated network for parallel communication and attach each
node of the cluster to a storage network where the ActiveScale
Storage Cluster is attached. Then each node can communicate directly
to the file system.
Terrascale
Terrascale Technologies
has a software only solution for high-performance
storage for clusters. Their product, TerraGrid, uses standard Linux
file system tools such as md, lvm, and evms
in conjunction with Linux file systems such as ext2 for a
global name space. The key to TerraGrid is the use of the iSCSI
protocol together with proprietary drivers and file system patches to
unify the storage space across multiple servers. It supports native
Linux file systems and can export the file system using
NFS and CIFS (for those occasional Windows hold out).
TerraGrid is a global name space file system. It uses the md
tools in Linux to aggregate the space
together, presenting the file system layer with a large multi-port
virtual hard disk.
In tests TerraGrid enable compute nodes can sustain 100 Mbyte/sec
(Megabytes per second) of single-stream I/O
(Input/Output) performance. It scales fairly linearly to hundreds of
nodes until either the network or the pool of I/O servers is
saturated.
Data Direct Networks
While Data Direct Networks
(DDN) does not deliver a complete storage solution with
a storage system and a file system, they are a major distributor in the HPC
and cluster market for a robust high-speed scalable storage
system. Their S2A8500 storage system can achieve 1.5 Gbytes/sec in
sustained throughput with either Fibre Channel disks or Serial ATA
(SATA) disks. The company says that they can scale from a handful of
disks to over 1,000 disks. This corresponds to tens of Terabytes in
space up to over a Petabyte of storage. This allows the throughput
to scale from 1.5 Gigabytes/sec to tens of Gigabytes/sec.
The S2A8500 is a 2U box that supports
four 2 GB/sec ports or two 4U boxes with eight 2 GB/sec ports. It can
support up to 20 Fibre-Channel loops supporting Fibre or SATA disks.
It can accommodate up to 1,120 disks resulting in up to 130 TB using
Fibre Channel disks or 250 TB using SATA disks. The controllers can
be configured in a SwiftCluster configuration to achieve over 1
Petabyte in storage.
The file system built using the storage
system can be exported and mounted on the compute nodes using a
variety of schemes. For instance, you can use normal NFS or connect
the nodes using Fibre Channel networking for the compute nodes.
|