|
Page 1 of 2
Spinning platters is what matters
A convergence of several key technologies has begun to change some of
the old assumptions about how to build clusters. These technologies
are high-speed affordable interconnects, high-speed file systems and
storage network protocols, MPI implementations with MPI-IO, and high
performance processors. This nexus of technologies is allowing the
next possible steps in cluster evolution to become
mainstream - diskless nodes.
Diskless nodes don't have local disks for storage so they need to
have access to some storage system to function as computational
nodes. In some cases the data needs to be seen by all nodes in the
cluster. This is sometimes referred to as a "global file system"
or a "shared" file system. In other cases the data can just be
replicated on the nodes so that each node has it's own copy. I will
refer to this as "local" storage.
Which type of storage do you need? Good question grasshopper.
The answer is, "it depends upon your application(s)." I hate to
give such a vague answer, but it's the truth." But let me explain.
If every process in your application (I'm assuming MPI here) needs
to read the same input file, but does not need to read or
write any data during the run that needs to be shared b all processes,
then you can use just local storage. If your code doesn't meet this
requirement, then you need a shared file system. Of course, there
is always that third case where the code doesn't necessarily need
a shared file system, but it makes things easier and the users run
using local storage anyway. We'll ignore this third possibility for
now.
In the past, diskless nodes have used NFS as a way to provide
global cluster wide storage. While used on many clusters NFS can
handle the basics, but when pushed, is not adequate for many I/O
heavy applications. Fortunately, there are a variety of storage
system options such as AFS or high-speed parallel file systems
such as Lustre, IBRIX, GPFS, Terrascale, or Panasas, and file systems
that use network storage protocols such as iSCSI, HyperSCSI, or
ATA-Over-Ethernet (AOE). Also, the growing availability of MPI
implementations that include MPI-IO allow codes that utilize its
capabilities to use alternative file systems. This capability broadens
the available file systems to include PVFS2.
In this article I'd like discuss these file systems
and how they enable diskless cluster nodes. In this article I'll focus
on small to medium clusters, up to perhaps 64 or 128 nodes. In the
next article I'll focus on larger clusters, over 128 nodes, including
clusters with over 1,000 nodes.
Assumptions
To discuss diskless systems, let's assume that each compute node
does not have a hard drive. The necessary OS files are either
stored in memory locally (RAM Disk), or are mounted on the node
via NFS, or are available through a network based file system.
While this assumptions seems straight forward and reasonable
you would be suprised how many people have a fundamental problem
with nodes without disks.
As I mentioned previously in this article I will focus on what I call
small to medium systems. These systems are interesting because there
are so many options
available. Some of the options are open-source and some are commercial
products. However, they all provide the storage capability that
diskless nodes need.
Let's loo at the file systems that enable diskless
nodes to be possible, starting with our old friend - NFS.
NFS
The first storage capability that clusters of this size should consider
is good old
NFS. NFS
has been around for a few years and while slow compared to other
file systems it
is very well known and understood. Moreover, for small to medium
systems it has been found to be very stable. It uses a basic
client/server model with the server exporting a file system to various
clients who mount the file system.
One of the goals of a networked file system such as NFS is to allow
all systems that mount the file system to access the exact same
files. In the case of clusters this allows all of the nodes within
the cluster to access and share the same data. However, be
careful when writing data to the same file from more than one
node because NFS is stateless and you likely not to get the results
you expected. In other words - don't do it.
The history of NFS is very interesting. With the explosion of network
computing came the need for a user to access a file on a remote machine
in the same or similar way that they access a local file. Sun
developed NFS (Network File System) to fill this need. NFS is now
a standard and is in version 3, with version 4 coming out soon. Since
it is a standard, the client and server can run different operating
systems yet share the same data.
The storage on the server can take many forms. Probably the simplest
form is for the NFS server to have internal disks. For smaller
clusters, the NFS server and the master node can be one in the same.
The internal disks can be configured in any way you like. To make
life easier down the road, it is a good idea to configure the disks
using lvm. This step allows the storage space to be easily grown or
shrunk as needed. If the file system can tolerate shrinking or growing
the file system, while it's on-line, then you can avoid having to
reformat the file system and restore the data from a backup.
Once the drives are configured, and the file system formatted on the
drives, then the file /etc/exports is edited to "export" the
desired file systems. Details of this process, if you are having
trouble, are at the Linux NFS
website. There is also a
Howto on configuring NFS
and for
performance tuning.
There has also been some good discussion on the NFS mailing list
about good options to choose for both exporting the file system and
for mounting the file system. In addition, there have been some
comments about NFS for diskless systems. The sidebar contains some
links to good discussions from the NFS mailing list.
Another option for using NFS is to buy a preconfigured NAS
(Network Attached Storage) box. Many of these boxes are basically servers
stuffed with a bunch of disks. Other NAS boxes have some special hardware
and software to improve the performance. NAS boxes have storage with a
file system that it exports to the cluster using NFS.
There are many, many vendors who sell basic NAS boxes. Quite a number
of them are of the basic white-box variety. If you want to purchase
boxes from these vendors be sure to check whether they have delivered
tuned NAS boxes for Linux. Be sure to ask for as many details
as possible. Also, ask for performance guarantees from them or ask
for a test box that you can use for testing.
There are other NAS box companies such as
Netapp,
EMC,
Agami,
and
BlueArc that
provide higher performance NAS boxes. Whether these companies can
deliver cost effective equipment for small to medium clusters is up
to you to decide. Again, be sure to do your homework and ask for a
test box or a performance guarantee.
PVFS and PVFS2
I've written several columns about
PVFS and
PVFS2. PVFS, like NFS,
provides a global file system for the cluster. However, this
file system
breaks the mold of a typical file system in that it is a specialized
file system intended as a high-speed parallel scratch file system.
However, for codes to take true advantage of PVFS, they need to be,
ideally, ported to use MPI-IO.
PVFS usually works well, if you use it in combination with either
another global file system such as NFS, or local file systems.
The primary reason for this is that PVFS does not currently allow
binaries to be run from the file system. So you cannot copy your
executable to a PVFS file system and run it from there. So, you need
to run the executable from another location, such as an NFS file system
or a local file system.
Like iSCSI and HyperSCSI, PVFS has a very flexible configuration.
PVFS is a virtual file system. That is to say, it is built on top
of an existing
file system such as ext2, ext3, jfs, reiserfs, or xfs. You select
which machines will run PVFS, either as an I/O storage node or a
metadata node. Currently PVFS2 supports TCP, Infiniband, and Myrinet
interconnects. It also supports multiple network connections (mulithome).
Configuring PVFS2 is very easy. You select some machines with
disks as servers (although you can use a single machine for this
purpose). A machine can act as a server and/or a metadata server.
Then PVFS2 is installed and configured on these machines. The client
software is installed on the diskless nodes to mount the PVFS file
system.
The
PVFS2 FAQ
gives some good guidance on selecting the number of
servers. However, for small clusters that only run one job and have
only one PVFS server, NFS may give better results. For larger
clusters, having more than one PVFS server will definitely help.
PVFS2 is very flexible allowing many possible configurations.
For example, one can take a few servers that have several disks
and use them as PVFS2 servers. To make things easier I recommend
using md to combine the block devices on the disks into a
single block device. If you think you might need additional storage
space at some point, then it would be a good idea to use lvm
with the disks to allow more storage space to be easily added. Of
course, you could use the disks in a RAID configuration, either
hardware or software RAID, to gain some resilience, and use lvm
on top of the RAID-ed block devices.
One other option is to use multiple networks from the servers to the
clients. Many server boards come with multiple GigE ports, so it's
very easy to just add another set of GigE switches to the network.
This extra network also gives you some resilience in case of network
failure.
AFS
The Andrew File System
or AFS grew out of the
Andrew Project
which was a joint project between Carnegie Mellon University and IBM. It
was very much a project focused on the needs of academia, but IBM took
much of the project and commercialized it. Of of these projects was AFS.
AFS is somewhat similar to NFS in
that you can access the same data on any AFS client, but
is designed to be more scalable and more secure than NFS. AFS is distributed
so you can have multiple AFS servers to help spread the load throughout
the cluster. To add more space, you can easily add more servers. This is
more difficult to do with NFS.
One of the features
of AFS is that the clients cache data (either on a local hard disk or in memory). Consequently, file access is much
quicker than NFS. However, this does require a mechanism for ensuring file
consistently across the file system and can cause a slowdown with shared file writes. Also, if a server goes down, then other
clients can access the data by contacting a client that has cached the data.
However while, the server is down, the cached data can not be updated
(but it is readable). The caching feature of AFS allows it to scale better
than NFS as the number of clients is increased.
iSCSI
An option for providing local storage is
iSCSI. I've talked about
iSCSI before, but just to recap, it stands for Internet SCSI (Small
Computer System Interface). The iSCSI protocol defines how to encapsulate
SCSI commands and data requests into IP packets and sends them over
an Ethernet connection, unpacks them, and creates a local stream of
SCSI commands or data requests.
In an iSCSI configuration, a server, which is called a target, provides
storage space in the form of block devices that are "exposed" over
an Ethernet network. A client, called an initiator, mounts the block
device(s) and builds a file system on them and then uses it as though
it it were local storage. One important point to make is that iSCSI
will only provide a file system local to each node (initiator) that
as far as the OS is concerned is a local disk. NFS on the other hand
provides a global file system so that all of the nodes see the same
files.
|