Print
Hits: 16016

Spinning platters is what matters

A convergence of several key technologies has begun to change some of the old assumptions about how to build clusters. These technologies are high-speed affordable interconnects, high-speed file systems and storage network protocols, MPI implementations with MPI-IO, and high performance processors. This nexus of technologies is allowing the next possible steps in cluster evolution to become mainstream - diskless nodes.

Diskless nodes don't have local disks for storage so they need to have access to some storage system to function as computational nodes. In some cases the data needs to be seen by all nodes in the cluster. This is sometimes referred to as a "global file system" or a "shared" file system. In other cases the data can just be replicated on the nodes so that each node has it's own copy. I will refer to this as "local" storage.

Which type of storage do you need? Good question grasshopper. The answer is, "it depends upon your application(s)." I hate to give such a vague answer, but it's the truth." But let me explain. If every process in your application (I'm assuming MPI here) needs to read the same input file, but does not need to read or write any data during the run that needs to be shared b all processes, then you can use just local storage. If your code doesn't meet this requirement, then you need a shared file system. Of course, there is always that third case where the code doesn't necessarily need a shared file system, but it makes things easier and the users run using local storage anyway. We'll ignore this third possibility for now.

In the past, diskless nodes have used NFS as a way to provide global cluster wide storage. While used on many clusters NFS can handle the basics, but when pushed, is not adequate for many I/O heavy applications. Fortunately, there are a variety of storage system options such as AFS or high-speed parallel file systems such as Lustre, IBRIX, GPFS, Terrascale, or Panasas, and file systems that use network storage protocols such as iSCSI, HyperSCSI, or ATA-Over-Ethernet (AOE). Also, the growing availability of MPI implementations that include MPI-IO allow codes that utilize its capabilities to use alternative file systems. This capability broadens the available file systems to include PVFS2.

In this article I'd like discuss these file systems and how they enable diskless cluster nodes. In this article I'll focus on small to medium clusters, up to perhaps 64 or 128 nodes. In the next article I'll focus on larger clusters, over 128 nodes, including clusters with over 1,000 nodes.

Assumptions

To discuss diskless systems, let's assume that each compute node does not have a hard drive. The necessary OS files are either stored in memory locally (RAM Disk), or are mounted on the node via NFS, or are available through a network based file system. While this assumptions seems straight forward and reasonable you would be suprised how many people have a fundamental problem with nodes without disks.

As I mentioned previously in this article I will focus on what I call small to medium systems. These systems are interesting because there are so many options available. Some of the options are open-source and some are commercial products. However, they all provide the storage capability that diskless nodes need.

Let's loo at the file systems that enable diskless nodes to be possible, starting with our old friend - NFS.

NFS

The first storage capability that clusters of this size should consider is good old NFS. NFS has been around for a few years and while slow compared to other file systems it is very well known and understood. Moreover, for small to medium systems it has been found to be very stable. It uses a basic client/server model with the server exporting a file system to various clients who mount the file system.

One of the goals of a networked file system such as NFS is to allow all systems that mount the file system to access the exact same files. In the case of clusters this allows all of the nodes within the cluster to access and share the same data. However, be careful when writing data to the same file from more than one node because NFS is stateless and you likely not to get the results you expected. In other words - don't do it.

The history of NFS is very interesting. With the explosion of network computing came the need for a user to access a file on a remote machine in the same or similar way that they access a local file. Sun developed NFS (Network File System) to fill this need. NFS is now a standard and is in version 3, with version 4 coming out soon. Since it is a standard, the client and server can run different operating systems yet share the same data.

The storage on the server can take many forms. Probably the simplest form is for the NFS server to have internal disks. For smaller clusters, the NFS server and the master node can be one in the same. The internal disks can be configured in any way you like. To make life easier down the road, it is a good idea to configure the disks using lvm. This step allows the storage space to be easily grown or shrunk as needed. If the file system can tolerate shrinking or growing the file system, while it's on-line, then you can avoid having to reformat the file system and restore the data from a backup.

Once the drives are configured, and the file system formatted on the drives, then the file /etc/exports is edited to "export" the desired file systems. Details of this process, if you are having trouble, are at the Linux NFS website. There is also a Howto on configuring NFS and for performance tuning.

There has also been some good discussion on the NFS mailing list about good options to choose for both exporting the file system and for mounting the file system. In addition, there have been some comments about NFS for diskless systems. The sidebar contains some links to good discussions from the NFS mailing list.

Another option for using NFS is to buy a preconfigured NAS (Network Attached Storage) box. Many of these boxes are basically servers stuffed with a bunch of disks. Other NAS boxes have some special hardware and software to improve the performance. NAS boxes have storage with a file system that it exports to the cluster using NFS.

There are many, many vendors who sell basic NAS boxes. Quite a number of them are of the basic white-box variety. If you want to purchase boxes from these vendors be sure to check whether they have delivered tuned NAS boxes for Linux. Be sure to ask for as many details as possible. Also, ask for performance guarantees from them or ask for a test box that you can use for testing.

There are other NAS box companies such as Netapp, EMC, Agami, and BlueArc that provide higher performance NAS boxes. Whether these companies can deliver cost effective equipment for small to medium clusters is up to you to decide. Again, be sure to do your homework and ask for a test box or a performance guarantee.

PVFS and PVFS2

I've written several columns about PVFS and PVFS2. PVFS, like NFS, provides a global file system for the cluster. However, this file system breaks the mold of a typical file system in that it is a specialized file system intended as a high-speed parallel scratch file system. However, for codes to take true advantage of PVFS, they need to be, ideally, ported to use MPI-IO.

PVFS usually works well, if you use it in combination with either another global file system such as NFS, or local file systems. The primary reason for this is that PVFS does not currently allow binaries to be run from the file system. So you cannot copy your executable to a PVFS file system and run it from there. So, you need to run the executable from another location, such as an NFS file system or a local file system.

Like iSCSI and HyperSCSI, PVFS has a very flexible configuration. PVFS is a virtual file system. That is to say, it is built on top of an existing file system such as ext2, ext3, jfs, reiserfs, or xfs. You select which machines will run PVFS, either as an I/O storage node or a metadata node. Currently PVFS2 supports TCP, Infiniband, and Myrinet interconnects. It also supports multiple network connections (mulithome).

Configuring PVFS2 is very easy. You select some machines with disks as servers (although you can use a single machine for this purpose). A machine can act as a server and/or a metadata server. Then PVFS2 is installed and configured on these machines. The client software is installed on the diskless nodes to mount the PVFS file system.

The PVFS2 FAQ gives some good guidance on selecting the number of servers. However, for small clusters that only run one job and have only one PVFS server, NFS may give better results. For larger clusters, having more than one PVFS server will definitely help.

PVFS2 is very flexible allowing many possible configurations. For example, one can take a few servers that have several disks and use them as PVFS2 servers. To make things easier I recommend using md to combine the block devices on the disks into a single block device. If you think you might need additional storage space at some point, then it would be a good idea to use lvm with the disks to allow more storage space to be easily added. Of course, you could use the disks in a RAID configuration, either hardware or software RAID, to gain some resilience, and use lvm on top of the RAID-ed block devices.

One other option is to use multiple networks from the servers to the clients. Many server boards come with multiple GigE ports, so it's very easy to just add another set of GigE switches to the network. This extra network also gives you some resilience in case of network failure.

AFS

The Andrew File System or AFS grew out of the Andrew Project which was a joint project between Carnegie Mellon University and IBM. It was very much a project focused on the needs of academia, but IBM took much of the project and commercialized it. Of of these projects was AFS.

AFS is somewhat similar to NFS in that you can access the same data on any AFS client, but is designed to be more scalable and more secure than NFS. AFS is distributed so you can have multiple AFS servers to help spread the load throughout the cluster. To add more space, you can easily add more servers. This is more difficult to do with NFS.

One of the features of AFS is that the clients cache data (either on a local hard disk or in memory). Consequently, file access is much quicker than NFS. However, this does require a mechanism for ensuring file consistently across the file system and can cause a slowdown with shared file writes. Also, if a server goes down, then other clients can access the data by contacting a client that has cached the data. However while, the server is down, the cached data can not be updated (but it is readable). The caching feature of AFS allows it to scale better than NFS as the number of clients is increased.

iSCSI

An option for providing local storage is iSCSI. I've talked about iSCSI before, but just to recap, it stands for Internet SCSI (Small Computer System Interface). The iSCSI protocol defines how to encapsulate SCSI commands and data requests into IP packets and sends them over an Ethernet connection, unpacks them, and creates a local stream of SCSI commands or data requests.

In an iSCSI configuration, a server, which is called a target, provides storage space in the form of block devices that are "exposed" over an Ethernet network. A client, called an initiator, mounts the block device(s) and builds a file system on them and then uses it as though it it were local storage. One important point to make is that iSCSI will only provide a file system local to each node (initiator) that as far as the OS is concerned is a local disk. NFS on the other hand provides a global file system so that all of the nodes see the same files.


There are many options with iSCSI. You can use a central server (target) with a number of disks, or several targets or several networks, or combinations (as the Vulcans say, "infinite diversity in infinite combinations" -- while not infinite, there are a number of options). iSCSI makes block devices available via the Ethernet network. To create the block devices, you can use md (Multiple Devices or software RAID), lvm, evms, etc. to create block devices suitable for iSCSI. These tools allow you to create block devices from partitions or different disks. Using these tools also allows you to export block devices from one target to multiple nodes (initiators). iSCSI also allows you to export block devices over various Ethernet networks to various initiators.

Using software RAID (md ) on the server (target) machine you can combine block devices before exposing them to iSCSI initiators. This allows you to use some kind of RAID protection before exposing the block devices. For example, you could create a software RAID 5 on the target machine using all of the disks, then use LVM to create volume group(s) and logical volumes across all of the disks, and then expose the logical volumes to nodes (initiators). This way if a disk is lost, it can be replaced without losing the file system on any node. However, if the server (target) goes down, you lose the storage on the nodes (initiators) it was serving.

An alternative is to take a number of servers with disks, expose a set of block devices from each target to a set of nodes such that each node (initiator) mounts one block device from a given target. Then the node would take the iSCSI block devices, use software RAID-5 (or RAID-1) and LVM to create a final block device that is formatted with a file system. This configuration allows an entire target machine to go down without losing the storage on the nodes (initiators) since the final block device is RAID-5 or at least RAID-1 so that you still have access to the data. You can also use RAID-5 on the targets so that the lose of a single disk will not interrupt the initiators. This configuration might also have some speed advantages depending upon how the storage is used.

You can also use striping via RAID or lvm to improve the disk performance on the target prior to exposing the storage block to the initiator(s). However, this will likely put the bottleneck on the network. You could also stripe on the initiator side by using the block devices from various targets in md to create the final block device for the file system.

Since Gigabit Ethernet (GigE) is relatively inexpensive today, it's possible to have the target machines expose block devices on various networks. This feature allows you to reduce the number of block devices communicating over a given network thus improving throughput.

There are many possible ways to configure an iSCSI storage solution. Using md and lvm or evms, you can create block devices on the targets and expose those to the initiators. Then you can use exposed devices from various targets on a single initiator to get good performance and improve resiliency.

HyperSCSI

HyperSCSI can also be used to provide local storage on the nodes. HyperSCSI is a network storage protocol like iSCSI, but rather than use IP as iSCSI does, it uses it's own packets over raw Ethernet. By doing so, it can be more efficient because of the reduction in the TCP/IP overhead. However, because it doesn't use IP packets it's not a routable protocol. For small to medium clusters this is not likely to be an issue.

Configurations for HyperSCSI are conceptually very similar to iSCSI configurations. It uses block devices as does iSCSI and it uses Ethernet networks. As I said before, the big advantage of HyperSCSI is that is doesn't use IP, but it's own packets. This feature can make for an extremely efficient network storage protocol and is very well suited for clusters since they typically don't use routed networks inside the cluster.

Commercial Offerings

There are several commercial options for providing storage, both locally and for global file systems. For example, one could use Lustre, IBRIX, GPFS, or Terrscale with various storage devices, or use the Panasas ActiveScale Storage Cluster. One could also use Coraid ATA-over-Ethernet product to provide local storage for each node in a fashion similar to iSCSI or HyperSCSI.

For smaller clusters, these solutions are likely to be too expensive. For larger clusters, perhaps from 32 nodes and up, they might prove to be a price/performance winner. However, there are some applications that are very I/O intensive and could benefit from a high performance file system regardless of the size of the cluster.

Summary

As you can see there are a number of options for providing either global storage to diskless nodes or local storage for diskless nodes. Depending upon your code(s), you can choose to use either global storage or local storage or a combination of the two.

For small to medium clusters, which I call up to 64 or 128 nodes, NFS will work well enough if you have a good storage subsystem behind it and your IO usage isn't too large (high IO rates can easily kill performance over NFS). In addition, AFS offers some very attractive feature compared to NFS so you should seriously consider it. If you need lots of IO, then PVFS or PVFS2 will work well, if you understand that it is a high-speed scratch file system and not a place for storing your files on a longer term basis such as what a home file system requires.

If you need storage local to each node for running your codes then either iSCSI or HyperSCSI will work well. Plus they are very flexible and can be configured in just about any way you want or need. In some cases you might have to also use global storage such as NFS to help. In either way

In my next installment I'll discuss commercial options more in depth as I continue discussing file system options for diskless clusters larger than 128 nodes.


Sidebar One: Links Mentioned in Column

Linux NFS

NFS mailing list - a discussion about good performance

NFS mailing list - a discussion about diskless systems

iSCSI on Linux - Article on Configuring iSCSI target and initiators

PVFS

PVFS2


The core of this article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted). He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).