Plenty of Options for Plenty of Files
Welcome to Cluster Money (aka' "The Monkey") and the Cluster File Systems Column. This column will be taking a look at these new file systems, how they work, how they fit into the HPC world, and how you can deploy them for maximum effect. It will also explore some of the details of file systems to help you understand such things as the difference between metadata and inodes. It will also discuss some of the underlying hardware that is important to file systems such as networking and the data storage devices itself.
This first column will present a brief overview of many of the file systems that are available today for clusters as well as some storage options for the file systems. This includes true parallel file systems, new network file systems, and even a storage option for high performance storage for clusters. The list isn't extensive but is intended to wet your appetite for more information about the explosion of storage that is happening.
Storage has become a very important part of clusters. Initially people used use NFS to distributed data to all of the nodes of the cluster. However, as clusters grew to hundreds and thousands of nodes, and the demand for increased data I/O rates grew, people realized that NFS was not going to cut it. So they used somewhat kludgy systems for a while or turned to other ideas such as PVFS (Parallel Virtual File System). Recently companies have started to realize that the market for HPC file systems is larger than they thought and largely untapped. Coupling this with very high-speed, low-cost networks such as Infiniband, results in the right time for an explosion of HPC file systems. This article will present a brief overview of some of the file systems that are available today for clusters. The list isn't extensive but is intended to wet your appetite for more information about the explosion of storage that is happening.
IBRIX
IBRIX is a relatively new company offering a distributed file system that presents itself as a global name space to all the clients. IBRIX' Fusion product is a software only product takes whatever data space you designate on what ever machines you choose and creates a global, parallel, distributed file system on them. This file system, or "name space," can be mounted by clients who can share the same data with all of the other clients. In essence, each client sees that exact same data, hence the phrase, "single" or "global" name space. The key to Fusion is that the common bottlenecks in parallel global file systems have been removed. Consequently, the file systems scales almost linearly with the number of data servers (also called IO servers). This architecture allows the file system to grow to tens of Petabytes (a Petabyte is about 1,000 Terabytes or about 1,000,000 Gigabytes). It can also achieve IO (Input/Output) speeds of Tens of Gigabytes per second for large or small files.
IBRIX has automatic fail-over as well as metadata journaling to speed recover in case of a crash. Perhaps more importantly IBRIX has developed a distributed metadata capability so losing several nodes will not result in losing access to any data. This unique feature also allows parts of the name space to be taken off line for maintenance, upgrades, or even backups, while the rest of the name space stays on-line. You can also add storage space while Fusion is running and it will automatically incorporate it. It can also export the file system using NFS or CIFS (for the Windows users that haven't gotten a clue yet.
It's easy to see that Fusion could be deployed in an HPC cluster by using all of the latent space available on the compute nodes. Since most nodes come with at least something like a 40 Gig or 80 Gig had drive and the OS only takes about 2-4 GB (Gigabytes) of space, you have some extra space to do something with. Fusion allows you to combine that extra space and create a global name space for all of the nodes within the cluster. Alternatively you could choose a few nodes and load them with storage space, create a global name space using the data servers, and mount it on the client nodes at speeds much faster than traditional NFS. These clients nodes don't need a local disk so you can run them diskless.
IBRIX Fusion currently has some limitations. It isn't 64-bit (yet), and requires IP for it's networking. Also, it doesn't support SUSE on the client nodes. However, IBRIX is aware of these issues and is working to provide all of these features.
Also, IBRIX Fusion comes bundled with a number of systems. For example, Dell is shipping Fusion with some of its cluster products. Also, recently, Rackable Systems has announced an OEM agreement with IBRIX. In addition, Scali has announced a reseller arrangement with IBRIX.
Polyserve
Polyserve Inc. has a unique product in the storage world. The Polyserve Matrix Server takes up to 16 SAN (Storage Area Network) attached servers and creates a high-performance, low-cost NAS (Network Attached Storage) system.
Polyserve takes low-cost PC Servers running Linux that are attached via a Fibre Channel (FC) network to a SAN and installs their proprietary file system. This file system is a true symmetric file system with high availability services and cluster and storage management capabilities, providing a global name space. Polyserve states that there is not central lock of metadata servers so there is no single point of failure. It provides a global name space.
The servers that are part of the Matrix Server network can then export the file system via NFS to compute nodes within a cluster. Since there are up to 16 servers in the Matrix Server, each server could NFS support for a portion of the cluster. Also, since the file system is global, if one server goes down, another server can provide NFS services to the nodes the original server was servicing.
Panasas
Panasas is one of the storage vendors contending for a part of the HPC market. Their ActiveScale Storage Cluster is a high-speed, scalable, global, storage system that uses an object based file system called ActiveScale. Panasas couples their file system with a proprietary, but commodity based, storage system termed Panasas Storage Blades. The basic storage unit consists of a 4U chassis and a number of blades that fit into the chassis with direct attached storage (hard drives). In each chassis is also a director blade that is in essence a part of the file system.
This file system is one of the unique features of Panasas' storage system. ActiveScale turns files into objects and then dynamically distributes the data activity across Panasas Storage Blades. The role of the director blade is to virtualize the data objects (the files) and put them onto the storage blades. This is a unique concept where the storage finds the data rather than the usual approach of the data looking for the storage.
Each chassis that is part of the Panasas Storage Cluster is called a shelf. Each shelf can hold up to two director blades and 10 storage blades, creating up to 5 TB (Terabytes) of space across the 10 storage blades (500 Gigabytes of data per blade). Each shelf also has a built-in Gigabit (GigE) switch for traffic within the chassis and for traffic within other shelves or outside the storage system. Panasas claims that their Storage Cluster system can achieve a data throughput of up to 10 Gigabytes per second.
The ActiveScale Storage Cluster is very useful for providing high-speed storage within a cluster. A typical approach for HPC clusters would have a dedicated network for parallel communication and attach each node of the cluster to a storage network where the ActiveScale Storage Cluster is attached. Then each node can communicate directly to the file system.
Terrascale
Terrascale Technologies has a software only solution for high-performance storage for clusters. Their product, TerraGrid, uses standard Linux file system tools such as md, lvm, and evms in conjunction with Linux file systems such as ext2 for a global name space. The key to TerraGrid is the use of the iSCSI protocol together with proprietary drivers and file system patches to unify the storage space across multiple servers. It supports native Linux file systems and can export the file system using NFS and CIFS (for those occasional Windows hold out).
TerraGrid is a global name space file system. It uses the md tools in Linux to aggregate the space together, presenting the file system layer with a large multi-port virtual hard disk.
In tests TerraGrid enable compute nodes can sustain 100 Mbyte/sec (Megabytes per second) of single-stream I/O (Input/Output) performance. It scales fairly linearly to hundreds of nodes until either the network or the pool of I/O servers is saturated.
Data Direct Networks
While Data Direct Networks (DDN) does not deliver a complete storage solution with a storage system and a file system, they are a major distributor in the HPC and cluster market for a robust high-speed scalable storage system. Their S2A8500 storage system can achieve 1.5 Gbytes/sec in sustained throughput with either Fibre Channel disks or Serial ATA (SATA) disks. The company says that they can scale from a handful of disks to over 1,000 disks. This corresponds to tens of Terabytes in space up to over a Petabyte of storage. This allows the throughput to scale from 1.5 Gigabytes/sec to tens of Gigabytes/sec.
The S2A8500 is a 2U box that supports four 2 GB/sec ports or two 4U boxes with eight 2 GB/sec ports. It can support up to 20 Fibre-Channel loops supporting Fibre or SATA disks. It can accommodate up to 1,120 disks resulting in up to 130 TB using Fibre Channel disks or 250 TB using SATA disks. The controllers can be configured in a SwiftCluster configuration to achieve over 1 Petabyte in storage.
The file system built using the storage system can be exported and mounted on the compute nodes using a variety of schemes. For instance, you can use normal NFS or connect the nodes using Fibre Channel networking for the compute nodes.