Article Index


dCache is an open-source project for a file system to store and retrieve large amounts of data. In a sense it's somewhat similar to AFS, but it is focus is different. The focus of dCache is to store huge amounts of data, such as the data from a physics experiment, on servers that are distributed throughout a cluster or a Grid. It is not necessarily focused on high-speed parallel access to the data, but rather, it is focused on storing large amounts of data efficiently and quickly, and giving access to the data through typical methods (e.g. POSIX commands).

dCache is a virtual file system that allows the files to be accessed using most of the typical access methods. It splits the name space of the file system and where the data is actually located. The name space is managed by a database, usually PostgreSQL. If an application or a client requests access to the data, then the response, the data, is handled by the nfs2 protocol. It can also respond using various ftp filename operations. Similar to AFS, dCache allows the files to exist on any of the various dCache servers, which are referred to as pools. dCache also has the capability of accessing data of a Tertiary Storage Manager which could be used with a HSM system. dCache handles all necessary data transfer between the servers and optionally between the external storage manager and the cache itself.

dCache has something called, dCAP, that is a native protocol that allows regular file access functionality. dCache also includes a C language client implementation of the POSIX functions >open, read, write, seek, stat, and close, as well as standard file system name space operations (available for Solaris, Linux, Irix64, and Windows). This library can be linked against client applications or it may be preloaded to over-ride the file system IO. The library also supports pluggable security modules. Currently, dCache comes with GssApi which is a Kerberos module, and ssl.

The library also performs all actions to survive a network or pool failures. Given that the name space and the data are decoupled, this task is much easier than it could have been. dCache also keeps several copies around in the pools, so this also helps in surviving a data server crash or a network crash.

As mentioned before, dCache can keep several copies of a data in the file system. This capability is part of a several data management capabilities in dCache. For example, dCache allows you to define where you might like to put data among the pools. Normally, dCache distributes the data around the pools in a load-balancing sense. But if you have a preference about where you want certain files you can specify that. As an example, you may want certain types of data or certain groups of data files to be closer to a group working on them. Another way to use what is termed a pool attraction model is to use the hard drives in dCache to accept data from an experiment and then move into the Tertiary Storage System (tape).

Coupled with the pool attraction model is the idea of load balancing. This module monitors file system metrics such as, the number of active data transfers and the age of the least recently used file for each pool. With this information, it can select the most appropriate pool. It can even decide when and if to replicate a file.

The Replica Manager Module within dCache makes sure there are at least N copies of a file distributed over different pool nodes. It also makes sure there are no more than M copies. In addition to making data access quicker, the Replica Manager can also help in data configuration if you need to shut down storage servers without affecting the system availability or to overcome node or disk failures.

So far dCache has mostly found acceptance in the high-energy physics community. dCache is in use at FNAL, DESY, Brookhaven, CERN, SDSC, and Fermi Labs, to name but a few. These labs all handle tremendous amounts of data from science experiments and dCache was developed by these labs to allow the efficient migration of data from these experiments to the physicists that need to examine the data. Currently, dCache has been stressed up to several hundred Terabytes and several hundred data servers.

Parallel File Systems

"Parallel File Systems - what do they look like?." OK, this isn't Real Genius but parallel file systems are a hotly discussed technology for clusters. They can provide high amounts of IO for clusters -- if that's what you truly need. I'm constantly surprised at how many people don't know how IO influences the performance of their codes. They can also provide a centralized file system for your clusters and in many cases you can link parallel file systems from geographically distinct sites so you can access data from other sites as though it was local. Centralized file systems can also ease a management burden and also improve the scalability of your cluster storage. You can also use them for diskless compute nodes by providing a reasonable high performance file system for storage and scratch space.

Parallel File Systems are distinguished from Distributed File Systems because the clients contact multiple storage devices instead of a single device or a gateway. An easy explanation of the difference is that with Clustered NAS devices, the client contacts a single gateway. This gateway contacts the various storage devices and assembles the data to be sent back to the client. A Parallel File System allows the client to directly contact the storage devices to get the data. This can greatly improve the performance by allowing parallel access to the data, using more spindles, and possible moving the metadata manager out of the middle of every file transaction.

There are lots of ways a parallel file system can be implemented. I'm going to categorize the parallel file systems into two groups. The first group uses more traditional methods in the file system such as block based or even file based schemes. And, being traditional is not a bad thing by any stretch. The second group is the next generation in parallel file systems -- object based file systems. We'll start the discussion of parallel file systems by covering the traditional approach then move on to object based systems. The first traditional parallel file system we'll cover is perhaps the oldest and arguably the most mature parallel file system, GPFS.


Until recently, GPFS (General Purpose File System) was only available only on AIX systems, but several years ago IBM ported it to Linux systems. Initially it was available only on IBM hardware, but in about 2005 IBM made GPFS available for OEMs so that it's now available for non-IBM hardware. The only OEM I'm aware of that used it was Linux Networx.

GPFS is a high-speed, parallel, distributed file system. In typical HPC deployments it is configured similarly to other parallel file systems. There are nodes within the cluster that are designated as IO nodes. These IO nodes have some type of storage attached to them whether it is direct attached storage (DAS) or some type of SAN (Storage Area Network) storage. (Note: There are many variations on this theme where you can combine various types of storage.). Then the file system is mounted by the clients and any IO allows the clients to contact the IO nodes, typically using a special driver, for high performance access to the data.

GPFS achieves high-performance by striping data across multiple disks on multiple storage devices. As data is written to GPFS, it stripes successive blocks of each file across successive disks. But it makes no assumptions about the striping pattern. There are 3 striping patterns it uses:

  • Round Robin: A block is written to each LUN in the file system before the first LUN is written to again. The initial order of LUNs is random, but that same order is used for subsequent blocks.

  • Random: A block is written to a random LUN using a uniform random distribution with the exception that no two replicas can be in the same failure group.

  • Balanced Random: A block is written to a random LUN using a weighted random distribution. The weighting comes from the size of the LUN.

To further improve performance, GPFS uses client-side caching and deep prefetching such as read-ahead and write-behind. It recognizes standard access patterns such as sequential, reverse sequential, and random and will optimize IO accesses for these particular patterns. Furthermore GPFS can read or write large blocks of data in a single IO operation.

The amount of time GPFS has been in production has allowed it to develop several more features that can be used to help improve performance. When the file system is created you can chose the blocksize. Currently, 16KB, 64KB, 512KB, 1MB, and 2MB block sizes are supported with 256KB being the most common. Using large block sizes helps improve performance when large data accesses are common. Small block sizes are used when small data accesses are common. But unfortunately, you cannot change the blocksize once a file system has been built without erasing the data.

To help with smaller file sizes, GPFS can subdivide the blocks into 32 sub-blocks. A block is the largest chunk of contiguous data that can be accessed. A sub-block is the smallest chunk of contiguous data that can be accessed. Sub-blocks are useful for files that are smaller than a block and are stored using the sub-blocks. This can help the performance of applications that use lots of small data files (e.g. Life Sciences applications). More over, this is very useful for file systems that have very large block sizes but still have to store some small files.

To create resiliency in the file system, GPFS uses distributed metadata so that there is no single point of failure nor a performance bottleneck. To get HA (High Availability) capabilities, GPFS can be configured using logging and replication. GPFS will log (journal) the metadata of the file system so in the event of a disk failure, it can be replayed so that the file system can be brought to a consistent state. Replication can be done on both the data and the metadata to provide even more redundancy at the expensive of less usable capacity. GPFS can also be configured for fail-over both at a disk level and at a server level. This means that if you lose a node GPFS will not lose access to data nor degrade performance unacceptably. As part of this it can transparently fail-over lock servers and other core GPFS functions, so that the system stays up and performance is at an acceptable level.

As mentioned earlier GPFS has been in use probably longer than any other parallel file system. In the Linux world, there are GPFS clusters with over 2,400 nodes (clients). One aspect of GPFS that should be mentioned in this context is that the GPFS is priced by the node for both IO nodes and clients. So you pay for licenses whether you buy more servers or buy more clients.

In the current version (3.x) GPFS only uses TCP as the transport protocol. In version 4.x it will also have the ability to use native IB protocols (presumably SDP). If NFS protocols are required, then the GPFS IO nodes can act as NFS gateways. Typically this is done for people who want to mount the GPFS file system on their desktop, but also allows you to use GPFS as a clustered NAS file system. Note this is a fairly easy and inexpensive way to get a clustered NAS. GPFS provides a very reliable back-end to the clustered NAS and you can add gateways for increased performance. Plus you only have to pay for the licenses used for IO nodes, not the clients. Moreover, GPFS can also use CIFS for Windows machines. Of course NFS and CIFS are a lot slower than the native GPFS, but it does give people access to the file system using these protocols.

While GPFS has more features and capabilities than I have space to write about, I want to mention two other somewhat unique features. GPFS has a feature called multi-cluster. This allows two different GPFS file systems to be connected over a network. As long as gateways on both file systems can "see" each other on a network, you can access data from both file systems as though they are local. This is a great feature for groups that may be located around the world to easily share data.

The last feature I want to mention is called the GPFS Open Source Portability Layer (GPL). GPFS comes with certain GPFS kernel modules built for a basic distribution. The portability layer allows these GPFS kernel modules to communicate with the Linux kernel. While some people may think this is a way to create a bridge from the GPL kernel to a non-GPL set of kernel modules, it actually serves a very useful purpose. Suppose there is a kernel security notice and you have to build a new kernel. If you are dependent upon a vendor for a new kernel, you may have to wait a while for them to produce a new kernel for you. During this time you are vulnerable to a security flaw. With the portability layer you can quickly rebuild the new modules for the kernel and reboot the system yourself.


Rapidscale is a parallel file system from that Rackable. Rapidscale was bought from what was Terrascale (actually Rackable purchased all of Terrascale). The Rapidscale approach aggregates storage nodes into a continuous file system (single name space). The resulting name space is mounted by each client node and can communicate directly with the data servers using a modified iSCSI protocol. Rapidscale's overall approach focuses on parallelization at the block device layer. In between the application and the storage devices sits the Rapidscale software that uses a slightly modified iSCSI to communicate with the storage servers. Rapidscale uses standard Linux file system tools such as md and lvm to create storage pool for the clients.

To get improved performance, Rapidscale builds the name space by aggregating the storage into a single virtual disk using the Linux tool md to create a RAID-0 stripe. Using RAID-0 allows the data to be striped across the storage servers, resulting in improved IO performance. You can format the resulting file system using mkfs that is in Linux.

Remember that Rapidscale is using md to create a RAID-0 device across all of the storage devices. When a file system is created across the devices, all accesses, whether they be data or metadata are distributed across the devices. This also means that metadata is not a single location and that metadata operations are spread across the storage servers.

Rapidscale is a 64-bit file system that requires a Linux kernel of 2.6.9 or greater. Rapidscale is scalable to 18 exabytes (18 million TBs). There are two primary components to Rapidscale: a client side piece and a storage server piece. The client side piece is a kernel module that runs on the clients and the storage server piece runs on the storage servers. The client software uses an enhanced iSCSI initiator that includes cache coherency (more on that in a bit).

The storage server pieces take on the role of an iSCSI target. Rapidscale uses a modified iSCSI target to add all of the functionality necessary to handle all of the simultaneous data and metadata requests from the clients. It also makes sure that cache coherency is maintained on the client nodes and prioritizes and manages metadata operations to ensure that the file system is responsive (no long delays) and that it is constant. This is handled by keeping the order of the operation requests from the clients.

Any change in to a given block with the name space made by any client will be available on demand using the Rapidscale cache coherence mechanism. Remember that Rapidscale is distributed so that means that the file system cache, which is in memory, is now distributed as well. To make sure that each client receives the contents of a data block as it was when the request was made, Rapidscale has added what is called a cache invalidation protocol. This protocol allows the block layer of the file system to maintain a local cache of any data blocks that are being used through the block layer interface. The Rapidscale client software operates synchronously. This means that the local client cache is a write through cache (i.e. as soon as an application writes, the client sends the blocks to the storage servers). But if a client has a read request, the local cache from the storage pool is always invalidated. This ensures the client is getting the latest data on the blocks requested.

To create a global file system, Rapidscale has taken a simple local file system and modified it slightly. The block device layer has been modified to force a re-validation of all block reads, ensuring that the reads get the current data. The block device layer has also been modified to provide a mechanism for maintaining consistency of the local file system control structures with their images on the storage server. When Rapidscale was Terrascale, it used ext2. Now, Rapidscale uses XFS, which Rackable calls zXFS.

Locking is also a very important part of a parallel file system. With a local file system, like XFS, ext3, reiserFS, etc., the locking is handled by local memory based data structures. These locks ensure that if one client (application) is writing to a block, that another client (application) that wants to read that block waits until the write operation is finished. This is also true for write operations that are waiting on a read operation to finish. But remember that Rapidscale is a distributed global file system and not a local file system on your desktop. Rapidscale handles locking by extending the local cache invalidation to the maintenance of POSIX locks across all clients. This is done similarly to what is done for the maintenance of other file system metadata information.

Fundamentally, Rapidscale uses byte range locking (as do many other file systems). It uses the Rapidscale cache invalidation protocol for locking. Any file system locks are propagated to each client using that file system. This means that if one client sets a lock, then all of the clients will also have the lock. The same is true that once the lock is removed, it is removed from all clients via the cache invalidation protocol.

Rapidscale also has the capability of being configured in a High Availability (HA) configuration. Rapidscale has three options for creating parity groups. These parity groups can tolerate the lose of an entire storage server without losing data. Rapidscale can be configured as 2+1 (two active storage servers with one stand by server), 4+1, and 8+1. If a storage server fails the a reconstruction takes place in the background using the stand-by server. The reconstruction proceeds at full disk speed.

You can also use gateways in Rapidscale to provide NFS and CIFS access. This is similar to other approaches. You just add servers to the storage network, mount the file system on the server, and then re-export the file system using NFS and/or CIFS. This creates a clustered NAS system. Since Rapidscale is a global parallel file system, you can increase the storage behind the NFS/CIFS gateways if more storage is needed. If you need more aggregate NFS/CIFS bandwidth then you can add more gateways. It's a very simple process.

Initially Rapidscale used only TCP networks, but now it can also use Myrinet networks and InfiniBand networks. This can be done between storage servers and also between clients and the storage servers. This allows Rapidscale to have a very high aggregate bandwidth. Also since, Rapidscale evenly distributes the file system across storage servers, aggregate performance increases linearly with the number of storage servers.

Performance of Rapidscale depends upon how many storage servers there are, how fast the storage is on each server, how many clients there are, and the data access pattern. But this brings up an important point -- designing or configuring a "balanced" storage system. In case you missed it, you can read about a balanced storage system in the NAS section above.


IBRIX offers a distributed file system that presents itself as a global name space to all the clients. The IBRIX Fusion product is a software only product that takes whatever data space you designate on what ever machines you choose and creates a global, parallel file system on them. This file system, or "name space,", can be then mounted in a normal fashion. According to IBRIX, the file systems can scale almost linearly with the number of data servers (also called IO servers or segment servers). It can scale up to 16 Petabytes (a Petabyte is about 1,000 Terabytes or about 1,000,000 Gigabytes). It can also achieve IO (Input/Output) speeds of up to 1 Terabyte/s (TB/s).

The IBRIX file system is composed of "segments." A segment is a repository for files and directories. A segment does not have to be a specific part of a directory tree and the segments don't have to be the same size. You can also split a directory across segments. This allows the segments to be organized however the file system needs them or according to policies set by the administrator or for performance. The file system can place files and directories in the segments irrespective of their locations within in the space. For example, a directory could be on one segment while the files within the directory could be spread across several segments. In fact, files can span multiple segments. This improves throughput because multiple segments can be accessed in parallel. The specifics of where and how the files and directories are placed occurs dynamically when the files or directories are created based on an allocation policy. This allocation policy is set by the administrator based on what they think the access patterns will be and any other criteria that will influence its function (e.g. file size, performance, ease of management). IBRIX Fusion also has a built-in Logical Volume Manager (LVM) and a segmented lock manager (for shared files).

The segments are managed by segment servers with each segment only having one server. However segment servers can manage more than one segment. Currently, IBRIX uses TCP or UDP as the communication protocol for the file system.

The clients mount the IBRIX file system through one of 3 ways: (1) using the IBRIX driver, (2) NFS, (3) or CIFS. Naturally the IBRIX driver knows about the segmented architecture and will take advantage of it to improve performance by routing data requests directly to the segment server(s) containing the data or metadata needed. The IBRIX driver is just a kernel module that requires no patches to the kernel itself. There is a version of the IBRIX driver for Linux and Windows. It can also accommodate mixed 64-bit and 32-bit environments (an advantage when uniting old and new clusters with a single file system).

If you use the NFS or CIFS protocol the client mounts the file system from one of the segment servers which acts as a gateway to the file system. You will most likely have more than one segment server so this allows you to load balance NFS or CIFS. As with other parallel file systems, this allows IBRIX to behave as an infinitely scalable clustered NAS.

The segmented file system approach has some unique properties. For example the ownership of a segment can be migrated from one server to another while the file system is being actively used. You can also add segment servers while the file system is functioning, allowing segments to be spread out to improve performance (if you are using the IBRIX driver). In addition to adding segment servers to improve performance, you can add capacity by adding segments to current segment servers. Since IBRIX doesn't have a centralized lock manager or centralized metadata manager, when you add segment servers you automatically get better lock performance and metadata performance. This can improve the IOPS (IO Operations per Second) of the file system.

One feature of IBRIX that many people overlook is the ability to split a directory across more than one segment server. All parallel and distributed file systems typically have problems with small files and with directories that have lots of files in them (you would be surprised at how many organizations use benchmarks that produce thousands or even of millions of files in a single directory). With IBRIX, since you can split a directory across various segment servers, it can handle a much higher file creation rate, deletion rate, access rate, etc. for directories with huge amounts of files. For instance, Bioinformatic applications can produce huge number of files in a single directory. So IBRIX has a distinct advantage in these situations.

The segmented approach as implemented in IBRIX Fusion has some resiliency features. For example, parts of the name space can stay active even though one or more segments might fail. To gain additional resiliency you can configure the segment servers with fail-over (High Availability - HA). IBRIX has a product called IBRIX Fusion HA, that allows you can also configure the the segment servers to fail over to a stand-by server. These servers can be configured in an active-active configuration so that both servers are being used rather than have one server in a standby mode where it's not being used.

IBRIX Fusion also features a snap shot tool called IBRIX FusionSnap. It allows snap shots of the file system to be taken and used for backups. You can also use it for recovering accidentally deleted files since you can store the snapshot and then recover the deleted files from it.

One other significant feature that IBRIX Fusion has is something called IBRIX FusionFileReplication. This feature allows administrators to set policies about creating replicas (copies) of files. The replicas are self-healing which means that if one of the replicas is damaged, it can be repaired by the other replicas. Also, Fusion makes sure that the replicas are distributed across the segment servers to avoid the replicas from being on the same storage device or segment server. Presumably, the replicas can also be used to share data access. For example, if several clients want to read the same file, then Fusion splits the IO across the replicas.

Recently, IBRIX and Mellanox announced a cluster NAS product that uses the IBRIX file system FusionFS and the Mellanox NFS/RDMA package. This is the first clustered NAS product to use native IB protocols from the client to the NFS gateways. This new product uses FusionFS as the backend storage for the clustered NAS allowing an extremely large single name space. Then NFS gateways are added to the file system that use the NFS/RDMA SDK. This means that the clients access the IBRIX file system using IB links, greatly increasing the bandwidth. All other clustered NAS systems use TCP links to the backend file system which limits their performance to each client (in the case of GigE, this means a maximum of about 100 MB/s). The IBRIX/Mellanox combination should allow a client to have a read bandwidth of about 1,400 MB/s and a write bandwidth of about 400 MB/s to a single NFS gateway. Don't forget that you can also add more NFS gateways to the system and load balance the clients across the gateways. This means you know have a clustered NAS product with a file system that can scale to Petabytes, and gateways that can use native IB protocols for very high speed single client access and you can add as many gateways as you want!

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.