Article Index

File System Introduction and Taxonomy

To make things a little bit easier I have broken file systems for clusters into two groups: Distributed File Systems, and Parallel File Systems. The difference between the file systems is that parallel file systems are exactly that, "parallel". They can utilize multiple data servers and the clients can access the data through multiple servers. On the other hand, distributed file systems use a single server for client access to the file system. The client access is not parallel because only a single server is being used, but they can give the user access to a parallel file system.

So, before we jump in and start looking at file systems for clusters, let's do some initial discussion of the data movement from the application down to the hardware. This will allow us to have a good frame of reference and will also allow me to point out differences in file systems. The diagrams you will see below illustrating the data flow are from Marc Unangst and Brent Welch at Panasas.

Let's start with Figure One below (courtesy of Marc Unangst and Brent Welch at Panasas). It represents the data flow for a single node with a single storage device.

File Systems/Storage Protocol Stack (Courtesy of Panasas)
Figure One: File Systems/Storage Protocol Stack (Courtesy of Panasas)

You can think of this diagram being similar to the OSI model for network protocols. At the top, the operating system performs file system policy checks such as user authorization ("can this user access this file?"), permission checks ("Does the user have read or write access?", "Will the user's quota be exceeded?"), and file level attributes (e.g. time stamps, physical size, etc.).

The file system then translates the data into blocks and they are then written to the storage device (hardware). The OS is responsible for the performance of the storage device because it is responsible for the layout and organization of the storage requests. It is also responsible for the prefetching and buffer caching of the data (this can have a great impact on the overall performance of the data transfer).

I will be referring back to this diagram from time to time to help explain how various file systems differ. At something of a more fundamental level, some file systems use the "block based" approach similar to a local file system (SAN's are examples of this), some use a file based approach (NFS is an example of this), and some are object based (Panasas and Lustre are examples of this).

If you're not as familiar with the structure of file systems, let me take a little bit of space to give a very high-level view of a classic, typical file system, particularly Linux/UNIX file systems. In general file systems break data into two parts - the metadata which is data about the data, and the data itself. The metadata contains information about the data such as when the file was created, when it was last accessed, any permissions on the data, the size of the file, the storage location or locations on disk, etc. Then you have the data itself. In the case of small files, the metadata can be as large as the data itself (or the data is as small as the metadata depending upon your point of view). For files that you are likely to see on clusters, the data is much larger than the metadata (of course there are always exceptions to this).

But the metadata is extremely important. Anytime the actual data is touched, changed, etc., the metadata has to be updated. So the file system always has to modify the metadata if the data is touched, modified, or in any way changed. As an example, if you do a "ls -l" command, the file system has to examine the metadata for each file in the directory to get the current information for the file or files. This can be a source of a great bottleneck in file system performance, particularly for distributed file systems, because they have to read the metadata and return it to the client. If you do this with wildcards or for a directory with a great number of files, then it can take a long time.

So given the simple concept of a file system, and a way to examine how a file system gets data from a client to a file system and back, let's go forth and examine some file systems. As I mentioned before, I'm going to divide the file systems into two classes, distributed file systems and parallel file systems.

Distributed File Systems

The first set of file systems I want to discuss are what I call distributed file systems. These are file systems that are network based (i.e. the actual storage hardware is not necessarily on the storage nodes) but is not necessarily parallel (i.e. there may not be multiple servers that are delivering the file system). For distributed file systems this means that the client will get and put data through a single server. I think you will recognize some of the names of the distributed file systems.


NFS has been the primary file system protocol for clusters because it is "there" and is pretty much "plug and play" on most *nix systems. It was the first widespread file system that allowed distributed systems to share data effectively. In fact, it's the only standard network file system. Consequently, it can be viewed as one of the enabling technologies for HPC clusters (i.e. getting user directories onto the nodes was trivial).

While NFS is likely to be the most ubiquitous cluster file system, it has gotten a little long in the tooth, so to say, and has some limitations. For example, it doesn't scale well for large clusters and has limited performance. It also used to have some security issues, but these were addressed in Version 4 of the NFS protocol. Despite these limitations, it remains the most popular cluster file system because:

  • It comes with virtually every OS (it can even be an add-on to Windows)
  • Easy to configure and manage
  • Well understood
  • Works across multiple platforms
  • Usually requires little administration (until it goes down)
  • It just works

NFS is a fairly easy protocol to follow. All information, data and metadata, flows through the file server. This is commonly referred to as an "in-band" data flow model shown in Figure Two below.

n-Band Data Flow Model
Figure Two: In-Band Data Flow Model (Courtesy of Panasas)

Notice that the file server touches and manages all data and metadata. This model does make things a bit easier to configure and monitor. Moreover, it has narrow, well defined failure modes. Some drawbacks include an obvious bottleneck for performance, has problems with load balancing (more on that later in the article), and security is a function of the server node and not the protocol (this situation means that security features can be all over the map).

With NFS, at least one server "exports" some storage space to the nodes of the cluster. These nodes mount the exported file system(s). When a file request is made to one of these mounted file systems, the mount daemon transfers the request to the NFS server, which then accesses the file on the local file system. The data is the transferred from the NFS server to the requesting node, typically using TCP. Notice that NFS is "file" based. That is, when a data request is made, it is done on a file, not blocks of data or a byte range. So we say that NFS is a file based protocol.

Figure Three below presents NFS using the data flow model introduced in Figure One.

NFS Protocol Stack (Courtesy of Panasas)
Figure Three: NFS Protocol Stack (Courtesy of Panasas)

If you compare this diagram with Figure One (the local storage stack), you will notice a few things. First, between the system call interface and the file system user component is a network protocol (usually TCP). This separation is fundamental to a distributed file system where the storage is not necessarily in the node itself. Second, there is usually some NVRAM (Non-Volatile Random Access Memory) in the server that is used to buffer IO.

History of NFS:

NFS is a standard protocol originally developed by Sun Microsystems in 1984 as a distributed file system that allows a system to access files over a network. It is built using a client-server architecture, where a NFS server exports some storage to various clients. The clients mount the storage as though it were a local file system and can then read and write to the space (it is possible to mount the file system as read-only so the clients can only read the data but not write to it). But it does lack some important features of true parallel file systems.

The very first version of NFS was NFS Version 1 (NFSv1), was an experimental version only used internally at Sun. NFS Version 2 (NFSv2) was released in approximately 1989 and originally operated entirely over UDP. The designers chose NFS to be a stateless protocol with other features such as locking to be implemented outside of the protocol. In this context stateless means that each request is independent of all others and that the NFS server is not required to maintain any information about a client, a file, or request stream between requests. The reason the server was chosen to be stateless was to make the implementation much easier.

In the late 1980's the University of Berkeley developed a freely distributable but compatible version of NFSv2. This lead to many OS's becoming NFS compatible so different machines could share a common storage pool. There were some incremental developments of NFSv2 to improve performance. One of these was to add support for TCP to NFSv2.

The Version 3 specification (NFSv3) was released around 1995. It added some new features to keep up with modern hardware, but still did not address the stateless feature nor address some of the security issues that had developed over time. However, NFSv3 added support for 64-bit file sizes and offsets to handle files larger than 4 gigabytes (GB), added support for asynchronous writes on the server to improve write performance, and added TCP as a transport layer. These are three major features that users of NFS had been asking to be added for some time. NFSv3 was quickly adopted and put into production and is probably the most popular version of NFS in use today.

However, even NFSv3 has some problems. For example it does not have what is usually termed "strict cache coherency". This means that when a change is made to a file, that change is not immediately visible to all other clients. In NFS, this is because there is no message from the server to the other clients to notify them that the file has changed (recall that the NFSv3 server is stateless so it keeps no information outside of the single request so it can’t notify other clients that the data has changed). This can have a great impact on writing to the same file. In other words - it can be dangerous. But NFS does not prohibit you from doing this so we can say that NFS has loose coherency.

However, even with loose coherency, two clients can still write to the same file without having problems, as long as their writes don't overlap and don't share pages. Not sharing pages is usually the most problematic aspect, because the necessary alignment depends on the page size of both machines, and there is no way for client A to find the page size of client B. Page-sharing causes problems because many NFS clients (including Linux) do their reads and writes in units of whole VM pages (4K on x86 and x86-64, 16K on IA-64). This means that if the application writes data from 0-2K in a 4K page, the NFS client will read the whole 4K page from the server, apply the new data to the first half of the page, and then write out the whole 4K page. This process is usually referred to as read-modify-write.

This is what leads to the "last writer wins" scenario. Consider the case where, at the same time client A is writing from 0-2K in the page, client B reads the same page from the server, modifies the other half of the page (offset 2K-4K), and then writes back the whole page to the server. Since the server sees two 0-4K writes for the same range in the file, the page will contain either [AAAold] or [oldBBB], depending on whether A or B wrote last, but not [AAABBB] as it should.

The other issue with NFSv3 is that the client is not required to write back modified data immediately to the server. It is only required to write data when the file is closed. This is what leads to the "open-to-close" cache consistency ascribed to NFS. If client A writes to a file, and client B subsequently reads from the modified range, it is not guaranteed to see the new data until client A closes the file. If code writers and users don't understand this, they can get unexpected results.

Around 2003, Version 4 of NFS (NFSv4) was released with some improvements. In particular, it added some speed improvements (who doesn't like speed), strong security (with the ability for multiple security protocols), and most important, NFSv4 finally became a stateful protocol (for the most part). Some server state was added, primarily in the form of delegations and file locks. Using file locks requires that the server track the state. File locks are the main item that requires the server to track state. In addition to the lock state itself, the server has to know about, and track, open files from the client's point of view, to enforce Win32 "share mode" locks. The server is not required to keep this state persistently (i.e., a server restart is allowed to destroy state about open files), and there is mechanism in the protocol that allows the client to detect when the state has been lost and recover it.

NFSv4.0 added strong security to NFS focusing on authentication, integrity, and privacy. To do this and allow for multiple security implementations, a new interface called RPCSEC GSSAPI was developed. It allows for authentication plug-ins to be created for protocols like NFS that use RPC (Remote Procedure Call). An example of one of the plug-ins is Kerberos 5.

NFSv4.0 also has an extensible architecture to allow easier evolution of the standard. Despite the better security, improved performance, and making NFS a stateful protocol, NFSv4 has not yet seen the widespread adoption that NFSv3 has. It has all of the components of an industrial strength file system but it still lacks performance and scalability. But this could change very soon.


NFS/RDMA,sometimes also called NFSoRDMA (NFS over RDMA) is a binding of NFS on top of RDMA (Remote Direct Memory Access) protocols such as iWarp (RDMA over Ethernet) and InfiniBand. The goal is to improve the network performance of NFS (version 2, 3, and 4) since a higher bandwidth and lower latency interconnect is being used. Using RDMA should allow the following aspects:

  • Reduced CPU usage on the client for data transfer
  • Zero-copy data transfers
  • User-space IO (kernel bypass)
  • Reduced latency
  • Improved data throughput

Work on NFS/RDMA is on-going for both the Linux kernel and for OpenSolaris. In the Linux 2.6.24 kernel, the NFS/RDMA client was released while the 2.6.25 kernel contained the server portion.

One of the other motivations for developing NFS/RDMA is that many clusters are shipping with a high-speed interconnect such as Myrinet, InfiniBand, or Quadrics. A vast majority of these clusters use a high-speed interconnect for computational traffic and a second network for data traffic (many times just a single GigE line). The high-speed interconnects have very low latency and very high bandwidth, but most cluster applications don't need all of the bandwidth. So there is room to push more data through the network. So why not use the left over bandwidth for data traffic? This also allows two networks to be combined into a single network since you can combine the computational traffic with data traffic. You can actually do that today with "standard" NFS, but you have to wrap TCP packets in IB packets to transfer them over the IB network. This has to be done since NFS only understands UDP and TCP (most people use TCP). For example, with InfiniBand this means that you have to use IPoIB (IP over IB). This limits performance since you have to encode and decode the TCP packets at either end of the network. NFS/RDMA adds the ability to use RDMA protocols to the RPC layer, so you don't have to wrap the TCP packets.

In the case of 10GigE, you could just run "normal" NFS over the NIC, but the of 10GigE will limit NFS performance and increase the load on the client CPU. Since many 10GigE NICs are moving towards using iWARP, NFS/RDMA will allow the highest possible performance for data transfer while reducing the CPU load on the client.

There is a great deal of work went into developing NFS/RDMA since it's an open IETF (Internet Engineering Task Force) standard. A number of people worked on getting NFS/RDMA functioning in Linux. Tom Tucker at Open Grid Computing was one of the leaders in the efforts to get both the NFS client and the NFS server into the Linux kernel. As mentioned earlier, the client when into the 2.6.24 kernel and the server went into the 2.6.25 kernel. You can visit the NFS-RDMA project to get the patches and see what work has been done. Since the NFS-RDMA, are at the RPC layer, then NFSv2, NFSv3, and NFSv4 should just work out of the box.

Mellanox demonstrated NFS/RDMA over a 20 Gb/s InfiniBand link from clients to a new Mellanox MTD2000 NFS-RDMA Server Product Development Kit (PDK) that features a RAID-5 back-end storage subsystem with SAS (Serial Attached SCSI) hard disk drives. With two clients connected to the MTD2000 each using a 20 Gb/s IB link and reading from the file server cache, Mellanox was able to achieve 1,400 MB/s of read throughput measured using IOzone. Actual write performance to the disk was measured at about 400 MB/s. Mellanox used an open-source NFS/RDMA server and client code for the demonstration and is compatible with the OpenFabrics Enterprise Distribution (OFED) v. 1.1. This shows the performance potential for running NFS over a high-speed interconnect using appropriate protocols.

But NFS is still limited in that all data traffic from a particular client must go through a single server to access data on the physical media. This creates a bottleneck in performance (all IO goes through one node). People have always wanted more performance and scalability out of NFS (who doesn't like performance?) but have not received it. That is about to change.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.