|
Page 3 of 4
File System Introduction and Taxonomy
To make things a little bit easier I have broken file systems for
clusters into two groups: Distributed File Systems, and Parallel File
Systems. The difference between the file systems is that parallel file
systems are exactly that, "parallel". They can utilize multiple data
servers and the clients can access the data through multiple servers. On
the other hand, distributed file systems use a single server for client
access to the file system. The client access is not parallel because
only a single server is being used, but they can give the user access to
a parallel file system.
So, before we jump in and start looking at file systems for clusters,
let's do some initial discussion of the data movement from the application
down to the hardware. This will allow us to have a good frame of reference
and will also allow me to point out differences in file systems. The diagrams
you will see below illustrating the data flow are from Marc Unangst and
Brent Welch at Panasas.
Let's start with Figure One below (courtesy of Marc Unangst and Brent
Welch at Panasas). It represents the data flow for a single node with a
single storage device.
 Figure One: File Systems/Storage Protocol Stack (Courtesy of Panasas)
You can think of this diagram being similar to the
OSI model for network
protocols. At the top, the operating system performs file system
policy checks such as user authorization ("can this user access this
file?"), permission checks ("Does the user have read or write access?",
"Will the user's quota be exceeded?"), and file level attributes
(e.g. time stamps, physical size, etc.).
The file system then translates the data into blocks and they are then
written to the storage device (hardware). The OS is responsible for the
performance of the storage device because it is responsible for the
layout and organization of the storage requests. It is also responsible
for the prefetching and buffer caching of the data (this can have a great
impact on the overall performance of the data transfer).
I will be referring back to this diagram from time to time to help
explain how various file systems differ. At something of a more
fundamental level, some file systems use the
"block based" approach similar to a local file system (SAN's are examples
of this), some use a file based approach (NFS is an example of this),
and some are object based (Panasas and Lustre are examples of this).
If you're not as familiar with the structure of file systems, let me
take a little bit of space to give a very high-level view of a
classic, typical file system, particularly Linux/UNIX file systems.
In general file systems break data into two parts - the metadata which
is data about the data, and the data itself. The metadata contains
information about the data such as when the file was created, when it
was last accessed, any permissions on the data, the size of the file,
the storage location or locations on disk, etc. Then you have the data
itself. In the case of small files, the metadata can be as large as
the data itself (or the data is as small as the metadata depending upon
your point of view). For files that you are likely to see on clusters,
the data is much larger than the metadata (of course there are always
exceptions to this).
But the metadata is extremely important. Anytime the actual data is
touched, changed, etc., the metadata has to be updated. So the file
system always has to modify the metadata if the data is touched,
modified, or in any way changed. As an example, if you do a "ls -l"
command, the file system has to examine the metadata for each file in the
directory to get the current information for the file or files. This can
be a source of a great bottleneck in file system performance, particularly
for distributed file systems, because they have to read the metadata and
return it to the client. If you do this with wildcards or for a directory
with a great number of files, then it can take a long time.
So given the simple concept of a file system, and a way to examine how
a file system gets data from a client to a file system and back, let's
go forth and examine some file systems. As I mentioned before, I'm going
to divide the file systems into two classes, distributed file systems and
parallel file systems.
Distributed File Systems
The first set of file systems I want to discuss are what I call distributed
file systems. These are file systems that are network based (i.e. the
actual storage hardware is not necessarily on the storage nodes) but is not
necessarily parallel (i.e. there may not be multiple servers that are
delivering the file system). For distributed file systems this means that
the client will get and put data through a single server. I think you will
recognize some of the names of the distributed file systems.
NFS
NFS
has been the primary file system protocol for clusters because it
is "there" and is pretty much "plug and play" on most *nix systems.
It was the first widespread file system that allowed distributed systems
to share data effectively. In fact, it's the only standard network
file system. Consequently, it can be viewed as one of the enabling
technologies for HPC clusters (i.e. getting user directories onto the nodes
was trivial).
While NFS is likely to be the most ubiquitous cluster file system, it
has gotten a little long in the tooth, so to say, and has some limitations.
For example, it doesn't scale well for large clusters and has limited
performance. It also used to have some security issues, but these were
addressed in Version 4 of the NFS protocol. Despite these limitations,
it remains the most popular cluster file system because:
- It comes with virtually every OS (it can even be an add-on to Windows)
- Easy to configure and manage
- Well understood
- Works across multiple platforms
- Usually requires little administration (until it goes down)
- It just works
NFS is a fairly easy protocol to follow. All information, data and metadata,
flows through the file server. This is commonly referred to as an "in-band"
data flow model shown in Figure Two below.
 Figure Two: In-Band Data Flow Model (Courtesy of Panasas)
Notice that the file server touches and manages all data and metadata.
This model does make things a bit easier to configure and monitor. Moreover,
it has narrow, well defined failure modes. Some drawbacks include an obvious
bottleneck for performance, has problems with load balancing
(more on that later in the article), and security is a function
of the server node and not the protocol (this situation means that security
features can be all over the map).
With NFS, at least one server "exports" some storage space to the nodes
of the cluster. These nodes mount the exported file system(s). When a
file request is made to one of these mounted file systems, the mount
daemon transfers the request to the NFS server, which then accesses the
file on the local file system. The data is the transferred from the NFS
server to the requesting node, typically using TCP. Notice that NFS is
"file" based. That is, when a data request is made, it is done on a file,
not blocks of data or a byte range. So we say that NFS is a file based
protocol.
Figure Three below presents NFS using the data flow model introduced
in Figure One.
 Figure Three: NFS Protocol Stack (Courtesy of Panasas)
If you compare this diagram with Figure One (the local storage stack),
you will notice a few things. First, between the system call interface
and the file system user component is a network protocol (usually TCP).
This separation is fundamental to a distributed file system where the
storage is not necessarily in the node itself. Second, there is
usually some NVRAM (Non-Volatile Random Access Memory) in the server
that is used to buffer IO.
History of NFS:
NFS is a standard protocol originally developed by Sun Microsystems in
1984 as a distributed file system that allows a system to access files
over a network. It is built using a client-server architecture, where a NFS
server exports some storage to various clients. The clients mount the
storage as though it were a local file system and can then read and write
to the space (it is possible to mount the file system as read-only so the
clients can only read the data but not write to it). But it does lack some
important features of true parallel file systems.
The very first version of NFS was NFS Version 1 (NFSv1), was an experimental
version only used internally at Sun.
NFS Version 2 (NFSv2)
was released in approximately 1989 and originally operated entirely over
UDP.
The designers chose NFS to be a stateless protocol with other features
such as locking to be implemented outside of the protocol. In this context
stateless means that each request is independent of all others and that
the NFS server is not required to maintain any information about a client,
a file, or request stream between requests. The reason the server was
chosen to be stateless was to make the implementation much easier.
In the late 1980's the University of Berkeley developed a freely distributable
but compatible version of NFSv2. This lead to many OS's becoming NFS
compatible so different machines could share a common storage pool. There
were some incremental developments of NFSv2 to improve performance. One of
these was to add support for TCP to NFSv2.
The
Version 3 specification
(NFSv3) was released around 1995. It added some new features to keep up
with modern hardware, but still did not address the stateless feature nor
address some of the security issues that had developed over time. However,
NFSv3 added support for 64-bit file sizes and offsets to handle files larger
than 4 gigabytes (GB), added support for asynchronous writes on the server
to improve write performance, and added TCP as a transport layer. These are
three major features that users of NFS had been asking to be added for some time.
NFSv3 was quickly adopted and put into production and is probably the most
popular version of NFS in use today.
However, even NFSv3 has some problems. For example it does not have what
is usually termed "strict
cache coherency".
This means that when a change
is made to a file, that change is not immediately visible to all other
clients. In NFS, this is because there is no message from the server to
the other clients to notify them that the file has changed (recall that
the NFSv3 server is stateless so it keeps no information outside of the
single request so it can’t notify other clients that the data has changed).
This can have a great impact on writing to the same file. In other
words - it can be dangerous. But NFS does not prohibit you from doing
this so we can say that NFS has loose coherency.
However, even with loose coherency, two clients can still write to the
same file without having problems, as long as their writes don't overlap
and don't share pages. Not sharing pages is usually the most
problematic aspect, because the necessary alignment depends on the page
size of both machines, and there is no way for client A to find
the page size of client B. Page-sharing causes problems because many
NFS clients (including Linux) do their reads and writes in units of
whole VM pages (4K on x86 and x86-64, 16K on IA-64). This means that if
the application writes data from 0-2K in a 4K page, the NFS client will
read the whole 4K page from the server, apply the new data to the first
half of the page, and then write out the whole 4K page. This process
is usually referred to as read-modify-write.
This is what leads to the "last writer wins" scenario. Consider the case
where, at the same time client A is writing from 0-2K in the page, client B
reads the same page from the server, modifies the other half of the page
(offset 2K-4K), and then writes back the whole page to the server. Since
the server sees two 0-4K writes for the same range in the file, the page
will contain either [AAAold] or [oldBBB], depending on whether A or B wrote
last, but not [AAABBB] as it should.
The other issue with NFSv3 is that the client is not required to write
back modified data immediately to the server. It is only required to
write data when the file is closed. This is what leads to the
"open-to-close" cache consistency ascribed to NFS. If client A writes
to a file, and client B subsequently reads from the modified range, it
is not guaranteed to see the new data until client A closes the file.
If code writers and users don't understand this, they can get unexpected
results.
Around 2003, Version 4 of
NFS (NFSv4) was released with some improvements. In particular, it added
some speed improvements (who doesn't like speed), strong security (with the
ability for multiple security protocols), and most important, NFSv4
finally became a stateful protocol (for the most part). Some server state
was added, primarily in the form of delegations and file locks. Using file
locks requires that the server track the state. File locks are the main
item that requires the server to track state. In addition to the lock state
itself, the server has to know about, and track, open files from the client's
point of view, to enforce Win32 "share mode" locks. The server is not
required to keep this state persistently (i.e., a server restart is allowed
to destroy state about open files), and there is mechanism in the protocol
that allows the client to detect when the state has been lost and recover it.
NFSv4.0 added strong security to NFS focusing on authentication,
integrity, and privacy. To do this and allow for multiple security
implementations, a new interface called
RPCSEC GSSAPI
was developed. It allows for authentication plug-ins to be created for
protocols like NFS that use RPC (Remote Procedure Call). An example of one
of the plug-ins is Kerberos 5.
NFSv4.0 also has an extensible architecture to allow easier evolution of the
standard. Despite the better security, improved performance, and making NFS
a stateful protocol, NFSv4 has not yet seen the widespread adoption that
NFSv3 has. It has all of the components of an industrial strength file
system but it still lacks performance and scalability. But this could change
very soon.
NFS/RDMA
NFS/RDMA,sometimes also called NFSoRDMA (NFS over RDMA) is a binding of
NFS on top of
RDMA (Remote Direct Memory
Access) protocols such as iWarp
(RDMA over Ethernet) and
InfiniBand. The goal
is to improve the network performance of NFS (version 2, 3, and 4) since a
higher bandwidth and lower latency interconnect is being used. Using RDMA
should allow the following aspects:
- Reduced CPU usage on the client for data transfer
- Zero-copy data transfers
- User-space IO (kernel bypass)
- Reduced latency
- Improved data throughput
Work on NFS/RDMA is on-going for both the Linux kernel and for OpenSolaris.
In the Linux 2.6.24 kernel, the NFS/RDMA client was released while the 2.6.25
kernel contained the server portion.
One of the other motivations for developing NFS/RDMA is that many clusters
are shipping with a high-speed interconnect such as Myrinet, InfiniBand,
or Quadrics. A vast majority of these clusters use a high-speed interconnect
for computational traffic and a second network for data traffic (many
times just a single GigE line). The high-speed interconnects have very low
latency and very high bandwidth, but most cluster applications don't need
all of the bandwidth. So there is room to push more data through the
network. So why not use the left over bandwidth for data traffic? This also
allows two networks to be combined into a single network since you can
combine the computational traffic with data traffic. You can actually do
that today with "standard" NFS, but you have to wrap TCP packets in IB packets
to transfer them over the IB network. This has to be done since NFS only
understands UDP and TCP (most people use TCP). For example, with InfiniBand
this means that you have to use IPoIB (IP over IB). This limits performance
since you have to encode and decode the TCP packets at either end of the
network. NFS/RDMA adds the ability to use RDMA protocols to the RPC layer,
so you don't have to wrap the TCP packets.
In the case of 10GigE, you could just run "normal" NFS over the NIC,
but the of 10GigE will limit NFS performance and
increase the load on the client CPU. Since many 10GigE NICs are moving
towards using iWARP, NFS/RDMA will allow the highest possible performance
for data transfer while reducing the CPU load on the client.
There is a great deal of work went into developing NFS/RDMA since it's an open IETF (Internet Engineering Task Force) standard. A number of people worked on getting NFS/RDMA functioning in Linux. Tom Tucker at
Open Grid Computing
was one of the leaders in the efforts to get both the NFS client and the NFS server into the Linux kernel. As mentioned earlier, the client when into the 2.6.24 kernel and the server went into the 2.6.25 kernel. You can visit the NFS-RDMA project to get the patches and see what work has been done. Since the NFS-RDMA, are at the RPC layer, then NFSv2, NFSv3, and NFSv4 should just work out of the box.
Mellanox
demonstrated NFS/RDMA over a 20 Gb/s InfiniBand link from clients to a
new Mellanox MTD2000 NFS-RDMA Server Product Development Kit (PDK) that
features a RAID-5 back-end storage subsystem with SAS (Serial Attached SCSI)
hard disk drives. With two clients connected to the MTD2000 each using a
20 Gb/s IB link and reading from the file server cache, Mellanox was able
to achieve 1,400 MB/s of read throughput measured using IOzone. Actual write
performance to the disk was measured at about 400 MB/s. Mellanox used an
open-source NFS/RDMA server and client code for the demonstration and is
compatible with the OpenFabrics Enterprise Distribution (OFED) v. 1.1. This
shows the performance potential for running NFS over a high-speed interconnect
using appropriate protocols.
But NFS is still limited in that all data traffic from a particular client must go through a single server to access data on the physical media. This creates a bottleneck in performance (all IO goes through one node). People have always wanted more performance and scalability out of NFS (who doesn't like performance?) but have not received it. That is about to change.
|