|
Page 1 of 3
Coding for PVFS1 and PVFS2
Arguably, one of the most popular parallel file systems is PVFS
(Parallel Virtual File System). Work on PVFS began around 1993 at
Clemson University. It has since grown through partnerships with
the academic, government, and industrial community. PVFS also enjoys
a licensed under the LGPL (Lesser General Public License). It has
been deployed in production at a number of sites for several years.
There are currently two versions of PVFS: PVFS1, the original
version; and PVFS2, the new version under development and very close
to production release. While PVFS is a great concept, most efficient
use comes from recoding your application to take true advantage of
either PVFS1 or PVFS2 (unless your code already uses MPI-IO). However,
rewriting the I/O portion of the your code to use MPI-IO is really
not a difficult task. In this column I'll be examining the various ways
one can code to use PVFS1 or PVFS2.
Before any coding or recoding begins you should remember that PVFS
was designed to be a high-speed scratch file system. It was not
designed to be a permanent file system with backup capability.
Consequently, the aim of PVFS development is on speed. This single
mindedness makes development of codes for PVFS fairly easy.
PVFS1 was the original file system started at Clemson University in the
Parallel Architecture Research Laboratory (PARL). Once clusters began
to become popular, it was evident that commercial parallel machines
enjoyed an advantage over clusters in the area of parallel file
systems. PVFS1 was developed with two objectives in mind: to provide
a basic software platform for pursuing the development of parallel I/O
and parallel file systems, and to provide a stable, full-featured
parallel file system for Linux clusters.
PVFS2 uses some the basic concepts from PVFS1 but redesigns the
architecture to make it more scalable, more flexible, and more
portable. It abstracts the storage layer allowing different types
of storage techniques, and it also abstracts the networking layer
allowing different types of networking protocols besides TCP to
be easily integrated into PVFS2. In addition, scalability has
been re-examined in all phases of the redesign to ensure that
PVFS2 is scalable to thousands if not tens of thousands of nodes.
Both PVFS1 and PVFS2 are similar in their basic architecture.
PVFS splits the tasks of the file system into pieces. It allows
the multiple metadata to be separate from the data servers.
PVFS1 uses one metadata server, while PVFS2 allows multiple
servers. Data are stored on other servers or I/O nodes, called
"iod's" (short for IO daemons). Both PVFS1 and PVFS2 are
"virtual" file systems in that they use existing local file
systems such as Ext2, Ext3, ReiserFS, and XFS to store data local
to each server, rather than relying on its own file system design.
PVFS data are striped in files across some or all of the iod's
within the PVFS system. Typically, these data are written in
round-robin fashion although PVFS2 allows different access
patterns.
I won't be discussing how to setup and configure either PVFS1 nor
PVFS2. Please see the sidebar that lists the links for PVFS
installation support. Let's start with by looking at PVFS1 and
how one can access the file system from your codes.
PVFS1 Coding
PVFS1 was designed so that there would be several API's (Application
Programming Interfaces) to PVFS and that applications developed
using the standard UNIX I/O API must still work with PVFS. Three
classes of API's are made available to the user: a native PVFS1 API;
the "UNIX/POSIX-like" I/O API; and other API's such as MPI-IO.
The UNIX/POSIX I/O API allows codes that have been built with the
standard UNIX I/O functions such as fopen(), open(),
fprintf(), fread(), fwrite(), read(),
write(), fclose(), etc., to all will work with PVFS1
without any changes
to the code and without any recompiling or relinking. The PVFS
kernel module provides this functionality. On each PVFS1 client,
there is a small daemon (pvfsd) that is run. All function calls
are processed by the VFS (Linux kernel Virtual File System interface)
which decides if the operation is a PVFS function. If it is a PVFS
operation, then the pvfsd performs the operation on behalf of the
application. The data from the I/O functions will use the default
stripe size from when PVFS1 was compiled, usually 64K. Also, all of
the machines defined in PVFS1 will be used beginning with the first
machine in the .iodtab file.
The functions read()and write() perform the data transfer
with the I/O nodes each time a call is made. For accessing small
amounts of data, this can be terribly time consuming since the data
may lie on another node which must be accessed over the network.
On the other hand, fread() and fwrite() locally buffer small
file access and perform data exchanges with the I/O nodes in chunks of
some minimum size. Consequently, the data transfer rate is much better
using these functions. However, there are some consistency issues with
fread() and fwrite() due to the buffering. Also, using PVFS1 for
formatted I/O using functions such as fprintf() and fscanf() is
not a good use of the file system because the data streams are usually
small and will not take advantage of the speed capability of PVFS1.
While running codes that use the "UNIX/POSIX-like" API with PVFS1
will work, to get better performance one should look at porting the
code to use the native PVFS API. This port may seem daunting but is
in fact it is fairly painless. The PVFS1 I/O functions are intended
to be very close to the UNIX/POSIX calls, so porting can be require
minimum effort.
For example, the normal open() function can be replaced with the
pvfs_open() function:
int open(const char *pathname,
int flags,
mode_t mode);
int pvfs_open(const char *pathname,
int flags,
mode_t mode,
struct pvfs_filestat *dist);
The structure pvfs_filestat is an optional argument to define the data
distribution that the user wants. If the data structure is not passed then
the default distribution is used. The data structures looks like the following:
struct pvfs_filestat {
int base; /* first iod node */
int pcount; /* number of iod nodes*/
int ssize; /* stripe size */
int soff; /* NOT USED */
int bsize; /* NOT USED */
}
where base is the starting I/O node number (it starts with 0) and
pcount is the number of I/O nodes you want to use, up to the total
number of nodes in PVFS1 (set pcount = -1 to use all of the nodes).
The element ssize is the stripe size. The elements soff and
bsize are not currently used. You can see the semantics of using
the PVFS1 API is very similar to the standard UNIX/POSIX I/O API.
Many of the other UNIX/POSIX I/O functions have PVFS1 counterparts:
pvfs_write(),
pvfs_open(), pvfs_lseek(), pvfs_lseek64(),
pvfs_close() and so on. Most of the time you can port your code to
use the native PVFS1 library by a simple global find/replace that just
changes the I/O function called.
|