Hits: 27620

Coding for PVFS1 and PVFS2

Arguably, one of the most popular parallel file systems is PVFS (Parallel Virtual File System). Work on PVFS began around 1993 at Clemson University. It has since grown through partnerships with the academic, government, and industrial community. PVFS also enjoys a licensed under the LGPL (Lesser General Public License). It has been deployed in production at a number of sites for several years. There are currently two versions of PVFS: PVFS1, the original version; and PVFS2, the new version under development and very close to production release. While PVFS is a great concept, most efficient use comes from recoding your application to take true advantage of either PVFS1 or PVFS2 (unless your code already uses MPI-IO). However, rewriting the I/O portion of the your code to use MPI-IO is really not a difficult task. In this column I'll be examining the various ways one can code to use PVFS1 or PVFS2.

Before any coding or recoding begins you should remember that PVFS was designed to be a high-speed scratch file system. It was not designed to be a permanent file system with backup capability. Consequently, the aim of PVFS development is on speed. This single mindedness makes development of codes for PVFS fairly easy.

PVFS1 was the original file system started at Clemson University in the Parallel Architecture Research Laboratory (PARL). Once clusters began to become popular, it was evident that commercial parallel machines enjoyed an advantage over clusters in the area of parallel file systems. PVFS1 was developed with two objectives in mind: to provide a basic software platform for pursuing the development of parallel I/O and parallel file systems, and to provide a stable, full-featured parallel file system for Linux clusters.

PVFS2 uses some the basic concepts from PVFS1 but redesigns the architecture to make it more scalable, more flexible, and more portable. It abstracts the storage layer allowing different types of storage techniques, and it also abstracts the networking layer allowing different types of networking protocols besides TCP to be easily integrated into PVFS2. In addition, scalability has been re-examined in all phases of the redesign to ensure that PVFS2 is scalable to thousands if not tens of thousands of nodes.

Both PVFS1 and PVFS2 are similar in their basic architecture. PVFS splits the tasks of the file system into pieces. It allows the multiple metadata to be separate from the data servers. PVFS1 uses one metadata server, while PVFS2 allows multiple servers. Data are stored on other servers or I/O nodes, called "iod's" (short for IO daemons). Both PVFS1 and PVFS2 are "virtual" file systems in that they use existing local file systems such as Ext2, Ext3, ReiserFS, and XFS to store data local to each server, rather than relying on its own file system design. PVFS data are striped in files across some or all of the iod's within the PVFS system. Typically, these data are written in round-robin fashion although PVFS2 allows different access patterns.

I won't be discussing how to setup and configure either PVFS1 nor PVFS2. Please see the sidebar that lists the links for PVFS installation support. Let's start with by looking at PVFS1 and how one can access the file system from your codes.

PVFS1 Coding

PVFS1 was designed so that there would be several API's (Application Programming Interfaces) to PVFS and that applications developed using the standard UNIX I/O API must still work with PVFS. Three classes of API's are made available to the user: a native PVFS1 API; the "UNIX/POSIX-like" I/O API; and other API's such as MPI-IO.

The UNIX/POSIX I/O API allows codes that have been built with the standard UNIX I/O functions such as fopen(), open(), fprintf(), fread(), fwrite(), read(), write(), fclose(), etc., to all will work with PVFS1 without any changes to the code and without any recompiling or relinking. The PVFS kernel module provides this functionality. On each PVFS1 client, there is a small daemon (pvfsd) that is run. All function calls are processed by the VFS (Linux kernel Virtual File System interface) which decides if the operation is a PVFS function. If it is a PVFS operation, then the pvfsd performs the operation on behalf of the application. The data from the I/O functions will use the default stripe size from when PVFS1 was compiled, usually 64K. Also, all of the machines defined in PVFS1 will be used beginning with the first machine in the .iodtab file.

The functions read()and write() perform the data transfer with the I/O nodes each time a call is made. For accessing small amounts of data, this can be terribly time consuming since the data may lie on another node which must be accessed over the network. On the other hand, fread() and fwrite() locally buffer small file access and perform data exchanges with the I/O nodes in chunks of some minimum size. Consequently, the data transfer rate is much better using these functions. However, there are some consistency issues with fread() and fwrite() due to the buffering. Also, using PVFS1 for formatted I/O using functions such as fprintf() and fscanf() is not a good use of the file system because the data streams are usually small and will not take advantage of the speed capability of PVFS1.

While running codes that use the "UNIX/POSIX-like" API with PVFS1 will work, to get better performance one should look at porting the code to use the native PVFS API. This port may seem daunting but is in fact it is fairly painless. The PVFS1 I/O functions are intended to be very close to the UNIX/POSIX calls, so porting can be require minimum effort.

For example, the normal open() function can be replaced with the pvfs_open() function:

int open(const char *pathname, 
         int flags, 
         mode_t mode);

int pvfs_open(const char *pathname,
              int flags, 
              mode_t mode, 
              struct pvfs_filestat *dist);

The structure pvfs_filestat is an optional argument to define the data distribution that the user wants. If the data structure is not passed then the default distribution is used. The data structures looks like the following:

struct pvfs_filestat {
 int base;   /* first iod node */
 int pcount; /* number of iod nodes*/
 int ssize;  /* stripe size */
 int soff;   /* NOT USED */
 int bsize;  /* NOT USED */
where base is the starting I/O node number (it starts with 0) and pcount is the number of I/O nodes you want to use, up to the total number of nodes in PVFS1 (set pcount = -1 to use all of the nodes). The element ssize is the stripe size. The elements soff and bsize are not currently used. You can see the semantics of using the PVFS1 API is very similar to the standard UNIX/POSIX I/O API. Many of the other UNIX/POSIX I/O functions have PVFS1 counterparts: pvfs_write(), pvfs_open(), pvfs_lseek(), pvfs_lseek64(), pvfs_close() and so on. Most of the time you can port your code to use the native PVFS1 library by a simple global find/replace that just changes the I/O function called.

Another way to take advantage of PVFS is to use MPI-IO which uses PVFS1 as the underlying file system. MPI-IO stands for Message Passing Interface-Input/Output which is defined in the MPI-2 standard. MPI-IO is a response to the call for a parallel I/O capability that is part of the MPI standard. There are many features of MPI-IO that allow the MPI processes to participate in the I/O process rather than more traditional approaches of having the rank 0 process perform all I/O or splitting the data so each processes has it's own set of data. While one might think that the native PVFS1 library interface would result in the faster I/O, using MPI-IO features such as noncontiguous accesses and collective I/O coupled with PVFS1 allows you to achieve some very high I/O rates. For example, ROMIO, an implementation of MPI-IO from Argonne National Laboratory, can be built to use PVFS1 and provide MPI-IO functions for the MPICH and LAM/MPI packages. The semantics of the MPI-IO functions were designed to be fairly close to the UNIX/POSIX I/O semantics. There are C versions of the functions as well as Fortran versions of the functions. While discussing MPI-IO is a bit beyond the scope of this column, let's take a quick look at some of the MPI-IO functions.

Opening a file using MPI-IO is fairly similar to the open() function.

int MPI_File_open(MPI_COMM comm, 
                  char *pathname, 
                  int amode, 
                  MPI_Info info, 
                  MPI_File *fh);
In this function all MPI processes call this function. The function is passed the MPI communicator (MPI_COMM_WORLD is usually passed), the pathname to the file, which should be a path to a PVFS mounted file system (although it doesn't have to be), the mode of the file access (the MPI-IO datatypes for used for this argument), and a "hint" argument which is type MPI_Info. The function returns a file handle that is of type MPI_File to the file. Then each MPI process can perform IO using this file handle. Here are the basic MPI-IO functions:
int MPI_File_open(MPI_COMM comm, 
                  char *pathname, 
                  int amode, 
                  MPI_Info info, 
                  MPI_File *fh);

int MPI_File_seek(MPI_File fh, 
                  MPI_Offset offset, 
                  int whence);

int MPI_File_read(MPI_File fh, 
                  void *buf, 
                  int count,
                  MPI_Datatype datatype, 
                  MPI_Status *status);

int MPI_File_write(MPI_File fh, 
                   void *buf, 
                   int count, 
                   MPI_Datatype datatype, 
                   MPI_Status *status);

int MPI_File_close(MPI_File *fh);

Using PVFS2 in Your Codes

PVFS2 is a new PFS developed by the same team that built the original version of PVFS. The developers found that modifying PVFS1 to add new networking protocols or new storage systems was very time consuming and difficult to maintain. Also, scaling PVFS1 to thousands and even tens of thousands of nodes was likely to be a problem. PVFS2 was a solution to these problems. The developers took the opportunity to redesign PVFS1 to make it easier to add new networking protocols and new storage techniques.

In redesigning PVFS the developers have taken the opportunity to rethink the APIs for accessing PVFS2. The developers found that not many people used the "UNIX/POSIX-like" I/O compatibility feature since the speed gain was fairly small due to the I/O not being optimized. Also people were not willing to port their applications to the PVFS-specific UNIX-like library just to get a little performance gain, particularly if they wanted to run the code somewhere else. Consequently, the PVFS2 developers are just focusing on the VFS support and MPI-IO support as the application API's at this time.

The best way to write code to take advantage of the tremendous speed potential of PVFS2 is to use MPI-IO. ROMIO has been updated to readily use PVFS2. The various hints and data access techniques within ROMIO have been updated to use PVFS2 to the fullest extent possible. This change makes a lot of sense since MPI-IO is part of the MPI-2 standard and is not supposed to change from vendor to vendor.

In addition to MPI-IO, there will be a library for interacting with PVFS2. However rather than have I/O functions that are very close to the UNIX/POSIX I/O functions, these are likely to be closer to the VFS functions that PVFS2 needs for speed and scalability. The difficulty for most people will be the lack of familiarity with the VFS functions in Linux. The best suggestion that Dr. Rob Ross, the lead developer of PVFS2 gives is to, "... suggest that people write serial tools to the UNIX I/O API and parallel ones to the MPI-IO API; that will get them both the best overall performance in the parallel case and the best portability in both cases."

Sidebar One: Links Mentioned in Column



MPI-2 Book

MPI-IO Doc 1

MPI-IO Doc 2

MPI-IO Doc 3

"Scalable I/O on Clusters, Part 1";, Forrest Hoffman, Linux Magazine, July 2002.

"High Performance I/O PVFS2 for Clusters," Neill Miller, Rob Latham, Rob Ross, and Phil Carns, ClusterWorld Magazine, Vol. 2, No. 4, April 2004.

"A Next Generation Parallel File System for Linux Clusters," Robert Ross, Rob Latham, Neill Miller, and Phillip Carns, January 2004.

"Scalable I/O on Clusters, Part ", Forrest Hoffman, Linux Magazine, August 2002.

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).