Benchmarking Parallel File Systems | FileSystems

In past columns, we've been talking about PVFS. We talked about how to configure it for performance, flexibility, and fault tolerance. If you are interested in performance, you need some way of measuring how the performance changes when you make changes. In this column, I'll talk about how one benchmarks parallel file systems. Of course, when I talk about benchmarks I don't mean comparing parallel file systems to one another. Rather, I mean the ability to determine the effects of changes on the performance of the parallel file system. This information gives you the ability to tune applications to maximize performance on a given parallel file system or to tune a parallel file system for a given set of codes.

Serial Benchmarking

Let's start out the discussion by looking at how plain old serial file systems are benchmarked. The absolute best way to test any file system is to test it with your application. However, this is not always possible. Consequently people have developed applications that simulate some pattern of file system usage. The most common benchmarks are Bonnie++, IOZone, Postmark, and homegrown tests that use commands such as dd, tar, or cpio.

Bonnie++ is a benchmark suite that performs a number of simple tests that will exercise the storage/file system combination. It primarily tests database type access to a single file or a set of files. It tests the creation, reading, and deleting of small files that can simulate the access pattern of codes such as Squid, INN, or Maildir. Despite the database focus of Bonnie++, people have found it to be a very good benchmarking program that allows various file systems to be compared to one another. It is also used to examine the effects of changing file system parameters.

As with all benchmarking codes, there are a number of caveats to running Bonnie++ that must be followed to produce any sort of meaningful result. The most important one is to use a file size that is much larger than physical memory to avoid the effects of buffering on the measurements. There are several recommendations for file sizes ranging from twice the size of physical memory to 10 times the size of physical memory. I usually recommend five times the size of physical memory.

IOZone is another common file system benchmarking tool. It generates and measures a variety of file operations such as read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read, and aio_write. The goal of IOZone is to test a broad range of functions so that vendors cannot tune their hardware and file systems to enhance the performance of just a few applications.

Another popular benchmark is Postmark. It is used to test the performance of file systems on small files. Examples of applications that are dominated by small file performance are email applications, news servers, and web based commerce applications. The developer of Postmark, Netapp, feels that the performance of file systems on small files is often over looked and small files that make up a very important part of Internet servers. Consequently, it simulates heavy small-file system loads.

The Postmark benchmark was designed to create a large pool of small files that is continuously changing to reflect the transient nature of the files on a large Internet email server. It measures the transaction rates where a transaction can be to create or delete a file, or to read or append to a file. How the transactions are processed is designed to minimize the effects of buffering and caching. However, this is dependent upon the parameters chosen for the benchmark run.

These benchmarks, Bonnie++, IOZone, and Postmark, are good if your application(s) have similar access patterns. For those application(s) that are not represented well by these benchmarks, people have developed their own benchmarks. One example of this is to use the dd command to write a number of blocks to a file system. You can choose the block size and the number of blocks to be written to best represent your application(s).

There have also been a number of file system benchmarks written to stress the metadata operations of a file system. These types of benchmarks will perform operations that, for instance, create directories, retrieve statistics on files, truncate files, delete files, create files, etc.

All of these benchmarks are good for testing applications that use small files on serial file systems. They can prove somewhat useful for certain parallel file systems such as PVFS, but may not be useful for others. They can prove useful if you are going to use NFS for some of your cluster data storage needs â for example, mounting user's home directories on the cluster nodes.

Parallel file systems are used for high-speed I/O for data, primarily larger size files. Serial benchmarks do not take into account the memory access patterns nor file access patterns of parallel codes. Consequently, using serial benchmarks for testing parallel file systems is not appropriate.

Existing Parallel File System Benchmarks

In general, how one benchmarks parallel file systems depends upon the I/O requirements of the application. However, the range of I/O requirements is very large. For example, some codes require storing large datasets, some require out-of-core data access, some have database access requirements, and some require the reading and writing of large amounts of data in a serial fashion. Also, some codes push all of the I/O through the rank-0 process or have each process write a small part of the output that is then assembled together into a single file after all of the computation is finished. Consequently, the tools to evaluate a parallel file system for such a range of codes are very disparate.

Writing benchmarks for parallel file systems is much more difficult than writing serial benchmarks. Parallel file systems have the issue of how the data is accessed in memory on the nodes and how this maps to the file(s) on the parallel file system. Moreover, there are issues with how the data is written to the file system. Parallel file systems are used to store large amounts of data at high speeds so other issues such as buffer sizes and caching techniques become much more important and relevant than serial file systems. Even something as simple as timing operations is difficult because the nodes may have different times.

There have been some efforts to create benchmarks for parallel file systems. Some of them cover a range of access patterns and while others are for specific workloads. Let's examine five of the more common benchmarks.

b_eff_io

The first one I'll examine is b_eff_io, or "Effective I/O Bandwidth Benchmark." In a general sense, this benchmark measures the time it takes to transfer data from a location in memory to a location in a file, which can then be used to compute the effective bandwidth. There are a huge number of parameters that could influence the time for this transfer. The benchmark developers chose to classify the parameters of a parallel file system benchmark into 6 general categories: application parameters, access methods, file systems parameters, programming interface, usage aspects, and statistical aspects.

The developers of the benchmark consider application parameters to be classified by things such as the way data is organized into memory for instance contiguous and non-contiguous, the way data is read or written to a file also in a contiguous or noncontiguous fashion, sizes of memory pages, size of data blocks, and the distribution of the blocks.

The interface aspect of the benchmark concerns the choice of file access API. There are several API's that could have been used - Posix I/O buffered or raw, vendors specific API's, or MPI-IO. The developers chose to use MPI-IO since it is a standard interface used to access many parallel file systems.

The access methods are all based on the interface aspect - namely MPI-IO. There are a wide rage of parameters covered for access methods. Since the code is testing I/O bandwidth and there is no overlap of computation and I/O, the authors chose to use blocking calls only.

File System parameters include such things as how many nodes are used, how much memory is used as buffer space, disk block size, disk stripping size, and number of parallel striping devices used. The b_eff_io benchmark cannot control these aspects directly, but the user can change as them as needed.

Parameters such as how many processes are used and how many parallel processors and threads are used for process are examples of usage aspects.

Statistical aspects include things such as a repetition factors and how to calculate to effective bandwidth based on only a subset of parameters used in the testing.

To make life easier for testing, the developers of b_eff_io have tried to ensure that the benchmark will run in 10-15 minutes. They try a number of different access patterns to the data in memory so that a variety of codes are effectively represented in the benchmark. However, only a single run of a specific pattern is run so repeatability is never tested. In addition, only a small amount of data for each access pattern is used, limiting the applicability of the results to larger problems. Consequently, only a rough average of the effective bandwidth is obtained.

IOR

Lawrence Livermore National Laboratory has developed a benchmark called IOR - "Interleaved or Random." The random portion of the code has been removed from the benchmark, but the name has remained the same. The goal of the benchmark is to measure I/O rates for combined open, write, read, and close operations.

There are two versions of the code: a Posix version that primarily views files as streams of bytes, and an MPI-IO version that allows for regions of files to represented as a datatype. In contrast to b_eff_io, IOR performs a great deal of I/O during the benchmark. The run will take longer than b_eff_io but you are more likely to simulate applications that use a parallel file system. Unfortunately, IOR only performs one access pattern, so applying the results to codes with differing I/O requirements is not easy.

When IOR is run, it creates a file in which data are written by each process at some offset in the file. The data do not overlapped, it is contiguous, and there are no gaps. The data are then read back by a differing process. Then the file is deleted, and the resulting throughput information is computed.

With IOR you can repeat the I/O phase of the benchmark any number of times you wish. You can also vary the number of block sizes using an environment variable.

mpi-tile-io

Rob Ross at Argonne National Laboratory wrote this benchmark. The goal of this benchmark is to test the MPI-IO implementation and underlying file system, usually parallel, under a noncontiguous access workload.

The benchmark simulates the workload of a large visualization code where a single large scene is broken up across multiple monitors - such as in a video wall. The scene is broken into a grid and each tile of the grid is assigned to a process. The data for each process is accessed contiguously in a sequential manner. There can be some overlap in the data between tiles. This data access pattern also follows some types of codes.

The benchmark will report the min, mean, max, and variance of the times to open, close, and read the data. The application can also write out the data in a file that can be used for subsequent testing. Also, as mentioned in the benchmark title, the benchmark makes heavy use of MPI-IO. To get at least good performance it also uses MPI datatypes for file access.

The benchmark is also smart enough to do some checking on the number of processes available versus the number of tiles the grid is broken into. It will through out extra processes, removing them from the collective operations. It will also return an error message if there are fewer processes than tiles requested.

NPB IO

The NASA Parallel Benchmarks (NPB) are probably the best-known public parallel benchmark codes. As the codes were tested on new systems and were developed as new architectures appeared. The problem sizes were increased. It was realized that the I/O portion of the code was becoming increasingly important. Some attention was focused on developing an MPI benchmark to test the I/O capability of file systems, particularly parallel file systems.

The initial code for testing I/O performance was provided by the Block-Tridiagonal (BT) NPB problem. The MPI implementation employs a fairly complex domain decomposition called diagonal multi-partitioning. The first implementation was called BTIO and used MPI-IO for file access.

The BTIO code was then formalized into a specification of a benchmark based on BTIO and was released under the name NPB-MPI 2.4 I/O. The code is a basically a CFD like solver (a time-stepping code that solves a set of nonlinear equations). After every five time steps the entire solution field is written to one or more files. After all time steps are finished, all the data belonging to a single time step must be stored in the same file and my be sorted. Any rearrangement of the data due to the file system or file system access is taken into account in the final timings. The benchmark has also been designed to minimize the possibility that output data resides in systems buffers by postponing a re-read of the data until the end of the program.

The benchmark includes a number of options for writing the data. These options allow the benchmark to write the data in various ways so you simulate various I/O patterns for your code. However, the memory access and file access patterns are in general fixed. Consequently, the benchmark cannot be adjusted for a specific memory or file access pattern.

Summary of Current Benchmarks

These parallel file system benchmarks provide some measure of I/O performance. However, they all have made decisions about the workload, whether it is small or large, how many times the workload is repeated, what API is used to perform the I/O, how the data is contained in memory or on the file system (contiguously or non-contiguously), and how the I/O is timed. While timing may seem somewhat trivial, it is very important in clusters since there is a propagation delay between nodes, and the all of the nodes have an independent clock.

In the future columns I'll examine some new parallel file system benchmarks that can possibly mitigate some of the limitations of these benchmarks.

Sidebar One: Links Mentioned in Column

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).