A Benchmark for Parallel File Systems | FileSystems

Home

Columns

FileSystems

A Benchmark for Parallel File Systems

Details: Written by Jeff Layton; Published: 25 January 2006; Hits: 36183

Article Index

Page 1 of 2

Our own benchmark, we are special you know

In a previous article I started to explore benchmarks for parallel file systems. In the article, we learned that benchmarks for serial file systems are not the best tools to measure the performance of parallel file systems (big surprise). Five of the most common parallel file system benchmarks were also mentioned, but the use of these was limited because they were only applicable to certain workloads and/or certain access patterns -- either memory access or storage access.

In this article we will take a look at a relatively new parallel file system benchmark suite that was designed to capture the behavior of several classes of parallel access signatures.

Introduction

A very good starting point for a new synthetic parallel file system benchmark suite is the Master Thesis from Frank Shorter at the Parallel Architecture Research Laboratory ( PARL) at Clemson University under the direction of Dr. Walt Ligon. [Just to be clear, there is no connection (that we know of) between the name of our mascot (Walt) and the esteemed Walt Ligon.-- Ed.]

This article will discuss synthetic benchmarks, that is, benchmarks that are not real applications, but are designed to reflect real applications. Due to the wide range of I/O requirements from various codes, developing synthetic benchmarks is the only realistic approach to benchmark parallel file systems.

Before jumping into a discussion of the new benchmark suite that Mr. Shorter suggests, we should ask the question about what should a parallel file system benchmark produce or what should it do?

Let's start with the premise that parallel file systems are designed for high performance for large amounts of data. So we're interested in measuring how long it takes to read and write data to and from the file system. One can think of this as the bandwidth of the file system (given the time and the amount of data, you get compute bandwidth).

However, there are some caveats here. First, what is a "large amount of data"? Second, how do we effectively measure time for distributed I/O when the nodes are independent of one another? Hopefully I'll answer these questions as we discuss some new benchmarks.

MPI-IO Is Here

Before the advent of the MPI 2.0 standard writing data to and from parallel file systems was done on an ad hoc basis that many times included non-standard methods. MPI 2.0 brought MPI based functions for reading and writing data to the file system, in our case, a parallel file system. This change provides portability to MPI applications needing parallel I/O. Some MPI's have implemented MPI-IO functions in their 1.X versions as well.

To simplify coding and help improve performance, MPI-IO was designed to abstract the I/O function. It allows the code to set what is called a virtual file view, which defines which part of a file is visible to a process. It also defines collective I/O functions that help non-contiguous file access in a few functions. By defining a virtual file view and using collective I/O functions, the code can then perform the I/O in a few function calls.

Also, MPI-IO allows complex implicit data types to be constructed for data of any type and size. The combination of a file view and virtual data types, allows virtually any access pattern to be addressed. In a generic sense, it allows the data, whether in memory or in a file, to be stored either contiguously or non-contiguously. The data can be stored either non-contiguously in memory and contiguously in a file, or contiguously in memory and non-contiguously in a file, or both non-contiguously in memory and non-contiguously in a file.

For the rest of the discussion, I'll focus on benchmarking that use MPI-IO. First, because it is a standard that allows codes to be portable. Second, because it allows different access patterns to be easily coded into the benchmark.

Work Loads

Before synthetic benchmark(s) can actually be written, we must choose the access patterns that we desire to simulate. These patterns come from examining the typical work loads that people see in their codes.

As discussed in Mr. Shorter's thesis, past efforts at defining work loads provide a very good summary of the dominant work loads. One of the dominant work pattern found by examining a large number of codes is what is called a strided access. A stride is the distance within the file from the beginning of one part of a data block to the beginning of the next part.

Mr Shorter has described some past work on the CHARISMA project that has shown there are two types of strided access. The first, termed, simple strided, was found to be used in many codes with and without serial access patterns. In other words, the data may be accessed in a serial manner, but there is was some distance between the data access. They went on to say that serial access can be mathematically decomposed into a series of simple strides.

The second popular stride pattern is called "nested stride." In this case, the code is accessing a stripe of data. Within the stripe, the data can be accessed in a set of simple strides. Hence, the term, nested stride.

In addition to how the data is actually accessed (the spatial access pattern), past efforts at examining I/O patters of programs have shown that there is a temporal access pattern. As you would guess, researchers found that codes sometimes have an access patterns that reflect various procedures in the code. Such patterns vary with time, and are called temporal access patterns.

Introducing pio-bench

The result of Mr. Shorter's thesis was a new benchmark suite, named pio-bench. In this section, I'll go over some of the design aspects that he considered. While it may seem a bit dry, understanding the critical issues and how they were implemented in the benchmark suite will help you understand the benchmark results.

Finding The Time

In his work, Mr. Shorter has taken great pains to develop a framework that provides accurate timing. The first step was to divide the actual benchmark into three pieces, a setup phase, an operations phase (reads or writes), and a cleanup phase. While this choice may sound simplistic, it allows the timings to focus on the various pieces of the whole I/O process that may be of interest to certain people. Also, it allows for standardized timing reports for all of the various benchmarks.

As I mentioned in the previous column, measuring the time it takes to perform I/O on a distributed system is a difficult proposition. The clocks on the node are skewed relative to one another and the nodes will finish their various portions of the I/O at different times. So, how does one measure time for a parallel file systems benchmark?

To resolve this issue, the pio-bench code uses aggregate time. That is, the time from when the earliest process starts its I/O, to the time that the last process finishes its I/O.

The general flow of the benchmarks begins with the setup phase. After the setup phase, an MPI_Barrier is called. This step ensures that the various MPI processes are synchronized. Then the operations take place, where the I/O is actually performed, followed by another MPI_Barrier. This aggregate time is written as:

Aggregate Time = T + S + B

Where the term S is the time it takes between the end of the first MPI_Barrier and the beginning of the first file access by any process. The term T is the total amount of time that all accesses take from beginning to end. Note that some processes may be done before T is reached. At the end of the operations phase, another MPI_Barrier is called. The term B is the time that this second barrier takes after each access has completed. This second barrier ensures that all processes have finished executing their operations.

If the amount data processed is small, including B in the timings can skew the results. So, the time for the I/O processing must be long enough so that B is insignificant to the aggregate time. This requirement helps define the minimum amount of data the benchmark needs to run for reproducible results.

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.