Resilient PVFS, Yes It Is Possible

Article Index

Check Pointing HOWTO

Many codes do checkpointing. That is, they write their current state of computation to a file. The concept is that if the code dies for whatever reason, the code could be restarted from the last good checkpoint. This method saves time, particularly for long computations.

Remember that PVFS is intended as a high-speed scratch file system. Writing your check point files to PVFS is a very good thing to do since the file system is so fast. However, there is a chance one of the IO servers could go down and you will not have access to the files that were using the IO server that is down. However this danger is true for any file system, not just PVFS. Let's take a few moments to examine how we might modify our codes to better do check pointing.

{mosgoogle right}

A simple approach to checkpointing is to write the state of computations to a file at some interval during the code run. The checkpoint file name is usually the same since this saves file space. This method is also convenient from a coding point of view since the code uses the same file name for writing the checkpoint and for reading the checkpoint. However, if during the writing of the checkpoint file a problem occurs then the checkpoint file will be corrupt and you will have lost the benefits of checkpointing, i.e. you must restart your entire program. Moreover, if the file systems, becomes corrupt or goes off-line, then you will have to wait until the file system has been repaired or restored to get the checkpoint file back.

There are several ways to avoid some of these problems. They are the same for PVFS or any other file system. The first thing you should do is write to multiple files and partitions. I would recommend rotating through at least two, preferably three files, and partitions, if available.

The first write to a checkpoint should write to the first name. If possible you should read the data back in to make sure the file is the correct size (this is optional of course). You can also do an estimate of the size of the file to make sure it is correct. After the file has been written and you have determined the file size is correct, do an md5sum on the file and save it as well. Also, if possible, the file should be copied from PVFS to a file systems that is backed up.

The next checkpoint should write to the next file name. After writing it should follow the same process of checking the size, computing the md5sum of the checkpoint file, and copying the file to a file system that is backed up.

This process continues for as many checkpoint files as you want. After you have written the last in the series, you then use the first filename, then the second, and so on.

The key to this process is using multiple files for writing checkpoint data. Also, be sure to compute the md5sum and if possible copy the checkpoint files from PVFS to another file system that is backed up.

RAID-1 Within PVFS Itself

Every so often the idea of using RAID-1 (mirroring) within PVFS itself is asked on the PVFS and PVFS2 mailing lists. The concept would be to split the IO servers in half, create a PVFS file system from half, and then mirror it on the other half of the IO servers. Then if an IO server goes down, the mirrored PVFS can take over until the faulty IO server is brought back on line.

There a couple of downsides to this idea. First, you are only using half of your IO servers which means you will only get half the speed. Second, the RAID-1 operation means that the throughput will be slowed because of the need to copy the data to the mirrored IO servers. You can look at this one of two ways - you will get less than half the speed you could be getting -or- you are paying twice the money for the same speed.

Moreover, remember the intention of PVFS. It is designed to be a high-speed scratch file system. The key word is scratch. Therefore, redesigning or adding internal components to make PVFS more resilient goes against the basic tenant of PVFS design. Even though the developers of PVFS do their best, to add things that help the resiliency of PVFS, they will normally not do anything to sacrifice the performance potential.

Parting Comments

This column illustrates many ways you can improve the resilience and the flexibility of PVFS. Some of these options are trades and some options improve both the throughput and flexibility of PVFS. As always your application should dictate how you deploy PVFS.

{mosgoogle right}

Sidebar One: Links Mentioned in Column

PVFS1

PVFS2

Software RAID Article

Software RAID Article

LVM2

LVM


Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.