How to sleep well when running PVFS
In the last article we looked at performance improvements for PVFS1 and PVFS2. In this installment, we'll examine improving the resilience or redundancy of PVFS as well as putting some flexibility into the configuration.
Redundancy or Resiliency is the ability to tolerate errors or failures without failure of the entire system. For PVFS (Parallel Virtual File Systems) this means the ability to tolerate individual failures without all of the PVFS being unavailable.
As with performance, there are three areas on which we can focus to improve the resilience of PVFS: PVFS configuration, IO servers configuration, and storage space. These three things will allow us to improve the redundancy or the ability to tolerate errors and faults and keep PVFS functioning. At the same time you'll find that many of these steps will also improve the flexibility of PVFS.
Since it's easy to think about PVFS in terms of storage space, let's examine how we can configure storage space to improve the resiliency and flexibility of PVFS.
Using RAID on IO Servers
In the previous column we discussed using RAID to improve the performance of the IO servers. We can also use RAID to improve the ability of PVFS to tolerate the loss of a drive(s) without PVFS losing any data. The IO servers and the metadata server(s) (recall that PVFS2 allows you to have more than one metadata server) can both take advantage of RAID levels for redundancy. Using at least RAID-1 (mirroring) will allow IO servers to tolerate the loss of a drive without losing any PVFS data. For example, you could use two drives, mirror them (RAID-1), build a file system on them, and run PVFS from this file system. If a drive fails, you replace it and it rebuild from the other drive in the background. If you have hot-swap drives, PVFS never goes down.
You can also use RAID-5 to get a larger number of drives and redundancy. Combinations of RAID, such as RAID-10 (combining RAID-1 and RAID-0) or RAID-50 (combining RAID-5 and RAID-0) could also be used. These combinations have some redundancy from RAID-1 or RAID-5 respectively, but also gain some speed because of the striping (RAID-0).
As mentioned in the previous column, you can use either dedicated hardware RAID controllers or software RAID (included in most Linux distributions). From a redundancy standpoint either solution will be fine. You can also use combinations of them. For example, you could use two hardware RAID controllers each with a RAID-1 set, and then use software RAID to stripe across the two sets (RAID-0).
Using LVM with PVFS
It's a fundamental law of nature and physics that users always want more space. If you prepare before hand you can deal with the never ending whining, Oops, I mean "requests," of users for more space. Fortunately, given the design of PVFS, it is fairly easy to accommodate the request for more space.
Recall that PVFS is a virtual file system that is built on an existing file systems such as ext2, ext3, jfs, xfs, reiserfs, or reiser4. The easiest way to add space to PVFS is to be able to add extend which ever file system you have chosen to put on your hard drives.
LVM (Logical Volume Manager) allows storage space to be adjusted as needs change. There are a number of things you can do with LVM to help you with PVFS. First, by creating the underlying file system using LVM, you can add more space to the file system as needed. You can also add a physical storage device (aka, a hard drive) and then extend a volume group, extend the logical volume, and then resize the file system to use the new space.
For a performance boost, you can configure LVM to use striping. Combining this with RAID-0 should be done carefully to get the best performance and retain the flexibility of LVM.
Hardening Metadata Servers
PVFS1 supports one metadata server while PVFS2 supports multiple metadata servers for performance reasons, but in both cases there is no failover design. So, in general, the metadata server is a single point of failure. Despite the fact that the metadata server is very lightly loaded, it would be good to perhaps consider options to improve the reliability.
One of the first things you can do is to improve the resilience of components in the metadata server to failure. For example, you can have redundant fail-over power supplies in the metadata servers. Another good idea is to make the storage on the metadata server more redundant. For example, using RAID-1 or RAID-5 with hot swap would allow a more resilient metadata storage system.
One could even go a bit further and use a High-Availability (HA) configuration for the metadata server. By configuring an active-passive HA configuration, the passive machine could take over in the event that the active system failed.
HA the IO/Metadata Servers
One other thing you can do is to make the IO servers into high availability systems. In this case, you could make one IO server an "active" machine and another IO server a "passive" machine. The passive machine won't participate in PVFS, but could take over the function of the active machine if the original active machine fails for some reason. This configuration means that you will use twice as many machines for the same level of performance. You get a much more redundant PVFS but the configuration is twice as costly and you could be using those extra machines for improved throughput. However, because PVFS is so flexible such a configuration is definitely possible if you decide it is needed.There is a document in the ~/doc section of the PVFS2 source that discusses a high-availability experiment the PVFS2 development performed. They began with two Dell machines with PowerEdge RAID Controllers in them. They also shared a Dell PowerVault with seven 160 GB disks in a RAID-5 configuration. The two nodes were connected with a cross-over GigE cable. If you are going to do an active-active configuration then you have to create one partition on the shared storage. Otherwise you create two partitions, one for each node.
In the first experiment the team configured the two nodes as an active-passive pair. In other words, one of the nodes is considered the "active" node providing storage and the other node just watches the active node via a heartbeat cable between the two machines. The active machine has an IP address known to the rest of the machines in the cluster. If it fails, then the passive machine will change it's IP address to match the known address. The rest of the machines won't even know the first one is down except that they may have to do a retry for some functions (Note: PVFS2 has the ability to retry operations).
During the experiment with the two nodes operating in failover, they brought down the first node (the active one) by simply turning off the heartbeat software. The second node thought the first one was down so the second one took over the IP address, the file system, and programs. When the first node is brought back up into production the operations will migrate back to the first node, if you configured the heartbeat software to do so.
They also did an experiment with an active-active configuration. In this configuration you have two nodes both serving as PVFS2 storage. Each one has it's own storage area on the shared storage device. The idea behind an active-active configuration is that both nodes are serving storage space, but if one dies the other will serve out all of the other's storage space. You have to configure the servers carefully, but it's not difficult thanks to the efforts of the PVFS2 development team.
Either configuration; active-passive, and active-active, was shown to work by the PVFS2 team. If you want to implement high availability on your metadata server(s) and/or your IO servers, you can pick either configuration. The active-active configuration is appealing because you can use all of the servers while they are up and functioning. However, configuration for the active-active configuration is a bit more difficult.
Multiple PVFS Partitions
Another benefit to PVFS is that it is easy to configure and group the IO servers in any fashion you wish to meet your requirements. One of the requirements might be to keep PVFS functioning as much as possible. We've already discussed some things that you can do to help this, but one thing that people often overlook is that you can group or configure the various IO servers however you want. One way to do this is to take all of the IO servers and group them into distinct PVFS systems.
For example, if you have 18 IO servers, you could break them into two groups of 9 IO servers, or three groups of 6 IO servers, etc. You can then mount each group on it's own set of clients. Each group would mount the specific PVFS group as /mnt/pvfs but PVFS will be from a different set of IO servers for each client grouping. The applications could then use /mnt/pvfs regardless of the client they are on.
You could then run a code on each group of clients which has it's own PVFS subsystem. This configuration gives you the benefit of being able to take down one of the PVFS groups for maintenance while the other PVFS groups stay in production. Or if a failure occurs in one of the groups, the other groups are still functional.
However, there may be problems with this approach since you might have to adjust the scheduling configuration so that a particular parallel job only got the clients associated with one group.
An alternative configuration is to create several different PVFS groups and mount them with different mount point names. In the previous example you could have /mnt/pvfs1, /mnt/pvfs2, and /mnt/pvfs3. The user codes could use whichever group they wanted. Or you could adapt the codes to look at each group. If one group has more space or perhaps is faster, then you could have the application write to that group.
Moreover, the codes could write a copy of their data to each PVFS group (a quasi-RAID 1). This method will require more time since the code is basically writing IO multiple times. However, since PVFS is so fast, this may not be noticeable. The upside is that you are unlikely to loose a node in all three groups. Consequently, the data should always be available in at least one group.
Of course if you have a fixed number of IO servers, creating separate PVFS groups will be limit the maximum PVFS performance available to your cluster. You can always add more IO nodes to your PVFS groups, however.
For slower networks connecting the clients and the IO servers, breaking the IO servers into more than one group could improve overall throughput. With too many IO servers communicating over a relatively slow network, the network will be saturated and performance will either plateau or get worse. By separating into multiple groups the traffic is potentially going to be better balanced since the the user applications will be at different points in their computations. Therefore the overall throughput should be better. Alternatively, you could put the IO servers on separate networks that are slower so that PVFS does not saturate the network. Then overall throughput will be better than using a single network.