|
Page 1 of 2
How to sleep well when running PVFS
In the last article we looked at performance improvements for PVFS1
and PVFS2. In this installment, we'll examine improving the resilience or
redundancy of PVFS as well as putting some flexibility into the
configuration.
Redundancy or Resiliency is the ability to tolerate errors or failures
without failure of the entire system. For PVFS (Parallel Virtual File Systems) this means the ability
to tolerate individual failures without all of the PVFS being unavailable.
As with performance, there are three areas on which we can focus to
improve the resilience of PVFS: PVFS configuration, IO servers
configuration, and storage space. These three things will allow us to
improve the redundancy or the ability to tolerate errors and faults
and keep PVFS functioning. At the same time you'll find that many of
these steps will also improve the flexibility of PVFS.
Since it's easy to think about PVFS in terms of storage space, let's
examine how we can configure storage space to improve the resiliency
and flexibility of PVFS.
Using RAID on IO Servers
In the previous column we discussed using RAID to improve the
performance of the IO servers. We can also use RAID to improve
the ability of PVFS to tolerate the loss of a drive(s) without
PVFS losing any data. The IO servers and the metadata server(s)
(recall that PVFS2 allows you to have more than one metadata server)
can both take advantage of RAID levels for redundancy. Using at
least RAID-1 (mirroring) will allow IO servers to tolerate the loss
of a drive without losing any PVFS data. For example, you could use
two drives, mirror them (RAID-1), build a file system on them, and
run PVFS from this file system. If a drive fails, you replace it
and it rebuild from the other drive in the background. If you have
hot-swap drives, PVFS never goes down.
You can also use RAID-5 to get a larger number of drives and
redundancy. Combinations of RAID, such as RAID-10 (combining
RAID-1 and RAID-0) or RAID-50 (combining RAID-5 and RAID-0)
could also be used. These combinations have some redundancy from
RAID-1 or RAID-5 respectively, but also gain some speed because
of the striping (RAID-0).
As mentioned in the previous column, you can use either dedicated
hardware RAID controllers or software RAID (included in most Linux
distributions). From a redundancy standpoint either solution will
be fine. You can also use combinations of them. For example, you
could use two hardware RAID controllers each with a RAID-1 set, and
then use software RAID to stripe across the two sets (RAID-0).
Using LVM with PVFS
It's a fundamental law of nature and physics that users always
want more space. If you prepare before hand you can deal with the
never ending whining, Oops, I mean "requests," of users for more
space. Fortunately, given the design of PVFS, it is fairly easy
to accommodate the request for more space.
Recall that PVFS is a virtual file system that is built on an
existing file systems such as ext2, ext3, jfs, xfs, reiserfs,
or reiser4. The easiest way to add space to PVFS is to be able
to add extend which ever file system you have chosen to put on
your hard drives.
LVM (Logical Volume Manager) allows storage space to be adjusted
as needs change. There are a number of things you can do with
LVM to help you with PVFS. First, by creating the underlying
file system using LVM, you can add more space to the file system
as needed. You can also add a physical storage device (aka, a
hard drive) and then extend a volume group, extend the logical
volume, and then resize the file system to use the new space.
For a performance boost, you can configure LVM to use striping.
Combining this with RAID-0 should be done carefully to get the
best performance and retain the flexibility of LVM.
Hardening Metadata Servers
PVFS1 supports one metadata server while PVFS2 supports multiple
metadata servers for performance reasons, but in both cases there
is no failover design. So, in general, the metadata server is a
single point of failure. Despite the fact that the metadata server
is very lightly loaded, it would be good to perhaps consider
options to improve the reliability.
One of the first things you can do is to improve the resilience of
components in the metadata server to failure. For example, you can
have redundant fail-over power supplies in the metadata servers.
Another good idea is to make the storage on the metadata server
more redundant. For example, using RAID-1 or RAID-5 with hot swap
would allow a more resilient metadata storage system.
One could even go a bit further and use a High-Availability (HA)
configuration for the metadata server. By configuring an active-passive
HA configuration, the passive machine could take over in the event
that the active system failed.
HA the IO/Metadata Servers
One other thing you can do is to make the IO servers into high
availability systems. In this case, you could make one IO server
an "active" machine and another IO server a "passive" machine. The
passive machine won't participate in PVFS, but could take over the
function of the active machine if the original active machine fails
for some reason. This configuration means that you will use twice as
many machines for the same level of performance. You get a much more
redundant PVFS but the configuration is twice as costly and you could
be using those extra machines for improved throughput. However,
because PVFS is so flexible such a configuration is definitely possible
if you decide it is needed.
There is a document in the ~/doc section of the PVFS2 source that
discusses a high-availability experiment the PVFS2 development performed.
They began with two Dell machines with PowerEdge RAID Controllers in
them. They also shared a Dell PowerVault with seven 160 GB disks in a
RAID-5 configuration. The two nodes were connected with a cross-over
GigE cable. If you are going to do an active-active configuration then
you have to create one partition on the shared storage. Otherwise you
create two partitions, one for each node.
In the first experiment the team configured the two nodes as an
active-passive pair. In other words, one of the nodes is considered
the "active" node providing storage and the other node just watches
the active node via a heartbeat cable between the two machines. The
active machine has an IP address known to the rest of the machines
in the cluster. If it fails, then the passive machine will change
it's IP address to match the known address. The rest of the machines
won't even know the first one is down except that they may have to
do a retry for some functions (Note: PVFS2 has the ability to
retry operations).
During the experiment with the two nodes operating in failover, they
brought down the first node (the active one) by simply turning off
the heartbeat software. The second node thought the first one was
down so the second one took over the IP address, the file system,
and programs. When the first node is brought back up into production
the operations will migrate back to the first node, if you configured
the heartbeat software to do so.
They also did an experiment with an active-active configuration. In
this configuration you have two nodes both serving as PVFS2 storage.
Each one has it's own storage area on the shared storage device. The
idea behind an active-active configuration is that both nodes are
serving storage space, but if one dies the other will serve out all
of the other's storage space. You have to configure the servers
carefully, but it's not difficult thanks to the efforts of the PVFS2
development team.
Either configuration; active-passive, and active-active, was shown to
work by the PVFS2 team. If you want to implement high availability
on your metadata server(s) and/or your IO servers, you can pick
either configuration. The active-active configuration is appealing
because you can use all of the servers while they are up and
functioning. However, configuration for the active-active configuration
is a bit more difficult.
Multiple PVFS Partitions
Another benefit to PVFS is that it is easy to configure and group the
IO servers in any fashion you wish to meet your requirements. One of
the requirements might be to keep PVFS functioning as much as possible.
We've already discussed some things that you can do to help this,
but one thing that people often overlook is that you can group or
configure the various IO servers however you want. One way to do
this is to take all of the IO servers and group them into distinct
PVFS systems.
For example, if you have 18 IO servers, you could break them into
two groups of 9 IO servers, or three groups of 6 IO servers, etc.
You can then mount each group on it's own set of clients. Each group
would mount the specific PVFS group as /mnt/pvfs but PVFS will
be from a different set of IO servers for each client grouping. The
applications could then use /mnt/pvfs regardless of the client
they are on.
You could then run a code on each group of clients which has it's own
PVFS subsystem. This configuration gives you the benefit of being
able to take down one of the PVFS groups for maintenance while the
other PVFS groups stay in production. Or if a failure occurs in one
of the groups, the other groups are still functional.
However, there may be problems with this approach since you might
have to adjust the scheduling configuration so that a particular
parallel job only got the clients associated with one group.
An alternative configuration is to create several different PVFS
groups and mount them with different mount point names. In the previous
example you could have /mnt/pvfs1, /mnt/pvfs2, and
/mnt/pvfs3. The user codes could use whichever group they wanted.
Or you could adapt the codes to look at each group. If one group
has more space or perhaps is faster, then you could have the
application write to that group.
Moreover, the codes could write a copy of their data to each PVFS group
(a quasi-RAID 1). This method will require more time since the code
is basically writing IO multiple times. However, since PVFS is so
fast, this may not be noticeable. The upside is that you are unlikely
to loose a node in all three groups. Consequently, the data should
always be available in at least one group.
Of course if you have a fixed number of IO servers, creating separate
PVFS groups will be limit the maximum PVFS performance available to
your cluster. You can always add more IO nodes to your PVFS groups,
however.
For slower networks connecting the clients and the IO servers, breaking
the IO servers into more than one group could improve overall throughput.
With too many IO servers communicating over a relatively slow network,
the network will be saturated and performance will either plateau or
get worse. By separating into multiple groups the traffic is potentially
going to be better balanced since the the user applications will be
at different points in their computations. Therefore the overall
throughput should be better. Alternatively, you could put the IO
servers on separate networks that are slower so that PVFS does not
saturate the network. Then overall throughput will be better than
using a single network.
|