|
Page 1 of 4 Storage: its where we put things
Clusters have become the dominant type of HPC systems but that
doesn't mean they aren't perfect (sounds like a Dr. Phil show
doesn't it?). While you get a huge bang for the buck from them,
somehow you have to get the data to and from the processors.
Moreover, some applications have fairly benign IO requirements
and others need really large amounts of IO. Regardless of your
IO requirements you will need some type of file system for your cluster.
I wrote a file system/storage survey article for clusters in the past,
but as always things change rather rapidly in the HPC arena. Originally, I had wanted to update
the original article, however, the updates became so large that it's really
an entirely new article. So this article, I hope, is a bit more
in depth and a bit more helpful than the past file system article.
[Editors note: This article represents the first in a series on cluster file systems (See Part Two: NAS, AoE, iSCSI, and more! and
Part Three: Object Based Storage). As the technology continues to change, Jeff has done his best to "snap shot" the state of the art. If products are missing or specifications are out of
date, then please contact
Jeff or myself. Personally, I want to stress magnitude and difficulty of this undertaking and thank Jeff for amount of work he put into this series.]
Introduction
Storage has become a very important part of clusters and is likely to
become even more important as problem sizes grow. For example, the size
of CFD meshes grows by almost a factor of 2 every
year. As the problem size grows, the file size grows as well.
Very soon the CFD solutions will become large enough that transferring
them to and from user's desktop or post-processing them on a desktop
will consume too much time or may not fit in memory. The best idea
then is to visualize, manipulate, and post-process the results where they
were created - on the cluster. CFD is but one example of the ever
increasing need for more storage and higher speed storage.
In a 2007 presentation at the ISC (International Super Computer)
Show in Dresden, IDC pointed out that storage and data management is
growing in importance. In a survey that IDC conducted, 92% of
the people who responded said they have applications that are
constrained by IO. Also, 48% of the people who responded said that they
were constrained by total file size. This number grew to 60% in
three years. So based on the IDC study, it looks like high speed data
storage, particularly parallel file systems, for clusters will become
increasingly important.
This article will only be discussing distributed and
parallel file systems for clusters. I won't be discussing typical Linux
file systems such as XFS, ext4, ext3, ext2, Reiserfs, Reiser4, JFS, BTRFS, etc.
These file
systems are what I think of as local file systems. That is, the
physical storage for these systems can be
Direct Attached Storage
(DAS) or part of a
Storage Area Network
(SAN). I also won't
be discussing file system tools such as
LVM,
EVMS,
md, etc.
A discussion of these tools is better left to an article on local
file systems.
Moreover, where possible this article will not focus on the hardware part of cluster storage,
but rather, it will focus on file systems or
parallel storage solutions. The hardware will only be discussed when
it is part of an overall solution and can't be separated from the
file system. I may disappoint people by not discussing which hard drive
manufacturer is better, which type of hard drive is better (hint: SATA
drives have the same "failure rate" as Fibre Channel (FC) and SCSI
drives), or which RAID card is better, or which RAID level is best for
certain workloads, etc. as these questions aren't really the focus
of this article.
Due to the time considerations, I can only touch on a
few file systems. Please don't be alarmed if your favorite is not in
this article. It's not intended to dissuade anyone from considering
or using that file system. Rather, it was just a choice I made to cut
the size of this article and to finish it in a reasonable amount of time.
I have tried to cover the popular systems, but alas popularity depends
on application area as well. Finally, the discussion of a file system is
not to be considered an endorsement by me or ClusterMonkey. If you think
your file systems deserves more attention, by all means, contact me (my
contact link is at the end of the article) and tell me how you use it and why
it works for you. And, ClusterMonkey is always looking for cutting edge writers.
In an effort to scratch the proverbial IO itch, this series of articles is designed
as a high level survey that just touches upon file systems (and file
system issues) used by clusters. My goal is to at least give you some
ideas of available file system options and some links that allow you
to investigate. Anything else starts to look like a dissertation and
I already did one of those. Before I start, however, I want to discuss
some of the enabling technologies for high performance parallel file
systems.
Enabling Technologies for Cluster File Systems
I think there are a number of enabling technologies that help make
parallel file systems a much more prevalent technology in today's
clusters. These technologies are:
- High-speed networks with a large bandwidth, such as InfiniBand
- Multi-core processors
- MPI2 (MPI-IO)
- NFS
You may scoff at some of these, but let me try to explain why I think they
are enabling technologies. If you disagree or if you can think of other
technologies please let me know.
As I discussed in another
article
InfiniBand (IB) has very good performance and the price has been steadily
dropping. DDR (Double Data Rate) InfiniBand is now pretty much the standard
for all IB systems replacing Single Data Rate (SDR). DDR IB has a theoretical
bandwidth of 20 Gbps (Giga-bits per second), a latency less than 3
microseconds (depending upon how you measure it), an N/2 of about
110 bytes
and a very high message rate. The price has dropped to about
$1,000-$1,400 per port (average costs including switching and cables).
In addition, Myrinet 10GigE is
also a contender in this space as well as 10GigE (if the price ever comes
down). But at the same time, most applications don't need all of that bandwidth.
Even with the added communication needs of multi-core processors many
applications will only use a fraction of the available bandwidth in
IB or 10GigE bandwidth. To make the best of all the interconnect
capability, vendors and customers are using the left over bandwidth to feed
the data appetite of cluster nodes. So they are running both the computational
traffic and the IO traffic over the same network.
In my opinion, another enabling technology is the multi-core processor.
People may disagree with me on this issue, but let me explain why I think multi-core
processors can be, in a sense, an enabling technology. The plethora of cores could be very useful in terms of IO.
There may be opportunities to use one or more cores per socket to do
nothing but IO. With this concept you dedicate a core per node to IO processing.
This processor is also programmable (it's a CPU), and very fast compared to
other processors such as those on RAID controllers or other
aspects of storage. So it gives you a great deal of flexibility. I think
it might be possible to dedicate one of the CPUs solely to IO processing,
freeing the other cores on the system for computing.
Imagine a dual-socket, quad-core node that has a total of 8 cores. Just a
few years ago you had at most 2 cores per node. That means only 2 cores
had to share a
Network Interface Card (NIC) or share access to local storage. Now you have 8 cores vying for the
same resources. When each core performs IO you will have 8 cores
trying to push data through the same NIC. But what if you could write
an application where 7 of the cores did computational processing and one of
the cores did all of the IO for the node. When one of the 7 cores needed
to perform IO, particularly writes, they just pin the
memory with the data and pass the memory address to the eighth core. Then
this eighth core performs the IO while the other seven cores continue
processing. When finished with the IO, the eighth core could just release
the memory back to the cores. The HPC community has always talked about
overlapping communication and computation. Now you can begin to talk about
overlapping computation and IO.
However, while I think there is something to the idea of overlapping
computation and IO, I think the most important reason multi-core processors
are an enabling technology is because of the IO demand they can create.
As I previously mentioned, just a few years ago we had at most two cores
per node. Now we can easily have eight and soon we will have sixteen
(AMD's quad-core CPU with 4 sockets per board or dual-socket Intel
Nehalem CPUs with 8 cores per socket in 2008). With this many processors
you may have to use a faster network so communication does not become a
huge bottleneck (e.g. InfiniBand). So now you've got IB to solve your
computational communication problem with lots of cores. As I
previously mentioned, you can take advantage of the large amount of
bandwidth that is leftover for IO. So multi-core processors could easily
drive IO demand in this fashion. It's something of a backwards argument
since the multi-core chips are not really driving IO requirements. But
since they are driving high speed interconnects on each node, they
create the opportunity to put a high-speed storage system on the same
network.
In addition, with multi-core processors becoming so prevalent (soon you
won't even be able to find new single core processors) you will see
people start to run jobs with more and more cores. Just a couple of years
ago you would have two cores per node and run, perhaps on 8-16 nodes, for
a total of 16-32 processors. Now with 8-16 nodes you can get 64-128 cores
for about the same hardware price. People will want to
run their jobs across more processors and they are going to want
to run larger problems. Larger problems and problems across more
cores can lead to more IO being required. So again, multi-core processors
could help drive up IO requirements.
I think another important enabling technology is
MPI-2 which now includes something called
MPI-IO.
This is an addition to the MPI standard
that covers IO functions. Prior to MPI-2, writing IO in a high-speed and
portable manner was very ad-hoc. You could use normal POSIX functions
but this was difficult to do effectively if you had complicated data
structures or unstructured data. In addition, people sometimes used
proprietary toolkits to write data on a parallel file system, thus
making portability an issue. With MPI-2, you now had a set of functions
that are portable (as long as the MPI implementation was compliant
with MPI-2 or at the very least MPI-IO), high performance, and easy to
use for complicated data structures or unstructured data. Being part
of the MPI standard did not hurt things since most parallel applications
used MPI. Now you have a portable way to do IO for MPI codes.
The MPI-2 committee was also very kind to the storage vendors. They
created a method where you could pass "hints" to the underlying file
system or MPI layer to take advantage of any special features of the
storage. If you built your application for one storage platform
and used hints, you will have to change it for another storage
platform. This change is usually just a few lines of code (less than
10 in general). So portability has not been adversely affected and
you get to take advantage of the storage.
Finally, one enabling technology that I think people forget about, are the file
systems themselves. Clusters need a file system that at the very least can be used
by all nodes in the cluster. This capability allows parallel applications to access
the same data set as all of the other nodes. But, just as important as
having a common file system for all of the nodes is that the file system
should be standard and if at all possible, part of the distribution
itself.
If a file system is a standard and part of the OS, then it becomes possible
for the hardware of various vendors to work together. This combination includes the
hardware of the node and the network. So if I take hardware
from vendor X and hardware from vendor Y and plug them into a network,
then they should be able to communicate. This is pretty much a given in
today's world. But, equally important is the ability for the hardware from
vendor X and Y to share data (files). This situation means you can shop for the best
price, or the best set of features, or the best support, or any other
criteria, and the hardware/OS combination should be able to share data.
The only true distributed file system standard today is NFS. NFS was and is one of the enabling
technologies for clusters because it allows different hardware/OS combinations
to share data because it is a standard. In fact a huge number
of vendors of hardware and software get together at
Connectathon every year to test
their interoperability, particularly for NFS. Without NFS we probably would
have a difficult time with IO for clusters since we would have multiple
implementations that would be vendor dependent.
However, you don't have to use NFS for data sharing in
a cluster, however. As long as you use hardware and software in a cluster that all of
the nodes have in common, you can use a proprietary file system. This situation does
restrict you in some cases to a single vendor. But it also means that you
can take advantage of a high performance file system in your cluster. What
many of the vendors of proprietary cluster file systems have done is to
also provide NFS gateways so you can at least get to the file system using
a standard file system such as NFS.
However, if you read further in this article, you will see that there is about
to be a new file system standard that is specifically designed for parallel
systems (i.e. clusters).
|