Details
Written by Jeff Layton
Published: 29 October 2008
Hits: 26818
Page 3 of 3
Quick Summary
I wanted to include a table with a summary of features because the article
is so long.
Distributed File Systems
File system
Networking
Features
Limitations
Example Vendors
NFS/NAS
TCP, UDP
NFS/RDMA (InfiniBand) very soon
Easy to configure and manage
Well understood (easy to debug)
Client comes with every version of Linux
Can be cost effective
Provides enough IO for many applications
May be enough capacity for your needs
Single connection to network
GigE throughput is about 100 MB/s max.
Limited aggregate performance
Limited capacity scalability
May not provide enough capacity
Potential load imbalance (if using multiple NAS devices)
"Islands" of storage are created if you multiple NAS devices
Clustered NAS
Currently TCP only (almost entirely GigE)
Usually a more scalable file system than other NAS models
Only one file server is used for the data flow (forwarding model
could potentially use all of the file servers)
Uses NFS as protocol between client and file server (gateway)
Many applications don't need large amounts of IO for good
performance (can use low gateway/client ratio)
Can have scalability problems (block allocation and write traffic)
Load balancing problems
Need a high gateway/client ratio for good performance
AFS
Currently TCP only (primarily GigE)
Caching (Clients cache data. Plus servers can go down without loss of
access to data)
Security (Kerberos and ACL)
Scalability (Additional servers just increase the size of the
file system)
Limited single client performance (as fast as data access inside an
individual node)
Not in wide spread use
Uses UDP
Open-source (link )
iSCSI
Currently TCP only (primarily GigE)
Allows for extremely flexible configurations
Software (target and initiator) comes with Linux
Centralized storage (easier administration and maintenance)
You don't have to use just SCSI drives
Performance is not always as fast as it could be
Requires careful planning (not a limitation, but just a requirement)
Centralized storage (if centralized storage goes down, all clients
go down)
Open-source (link )
HyperSCSI
Currently IP only (it uses it's own packets)
Performance can be faster than iSCSI (since it uses it's own packet
definition, it can be more efficient than TCP)
Allows for very flexible configurations
Hasn't been updated in a while
Cannot route packets since they aren't UDP or TCP
Open-source (link )
AoE
Currently IP only (it uses it's own packets)
Performance can be faster than iSCSI (since it uses it's own packet
definition, it can be more efficient than TCP)
Drivers are part of the Linux kernel
Uses ATA protocol (really a requirement and not a limitation)
Cannot route packets since they aren't UDP or TCP
Open-source (link )
dCache
Can use hardware space on all available machines (even clients)
Tertiary Storage Manager (HSM)
Performance (it's only as fast the local storage)
Limited use (primarily high-energy physics labs)
Open-source (link )
Parallel File Systems
File system
Networking
Features
Limitations
Vendors
GPFS
Currently TCP only
Native IB soon (4.x)
Probably the most mature of all parallel file systems
Variable block sizes (up to 2MB)
32 sub-blocks (can help with small files)
Multi-cluster
IO pattern recognition
Can be configured with fail-over
NFS and CIFS gateways
Open portability layer (makes kernel updates easier)
File system only solution (you can select what ever hardware you want)
Pricing is by the server and client (i.e. you have to pay for every
client and server)
Block-based (has to use sophisticated lock manager)
Can't change block size after deployment
Current access is via TCP only (but IB is coming in version 4.x)
File system only solution (it allows people to select unreliable hardware)
Rapidscale
Currently TCP only (primarily GigE)
Uses standard Linux tools (md , lvm )
Distributed metadata
Good load balancing
NFS and CIFS gateways
High availability
More difficult to expand capacity while load balancing
Dependent on RAID groups of disks for resiliency and reconstruction
Modified file system, modified iSCSI
Currently network protocol is TCP (limits performance)
Must use Rackable hardware
Rackable
IBRIX
Currently TCP only (primarily GigE)
Can split files and directories across several servers
Can split a directory across segment servers (good for directories
that have lots of IO and lots of file)
Segment ownership can be migrated from one server to another
Segments can be taken off-line for maintenance without bringing
the entire file system down
Can configure HA for segment fail over
Snap shoot tool
File replication tool
File system only solution (you can select what ever hardware you want)
Distributed metadata
NFS and CIFS gateways
Administration load can be higher than other file systems (some of
this is due to the flexibility of the product)
Dependent on RAID groups of disks for resiliency and reconstruction
Native access is currently only TCP (limits performance)
File system only solution (it allows people to select unreliable hardware)
Rumors of having to pay for each client as well as segment servers (data servers)
IBRIX
GlusterFS
Open-source
Excellent performance
Can use almost any hardware
Plug-ins (translators) provide a huge amount of flexibility and
tuning capability
Very fast performance
File system only solution (you can select what ever hardware you want)
No metadata server
Automated File Recovery (AFR) and auto-healing if a data server is lost
NFS and CIFS gateways
Relatively new
Dependent on RAID groups of disks for resiliency and reconstruction
File system only solution (it allows people to select unreliable hardware)
Extremely flexible (it takes some time to configure the file system the
way you want it
Open-source (link )
EMC Highroad (MPFSi)
Uses iSCSI as data protocol
NFS and CIFS gateways
Uses EMC storage so backups may be easier
Only EMC hardware can be used
Dependent on RAID groups of disks for resiliency and reconstruction
Single metadata server
FC protocol requires an FC HBA in each node and FC network ($$)
Most popular deployments use TCP (limits performance)
EMC
SGI CXFS
TCP (metadata) and FC (data)
Multiple metadata servers (although only 1 is active)
Lots of redundancy in design (recovery from data server failure)
Guaranteed IO rate
NFS and CIFS gateways?
Doesn't scale well on clusters with many nodes
FC protocol requires an FC HBA in each node and FC network ($$)
Only one active metadata server
NFS and CIFS gateways?
Dependent on RAID groups of disks for resiliency and reconstruction
Restricted to SGI only hardware
Only one metadata server
SGI
Red Hat GFS
Fibre Channel (FC)
TCP (iSCSI)
Open-source
Global locking
Can use almost any hardware for storage
Quotas
NFS and CIFS gateways
Limited expandability (but limit is large)
Dependent on RAID groups of disks for resiliency and reconstruction
Open-source (link )
Object Based File Systems/Storage
File system
Networking
Features
Limitations
Vendors
Panasas
Currently TCP only (primarily GigE)
Object based file system
Easy to setup, manage, expand
Performance scales with shelves
Distributed metadata
Metadata fail-over
Fast reconstruction in the event of a disk failure
Disk sector scrubbers (looks for bad sectors)
Can restore a sector if it is marked bad
Network parity
Blade drain
NFS and CIFS gateways (scalable NAS)
Coupled hardware/software solution (more like an appliance)
Have to use Panasas hardware
Limited small file performance
Kernel modules for kernel upgrades come from Panasas
Single client performance is limited by network (TCP)
Coupled hardware/software solution (limits hardware choice)
Panasas
Lustre
TCP
Quadrics Elan
Myrinet GM and MX
InfiniBand (Mellanox, Voltaire, Infinicon, OFED)
RapidArray (Cray XD1)
Scali SDP
LNET (Lustre Networking)
Open-source
Object based file system
Can use a wide range of networking protocols
Can use native IB protocols for much higher performance
Excellent performance with high-speed network
NFS and CIFS gateways (scalable NAS)
File system only solution (you can select what ever hardware you want)
Single Metadata server
Dependent on RAID groups of disks for resiliency and reconstruction
File system only solution (it allows people to select unreliable hardware)
Lustre
PVFS
TCP
Myrinet (gm and mx)
Native IB protocols
Quadrics Elan
Object base file system
Easy to setup
Distributed metadata
Open-source
High-speed performance
Can use multiple networks
File system only solution (you can select what ever hardware you want)
Lacks some of the resiliency of other file systems (but wasn't
designed for that same functionality)
File system only solution (it allows people to select unreliable hardware)
PVFS
I want to thank Marc Ungangst, Brent Welch, and Garth Gibson at
Panasas for their help in understanding the complex world of cluster
file systems. While I haven't even come close to achieving the
understanding that they have, I'm much better than I when I started.
This article, as attempt to summarize the world
of cluster file systems, is the result of many discussions between
where they answered many, many questions from me.
I want to thank them for their help and their patience.
I also hope this series of articles, despite their length, has given you some good
general information about file systems and even storage hardware. And
to borrow some parting comments, "Be well, Do Good Work, and Stay in
Touch."
A much shorter version of this article was originally published in
ClusterWorld Magazine. It has been greatly updated and formatted for the
web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine .
Dr. Jeff Layton hopes to someday have a 20 TB file system in his home
computer. He lives in the Atlanta area
and can sometimes be found lounging at the nearby Fry's, dreaming of
hardware and drinking coffee (but never during working hours).
© Copyright 2008, Jeffrey B. Layton. All rights reserved.
This article is copyrighted by Jeffrey B. Layton. Permission to use any
part of the article or the entire article must be obtained in writing
from Jeffrey B. Layton.
You have no rights to post comments