Page 3 | File Systems O'Plenty Part Three: Object Based Storage | FileSystems

Home

Columns

FileSystems

File Systems O'Plenty Part Three: Object Based Storage

Details: Written by Jeff Layton; Published: 29 October 2008; Hits: 27039

Article Index

Page 3 of 3

Quick Summary

I wanted to include a table with a summary of features because the article is so long.

Distributed File Systems
File system	Networking	Features	Limitations	Example Vendors
NFS/NAS	TCP, UDP NFS/RDMA (InfiniBand) very soon	Easy to configure and manage Well understood (easy to debug) Client comes with every version of Linux Can be cost effective Provides enough IO for many applications May be enough capacity for your needs	Single connection to network GigE throughput is about 100 MB/s max. Limited aggregate performance Limited capacity scalability May not provide enough capacity Potential load imbalance (if using multiple NAS devices) "Islands" of storage are created if you multiple NAS devices	IBM Netapp EMC HP ONStor Scalable Informatics BlueArc SnapServer Dell
Clustered NAS	Currently TCP only (almost entirely GigE)	Usually a more scalable file system than other NAS models Only one file server is used for the data flow (forwarding model could potentially use all of the file servers) Uses NFS as protocol between client and file server (gateway) Many applications don't need large amounts of IO for good performance (can use low gateway/client ratio)	Can have scalability problems (block allocation and write traffic) Load balancing problems Need a high gateway/client ratio for good performance	Polyserve Isilon Panasas Netapp
AFS	Currently TCP only (primarily GigE)	Caching (Clients cache data. Plus servers can go down without loss of access to data) Security (Kerberos and ACL) Scalability (Additional servers just increase the size of the file system)	Limited single client performance (as fast as data access inside an individual node) Not in wide spread use Uses UDP	Open-source (link)
iSCSI	Currently TCP only (primarily GigE)	Allows for extremely flexible configurations Software (target and initiator) comes with Linux Centralized storage (easier administration and maintenance) You don't have to use just SCSI drives	Performance is not always as fast as it could be Requires careful planning (not a limitation, but just a requirement) Centralized storage (if centralized storage goes down, all clients go down)	Open-source (link)
HyperSCSI	Currently IP only (it uses it's own packets)	Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP) Allows for very flexible configurations	Hasn't been updated in a while Cannot route packets since they aren't UDP or TCP	Open-source (link)
AoE	Currently IP only (it uses it's own packets)	Performance can be faster than iSCSI (since it uses it's own packet definition, it can be more efficient than TCP) Drivers are part of the Linux kernel	Uses ATA protocol (really a requirement and not a limitation) Cannot route packets since they aren't UDP or TCP	Open-source (link)
dCache	Currently TCP	Can use hardware space on all available machines (even clients) Tertiary Storage Manager (HSM)	Performance (it's only as fast the local storage) Limited use (primarily high-energy physics labs)	Open-source (link)

Parallel File Systems
File system	Networking	Features	Limitations	Vendors
GPFS	Currently TCP only Native IB soon (4.x)	Probably the most mature of all parallel file systems Variable block sizes (up to 2MB) 32 sub-blocks (can help with small files) Multi-cluster IO pattern recognition Can be configured with fail-over NFS and CIFS gateways Open portability layer (makes kernel updates easier) File system only solution (you can select what ever hardware you want)	Pricing is by the server and client (i.e. you have to pay for every client and server) Block-based (has to use sophisticated lock manager) Can't change block size after deployment Current access is via TCP only (but IB is coming in version 4.x) File system only solution (it allows people to select unreliable hardware)	IBM Linux Networx
Rapidscale	Currently TCP only (primarily GigE)	Uses standard Linux tools (`md`, `lvm`) Distributed metadata Good load balancing NFS and CIFS gateways High availability	More difficult to expand capacity while load balancing Dependent on RAID groups of disks for resiliency and reconstruction Modified file system, modified iSCSI Currently network protocol is TCP (limits performance) Must use Rackable hardware	Rackable
IBRIX	Currently TCP only (primarily GigE)	Can split files and directories across several servers Can split a directory across segment servers (good for directories that have lots of IO and lots of file) Segment ownership can be migrated from one server to another Segments can be taken off-line for maintenance without bringing the entire file system down Can configure HA for segment fail over Snap shoot tool File replication tool File system only solution (you can select what ever hardware you want) Distributed metadata NFS and CIFS gateways	Administration load can be higher than other file systems (some of this is due to the flexibility of the product) Dependent on RAID groups of disks for resiliency and reconstruction Native access is currently only TCP (limits performance) File system only solution (it allows people to select unreliable hardware) Rumors of having to pay for each client as well as segment servers (data servers)	IBRIX
GlusterFS	TCP InfiniBand	Open-source Excellent performance Can use almost any hardware Plug-ins (translators) provide a huge amount of flexibility and tuning capability Very fast performance File system only solution (you can select what ever hardware you want) No metadata server Automated File Recovery (AFR) and auto-healing if a data server is lost NFS and CIFS gateways	Relatively new Dependent on RAID groups of disks for resiliency and reconstruction File system only solution (it allows people to select unreliable hardware) Extremely flexible (it takes some time to configure the file system the way you want it	Open-source (link)
EMC Highroad (MPFSi)	TCP Fibre Channel	Uses iSCSI as data protocol NFS and CIFS gateways Uses EMC storage so backups may be easier	Only EMC hardware can be used Dependent on RAID groups of disks for resiliency and reconstruction Single metadata server FC protocol requires an FC HBA in each node and FC network ($$) Most popular deployments use TCP (limits performance)	EMC
SGI CXFS	TCP (metadata) and FC (data)	Multiple metadata servers (although only 1 is active) Lots of redundancy in design (recovery from data server failure) Guaranteed IO rate NFS and CIFS gateways?	Doesn't scale well on clusters with many nodes FC protocol requires an FC HBA in each node and FC network ($$) Only one active metadata server NFS and CIFS gateways? Dependent on RAID groups of disks for resiliency and reconstruction Restricted to SGI only hardware Only one metadata server	SGI
Red Hat GFS	Fibre Channel (FC) TCP (iSCSI)	Open-source Global locking Can use almost any hardware for storage Quotas NFS and CIFS gateways	Limited expandability (but limit is large) Dependent on RAID groups of disks for resiliency and reconstruction	Open-source (link)

Object Based File Systems/Storage
File system	Networking	Features	Limitations	Vendors
Panasas	Currently TCP only (primarily GigE)	Object based file system Easy to setup, manage, expand Performance scales with shelves Distributed metadata Metadata fail-over Fast reconstruction in the event of a disk failure Disk sector scrubbers (looks for bad sectors) Can restore a sector if it is marked bad Network parity Blade drain NFS and CIFS gateways (scalable NAS) Coupled hardware/software solution (more like an appliance)	Have to use Panasas hardware Limited small file performance Kernel modules for kernel upgrades come from Panasas Single client performance is limited by network (TCP) Coupled hardware/software solution (limits hardware choice)	Panasas
Lustre	TCP Quadrics Elan Myrinet GM and MX InfiniBand (Mellanox, Voltaire, Infinicon, OFED) RapidArray (Cray XD1) Scali SDP LNET (Lustre Networking)	Open-source Object based file system Can use a wide range of networking protocols Can use native IB protocols for much higher performance Excellent performance with high-speed network NFS and CIFS gateways (scalable NAS) File system only solution (you can select what ever hardware you want)	Single Metadata server Dependent on RAID groups of disks for resiliency and reconstruction File system only solution (it allows people to select unreliable hardware)	Lustre
PVFS	TCP Myrinet (gm and mx) Native IB protocols Quadrics Elan	Object base file system Easy to setup Distributed metadata Open-source High-speed performance Can use multiple networks File system only solution (you can select what ever hardware you want)	Lacks some of the resiliency of other file systems (but wasn't designed for that same functionality) File system only solution (it allows people to select unreliable hardware)	PVFS

I want to thank Marc Ungangst, Brent Welch, and Garth Gibson at Panasas for their help in understanding the complex world of cluster file systems. While I haven't even come close to achieving the understanding that they have, I'm much better than I when I started. This article, as attempt to summarize the world of cluster file systems, is the result of many discussions between where they answered many, many questions from me. I want to thank them for their help and their patience.

I also hope this series of articles, despite their length, has given you some good general information about file systems and even storage hardware. And to borrow some parting comments, "Be well, Do Good Work, and Stay in Touch."

A much shorter version of this article was originally published in ClusterWorld Magazine. It has been greatly updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

© Copyright 2008, Jeffrey B. Layton. All rights reserved.
This article is copyrighted by Jeffrey B. Layton. Permission to use any part of the article or the entire article must be obtained in writing from Jeffrey B. Layton.

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.