Hits: 8608

Some aid for those that use RAID

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article we turn our attention to other mailinsg lists that also can provide useful information. In this article I review some postings in the Rocks-Discuss and LVM mailing lists where we report on RAID and file system preferences.


Most of the time the mailing lists for specific cluster applications or cluster distributions are devoted to specific questions about the application or distribution. However, some times you will see general questions and very good responses from knowledgeable people on these lists. Rocks is a popular cluster distribution. On January 6, 2004, a simple question to the Rocks mailing list gave rise to some good recommendations. Purushotham Komaravolu asked for recommendations for a RAID configuration for about 200 GB of data (recall that RAID stands for Redundant Array of Inexpensive Disks).

Greg Bruno provided the first answer. He said that for pure capacity (not necessarily throughput) you should use a 3ware 8006-2LP serial ATA controller with two 200 GB (Gigabyte) serial ATA drives that are configured for mirroring (RAID-1). He said that this should give about 80 Megabytes/sec (MB/s) in read performance and about 40 MB/s in write performance. For more performance, Greg recommended using a 3ware 8506-4LP serial ATA controller and four 100 GB ATA drives configured as RAID-10 (two sets of mirrored drives which are then striped over the two sets). Greg was estimating performance as 160 MB/s for read IO and 80 MB/s for write IO, if you use decent disks.

Jon Forrest joined in the discussion saying that he had a difficult time getting the Promise and Iwill RAID cards (RAID-0 or RAID-1) working with Linux. Greg Bruno responded that they had good luck with 3ware controllers and bad luck with the controllers that Jon mentioned. However, Tim Carlson joined in that he was not impressed with the RAID-5 performance of the 3ware controllers even using serial ATA (SATA) drives. Tim said that he had never gotten more than 50 MB/s using RAID-5 and SATA. He recommended going with SCSI drives and a SCSI RAID controller along with software RAID. Tim finally suggested using a box of IDE (ATA) disks with a back end controller that converts things over to SCSI or FC (Fibre Channel). He said that in his experience this solution scales nicely to tens of TB (terabytes).

Joe Landman jumped in to say that using RAID-5 for high performance is not a good idea. Rather one should be using something like RAID-0 (striping) for increased performance. Joe also took issue with the idea of using SCSI disks. Joe said that in his experience ATA drives were very good but suffered from an interrupt problem that leads to increased CPU load to the point that you could swamp a CPU by writing many, many small blocks at the same time (think of a cluster head node or NFS file server). SCSI controllers hide this behind a controller interface. Joe went on to discuss that current CPUs have much more power than the controller in a RAID card. However, combining software RAID over a cheap hardware controller is asking for trouble, particularly for large loads. Joe ended that he agreed with Tim's recommendation of using IDE disks with a back end controller that converts to SCSI or FC.

A little later Joe said that the important question was what file system people were running on their RAID disks. Joe said that XFS was the best and should be incorporated into ROCKS (note that XFS is now part of the standard 2.4 and 2.6 kernels from Joe Kaiser chimed in that he thought XFS was great and that they have had very good luck with it. Tim Carlson jumped back in to say that he has good luck with ext3. Joe Kaiser responded that they had some data corruption with ext3 for large arrays when the disk has been filled all of the way. Joe and Tim then discussed several aspects of design including the importance of understanding your data needs and your data layout.

This discussion points out that there are several important considerations when designing a file server for a cluster. Considerations such as your data layout, the host machine (CPU power), disk types, RAID controllers, monitoring capabilities, and file system choice, can all have a great effect on the resultant IO performance.

ROCKS: Using Other File Systems

A couple of months after the previous discussion about RAID, a discussion about alternative file systems was begun on the ROCKS-Discuss mailing list. On 16 April, 2004, Yaron Minsky asked about using something other than ext2 on the master node of his ROCKS cluster, particularly ReiserFS or XFS. Phillip Papadopoulos replied that this was a bug in ROCKS 3.1.0 forcing you to use ext2 and would be fixed in the next release. However he did say that you could convert the ext2 filesystem to ext3 using C to add a journal.

Laurence Liew responded that he thought ext2, ext3, ReiserFS, and XFS all had their strengths and weaknesses. He suggested using ext2 for a while to understand the application usage pattern. He also said that in some cases, modifying the layout of the cluster would have a bigger impact than changing file systems. Yaron replied back that he thought ext3 faired worse than XFS or JFS in benchmarks. Laurence replied that he remembered some SNAP benchmark results that showed ext3 winning in certain cases.

There was some discussion about whether Red Hat included ReiserFS and/or XFS in the version of RHEL (Red Hat Enterprise Linux) that ROCKS uses. It was finally determined that XFS was not included but ReiserFS was included but as an unsupported RPM. Later on, Josh Brandt mentioned that he thought ReiserFS would do better on lots of small files compared to other file systems. However, for large files ReiserFS performed worse than other file systems. Yaron, the original poster, posted his basic usage pattern (size of files, number of files, number of directories, etc.). Josh thought he should give ReiserFS a try.

While this discussion is brief it does show that there is a difference in file system performance among various people and groups.


It's been a few months, but there was an interesting short discussion on the Linux LVM (Logical Volume Manager) mailing list. On January 15, 2004, on the Linux-LVM mailing list, Rajesh Saxena asked which file system, JFS or XFS, would be better for a file, mail, or web server that is running LVM. A number of people with extensive experience with LVM, file systems, and using them in production responded to the question. The first response was from a poster named neuron. Neuron suggested that Rajesh use ReiserFS instead of either JFS or XFS since it was designed for handling lots of small files. Also, neuron said he has had some trouble with JFS in the past. The respected and very experienced Austin Gonyou posted some comments about journaling file systems in general.

Greg Freemyer, who is also a very experienced Linux user, agreed with neuron that ReiserFS would be a good choice. Greg also suggested that if anyone uses XFS to stay away from any version earlier than 1.3.1 because earlier version ignored the sync command which could cause the lose of the journal information sitting in the disk cache. Then a user called spam, posted and added that they thought the ReiserFS tools for recovering and repairing problems were very mature if you had a file system problem. Steven Lembark chimed in that he would suggest changing the phrase, "if you have a problem" to "when you have a problem."

Rajesh posted again and thanked everyone for their suggestion of using ReiserFS which he had not considered before. Rajesh did some homework and found that XFS had some features he really liked for taking snapshots. However, Rajesh also mentioned that he had heard of some problems with ReiserFS over LVM when taking snapshots (a snapshot is a 'copy' of a file systems that is used for backups so that a live file system need not be taken off-line for backups). Heinz Mauelshagen, one the big LVM wranglers, told Rajesh that there was a LVM patch that took care of the snapshot issue with ReiserFS. Alasdair Kergon pointed out that ReiserFS snapshots for the 2.6 kernel were not yet in the kernel tree because of the switch from LVM1 to LVM2 in the 2.6 kernel. However, as you read this, the snapshot patches are in the 2.6 kernel.

This discussion is very interesting because it points out that sometimes asking about an idea you have, or asking for an opinion prior to implementation, can greatly help. Rajesh had not considered ReiserFS prior to his mailing list posting. However, after his posting he discovered that what people were suggesting was a good idea. So the moral of the story is, before implementing anything, ask for some opinions on the mailing lists (everyone know that the mailing lists are not shy about offering opinions).

Sidebar One: Links Mentioned in Column

ROCKS Archives







This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at (don't stick your arms through the bars though).