A Balanced Storage System
One would think that storage for clusters would be a simple task. You just slap some disks together with a RAID controller, use something like NFS, and you're off to the races. You can do this, but it's not what anyone would consider a well thought out design. What qualifies as a well thought out design? In my opinion, it's what I call a "balanced" storage system.
A balanced storage system just means that the storage hardware can keep up with the amount of data from the clients and that the clients are capable of providing enough data to almost max out the storage hardware. For example, let's assume that you have 4 storage nodes that have enough disk in them to sustain about 200 MB/s in disk performance. This means that you have an aggregate of 800 MB/s in disk performance. Let's assume that the clients are connected to the storage system using GigE and a single GigE connection can achieve about 80 MB/s in IO performance.
800 MB/s / 80 MB/s = 10This number means you need, 10 clients to be able to provide enough data to keep the disk subsystem busy.
Why is achieving balance so important? I'm glad you asked. Let's return to the previous example. If you have more than 10 clients, then they are providing more than 800 MB/s of data to the storage system (if they are performing IO at the same time). So the storage hardware becomes a bottleneck and you will never get more than 800 MB/s in aggregate IO performance. This trade off may be acceptable to you, but you have introduced a bottleneck.
On the other hand, if you have fewer than 10 clients, then they cannot provide enough data to keep the storage hardware busy. For example, if you only had 6 clients, then they would provide no more than 600 MB/s in data to the storage hardware. So, you would have more disk hardware performance than you actually need. In other words, you have spent more money that you needed to keep up with the IO demands.So we can say that the storage solution is unbalanced.
To make sure you have a balanced file system you have to do some work. For example, one thing you will need to know, is the IO performance of the disk hardware. Then you will need to know how the clients are connected to the storage hardware and test that connection to determine how much data they can push onto the storage hardware. But, one of the most important things you need to determine is how much IO your applications will consume/produce.
If your applications produce a great deal of data, then it is more likely that you will need a balanced storage system. In other words, IO is a driving factor in performance and a balanced system allows you to match data requirements to hardware, saving money and taking full advantage of the hardware.
On the other hand, if your applications don't do as much IO and if IO performance is not a big driver for application performance, then you might be able to reduce the storage hardware. Let's look at a quick example.
If your applications only spend about 10% of their time on IO, then it's not as important to have a balanced storage system. The reason is that a majority of the time your applications are computing, not doing IO. If you had a balanced storage system, you would only be matching the peak IO requirements of your application. But even if you had an infinitely fast IO system, you could only improve the application performance by 10%. And we all know how much an infinitely fast storage system would cost. Perhaps it would be better to add a few more nodes with the likely-hood of increasing performance more than 10% (This gets into cluster design where you have to know the scalability of the applications and the network, etc. This is beyond the scope of this article â sorry).
Whether you choose to have a balanced storage system or not is up to you. I personally prefer to have a balanced storage system because I know I'm taking full advantage of my hardware. But if my applications aren't doing too much IO, then I will cut back on my storage hardware since even really fast IO won't have much impact on the overall performance. Then I can buy a few more nodes.
Data Corruption - Coming to a Storage System Near You!
Everyone is absolutely paranoid about losing data and rightfully so. I don't want my collection of "KC and the Sunshine Band" MP3's to disappear. So we use RAID devices to create RAID groups that can tolerate the loss of a disk or two without losing data. But, I hate to tell you that the assumption that RAID will protect our data may not be true in the future. Let me explain why.
Today we have 1 TB disks and soon we will have 1.5 TB drives and even 2 TB drives (Maybe I can add in my collection of "Surf Punks" MP3's). A 1 TB drive has about 2 billion (2 x 10^9) sectors. The drive manufacturers claim that an Unrecoverable Read Error (URE) happens about every 1 x 10^14 bits. This converts to 24 x 10^9 sectors (assuming that we have 512 byte sectors or 512 * 8 = 4096 bits per sector).
If you divide the URE by the number of bytes per disk,
24 x 10^9 / 2 x 10 ^9 = 12So if you have 12, 1TB disks, your probability of hitting a URE is one (it is going to happen).
So if you have a RAID-5 group of thirteen 1TB disks and you lose a disk the RAID array starts to reconstruct by reading all of the sectors on the remaining 12 drives. In this case RAID is on a block-level so all the blocks on the remaining drives have to be read even if they donât contain any data. Therefore you are almost guaranteed that during reconstruction you will hit a URE, causing the reconstruction to fail. This means that you will have to restore the RAID group from a backup. Restoring the backup involves having to copy up to restore up to 13TB. This could take quite a while.
RAID-6 was designed to help with this since you can lose up to 2 disks before compromising the ability to reconstruct the data. So if you have a RAID-6 group of 13 1TB disks and you lose a disk, reconstruction starts on the remaining 12 drives. But as shown previously, the probability of hitting a URE with 12, 1TB drives is nearly 1. If you lose another drive during reconstruction you are now down to 11 drives. While the probability of hitting a URE isnât 1 itâs very close. If you lose another drive during reconstruction with the 11 drives, your RAID group is toast and you have to restore from a backup.
Not many people realize that if you have enough 1TB disks in a RAID group, you are almost guaranteed that you will hit a URE at some point. So creating large RAID groups for your MP3 or YouTube collection is not a good idea.
So what do you do? There are a few things you can do:
- Put fewer disks in the RAID group (keep the total below 10-12 TB)
- Run a RAID-1 across RAID groups
- Hope the disk vendors change the URE rate (or ask them)
- Switch to a storage scheme that doesn't require a RAID controller and RAID groups (such as object storage)
Let's talk about these options because some of them are better than others.
You can put fewer disks in a RAID group to keep the total usable space below 10-12 TB. This limits the amount of storage in a single RAID group which may or may not be a problem for you. But, if you are using NFS and these are NAS boxes, then you might end up with a whole bunch of NFS file systems being exported. This makes management a mess. So this option, while it works, is probably not the best for cluster storage.
Another option is to create a RAID group with as many drives and space as you want, but then use RAID-1 to mirror the group. So you would create a RAID-51 or a RAID-61 using this approach. In either case, you will be wasting 50% of your usable space. Then if one of the RAID groups loses a drive and then hits a URE, then you can copy the data over from one side of the RAID-1 to the affected one (presumably after the disk(s) have been restored). But the gotcha is that during this the copy, you are reading all sectors from the unaffected RAID-group. This means that you're probability of hitting a URE is 100%. So this approach, while seemingly a good one, is actually not a good idea either. But it does give you a little more protection since a RAID-51 can tolerate the loss of 2 drives, 1 on each RAID-5 group, and RAID-61 can tolerate the lose of 4 drives, 2 on each RAID-6 group, before you have to restore data from a backup.
On the other hand, if you can create more than 2 RAID groups in a RAID-1, such as using software RAID, you could copy part of the sectors from one unaffected RAID group and another part from the other unaffected RAID group. Again, not a stellar idea. So using the RAID-1 approach again says once again that you want to use fewer than 10-12 TB in a single RAID group.
The third option, hoping the disk vendors change the URE rate, is something that may or may not work. From what I understand, the disk vendors are capable of making disks with much smaller URE rates. But the resulting drives would be smaller in capacity and more expensive (if they're smaller you wouldn't likely hit a URE so you wouldn't even need a lower URE rate). Conversely the drive manufacturers can also make larger drives with a higher URE rate (i.e. more likely to hit a URE). But this too might not be a good idea for obvious reasons. Plus cost is always an issue. You can put pressure on the drive manufacturers, but I'm not sure they are able to respond (most definitely not able to respond quickly).
The last option is switching to a storage scheme that doesn't require a RAID controller and RAID groups. There aren't many file systems that can do this. The only one I know of is Panasas. Since Panasas is using an object based file system and a per-file RAID layout, they don't have to use RAID controllers and RAID groups. We will have more on this topic in Part Three.
I like morals. The moral of this section is, "If you're walking on egg shells, don't hop" (my apologies to Blue Thunder). More precisely, we could say, "Watch the size of your RAID group versus the URE rate." If you put too much data (too many sectors) in a single RAID group, you are almost guaranteed to encounter a URE if the group performs a RAID reconstruction. This has implications for how you architect storage as well. If you use a file system that has IO nodes (e.g. GPFS, IBRIX, Lustre), then you will have to use more IO nodes since of the limitation of the RAID group size.
Enough scary bedtime cluster stories (but you have been warned). Let's move onto the file system themselves.