There are many options with iSCSI. You can use a central server (target) with a number of disks, or several targets or several networks, or combinations (as the Vulcans say, "infinite diversity in infinite combinations" -- while not infinite, there are a number of options). iSCSI makes block devices available via the Ethernet network. To create the block devices, you can use md (Multiple Devices or software RAID), lvm, evms, etc. to create block devices suitable for iSCSI. These tools allow you to create block devices from partitions or different disks. Using these tools also allows you to export block devices from one target to multiple nodes (initiators). iSCSI also allows you to export block devices over various Ethernet networks to various initiators.
Using software RAID (md ) on the server (target) machine you can combine block devices before exposing them to iSCSI initiators. This allows you to use some kind of RAID protection before exposing the block devices. For example, you could create a software RAID 5 on the target machine using all of the disks, then use LVM to create volume group(s) and logical volumes across all of the disks, and then expose the logical volumes to nodes (initiators). This way if a disk is lost, it can be replaced without losing the file system on any node. However, if the server (target) goes down, you lose the storage on the nodes (initiators) it was serving.
An alternative is to take a number of servers with disks, expose a set of block devices from each target to a set of nodes such that each node (initiator) mounts one block device from a given target. Then the node would take the iSCSI block devices, use software RAID-5 (or RAID-1) and LVM to create a final block device that is formatted with a file system. This configuration allows an entire target machine to go down without losing the storage on the nodes (initiators) since the final block device is RAID-5 or at least RAID-1 so that you still have access to the data. You can also use RAID-5 on the targets so that the lose of a single disk will not interrupt the initiators. This configuration might also have some speed advantages depending upon how the storage is used.
You can also use striping via RAID or lvm to improve the disk performance on the target prior to exposing the storage block to the initiator(s). However, this will likely put the bottleneck on the network. You could also stripe on the initiator side by using the block devices from various targets in md to create the final block device for the file system.
Since Gigabit Ethernet (GigE) is relatively inexpensive today, it's possible to have the target machines expose block devices on various networks. This feature allows you to reduce the number of block devices communicating over a given network thus improving throughput.
There are many possible ways to configure an iSCSI storage solution. Using md and lvm or evms, you can create block devices on the targets and expose those to the initiators. Then you can use exposed devices from various targets on a single initiator to get good performance and improve resiliency.
HyperSCSI can also be used to provide local storage on the nodes. HyperSCSI is a network storage protocol like iSCSI, but rather than use IP as iSCSI does, it uses it's own packets over raw Ethernet. By doing so, it can be more efficient because of the reduction in the TCP/IP overhead. However, because it doesn't use IP packets it's not a routable protocol. For small to medium clusters this is not likely to be an issue.
Configurations for HyperSCSI are conceptually very similar to iSCSI configurations. It uses block devices as does iSCSI and it uses Ethernet networks. As I said before, the big advantage of HyperSCSI is that is doesn't use IP, but it's own packets. This feature can make for an extremely efficient network storage protocol and is very well suited for clusters since they typically don't use routed networks inside the cluster.
There are several commercial options for providing storage, both locally and for global file systems. For example, one could use Lustre, IBRIX, GPFS, or Terrscale with various storage devices, or use the Panasas ActiveScale Storage Cluster. One could also use Coraid ATA-over-Ethernet product to provide local storage for each node in a fashion similar to iSCSI or HyperSCSI.
For smaller clusters, these solutions are likely to be too expensive. For larger clusters, perhaps from 32 nodes and up, they might prove to be a price/performance winner. However, there are some applications that are very I/O intensive and could benefit from a high performance file system regardless of the size of the cluster.
As you can see there are a number of options for providing either global storage to diskless nodes or local storage for diskless nodes. Depending upon your code(s), you can choose to use either global storage or local storage or a combination of the two.
For small to medium clusters, which I call up to 64 or 128 nodes, NFS will work well enough if you have a good storage subsystem behind it and your IO usage isn't too large (high IO rates can easily kill performance over NFS). In addition, AFS offers some very attractive feature compared to NFS so you should seriously consider it. If you need lots of IO, then PVFS or PVFS2 will work well, if you understand that it is a high-speed scratch file system and not a place for storing your files on a longer term basis such as what a home file system requires.
If you need storage local to each node for running your codes then either iSCSI or HyperSCSI will work well. Plus they are very flexible and can be configured in just about any way you want or need. In some cases you might have to also use global storage such as NFS to help. In either way
In my next installment I'll discuss commercial options more in depth as I continue discussing file system options for diskless clusters larger than 128 nodes.
|Sidebar One: Links Mentioned in Column|
NFS mailing list - a discussion about good performance
NFS mailing list - a discussion about diskless systems
iSCSI on Linux - Article on Configuring iSCSI target and initiators
The core of this article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted). He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).
- << Prev