Enter Dynamic ProvisioningThe key to an HPC Cloud is an idea called "dynamic provisioning." The traditional cluster usually has a fixed Operating System (OS) on all the compute servers. A user program must conform to this specification or it may not run. Similar to a standard Cloud, an HPC Cloud should allow the user to pick and choose (even design) the OS environment for the computing servers. This capability is possible though dynamic provisioning where all compute servers are bare-metal provisioned by the resource scheduler. In essence the compute nodes are rebuilt each time a program is executed.
While dynamic provision may seem time consuming and inefficient, there are a few things to consider. First, most HPC applications run for hours, days, or even weeks. Giving away a small chunk of run-time is a small price to pay for a flexible Cloud like environment. Second, and perhaps more important, there are provisioning methods that do not require the hard drive on each worker node to be re-imaged, thus reducing the time required to provision the node.
Using options such as RAM based disks, NFS, and other standard *NIX tools, nodes can be easily provisioned with unique OS environments without touching any of the node hard drives (should they even exist). Hard drives on the nodes can still be used for local scratch storage, but all important OS files and directories are loaded by the resource allocator into a RAM disk. One interesting example of this type of tool is the Warewulf Project. The Warewulf toolset is a freely available package that allows easy creation and management of node images that are then loaded as RAM disk images on the compute servers. Booting a node is actually very fast and can be easily changed to suit user or application preferences. A flexible commercial solution is Bright Cluster Manager, which allows easy installation, monitoring, managing of cluster. Bright also offers a Cloud Bursting where jobs can be directed to external Clouds.
HPC Clouds that combine high performance interconnects and dynamic provisioning can offer the most desirable Cloud features such as flexibility, scalability, and software choice while also maintaining HPC features that deliver expected performance levels.
The Storage IssueAn application that needs heavy I/O, which implies predictable and consistent I/O rates, is usually highly tuned for a given I/O environment (and vice-versa). Contrast this with the standard Cloud storage where bulk storage is generally flexible and robust. There are few if any service level agreements (SLAs) that will guarantee high I/O rates. The storage is there, growable, and reliable, but not guaranteed to work at HPC levels.
HPC storage is often an engineered solution based on the user needs and application data flow. The baseline plug-and-play solution is the Network Files System (NFS), which was never designed to support high performance or parallel access. In light I/O cases, NFS is a valid solution, however it can quickly become a bottleneck for large clusters due to its shared design. A specialized high performance NAS can help in this situation, but even this capability is not normally found in the standard Cloud. Higher performance solutions that employ distributed or parallel file systems are the preferred method in many clusters.
One solution is the adoption of pNFS (parallel NFS). The pNFS standard allows storage vendors to supply high performance storage through a standard and familiar method. The back-end storage, which will determine the ultimate performance, depends on the vendors technology. The I/O rate will be a necessary part of the Cloud HPC SLAs. It is unlikely that the traditional Cloud will ever offer this level of service and there is not much incentive to include pNFS or any other high performance storage. A properly engineered HPC Cloud should have a robust I/O solution if it is to a larger portion of the HPC market.
But Will It Work For Me?As presented above, the typical Cloud may not be the best candidate for production HPC work. Recent estimates from IDC put the HPC or technical server market at about $10 Billion per year for last year (2011). The exact amount is not important, but the fact that it is a sizable market seems to have attracted many traditional Cloud vendors to the "HPC space.".
Making a business case based on a large unsegmented market is a rookie mistake. Experienced start-up veterans will often use market segmentation to better define the "real market." A simplified analysis is useful in the case of Cloud HPC. Starting with the total market size of $10 Billion, we can do a first segmentation based on the need for a high speed interconnect. This requirement means your Cloud needs either InfiniBand (IB) or high performance Gigabit Ethernet. If your applications can be considered Embarrassing Parallel (EP) then standard 10 Gigabit Ethernet (or even Gigabit Ethernet) would be adequate. A generous ball park estimate of 60% for the EP portion of the total HPC market results in a $6B market that can be addressed by traditional Clouds.
Next, consider I/O requirements. There are many applications that require heavy I/O otherwise computation will stall waiting to read or write data (e.g. scratch file, restart files, and "big data" applications). In this analysis, heavy I/O utilizes enhanced technologies to boost I/O performance. These capability are not normally part of a traditional Cloud and may include optimized NAS, distributed file systems, parallel file systems, and even the use of SSDs.
If we assume 40% of the EP market uses non-heavy or light I/O then the total market share for HPC computing on traditional Clouds has now shrunk to 24% and is an estimated $240 million. A summary of the oversimplified analysis is shown in Figure Two below.
While the numbers may not be exact, the lesson is clear. Not all HPC applications will work "out of the box" in traditional Clouds. Indeed, there are other factors that may further segment the market, such as data movement to and from the Cloud, security, availability, and backups of big data results, etc.
A True HPC CloudThe following are some general conclusions that may be helpful in deciding if Cloud Based HPC is right for you or your organization. Keep in mind, there are vendors who specialize in HPC Cloud computing such as R-HPC and Penguin Computing.
- Standard Cloud services can only address a portion of the HPC market. Those applications are usually embarrassingly parallel and have low I/O requires (light I/O). Careful analysis of application requirements is needed in order to determine the effectiveness of HPC in standard Cloud offerings.
- The desirable Cloud features, such as instant availability, large capacity, software choice, and virtualized environments, can be made available to HPC Cloud users with specially designed HPC Clouds. The use of high performance interconnects and dynamic provisioning can offer Cloud features while maintaining HPC performance levels.
- Providing a high performance I/O component to the HPC Cloud is necessary to ensure many of the I/O heavy HPC applications will run to their fullest potential. pNFS will eventually provide a good plug-and-play interface for many of these users. The back-end storage design however, will be important in achieving acceptable performance.
Based on these observations, performing HPC in the Cloud is indeed possible, but many applications cannot be shoehorned into any Cloud solution. Clouds designed for HPC are needed and represent a viable solution to many organizations. In addition, there may be other issues that need to be discussed before HPC Cloud can deliver low cost and flexible HPC cycles. Don't decommission that cluster just yet!
- << Prev