What you may not know can cost you
The commodity cluster has changed the High Performance Computing (HPC) landscape in a significant way. Indeed, clusters have had a disruptive influence on many sectors of the IT market in addition to HPC. As with most disruptive technologies, clusters hit the market with a the promise of "faster, better, cheaper" computing. Marketing numbers from IDC, seem to support the perception that clusters are delivering on their promise. A deeper look, however, reveals that in reality some of the "cheaper" promise is due to shifting certain costs from the traditional HPC vendor to the customer.
Purchasing an HPC cluster can be likened to buying low cost self-assembled furniture. The pile of flat-packed boxes that you take home is often a far cry from the professionally assembled model in the showroom. You have saved money, but you also will be spending time deciphering instructions, hunting down tools, and undoing missteps before you can enjoy your new furniture. Indeed, should you not have the mechanical inclination to put bolt 32 into offset hole 19, your new furniture may never live up to your expectations.
The furniture analogy actual breaks down in a perverse way because that new rack of servers does not come with an HPC assembly manual. Indeed, a typical HPC cluster procurement has many hidden costs that are not included in the raw hardware purchase and can force the actual cost to much higher levels. Because the there is a huge difference between a stack of hardware and function HPC cluster, understanding the hidden cost variables involved in a HPC procurement can save time, money, and most of aggravation.
The most famous of these companies was, of course, Cray Computer. When you purchased a Cray, the system was delivered ready to run. An end user could sit down and begin compiling and running code right away. Should the user need assistance (i.e. optimization, debugging, etc.) an end-user manual was never far away and training was always available. If there was a problem with a compiler switch or performance issue, Cray had the ability to examine the issue "end-to-end" because they integrated the hardware and software into a functioning system.
This level of integration, as well as the specialized hardware, came at a justified premium cost. Users focused on programming solutions, system administrators supported the users, and supercomputer companies focused on running the programs in the least amount of time. This traditional relationship worked well for the HPC market.
While clustering systems together for greater performance was not a new concept, the use of commodity hardware was somewhat novel. The price for such hardware is low (due to it being sold in large volume) and the performance has been steadily increasing (due to competition within the desktop and server sector). In addition, Ethernet and other high end networking products are available for connecting individual cluster servers (or nodes).
Historically, HPC has used the UNIX operating system to drive high end hardware. The growth of the Linux® operating system (and subsequent distributions) has emerged as an open (freely available) and virtually plug-and-play replacement for these UNIX environments. In addition, the openness of Linux has fostered an eco-structure on which other HPC software could be easily ported or written.
For those wanting to use clusters, however, it remains difficult to purchase a fully integrated cluster because the components come from a variety of manufactures and integrators are reluctant to take responsibility for the whole system. The operating system often comes from a Linux vendor (or project), the middle-ware (MPI libraries) comes from one of several possible sources, an optimizing Fortran or C/C++ compiler, not part of the standard OS bundle, comes from still another source. Storage, interconnects, switches, debuggers, parallel file systems, and many other options also add to the list of possible sources..
Another misconception that extends far beyond HPC clusters, is the notion that openly available software is free and therefore adds no cost to a cluster. While the initial cost of open software may be non-existent, there is a substantial cost associated with software support and integration. In the case of HPC clusters, these costs can quite substantial and have in essence are now the responsibility of the customer.
Support and infrastructure costs can can range from small to substantial depending on the users goal. In general, the more people that use the cluster, the higher the amount of work the end users must shoulder. Hidden costs for a cluster can be broken down into five categories; Integration, Validation, Maintenance, Upgrading, and Infrastructure. These topics will be discussed separately below.
Another issue with the classic cluster design (OS image installed on each node) is that of version skew or "node personalities." Initially keeping nodes in sync, seems trivial -- just install the same thing on all the nodes. This approach breaks down as the cluster ages because replacement nodes must be re-imaged to reflect all other nodes. To accomplish this, changes must be tracked and a current "snapshot" created. Changes also include OS tuning parameters and tweaks that must be performed on nodes so that certain software applications will run correctly. This "change/snapshot/re-image" cycle is expensive and can incur significant down time for the simplest of maintenance issues.
There are several more advanced cluster methodologies, such as NFS-root, or RAM-Disk, that help solve some of these issues. These applications must be evaluated carefully as changing your provisioning scheme after the cluster is operational can be difficult and cause disruptions.
These numbers are more striking when the cost of the entire cluster is taken in to account. Consider a typical cluster purchase in today's market where the typical node can cost $3500 per (including racks, switches, etc.) Using standard dual core technology a node provides two processors and four cores. A typical 128 node cluster will then provide 256 processors and 512 cores and costs $448,000. Based on the above assumptions, the annual power and cooling budget is then $67,300. Over a three year period this amounts to $202,000 or 45% of the system cost.
While costs may vary due to market conditions and location, the above analysis illustrates that for a typical commodity cluster the three year power cost can easily reach 40-50% of the hardware purchase price.
Other infrastructure issues can effect cost as well. A typical industrial rack mount chassis can hold 42 cluster nodes. An average cluster node weighs around 45 pounds. Thus, each rack requires a floor capable of supporting 2000 pounds in the space of a single rack mount enclosure. In a typical data center, racks mount hardware is a mix of storage and servers with many underpopulated racks. HPC clusters, on the other hand, represent the most dense and therefore heaviest load in a data center. In our 128 node example, the cluster will require support for 6000 pounds in a 4x8 foot area.
For the more scientifically inclined, there is a kind of conservation of cost when it comes to HPC. Cost in this sense is both time and money because the time to solve an implementation problem often cannot be reduced with money. The low price of clusters did eliminate some costs, but shifted many of the non-reducible costs to the end user which ultimately impacts how much computing per dollar the cluster user can archive. These costs coupled with infrastructure costs often push the push the total cost of ownership much higher than originally anticipated.
Factoring the hidden costs into such a number can be very difficult. The amount of time and money required depends on your level of in-house expertise. Attempting to build and maintain a production HPC cluster requires a skill set that is currently in short supply and thus expensive. If your organization does not have the technical depth, then purchasing hardware in a very real sense is putting the cart before the horse. Infrastructure costs, on the other hand, are more easily estimated and therefore should be an integral part in all success metrics.
If you are planning to purchase an HPC cluster, consider the additional work required to achieve a functioning system. Failure to account for the hidden time and money will result in lost up-time, higher costs, and poor performance. As part of your cluster plan, determine whether you have the in-house expertise to accomplish these tasks in a cost effective manor. If you need help, look to a vendor that has an intimate understanding of your needs and experience with HPC systems. In reality most large vendors will stop slightly beyond a "standard install" by using a professional services organization (either internal or externally based) at which point, you are on your own. There are a number of smaller vendors that can help minimize the hidden costs and provide real long term support for you HPC needs. Finally, there are a small number of consultants that specialize in cluster integration, testing, and support.
Cluster HPC is powerful and effective computing platform. Understanding the real cost structure will help set expectations and assist in planing and implementing your HPC resource.