Print
Hits: 36289

Read this article and become a cluster design expert! Use a new tool from the aggregate.org to determine price and performance before you buy! Get a handle on everything from Ethernet cables, to GFLOPS, to power and cooling. ClusterMonkey likes to call it the Clustanator, you will probably call it extremely useful.

Whenever someone asks what hardware to buy for their new cluster, the answer is always, "It depends." It depends on what application the cluster will be used for, it depends on how much space, power, and cooling are available, it depends on the costs of operating the cluster, and it depends on how much the parts that might be used cost. The standard process is to analyze the application and use rules of thumb and experience to guess what hardware will work best. The sophistication of the analysis depends on how much money is involved and how much computer engineering expertise the designer has.

Large clusters, with price tags in the millions of dollars, justify spending a lot of effort to characterize the applications that will run on them and to design the best system for those applications. However, the vast majority of clusters that get built are smaller, typically costing anywhere from $10,000 to $200,000 to build. It is not economically viable to pay an expert for time to design such a cluster. The number of components available and the ways they can be combined can be overwhelming, even for experienced designers. Moreover, the scientists and engineers wanting a cluster are typically not computer engineers. Rather, they are experts in their own field who are just using the cluster as a tool to help them solve their problems faster than is otherwise possible. Automating the design process is the key to helping both experienced and inexperienced designers get the most for their money for these low cost systems.

The Cluster Design Rules (CDR) is a web based software tool for designing this sort of cluster supercomputer. Users specify requirements of their applications and resources available to them, like power, cooling, and floor space. The CDR uses these constraints along with a performance model and a database of available components to find a design that meets all the constraints and optimizes performance.

The Cluster Design Rules

The CDR models a cluster based on commodity components available to the end user. The CDR combines network interfaces, cables, switches, motherboards, processors, memory parts, disk drives, cases and racks from a database to design a cluster. The CDR searches for viable designs by selecting a number of nodes and a motherboard type. It then tries to build various network topologies (no network, ring, 2D mesh, 3D mesh, switched network, tree, fat tree, flat neighborhood network, flat neighborhood network of trees) using available network interfaces, switches, and cables. The remaining system components are added to the design one at a time until a complete design is available. Complete designs that do not meet application or resource constraints are discarded. If at any stage a partial design cannot be completed without violating the application or resource constraints, the design is removed from the search. For example, if a partial design costs more than the acquisition budget, then it cannot possibly be part of a valid full design because adding more components will only increase the cost.

The application and resource constraints describe the needs of the application and the resources available to the user. Table One lists the application parameters users can set as constraints in the CDR. Some constraints, like data size, may be known a priori from the application and the type of problem being solved. Other constraints like memory bandwidth and the networking parameters can either be estimated based on the source code analysis and knowledge of the problem, or they can be measured from profiling running versions of the application. It is worthwhile to point out that memory bandwidth is measured in bytes/FLOP so that it will scale with processor speed and memory bandwidth. Also bisection bandwidth is measured per processor core, because all of the cores in the same node shares the network links attached to a node.

Table One: Application constraints modeled by the CDR
Memory Size for Data (Bytes/cluster) Message Latency (μs)
Memory Size for Code (Bytes/node) Collective Latency (μs)
Memory Size for Operating System (Bytes/node)   Bisection Bandwidth/Processor Core (Mbps/Core)
Memory Bandwidth (Bytes/FLOP) Coordinality (Nodes/Node)
Virtual Memory Size (Multiple of node memory) Number of Nodes/Processors/Cores (n2,n3,2n)
Local Disk Storage (Bytes/Cluster) GFLOPS

Resource constraints describe the budget and infrastructure available to the user. Table Two lists the resource constraints considered by the CDR. The first resource constraint is often acquisition budget for the cluster. However, available power, cooling capacity, floor space, and operating costs are often more important limiting factors. It is not uncommon for a user to buy as many nodes as they can afford only to find out they do not have enough power or that their current air conditioner cannot keep the room cooled. The CDR considers these constraints and avoids designs that will not fit within existing infrastructure.

Table Two: Resource constraints modeled by the CDR
Floor space Operating Budget
Power Acquisition Budget
Air Conditioning

All designs that meet application constraints and resource constraints are ranked by a performance metric. The performance metric can either be a weighted sum of system-wide parameters, like usable GFLOPs, network bisection bandwidth per processor, network latency, memory bandwidth per processor, or acquisition cost. Alternatively, the metric can be based on an application model. When no application model is available the weighted sum is useful for approximating performance. The weighted sum method measures each system parameter relative to the minimum amount specified as a design constraint and multiplies it by a weighting factor. The most important system parameter is weighted most heavily followed by the second most important parameter, etc. Determining precisely what the weightings should be used can be difficult, but it is usually easy to guess a reasonable range for the weightings. The CDR runs quickly enough that designs covering a range of weightings can be computed in a short amount of time. Designs that rank among the top designs for many combinations of settings are likely to work well.

As an alternative to the weighting factors, the CDR provides several application performance models. An application performance model estimates application performance based on design parameters. Application models are usually more accurate than a simple weighting formula because they can use the system parameters, including network topology, in any arbitrary calculation that can be expressed as C code. Currently, application performance models are available for the SWEEP3D benchmark [HoLW00] and the HPL benchmark [PWDC04], but a programming interface is available to add new models.

Design Examples

The CDR can be used to find a single design for an application or for design space exploration. To find a single design, the user simply inputs the relevant constraints, selects the performance model, and the CDR generates the design that meets them best while maximizing performance on the desired benchmark. For example, if the application execution time is dominated solving a linear system of 150,000 dense equations with 150,000 unknowns using LU decomposition, the High Performance Linpack (HPL) benchmark is likely to predict performance well. Assuming the system will need one matrix of 150,000 by 150,000 double precision floating point values, the cluster will need 150,000 × 150,000 × 8 bytes/double = 180 GB of memory. The memory size constraint can be set by selecting 128 GB to 256 GB in the memory section of the CDR web form, shown in Figure One.

Memory entry section of the CDR web form
Figure One: Memory entry section of the CDR web form

The HPL executable is approximately 574 KB, so the code size is set to "1 MB or less." Though HPL is sensitive to memory bandwidth, we do not need to specify a memory bandwidth constraint in the memory section. The HPL performance model automatically adjusts performance based on memory bandwidth. If paying to improve memory bandwidth helps performance better than paying for to improve some other aspect of the system, the CDR will select designs with higher memory bandwidth. Otherwise, it will spend money on improving another aspect of the system with a larger impact on performance. Likewise, network parameters need not be specified because the network performance is included in the HPL performance model, so no constraints need be specified. We assume the budget is $100,000, there are no requirements for disk within the cluster, and no constraints on space, power, cooling, and operating costs.

For this input specification, the CDR fully evaluates just over 9,000 designs after eliminating millions of designs with a partial evaluation. The best design is summarized by the CDR in Figure Two. The design has 200 nodes with 13 cold spares and costs just under $100,000. It more than meets our constraint of at least 160 GB per cluster for data, 1 MB per node for code with approximately 400 GB of total DRAM, leaving 2 GB per node. Running at full power the cluster will need 30.6 KW and 8.8 tons of air conditioning. The design uses a Flat Neighborhood Network (FNN) [DiMa00, HMLDH00] of trees, shown here as Figure Three, to achieve connectivity between all 200 nodes.

HPL Design Example Statistics
Figure Two: Summary of HPL-optimized cluster

The components used in this design are shown in Figure Four. Component prices come from the CDR components database. Prices of non-commodity hardware (like Infiniband, Myrinet, Quadrics, etc. network interfaces) are entered into the database by hand, and thus do not change frequently. Commodity hardware prices are automatically updated daily using Amazon.Com's web API [Levi05]. As faster processors, memory, and networking hardware become less expensive (or more expensive), the best design for the particular application will change.

Components usedin the HPL-optimized cluster design
Figure Four: Components used in the HPL-optimized cluster design

While it is a bit disconcerting to know that the CDR can produce different designs given the same input, following price fluctuations is an important feature of the CDR. It is difficult for a human designer to follow the prices of all of the hundreds of components in the CDR's database. It is even more difficult for the designer to know how those price changes affect trade-offs in price and performance. A designer can try to recompute the trade-offs by hand, but by the time the calculation is done, prices may have changed.


Figure Five shows operating costs based on electrical power consumption for the cluster itself, power consumption for the air conditioning unit, and rental of floor space using recent business power and space rental costs for Lexington, Kentucky. The power cost for this design is a significant fraction of the purchase price — electricity for powering and cooling the cluster costs approximately 16 % of the purchase price. In other regions these costs can be significantly higher. Electric rates in other regions of the U.S. are commonly twice those of Kentucky.

Operating costs of the HPL-optimized cluster
Figure Five: Operating costs of the HPL-optimized cluster

For each metric, the CDR outputs metric specific information. Figure Six shows the HPL-specific output for the top design. This design is expected to have an Rmax of 1,620 GFLOPS with an Nmax of 230,741 elements. This problem size is bigger than the desired problem size, because the HPL benchmark allows the problem to be scaled with the the cluster size to maximize performance. The metric function could be modified to use a fixed problem size if desired. The other parameters listed are HPL specific tuning parameters that the CDR used to determine Rmax. Actual performance may differ from predicted performance, but a comparison of the the HPL model used in the CDR with the machines on the Top 500 list shows that reported results typically fall within 20 % of the result predicted by the HPL model.

HPL metric information for the HPL-optimized cluster
Figure Six: HPL metric information for the HPL-optimized cluster

HPL has a trade-off between memory size and communication cost. Since the data size is allowed to increase with increasing memory size, it is difficult to know whether money is better spend adding nodes or adding memory. Interestingly, if we relax the constraint on how much memory is required, the fastest HPL cluster for $100,000 is a similar design. It has 248 nodes connected by a FNN of trees. Each node has only 512 MB of memory, giving the cluster just about 120 GB of usable memory. The smaller memory size allows more nodes to be purchased giving an Rmax of 1,1711 GFLOPS. It seems odd that the CDR chose a design with 2 GB per node when the optimal design for any memory size has 512 MB per node, and 1 GB per node would more than meet the memory requirement. However, balance between number of nodes and memory in the 1 GB design is such that it is slower than the 2 GB per node design and memory parts only come in discrete sizes.

Many applications have additional constraints on resources available. For example, the above design described above takes 126 ft2 and needs 8.8 tons of air conditioning. Many researchers do not have that much space or air conditioning available for their cluster. If we constrain memory to 128 GB to 256 GB, limit floor space to two racks, air conditioning to three tons, and power to 60 A (four standard 15 A circuits) then the best design has only 14 powered nodes with an Rmax of 98.7 GFLOPS. There are several reasons for essentially losing an order of magnitude in performance, that we can see from the design statistics in Figure Seven. First, power is the limiting factor for this design. It takes only one of the two racks allowed and less than one third of the cooling budget but all of the power budget. To get enough memory into the few nodes that fit within the power budget, the CDR must choose more expensive 2 GB memory DIMMs; about half the budget is spent on DRAM.

Best design result for HPL metric with 256 GB, no more than 2 racks, 3 tons of air conditioning, and 60 A
Figure Seven: Best design result for HPL metric with 256GB, no more than 2 racks, 3 tons of air conditioning, and 60A.

If we relax the power budget and keep the two rack limit, performance improves dramatically — from 98.7GFLOPS to 466 GFLOPS. This speedup comes at a higher power cost. When power is not constrained it increases to 171 A, requiring at least 12 separate 15 A circuits. Moreover, the annual power and cooling costs are almost $4,600 vs. a little over $2,000 for the 60 A limit. This kind of analysis can help a buyer decide whether an air conditioning or power upgrade is worthwhile.

Conclusion

The CDR allows a designer to quickly evaluate many design options and trade-offs quickly. Without Pricing information makes these trade-offs possible. Without application requirements it is easy to spend a lot of money on hardware that does not help performance. Without price, the designer cannot tell how much performance is being given up in one area to buy more performance in another.

We find the CDR useful in exploring the cluster design space when helping people design clusters, and we are working to add more features to improve usefulness. The web interface, available at http://aggregate.org/CDR makes the CDR easy to try for new applications. Source code is available through the BDR (Beowulf Design Rules) project on http://www.sourceforge.net/projects/bdr.

References

[DiMa00] Henry G. Dietz and Tim I. Mattox. KLAT2's flat neighborhood network. In Proceedings of the Extreme Linux track of the 4th Annual Linux Showcase, Atlanta, GA, USA, October 2000. USENIX Association.

[HMLDH00] Thomas Hauser, Timothy I. Mattox, Raymond P. LeBeau, Henry G. Dietz, and P. George Huang. High-cost {CFD} on a low-cost cluster. In Proceedings of SC2000 Conference on High Performance Networking and Computing, Dallas, TX, USA, November 2000. ACM Press.

[HoLW00] Adolfy Hoisie, Olaf Lubeck, and Harvey Wasserman. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. International Journal of High Performance Computing Applications, 14(4):330–346, 2000.

[Levi05] Jason Levitt. The Web Developer's Guide to Amazon E-commerce Service. Lulu Press, 2005.

[PWDC04] Antoine Petitet, R. Clint Whaley, Jack J. Dongarra, and Andrew Cleary. HPL — a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/index.html, January 2004.