|
Page 1 of 3
Read this article and become a cluster design expert! Use a new tool from the aggregate.org to determine price and performance before you buy! Get a handle on everything from Ethernet cables, to GFLOPS, to power and cooling. ClusterMonkey likes to call it the Clustanator, you will probably call it extremely useful.
Whenever someone asks what hardware to buy for their new cluster, the
answer is always, "It depends." It depends on what application the
cluster will be used for, it depends on how much space, power, and
cooling are available, it depends on the costs of operating the
cluster, and it depends on how much the parts that might be used cost.
The standard process is to analyze the application and use rules of
thumb and experience to guess what hardware will work best. The
sophistication of the analysis depends on how much money is involved
and how much computer engineering expertise the designer has.
Large clusters, with price tags in the millions of dollars, justify
spending a lot of effort to characterize the applications that will
run on them and to design the best system for those applications.
However, the vast majority of clusters that get built are smaller,
typically costing anywhere from $10,000 to $200,000 to build. It
is not economically viable to pay an expert for time to
design such a cluster. The number of components available and the
ways they can be combined can be overwhelming, even for experienced
designers. Moreover, the scientists and engineers wanting a cluster
are typically not computer engineers. Rather, they are experts in
their own field who are just using the cluster as a tool to help them
solve their problems faster than is otherwise possible. Automating
the design process is the key to helping both experienced and inexperienced
designers get the most for their money for these low cost systems.
The Cluster Design Rules (CDR) is a web based software tool for designing this
sort of cluster supercomputer. Users specify requirements of their
applications and resources available to them, like power, cooling,
and floor space. The CDR uses these constraints along with a performance
model and a database of available components to find a design that
meets all the constraints and optimizes performance.
The Cluster Design Rules
The CDR models a cluster based on commodity components available to
the end user. The CDR combines network interfaces, cables, switches, motherboards,
processors, memory parts, disk drives, cases and racks from a database
to design a cluster. The CDR searches for viable designs by selecting
a number of nodes and a motherboard type. It then tries to build various
network topologies (no network, ring, 2D mesh, 3D mesh, switched network,
tree, fat tree, flat neighborhood network, flat neighborhood network
of trees) using available network interfaces, switches, and cables.
The remaining system components are added to the design one at a time
until a complete design is available. Complete designs that do not
meet application or resource constraints are discarded. If at any
stage a partial design cannot be completed without violating the application
or resource constraints, the design is removed from the search. For
example, if a partial design costs more than the acquisition budget,
then it cannot possibly be part of a valid full design because adding
more components will only increase the cost.
The application and resource constraints describe the needs of the
application and the resources available to the
user. Table One lists the application
parameters users can set as constraints in the CDR. Some constraints,
like data size, may be known a priori from the application and the
type of problem being solved. Other constraints like memory bandwidth
and the networking parameters can either be estimated based on the
source code analysis and knowledge of the problem, or they can be
measured from profiling running versions of the application. It is
worthwhile to point out that memory bandwidth is measured in
bytes/FLOP so that it will scale with processor speed and memory
bandwidth. Also bisection bandwidth is measured per processor core,
because all of the cores in the same node shares the network links
attached to a node.
|
Table One: Application constraints modeled by the CDR
|
|
Memory Size for Data (Bytes/cluster)
|
Message Latency (μs)
|
|
Memory Size for Code (Bytes/node)
|
Collective Latency (μs) |
|
Memory Size for Operating System (Bytes/node)
|
Bisection Bandwidth/Processor Core (Mbps/Core) |
|
Memory Bandwidth (Bytes/FLOP)
|
Coordinality (Nodes/Node) |
|
Virtual Memory Size (Multiple of node memory)
|
Number of Nodes/Processors/Cores (n2,n3,2n) |
|
Local Disk Storage (Bytes/Cluster)
|
GFLOPS |
|
|
Resource constraints describe the budget and infrastructure available
to the user. Table Two lists the resource constraints considered by
the CDR. The first resource constraint is often
acquisition budget for the cluster. However, available power, cooling
capacity, floor space, and operating costs are often more important
limiting factors. It is not uncommon for a user to buy as many nodes
as they can afford only to find out they do not have enough power or
that their current air conditioner cannot keep the room cooled. The
CDR considers these constraints and avoids designs that will not fit
within existing infrastructure.
| Table Two: Resource constraints modeled by the CDR |
| Floor space
|
Operating Budget |
|
Power
|
Acquisition Budget |
|
Air Conditioning
|
|
All designs that meet application constraints and resource constraints
are ranked by a performance metric. The performance metric can either
be a weighted sum of system-wide parameters, like usable GFLOPs, network
bisection bandwidth per processor, network latency, memory bandwidth
per processor, or acquisition cost. Alternatively, the metric can
be based on an application model. When no application model is available
the weighted sum is useful for approximating performance. The weighted
sum method measures each system parameter relative to the minimum
amount specified as a design constraint and multiplies it by a weighting
factor. The most important system parameter is weighted most heavily
followed by the second most important parameter, etc. Determining
precisely what the weightings should be used can be difficult, but
it is usually easy to guess a reasonable range for the weightings.
The CDR runs quickly enough that designs covering a range of weightings
can be computed in a short amount of time. Designs that rank among
the top designs for many combinations of settings are likely to work
well.
As an alternative to the weighting factors, the CDR provides several
application performance models. An application performance model
estimates application performance based on design
parameters. Application models are usually more accurate than a simple
weighting formula because they can use the system parameters,
including network topology, in any arbitrary calculation that can be
expressed as C code. Currently, application performance models are
available for the SWEEP3D benchmark [HoLW00] and
the HPL benchmark [PWDC04], but a programming
interface is available to add new models.
|