How to Write a Technical Cluster RFP | Projects

Need a cluster? Read this first!

Need a cluster? Read this first!

Writing
a technical RFP (Request for Proposal) for a cluster is not something
to be taken lightly. There are many options available for clusters
that must be researched and addressed before writing the technical
RFP. Moreover, there are things one can put into a technical RFP that can
help discriminate between vendors and help make your decision process
easier (or at least make things clearer). To help with your procurement, there is an RFP outline/template provided at the end of the article.

This article was originally a synopsis of
a tutorial given at the 2003 ClusterWorld Conference and Expo on how to
write a technical cluster RFP. However, it has been expanded and edited
to discuss some topics more in depth and to add topics that weren't
discussed before. 

Introduction

Before
beginning with a discussion of how to go about writing a technical
cluster RFP, we should define some terms and go over some ground
rules. First, let's define a technical RFP as a set of requirements
for the cluster itself excluding procurement and company policies,
such as, costs, warranties, delivery schedules, etc. However,
there are cases where you might want to include warranties, support,
delivery schedules, etc., in your technical RFP. Whether you include
these or not is up to you and your company policies. In my opinion
putting delivery schedules in the technical RFP is definitely warranted
since it can impact the technical requirements.

In many
cases, the RFP makes up the vast majority of the document that is
sent to the vendors from
the procurement group. However, the procurement group adds all of the required
procurement procedures and legal language and policies. This article
is not concerned with the procurement aspects of the RFP and/or
purchasing process since these details vary from company to company.
This article is interested in only the technical details of
the RFP. But, as I mentioned in the previous paragraph, some
details that usually come from a procurement office can and should
be put into the technical RFP because they affect the technical aspects of
the cluster and they affect the technical evaluation.

In this article, I'll walk through the steps you should go through
in writing the RFP. You may be surprised that the first step in
writing an RFP is understanding your code(s).

Step One - Understanding Your Code(s) and Users - Profiling

We all want speed,
glorious speed. And we want it for nothing. While we're at it, it
would be nice if it were reliable, easy to maintain, use only a small
amount of power, have a small foot print, and so on. However, we
can't have this, sigh... . So what do we do? Well, we have to find
the best system to meet our requirements. These requirements form the
basis of our technical cluster RFP. This means you have to do your
homework (seems like we never escape homework doesn't it?). The first
step in writing our technical RFP is to understand the code(s) and
understand the user base for our cluster. The generic term I'm going
to use for understanding codes is, profiling, but I'm going
to go beyond the usual meaning of the term profile. I use the term
to mean profiling your codes and profiling your user base

Usage Profiling

Since clusters are almost always for technical computing, many
people ask why they have to understand their user base.
Profiling your user
base is a very important source of information that many people over
look when preparing for a technical cluster RFP. Determining what
applications the users are running, how often they run them, the
problem sizes, typical number of processors used, what time of day
they run, etc., can be a great source of information.

For example, by
profiling your users, you can determine the largest problem size and
the number of processors for a single job. This will tell you the
minimum "size" of the cluster (where size is the combination of number
of processors and amount of memory per processor). You can also find
the largest number of processors anyone is using in their runs or the
largest amount of memory anyone is using in their runs.

It would
also be good to have this same data; problem size and number of
processors, for the past couple of years. You can use this historical
data to make projections about the number of nodes you are likely to
need, how much memory you might have to add, etc. More precisely, if
the cluster is to last 3 years, which is the maximum you should keep
a cluster, you need to know what the cluster should look like in
3 years (more on this later).

How do you get this
user base information? Good question. There are many ways to get it.
You can survey your user base on a periodic basis (but be careful,
users don't like to fill out too many surveys). You can also watch
your systems and capture job information from the queuing system and
the nodes. Usually monitoring systems can help capture this
information as well.

Application Profiling

The other side of the profiling coin is to profile the
user applications themselves. I'm going to focus on MPI (Message
Passing Interface) applications since they are probably the dominant
class of cluster applications. The first step is to gather a list of
parallel user applications. If you have access to the application's
source code then your life can be a bit easier. Regardless, you need
to profile your applications in the traditional sense of the word.
That is, determine which parts of the code consume the most time
including MPI functions.

How you profile depends
upon the MPI implementation being used. Many of the main MPI
implementations, such as
MPICH,
MPICH2,
LAM,
MPI-Pro,
Scali MPI Connect,
OpenMPI,
all have ways to
gather profiling information. The systems are flexible enough that
you can gather a great deal of information or a little bit of
data. From this information you want to know the message sizes
and the number of messages passed for the various MPI functions used
in the code. You also should look for how much total time is used in
each MPI function call. 

Now that you have all of this MPI data, what do you do with it?
All of this data will help determine the typical message size and what
MPI function(s) are used the most frequently. You can then pass this
information on to prospective vendors to help them select an
appropriate interconnect to bid. Or you can use the data to help you
specify an interconnect for the RFP.

Since we're talking
about parallel codes, we need to do some profiling on parallel
systems and vary the number of processors. Ths point of this profiling
is to gather scaling information (how your code performs as you increase
the number of processors) as well as how much the memory usage per node
changes
as you increase the number of processors. To gather this information,
if possible, start with one processor or one node and then increase the
number of processors, but don't
get use too many processors. Profiling codes can produce a great deal of
information, so be careful how many processors you use. Also, don't
worry about the cluster interconnect you are using. Get some
profiling data first and then worry about the interconnect details
later.

Now that you have this
large database of profiling information of user applications, what do
you do with it? That's a very good question and I'll try not to
weasel out of answering it, but my general answer is that it depends
upon your specific case (how's that for weaseling out of answering
the question!). However, there are a few general things we can do
with the data. First look at the message sizes as a function of the
problem size and number of processes for a particular code. If the
code is using a large number of relatively small messages, then the
code is likely to be more sensitive to interconnect latency rather
than bandwidth. The code is also likely to be require a small N/2
for best performance (see Sidebar 1 for an explanation of N/2).
If the code(s) are using a large number of large
messages, then it is likely that the code is going to be more
sensitive to interconnect bandwidth than latency. If the message
sizes are distributed fairly well, with not too many small or large
messages, then the code is perhaps reasonably designed well and
could scale fairly well, possibly even on Fast Ethernet (be sure to
check the scalability results)

You should also look
for scalability trends. How does the code perform on the same data
set as you increase the number of processors? You can make the
classic scalability plots of speedup on the y-axis and number of
processors on the x-axis to get the scalability information. You can
also increase the problem size as you increase the number of nodes.
What you are looking for is the increase in MPI traffic as the number
of nodes increases. Does this traffic increase much faster than the
number of processors? If so, how much? This information gives some
insight into how the code scales.

Sidebar One: Definition of N/2

The usually way to characterize the bandwidth and latency of an
interconnect is to send messages of varying size between two
nodes through the interconnect. You measure the time is takes for
the messages to travel between the two nodes. From the time you
can compute bandwidth (amount of data per second). Then you can
create a plot of bandwidth on the y-axis and message size on the
x-axis. The peak bandwidth is the largest value of bandwidth on
the plot. Latency is the amount of time it takes to send a
0 byte packet.

During this testing, both nodes are sending and receiving data.
N/2 is size of the message where you get half of the peak
bandwidth. Why half of the peak bandwidth? N/2 is message size
where you get the full bandwidth in one direction (i.e. sending
data from one node to another). If you profile MPI codes you
are likely to see lots of small messages. Being able to transmit
the messages as fast as possible helps improve performance of the
code and improve scalability. This is why N/2 is important.

Operation Counting

When you profile
the user codes for MPI information, you can also take the opportunity
to count certain operations of the codes at the same time. Most
modern processors have hardware counters that allow you to count
certain CPU operations while a code is running. You can get to this
information by applying a small patch to the Linux kernel and then
using some user-space tools to extract the information desired

There are several
counter projects for Linux. The most wide-spread project , the
Performance Application Programming Interface (PAPI),
is hosted at
the University of Tennessee. It is an application programming
interface (API) that allows you to instrument your code to count
events. There are also tools built on top of PAPI that allow you to
watch your codes without having to modify the source code, such as
PapiEx,
TAU,
Kojak,
PerfSuite, and
HPCToolkit.
For Pentium or AMD
processors, you have to use the
perfctr
kernel patch to get system
level access to the counters. You can control what information you
extract, but you should at least extract the following:

Number of floating-point operations performed
Total number of cycles
Total number of instructions
L1 cache misses
L2 cache misses

Coupling the time
it took to run the code with the floating-point operation count
allows you to compute FLOPS (Floating-Point Operations per Second).
You can use the FLOPS information when you increase the number of
processors and the problem size to watch the efficiency of the code.
You can also see the effects of various size caches on the
performance of the code by watching the cache misses

Step 2 - Testing Candidate Hardware

Now that you know what
applications will be run on the cluster, what kind of datasets will
be used, the size of the datasets, and how the applications behave
and scale, you need to test candidate hardware. While this may seem
like a logical thing to do, many people do not take the opportunity
to test their codes prior to writing the technical RFP. 

Testing before you issue the technical RFP will only give you more
information to help write the specifications or to help judge the
proposals from the vendors (still more homework!). Many vendors or
labs or universities have clusters you can test on and they are very
good about helping people test their codes. The
beowulf mailing list
is a good place to start. I will warn you that some people may give
you free time on their machines for testing, but do not abuse that
privilege by using too much time. Instead determine how much time
you will need and talk to the cluster owner about the best time
to run your tests. You might also discuss the possibility of sharing
your results with the cluster owner.

Try to test on various
processors at various speeds and various cache sizes (if possible).
Also try to test on clusters with various cluster interconnects.
Ideally you would like to test your codes in a cluster with multiple
interconnects so you can sort out the effects of the interconnect on
performance.

Also, try various MPI
implementations. There are various implementations of the MPI
standard and each has their own goals and assumptions. Testing
your code(s) with the various MPI implementations will tell you
which ones are appropriate for the
particular code. Many of the MPI implementations also have tuning
options that you can use to improve performance. Be sure to try these
on the codes since several simple changes can greatly improve code
performance

As I mentioned before, there are
times when you can't test your codes. Perhaps they are commercial
codes, or they are covered by security rules, or they are
proprietary. All I can suggest is to do your best to test. I have
even heard of vendors shipping small systems so you can do on-site
testing (just be sure to wipe the disks thoroughly or better yet,
replace the disks with new ones). If all of this won't work for you,
then you may have to write pseudo-applications that approximate the
behavior of your codes. This can be a great deal of work, but it
means the codes are easy to ship for testing. In fact, writing
applications simulators is exactly what the Department of
Defense Major Shared Resource Centers (MSRC) do. They write codes
that approximate the behavior of user codes

What happens if you don't test? I have seen many RFP's where it is
obvious the writers have never done any testing. In that case, they
rely on standard benchmarks to measure performance. I've use people
use HPL (Top 500 benchmark), NAS Parallel Benchmark (NPB), and some
other codes to measure performance of the system. The problem is
that it in almost every case, they have no idea how the results
of these benchmarks correlates to the performance of their codes.
More over, I have seen people purchase clusters based solely on the
HPL result (Top 500 benchmark). Unfortunately, the system they
chose based on this performance measure is actually slower
on their user codes than a another system that didn't have as high
a Top500 result. This is the danger in not testing - you might end
up with a system that doesn't run your codes as well as you thought.

Step 3 - Selecting Prospective Vendors

At this point you
should have a pretty good idea of how your code(s) perform and you
should have started testing. The next step I recommend is to start
selecting vendors. A good way to get started is to do what every
person does today - use Goggle. After your google search, you
might spend some time on the beowulf mailing list asking for advice
and recommendations for vendor candidates. You can
also read the archives of various mailing lists to get an idea of
companies you want to consider. You can also post to the mailing
lists yourself and ask what vendors people recommend and which ones
they stay away from. Be careful though, you will get conflicting
recommendations. But, sometimes you can get names of companies you
didn't even think of considering. However you chose to create a list
of possible vendors you will end up with a list of various vendors,
some small and some not so small. What do we do with the list? I
recommend classifying the prospective vendors into various categories.

I like to classify
vendors in three ways. The first way to classify vendors is based on
their overall size, sales, and experience with computer systems.
The second way to classify them is
based on their size, sales, and experience with clusters.
And the the third way is their understanding of clusters.
Particularly how to architect them, how to tune the distribution for
performance, and how they work with you (the customer) to get the
best price/performance possible.

First Classification - Computer System Vendors

The first classification is based on the company based on
all computer systems and support.  
You can divide companies into the usual Tier 1, Tier
2, and Tier 3 vendors that you read about in the literature. These
breakdowns are usually based on revenue from computer systems and
their surrounding services (e.g. support, etc.). However,
remember that we're talking about the overall size of the company and
sales on all computers, not just clusters. I call these the
"conventional" Tiers.

Second Classification - Cluster System Vendors

You can do the same basic classification as we did before, but base
the tiers on cluster sales.
That is, you can break vendors into Tier 1, Tier 2, Tier3, etc. vendors
based only on clusters sales. The ranking based on
clusters will definitely be different than the first ranking. I call
these the "cluster" Tiers.

Third Classification - True Cluster Vendors

The third
classification is a bit different than the previous two classifications,
and in many ways becomes a subjective
ranking. But, since I'm on my soapbox here, let me give you me
interpretation of true cluster vendors.

There are many vendors who are quite capable of
assembling computers and connecting them with networks. They
may even put a standard cluster distribution on them. I call these
vendors, the "rack-n-stack" vendors. They really don't know much
about clusters but just ship hardware. Unfortunately, you can expect
little support from these kinds of vendors. If you get into trouble
that involves more than, "how do I power off nodes?" then you will
be out of luck. In some cases, they won't even be able to answer
questions about the cluster management system they installed!

The second kind of true cluster vendor
can assemble clusters correctly, put on a cluster distribution with
a cluster management system, and even provide some basic support.
They can help you if you have a node go down or if you have to put a
node back into production, or if you have some basic error. I call
these vendors "basic cluster" vendors. They may have a decent
idea of how to create a cluster, but if you get into trouble with
your applications, they will never be able to help you. Also, they
have little or no idea about how to architecture a cluster for
your application(s) and how to tune performance. 

The third group of true
cluster vendors really understand how to assemble a cluster based on
your requirements and your code. They can tune your cluster for
performance for your code(s) and can help you if you have a problem
with your codes. Let me give you an example.

I know of one case where a major
"conventional" Tier 1 vendor and a "conventional" Tier 4 vendor,
who was also a Tier 1 "true cluster vendor" used
identical hardware and installed the same distribution of Linux, but
the conventional Tier 4 vendor tuned their system and the Tier 1
did not. The conventional Tier 4 vendor's cluster ran 20% faster using
the same compilers, compiler options, and MPI! Remember this is using
the same hardware and software. The Tier 4 vendor understood how to
integrate the cluster and tune for performance. I call these vendors
"cluster architects" and unfortunately, there are very few of them. 

New Approach - Team Cluster!!!!

People usually want a single vendor to provide everything - hardware,
software, services, support, warranty, etc. This is often referred
to as the "One Throat to Choke" approach. Since I'm in a
Xen-like mood let me
ask, why? Why do you have to have one company do everything for you?

Looking at the general commodity computer market you will see that
there are companies who are very, very good at hardware. These companies
are very good at making hardware that works well and is very
low cost. Many of them also manage to make money in the process. However,
they are not the best when it comes to software and they are
definitely not the best when it comes to understanding how to architect,
tune, and integrate clusters.

Then we have small companies that I call "cluster architects" as
I mentioned before. They are usually behind the curve on hardware
because of costs and volume, but they know clusters very well,
including cluster software.

Why can't you buy hardware from the hardware vendor and cluster
software and services from a true cluster company? Basically you
build your
own "team" - a hardware vendor, a cluster software and integrator,
and yourself. While this isn't the "one throat to choke" approach
that virtually all IT departments have become fixated upon, it
does have some compelling arguments.

First, this team approach gives you the best performance from the
various aspects of a cluster. You get the best hardware from a
company that truly understands hardware. You get the best
cluster software and integration from a vendor who knows clusters.
The hardware vendor doesn't have to worry about software, which
many of them find to be the bane of their existence. The cluster
vendor doesn't have to worry about hardware which requires a great
deal of time, effort, capital, and usually has low margins.

Also, this approach allows you to select hardware from vendors
who are not thought of as cluster companies. If you are bound by
the "one throat to choke" mentality you have to pick a single
vendor for cluster hardware, software, and services. This limits
your selection. However, what if you could chose a hardware company
that is not a cluster vendor? This gives you a wider range of
companies to choose from - this gives you more flexibility and
possibly better price/performance. What if you could chose a
cluster software/integrator that truly understands clusters?
This company gives you a wide range of cluster software to choose
from (as opposed to the "one throat to choke" who only have one set
of cluster software). This gives you more flexibility and perhaps
better price/performance as well. 

The approach of "decoupling" hardware and software gives you
opportunities that the "one throat to choke" concept does not
give you. For example, you can standardize on a single set of
cluster software and then select the hardware vendor that gives
the best price/performance, the latest hardware, or some specialized
hardware adapted for your application.

Just like real-life, there are downsides to this approach. Because
you don't have "one throat to choke" you have to bear more of the
responsibility for the cluster. You can't just call a single phone
number and expect "cluster batman" to fix whatever problem you
have. You know need to make sure the hardware vendor and the
software vendor work together (I would make sure they can do this
prior to buying anything) and that the habit of blaming each other
does not become a common occurrence. Also, this idea depends upon
the cluster software and the hardware vendor adhering to standards.
This allows the hardware piece and the software piece to be
interchangeable.

I can promise you that any IT manager who is reading this is silently
saying that they would never consider this idea and that I'm nuts.
Why would they trade ease of support for more headache on their
part? Well, the answer is simple - you can get much better
price/performance with this approach. As I mentioned before you
can get the best price/performance in hardware and the best
price/performance in software. If the integration is done well,
then you should have the best price/performance system.

This concept is a developing one that customers are starting to
embrace. It bears further thought, but I'm personally betting this
is the wave of the future.

Recommendations (but don't sue me)

You want to have a reasonable number of candidate vendors without
having too many. How many is enough? That's really up to you.
But, there are some easy recommendations I can give you (at least
"easy" in my opinion). Then I will give you my idea of how many
vendors I would pick and which kind of vendors I would pick.

The easiest recommendation I can make is to stay
away from the "rack-n-stack" vendors unless you have some good
in-house cluster staff. If you are trying the "cluster team"
approach, then the 'rack-n-stack" companies could make sense. However,
I would recommend looking for some "basic cluster" vendors
since they are plentiful in numbers. Most important, however, is to
spend as much time as you can looking for the "cluster architects."
These are the companies you will want to buy from.

In general, I would select around 4-5 vendors in total. I might
select at
least one conventional Tier 1 vendor, one or two "basic cluster vendors"
which are not conventional Tier 1 vendors, and at least one, but hopefully
two or more "cluster architect" companies. Don't select too many companies
since this will
make your life difficult when doing an evaluation, but also don't
select too few, since you will have a difficult time comparing the
vendors and their offerings

I have one more
comment about selecting vendors. Don't select a vendor based on just
their size and sales. Companies, or rather IT managers, 
have a tendency to select
companies based on their size and sales (the bigger the better).
People seem to have some comfort in large companies because
they believe they will always be around. While they have a valid
point, remember that clusters, and beowulf's in particular, are made
from commodity components, that can be easily found from other
companies. Also remember that Enron and Worldcom were huge
corporations with large sales and supposedly large cash reserves just
before they went bankrupt.

I have also seen a disturbing trend in cluster sales recently. There
are many people who are shopping for cluster vendors, and making
decisions, based purely on one number. This number is $/GFlop
(price per billion floating-point operations per second). The GFlop
number is either the peak performance or the Top500 performance
(I've seen requests for both numbers). What these people are looking
for is the cheapest hardware possible. By focusing on this single
number, they have immediately eliminated all discussion about the
ability of a vendor to support their hardware, fix problems, interconnect
options, software tuning, cluster management, etc. Also, they have
now eliminated any discussion of the performance of real codes.
It's really sad to hear people ask for these numbers knowing that
they are likely to end up with a cluster that doesn't work as
advertised, and doesn't deliver the best performance on user codes.

Step 4 - Technical Specifications

So we understand the code(s) that will be used on the cluster and
we understand our user base. Hopefully you are testing the codes on
cluster with various interconnects, processors, etc. Plus we are
investigating what vendors
we might select. The next step is the fun one - defining the technical
specifications!!

While this sounds like the fun section of this article, I'm afraid I
can't write too many details because the exact specifications
depend upon the specific code(s), the results of your testing,
and your specific company or
specific needs. But, I will discuss various aspects of clusters and things to
consider. OK, it will be like a laundry list, but at least they will
give you some ideas to consider as well as some thing to think about.

Aspects of clusters

Processor

   Integer Computations

Floating-point Computations

Memory bandwidth

Cache (L1 and L2)

Power (and heat)

Interconnect
Storage

   Disks

Storage Network

File System

Software

   Cluster Management

Compilers

MPI

Infrastructure

   Power requirements

Weight

Size of racks

Cooling requirements

Support
Warranty

and so on. Let's talk about some of these aspects.

Processors

Processors are what people typically fixate upon. Who can blame them?
Clusters are for computing and processors do the computing so it's
logical to focus on them. However, they are but one part of clusters.
But we are all human, or rather geek, so let's talk about processors!

Currently there are two main processor sources - Intel and AMD. Both
make processors for the server market. I'm not going to review them in
any detail in this article, but rather I'll just make a few comments
about them, focusing on setting requirements.

Integer Computations

You may have some codes that are primarily driven by integer computations.
These code don't have to be life science codes that do a great deal of
pattern searching. I have seen many engineering codes that are heavily
driven by integer computations. One way to specify an integer computation
capability is by requiring a certain level of integer capability. A
good way to do this is by specifying a certain
SpecInt2000
level of performance. However, these results are for single core chips.
If you have more than one socket or more than one core on a chip (pretty
soon this will be the standard) then a better way to specify integer
performance is by
SpecInt2000_rate.
It basically measures the integer performance as you add cores.

Floating-Point Computations

In technical computing people tend to focus on floating-point computations.
How do you measure the floating-point performance so you can specify it?
One way is by requiring a certain level of floating-point capability. A
good way to do this is by specifying a certain
SpecFP2000
level of performance. As with the integer performance, this measure of
performance is for a single core. For multi-core chips and multi-socket
systems, a better way to specify floating-point performance is by,
SpecFP2000_rate.

Memory Bandwidth

In my opinion, memory bandwidth is one the most underrated performance measures
for clusters. It measures how fast one can move data in and out of memory.
Many computational codes need as much memory bandwidth as possible. Fortunately
there is a simple benchmark for measuring this performance. The
Stream benchmarks measures
sustainable memory bandwidth for a simple vector kernel.

However, the Stream benchmark is not perfect. There are other
memory bandwidth benchmarks available that will predict worst case memory
performance. If possible, you should ask for these benchmarks as well.

Cache (L1 and L2)

Cache is generally important, particularly if the code is what is called
"cache-friendly". Some codes that are very dependent on cache size and
cache design. For example, structural analysis codes depend heavily on the
L1 and L2 cache size and design or performance. You can specify L1 and L2
cache sizes if your codes are cache-friendly, but be careful not to
specify cache sizes that aren't realistic. 

Power and Heat

Processor power and heat are becoming two of the biggest concerns for
cluster users. The reason is that current processors use a great deal
of power for computing. The resulting heat has to be cooled in some fashion
so the nodes don't overheat. Cooling requires even more power.

If you feel it is appropriate to specify the required power for the
CPU and/or the heat produced by the CPU, you certainly can do this.
I would caution you to talk to the vendors to specify realistic
values.

Interconnect

There are a huge range of choices for cluster interconnects. Everything
from simple Fast Ethernet in a single unmanaged switch to complicated
network topologies with layers of Infiniband switches are available for
clusters. When you right an Technical RFP, what should you do?

I recommend one of two options. The first option is or you to specify
exactly what interconnect you want. I don't recommend you specify the
hardware in detail, unless you are constrained to use that hardware for
some reason. The second option is to ask for RFP responses for several
interconnects. However, with each interconnect you should ask for
benchmark results with your codes. Then you can do a price/performance
analysis of the various interconnects.

One thing I don't recommend is asking for all possible interconnect
options. Usually cluster vendors will only recommend a few interconnects,
but I have seen customers who have asked for benchmark results for
just about all possible interconnects for a wide variety of codes.
This causes problem for the vendors forcing them to spend a great deal
of time benchmarking and developing quotes.

A reasonable alternative to specifying interconnects is to specify
interconnect characteristics that you know work well for your code.
For example, you could specify a lower bound on bandwidth, and an
upper bound on latency and N/2.

Storage

Storage is such a broad topic that I can't really discuss all of the
details and options in the article. One suggestion I can make is to
characterize storage by three items; disks, storage networks, and
file systems. Then you can either request information on these aspects
in the vendor response or you can specify bounds on these aspects.

Disks

You can specify certain types of disks if you like. What I think is a
better idea is to specify characteristics you think you need. For
example, you could ask for disk speeds, access times, number of
platters, or power requirements.

Alternatively, rather than specify disk specific disk characteristics
you could also specify which disk manufacturers you do not want.
I have used this approach on technical RFPs when I knew of bad disks
and did not want to have those disks show up in a cluster response
from a vendor.

Storage Network

You are likely to need some sort of single name space file system.
Something as simple as NFS or as complicated as a high-speed parallel
file system. This file system will have to use a network to allow
the nodes access to the data. Which network should it use? Can it
share the network with computational (MPI) traffic? If it needs it's
own network, what characteristics do you need?

File System

As with interconnects there are a number of options for file systems
for cluster. They range from the simple such as NFS to the more complex
such as high-speed parallel file systems. You can specify a single file
system, but I think it's better to specify the IO requirements you
need. What kind of IO speed do you need simultaneously on how many
nodes? What kind of metadata access do you need? What size files do
you typically work with.

By specifying the IO requirements you allow the vendor to choose a
file system that they believe meets your requirements and one that
they think will be stable and robust. One important aspect to consider
is asking if the vendor has experience with the file system.

Another thing to consider is why do you need one file system? Disks
and networks are fairly cheap so why have just one file system?
For example, you could use a more resilient file system for your
home directories and then something like PVFS2 for a high-speed
parallel file system for running your codes.

Software

Software is a bit more nebulous to specify because there are so
many options. It's better to specify functionality rather than specifics
if you can help it. However, there may be a time when you have to
specify a certain software because you have standardized on it or have
found that it gives you the best performance.

Cluster Management

Cluster management is a topic likely to invoke lots of argument. Which
one is better? Which one is faster? Which one is more stable? Which one
does blank? While there are a huge number of cluster management packages
available I think it's better to specify functionality rather than a
specific package.

If you specify functionality, you need to be specific as possible. For
example, you want to state that from the cluster master node you want
to be able to shut down the power to a single node, group of nodes, or
all nodes with a single command. If you can, be specific as possible
to help the vendors develop the best solution for your situation.

Compilers

Compilers, fortunately, are a bit simpler than cluster management
software. In the RFP you just need to specify which compiler or
compilers you want. Which one(s) you should specify is dependent
on your test results. The only advice I can offer is to be sure you
specify exactly what you want (just saying I want the Portland
Group compilers can lead to problems).

MPI

I think MPI libraries are one of the most important aspects to good
performance. Consequently, specifying the correct MPI library or
libraries is very important for the RFP.

On the other hand, you can try the opposite approach and let the vendor
select the MPI library based on benchmarks. You would specify the benchmarks
you want to see run, preferably your own codes, and then let the vendor
pick the best one in terms of performance or price/performance. This
approach will also help tell you if the vendor is a cluster architect
company or just a box company. If they are a true "cluster architect"
company, then they will test multiple MPI libraries and show you which
one is best in their response. Others will just reach for the standard
MPICH1 MPI library and give you the results.

Infrastructure

This is an often neglected aspect in a technical RFP. However, as I
will mention, these aspects can often make or break a successful
cluster.

Power Requirements

One requirement you can add to your RFP are the power requirements.
You can other set an upper bound on the power or you can simply
ask the vendor to tell you the power requirements for their proposed
system.

I would go beyond a simple number for the power requirements to include
the number of type of power connectors. Many times I have seen confusion
between the vendor and customer on this issue. You should explicitly
ask for a certain number of type of connectors or you should ask the
vendor to specify what they propose. All too often I have seen customers
not read the proposal and vendors not explicitly tell the customer
what kind of connectors they are proposing and how many.

Weight

I've heard stories of customers who have received fully loaded racks
and put them on raised floors only to watch them tip over because the
supports collapse. I don't know if they are true, but you need to
pay attention to the total weight of the system and the weight of each
rack. As with the other aspects, you can set a weight limit or you
can ask for the total weight of the system to be reported.

One aspect that tends to be ignored in regard to weight is the route
to the final location of the cluster. I highly recommend that you
walk the delivery route of the cluster to it's final location. Then
check with the facilities people as to the weight bearing limit of
the floors, elevators, ramps, lifts, etc. along the route.

Size of Racks

In addition to weight, the dimensions of the racks is also important.
If the cluster is to be in a raised floor area, then be sure to know
the distance from the rack to the ceiling. Many times there are
restrictions on the height of racks relative to the ceiling. Also,
more fundamentally, be sure the system will fit in the area you intend.

Similar to the discussion on weight, you also need to pay attention
to route from the delivery dock to the final cluster location. Be sure
to find out the shortest height along this route as well as the
minimum width and depth of any turns along the way. I have seen several
times where this is no way to get the cluster to the final location
without the cluster being totally dismantled. My favorite story was
an end user who order a rack, but didn't tell the vendor that it was
to go on the second floor, in an office cubical, and there was no
elevator.

Cooling Requirements

Along with weight, size, and power, the overall cooling requirements
are very important. However, I would not stop with just asking for
or specifying the specific total amount of cooling. I would also ask
or specify the pressure from any vented tiles (if it's a raised floor
machine room). I've seen cases where a machine room has enough cooling
but not enough pressure to effectively cool the nodes at the top of
a rack.

Support

Support is one of those things that depends upon you company policies
and/or your personal tendencies. If your company requires a certain
level of support, then this is something you need to specify. However,
if you have some flexibility in choosing a support model, then you
have many options. I recommend asking for pricing at least two support
models

The first support model is the traditional enterprise model of 4-hour
on-site service/repair/replace model. This model can be quite expensive,
but this is what traditional IT managers have come to expect. I think
that IT managers don't truly understand the cluster concept. Why worry
about one node out of many going down? You only lose a small percentage
of the compute power of the cluster and you can still run jobs on the
cluster. So why spend huge amounts of money to get node back into
production quickly when it has only a small impact on production?
However, one good thing this support model does is tell you the upper
end of support costs.

The second support model is at the lower end of the support spectrum.
Here the model is one of mail the node back, get it diagnosed, repaired,
and mailed back. Plus, email support for normal "problems" within the
cluster is included. To me this support model makes much more sense
because it gets the nodes repaired and/or replaced, plus it covers
software problems.

I actually think a third alternative is something I would consider. It
is really a mix of the two previous models. The model focuses on the
critical pieces of the cluster. The model has critical response for
the cluster interconnect and perhaps the master node and storage
(you can decide what "critical response" means to you) and uses a
lower support model on thing such as the compute nodes and software.
However, I would perhaps put a clause into the support contract to
cover systematic software problems and systematic hardware problems.
I consider this to be a "balanced" support model.

Warranty

A warranty is a bit different than support. But you can use a warranty
to effectively replace support for hardware. I know of one major site
that does this very effectively. The concept is that you just need
hardware fixed and/or replaced in a timely manner. By asking for a
warranty that covers the hardware for the life of the machine, you
can get rid of support costs. Of course, this doesn't cover software
support, but if you have a good staff, then the software support is
effectively covered except for upgrades which can be handled directly
by the software vendor.

Another option you might want to think about is having on-site spares.
Chose some percentage of the number nodes - perhaps 2% - and require
that many nodes to be on-site. However, rather than have them sit on
a shelf somewhere, put them into production. You get the performance
from them while all nodes are operating. Plus, you can lose up to
that many nodes before you go below the required number. For example,
if you have 4 on-site spare nodes, you can lose up to 5 before you
go below the required number.

Step 4 - Let's Write the Technical RFP

Now that you have a
good set of requirements in hand, your homework is done and you are
ready for your final exam - writing the technical RFP. There are a
few general guidelines I recommend in doing the actual writing.
First, always be as specific as possible to reduce miscommunication
with the vendor. However, always be ready for miscommunication since
it will happen. Also, be as flexible as possible to allow the vendor
to innovate and, hopefully, demonstrate their ability to provide good
value. If you can, provide figures. They will help reduce the
miscommunication. And finally, be as concise and clear as possible
(sounds like an English exam doesn't it?). 

What you put into the
technical cluster RFP is really up to you and your specific case. At
the bare minimum you need to specify what hardware you want (or don't
want) and possibly what kind of codes you want to run. However, there
are some things I strongly recommend go into the RFP. Let's
break these things up into three categories, miscellaneous, company,
and benchmarking

Miscellaneous

The miscellaneous
category is one that many people forget. This category deals with
things such as the environmental aspects and practical aspects.
Environmental aspects include what kind of power and connections you
need for the cluster, the footprint of the cluster, weight, height,
required cooling, noise, etc. These are very important things to
consider when planning your cluster. You can either request this
information from the various vendors or you can make them
requirements in your technical cluster RFP. One thing you need to
think about is getting the cluster from it's delivery point to its
final destination. Make sure the path can accommodate the weight and
the size, particularly the height, of the cluster (you would be
surprised how many people forget to check all of the
doorways from the delivery point to the server room!)

Miscellaneous items
also includes things like asking for the OS to be included in the
response to the RFP (unless you specify it otherwise) and a
description of the cluster management system (CMS) including
instructions on how to rebuild the cluster from bare metal, restoring
a node, bringing down the cluster, bringing the cluster up, and
monitoring the cluster. It should also include a request for details
about the warranty period and exactly what is covered and
not covered in the warranty. Moreover, you should request details
about tech support. The vendor should provide a single phone number
and point of contact (POC) for your cluster. They should also provide
details on what is supported (including possible OS problems), and
how quickly they will respond to the problem. This also include
hardware and software support. Be sure to ask about installing
commercial applications after the cluster is installed. Some vendors
will use this as an excuse for not supporting their clusters. Ask if
they vendor installs and supports MPI. Also, inquire about security
patches and the procedure for installing them. You can also ask the
vendor about what patches they apply to the kernel (if they do patch
the kernel)

Company

The next category, "company" allows you to get some information about the company
itself. You can ask for information about things such as total
cluster revenue, but many companies don't like to volunteer that
information and as far as I know, they are under no obligation to do
so. But some good things to ask about the company are

What have they given to the beowulf community?
Do they support open-source projects and if so which ones and how deep is their involvement?
Do they support open-source cluster projects and if so which ones and how deep is their involvement?
Do they support on-site administrator training? (I suggest requiring this)
Does the vendor stock replacement parts for your cluster?
Ask the company to provide a description of a support call that has gone well?
Ask the company to provide a description of a support call that has not gone well? What did they do to recover from it?
Can they describe they experience with clusters? Experience with Linux and clusters?
What do they do to tune their clusters for performance?
Recommendations from customers?

I'm sure you can think
of more good questions to ask the vendors. If the vendor is a good
one, then it should have no problem in answering these questions.

Benchmarks

In the final category,
"benchmarks", I recommend asking the vendors to run benchmarks on
the proposed hardware. The purpose of running the benchmarks is for
several reasons: it forces the vendor to actually test the proposed
hardware; it determines which vendors can run and complete
all of the benchmarks; it allows a direct comparison between vendors
on the benchmarks; it allows the vendors to show off their tuning
capabilities; and it gives the vendor some flexibility so they can
show their knowledge of clusters

Ideally, you should have the vendors run your benchmarks. This will give
you the best information on the performance of the cluster running your
applications with your data sets. However, I know this isn't always
possible. The next best thing are to run synthetic or open-source
benchmarks.

The benchmarks I
recommend asking the vendors to run are all open-source, so there are
no issues with obtaining them. However, be warned that these are
"synthetic" benchmarks in that they are not your codes. So, I
wouldn't recommend betting the speed of your codes on the results of
the benchmarks. I recommend asking for four varieties of benchmarks:
nodal benchmarks which are sometimes called micro-benchmarks; network
benchmarks; message passing benchmarks; and file system benchmarks. I
would have the vendor run various codes in each category several
times and report the average, standard deviation, geometric mean, and
the raw scores. This allows you to see the spread of the scores

Nodal benchmarks, such
as lmbench,
stream,
and cpu_rate,
are very
useful in measuring various aspects of the the performance of the
nodes. This allows you to differentiate between various hardware
offerings from the vendors

Network benchmarks can
spot problems in network configurations and also allow you to compare
network performance between vendors. The best program for doing this
is probably
Netpipe.
I would require that Netpipe be run in several
ways including via MPI over the proposed network. Also, I would
require the vendor to run
MPI Link Checker from Microway on the test
cluster. As an added precaution, I would require it to also be run on
the delivered cluster with some guarantee about performance (latency
and bandwidth) between all connections

The next category of
benchmarks, message-passing benchmarks is useful for comparing
vendors and the ability to tune the cluster for performance. The
current best set of message-passing benchmarks is the
NAS Parallel Benchmarks. 

The final category,
file system benchmarks, will give you performance numbers for local
disk performance (if you have disks in the nodes), and file system
performance over the network. Benchmarks such as
IOZone,
Bonnie++,
and Postmark,
are useful for performance testing. I would have the
vendors run the benchmarks using the proposed file system (you can
either pick the file system or let the vendor choose) on the proposed
cluster nodes. I would also have them run the exact same benchmarks
on any NFS mounted file systems you are using in the cluster. Be sure
to ask the vendor to provide the mount and export options for all NFS
exported file systems

Step 5 - Selecting Prospective Vendors

Once you get back the
benchmark results and the cost of the proposed solutions, it's time
to either select the winner from a technical point of view or to down
select to the finalists. However, before you do this, I would suggest
developing a scoring scheme for the various aspects of the cluster.
You could assign certain scores for completing each benchmark and
another score based on performance on the benchmark. You can also
assign scores based on other factors such as weight, power, cooling,
etc. Then when you receive the response from the vendors, you can
give them an overall technical score

Depending upon the
procurement procedures in your company, the technical score may only
be a certain percentage of the overall score. For example, the
technical score could be 70% of the total and the remaining 30% could
be the score of the vendor itself including cost. The procurement
policies of your business or lab or university will determine the
final breakdown between the technical score and the other scores

A few quick comments
about the process. Once you have the RFP developed, send it out!
Don't waste time it. Also, be sure to give the vendors time to
perform the benchmarks. If you don't hear from the vendors, be sure
to check with them to make sure they understand everything and are
making progress. Also, be flexible, because there will be questions
and concerns from the vendors. If there are changes to be made, be
sure all vendors know about it. Also, don't share pricing
information or benchmark results between the vendors. And finally,
beware of companies that low bid. They are trying to buy your current
business but may end up costing you down the road.

Step 6 - How to Down Select

Now that you have sent
out the RFP (or rather the procurement people have sent it out) and
you have gotten the responses back from the vendors and have scored
all of the vendors, how do you down select or select the winner?
Well, that's a very good question, and one that is difficult to
answer, because it depends upon your procurement policies. However, I
can offer a few words of advice

Be sure to define the
scoring before you send out the RFP, but don't tell the
vendors the scoring. Then when you get the scores back, put together
a review team that has a vested interest in the cluster. Have the
team members score the vendors and on the  basis of the scores, rank
the vendors. Then have the team discuss the ranking of the vendors
and perhaps make adjustments in the rankings. Then you can select the
winner or the vendors to be considered for the final competition.
Depending upon the policies of where you work, your winner(s) will
have to be filtered through procurement and the central IT
management. Be ready to explain the scoring system, the actual
scores, the rankings based on the scores, and any adjustments done to
the rankings. 

Then, if you can,
provide feedback to the companies that were not selected. This
feedback will help them improve their product offering for the next
competition you have

Example Scenario

In Sidebar 2, a sample
scenario is listed. This scenario is totally fictitious and does not
represent any real competition that I'm aware of. Also, the numbers
used in the sample technical cluster RFP are totally fictitious as
well. However, the overall structure is one that I recommend for a
technical cluster RFP. Here is the scenario:

A group of researchers
is interested in a cluster to support an MPI application. In this
case, based on the requirements of the users and the speed of certain
processors that came from testing their application, the group knows
how many nodes they need. Various interconnect technologies have been
studied to understand the impact of increased networking bandwidth
and decreasing latency on the performance of the codes. The number of
processors is fixed at 256 and dual processor nodes are allowed.
There is a master node that serves out the compilers and
queuing/scheduling software to the rest of the cluster. There is also
a dedicated file system server that has to have 4 TB (Terabyte) of
space to the cluster only (not on the "outside" network). The
goal is to meet all of these requirements for the lowest cost from a
company that provides good value. Please read the sidebar for an
example technical RFP that I have put together.

From this basic
framework you can add or subtract things that fit your specific
needs. You can also turn some of the requirements into requests for
information or just as easily turn the requests for information into
requirements. You can also use the concepts to create a technical
cluster RFP for the most computational power for a fixed price. There
are many variations than can be done using the pieces the example
provides.

Final Comments

I hope this article has
proved useful to you. It's a bit long, but I wanted to make sure that
most people could take away one useful thing from the article.
Writing a technical RFP can be a very long and grueling process with
the potential for many  disagreements. However, if you do your
homework then writing one is not difficult. In the end, doing your
homework and following some of these guidelines can help you save
time and money

Sidebar Two: Links Mentioned in Article

PAPI

perfctr

TAU

svPablo

MPICH

LAM

MPI-Pro

lmbench

stream

cpu_rate

Netpipe

MPI Link Checker

NPB

Scali MPI Connect

Sidebar Three: RFP Outline

Overview

The Cluster shall have 256 processors in the compute nodes, one master node
with up to two processors, and one file server node with up to two processors,
all on a private network with the master node also having an additional
network connection to an outside network.

Compute Node Requirements:

Nodes can be dual processors
Nodes can have either:
Intel P4 Xeon processors running at least 3.0 GHz
AMD Opteron processors - at least Opteron 248
(Note: Opterons give 35% improvement in speed compared to Xeon)
Each node should have two built-in Gigabit (GigE) NICs
At least one PCI-Express slot shall be available per node
One riser card that supports PCI-Express cards per node is required
Each node should have at least 2 Gigabytes (GB) of DDR ECC memory per processor
Each node should have at least one hard drive - ATA or SATA drives are acceptable
Each node should have built-in graphics of some type that is supported by Linux
The nodes can be rack mountable or blades. A 2U rack mount case is the largest case acceptable. Smaller cases are preferred.

Master Node Requirements:

Node can have dual processors
Node can have either:
Intel P4 Xeon processors running at least 3.0 GHz
AMD Opteron processors - at least Opteron 248
(Note: Processors should match the compute nodes)

Master node should have two built-in Gigabit (GigE) NICs
At least one PCI-Express slot shall be available per node
One riser card that supports PCI-Express cards per node is required
Master node should have at least 2 Gigabytes (GB) of DDR ECC memory per processor
Master node should have at least two hard drives in a RAID-1 configuration. ATA or SATA disks are acceptable.
Master node should have a 3D graphics card with at least 256 MB of dedicated video memory that is supported by Linux
Master node should have hot swappable power supplies
Master node can be rack mountable or blade. A 2U rack mount case is the largest case acceptable. Smaller cases are preferred.

File Serving (FS) Node Requirements:

Node can be dual processors
Node can have either:

Intel P4 Xeon processors running at least 3.0 GHz
AMD Opteron processors - at least Opteron 248
(Note: Processors should match the compute nodes)

FS node should have two built-in Gigabit (GigE) NICs
At least one PCI-Express slot shall be available per node
One riser card that supports PCI-Express cards per node is required
FS node should have at least 6 Gigabytes (GB) of DDR ECC memory (3 GB per processor)
FS node should have at least two hard drives in a RAID-1 configuration for the OS. ATA or SATA disks are acceptable.
FS node should have at least 4 Terabytes (TB) of usable space in a RAID-5 configuration. Disks should be hot-swappable. More
space is preferred. ATA or SATA disks are acceptable if hot-swappable.
FS node should have a graphics card of some type supported by Linux
FS node should have hot swappable power supplies
FS node can be rack mountable or blade. No greater than 5U case is acceptable. Smaller cases are preferred.

Networking Requirements:

There are two private networks connecting the compute nodes,
the master node, and the file serving node.

A GigE network for NFS and management traffic connecting
the compute nodes, the master node, and the file server node
A separate computational network for message-passing
traffic that connects the compute nodes, the master node,
and the file server node.

The switch for the gigabit network is vendor selected.
A single switch to connect all nodes is required. Please
provide the following information:

   Switch manufacturer and model number

Back plane bandwidth

The computational network is vendor selected from the
list below. The following performance numbers may be used to
select the network (Note: higher performance is preferred):

   Gigabit Ethernet: Baseline

Myrinet - 10% faster than baseline

Quadrics - 15% faster than baseline

Dolphin - 12% faster than baseline

IB - 30% faster than baseline

Please provide the following information:

   Network chosen including model numbers

Detailed layout of the two networks. Please include a diagram.

Tested or estimated cross-sectional bandwidth (specify whether tested or estimated)

Tested or estimated worst-case latency (specify whether tested or estimated)

Physical and Environmental Requirements:

Cluster should fit through an 84" high doorway without disassembly
Cluster racks should have wheels for movement
Cluster racks should have lockable front and rear doors
Cluster racks should have mesh front and back except for specific cooling system
Please provide the following information:

   Weight of each rack with all nodes and cabling installed in pounds

Dimensions of each rack with all nodes and cabling installed

Power requirements for each rack, specifying the number of type of outlets

Cooling requirements for the entire cluster in tons of air-conditioning
   required. Also please give ambient temperature.

Estimate total noise level of cluster

Software Requirements:

A cluster management system (CMS) should be installed in the cluster.
The CMS will have the ability to perform the following functions:

   Ability to install OS on each node (unless CMS does not do this)

Ability to rebuild a node that has been down

Ability to execute commands across cluster (parallel commands)

Ability to rebuild cluster from bare metal

Ability to shut down cluster

Ability to restart cluster without overloading circuits

Extensive monitoring capability (Please provide details of what is monitored)

Insert Operating System of Choice

   Describe OS update process including security updates

Vendor will install and test GNU and commercial compilers. The
commercial compilers will be provided 15 days prior to installation
date to the vendor.
Vendor will install and test commercial MPI implementation
(provided by customer). It will be provided 15 days prior to installation
date. Open-source MPICH and LAM will also be installed and tested.
Vendor will install and test commercial queuing/scheduling software
(provided by customer). It will be provided 15 days prior to installation
date.  

Benchmarking Requirements:

The vendor will be responsible for benchmarking proposed hardware.
The following benchmarks will be run:

   Stream benchmark:

      Run a single copy of stream seven times on proposed computer nodes
      and compute average and geometric mean

Run two copies of stream at the same times (one on each processor)
      seven times on the proposed compute nodes and compute average and
      geometric mean. Perform this benchmark only if proposing dual CPU nodes.

Report all stream results (all cases)

Report compiler, version, and compile flags used

Report version of stream benchmark used

No code modifications to the benchmark are allowed

lmbench benchmark:

      Run a single copy of lmbench seven times on proposed computer
      nodes and compute average and geometric mean

Run two copies of lmbench at the same times (one on each processor)
      seven times on the proposed compute nodes and compute average and
      geometric mean. Perform this benchmark only if proposing dual CPU nodes.

Report all lmbench results (all cases)

Report compiler, version, and compile flags used

Report version of lmbench benchmark used

No code modifications to the benchmark are allowed

cpu_rate benchmark:

      Run a single copy of cpu_rate seven times on proposed computer
      nodes and compute average and geometric mean

Run two copies of cpu_rate at the same times (one on each processor)
      seven times on the proposed compute nodes and compute average and
      geometric mean. Perform this benchmark only if proposing dual CPU nodes.

Report all cpu_rate results (all cases)

Report compiler, version, and compile flags used

Report version of cpu_rate benchmark used

No code modifications to the benchmark are allowed

NetPIPE benchmark:

      Run NetPIPE between two compute nodes through the proposed gigabit
      switch and the proposed computational network.

Run NetPIPE seven times and compute the average of peak
      bandwidth in MB/sec (Megabytes per second), average latency in
      microseconds, geometric mean of peak bandwidth in MB/sec (Megabytes
      per second), geometric mean of latency in microseconds.

         Do this for the gigabit network

Do this for the computational network

Provide graphs of all seven runs for each network

Report compiler, version, and compile options used

Report network options used (particularly MTU)

Report OS details including kernel and network drivers. Also report
      network driver parameters used. Report all kernel patches.

Run NetPIPE with TCP, LAM, MPICH, and Commercial MPI

No code modifications to the benchmark are allowed

NPB (NASA Parallel Benchmarks) benchmarks:

      Run NPB 2.4 benchmarks with Class C sizes over the proposed computational
      network with proposed computational nodes

Run all 8 benchmarks for the following combinations of number of processes and MPI:

         4 processors (Single CPU per node - 4 nodes. Two processes per node - 2 nodes)

16 processors (Single CPU per node - 16 nodes. Two processes per node - 8 nodes)

64 processors (Single CPU per node - 64 nodes. Two processes per node - 32 nodes)

MPI: MPICH, LAM, and commercial MPI

For each of the 8 codes in NPB 2, you will have 12 sets of results:
         three MPI implementations and four node configurations

Run all combinations seven times and compute average and
         geometric mean for each of the 8 benchmarks in NPB 2. Do this for each of the
         12 combinations.

Report all results in a spreadsheet

Report compiler, version, and compile options used

Report network configuration

Report OS and kernel version used. Report all patches to kernel

No code modifications to the benchmark are allowed

Bonnie++ benchmark:

      Run Bonnie++ seven times on the proposed compute node
      hardware, master node hardware, and file server hardware

         This is a total of 21 runs

Run with file sizes that are 5 times physical memory

Compute average and geometric mean of results (throughput and CPU usage)

Report all results in a spreadsheet

Report compiler, version, and compile options

No code modifications to the benchmark are allowed

Bonnie++ over NFS benchmarks:

      Run Bonnie++ over NFS from proposed file server to two proposed
      clients over the proposed gigabit network including the proposed GigE switch.

         NFS server is the proposed file server

NFS client is the proposed compute node

Run Bonnie++ seven times on each of the two clients. Compute average
      and geometric mean of the results (throughput and CPU usage).

         Unmount file systems in between runs

Report all results for both clients in a spreadsheet

Export file system using the following options:

         sync, rw, no_root_squash

Client mount options are variable with the following exceptions that must be set:

         hard, intr, tcp

Report compiler, version, and compile options

Report network configuration used

Report NFS server configuration used including export options

Report NFS client mount options used

Report OS version including kernel, kernel patches, GigE NIC drivers,
      version, and options for the NFS file server and NFS clients. 

No code modifications to the benchmark are allowed.

Other Information:

Please provide the following information:

   List your largest cluster installation to date

List your smallest cluster installation to date

How many clusters have you installed?

How many clusters have you installed with a high-speed interconnect?

What open-source projects to you support?

What open-source cluster projects do you support?

Do you have experience with installing parallel file systems? If so,
   which ones?

Please provide 3 references that may be contacted

Please describe a recent support call that is an example of a
   "good" support call.

Please describe a recent support call that is an example of a
   "bad" support call.

      What was done to recover this call?

Please describe your experience with Linux clusters

Please describe all high-speed networks that you installed at customer sites

Please list all kernel patches that you apply

Warranty/Maintenance Requirements:

Please describe warranty coverage and length of warranty. Please
provides details about what components are not covered.
Please provide a separate line item in costing for yearly hardware
and software maintenance for the following:

   On-site, next day repair of all components

On-site, next day repair of interconnect and master node

Please provide the same information for all equipment that is from
third party companies (e.g. the networking equipment)
Please describe warranty repair procedure beginning with support call
initiation
Please describe maintenance repair procedure beginning with support
call initiation

Delivery Requirements:

Cluster will be delivered within 30 calender days of receipt of purchase
order (PO).

   Customer will provide all customer provided software within 15 days
   of the receipt of the PO

Any changes to this schedule must be agreed upon by vendor and customer
All nodes will be configured and operational within 14 days of delivery
to customer site.

   MPI Link-Checker from Microway will be run on entire cluster to ensure
   that all nodes are working correctly. It will be run over the computational
   network and the NFS and management network. The following performance (bandwidth
   and latency) must be met:

      insert requirements for each network here

A 30 day test period will begin after the 14 day installation period

   If the cluster does not meet the following specifications then it is
   deemed unacceptable and will be sent back to the vendor:

      insert acceptance requirements here

On-site training of two administrators will be provided
performance guarantees on user codes or open-source benchmarks can
be inserted here

This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Jeff Layton is proud that he has 4.33 computers for every person in his
house - the most in his neighborhood but he's not telling his neighbors.