Support
Support is one of those things that depends upon you company policies and/or your personal tendencies. If your company requires a certain level of support, then this is something you need to specify. However, if you have some flexibility in choosing a support model, then you have many options. I recommend asking for pricing at least two support models
The first support model is the traditional enterprise model of 4-hour on-site service/repair/replace model. This model can be quite expensive, but this is what traditional IT managers have come to expect. I think that IT managers don't truly understand the cluster concept. Why worry about one node out of many going down? You only lose a small percentage of the compute power of the cluster and you can still run jobs on the cluster. So why spend huge amounts of money to get node back into production quickly when it has only a small impact on production? However, one good thing this support model does is tell you the upper end of support costs.
The second support model is at the lower end of the support spectrum. Here the model is one of mail the node back, get it diagnosed, repaired, and mailed back. Plus, email support for normal "problems" within the cluster is included. To me this support model makes much more sense because it gets the nodes repaired and/or replaced, plus it covers software problems.
I actually think a third alternative is something I would consider. It is really a mix of the two previous models. The model focuses on the critical pieces of the cluster. The model has critical response for the cluster interconnect and perhaps the master node and storage (you can decide what "critical response" means to you) and uses a lower support model on thing such as the compute nodes and software. However, I would perhaps put a clause into the support contract to cover systematic software problems and systematic hardware problems. I consider this to be a "balanced" support model.
Warranty
A warranty is a bit different than support. But you can use a warranty to effectively replace support for hardware. I know of one major site that does this very effectively. The concept is that you just need hardware fixed and/or replaced in a timely manner. By asking for a warranty that covers the hardware for the life of the machine, you can get rid of support costs. Of course, this doesn't cover software support, but if you have a good staff, then the software support is effectively covered except for upgrades which can be handled directly by the software vendor.
Another option you might want to think about is having on-site spares. Chose some percentage of the number nodes - perhaps 2% - and require that many nodes to be on-site. However, rather than have them sit on a shelf somewhere, put them into production. You get the performance from them while all nodes are operating. Plus, you can lose up to that many nodes before you go below the required number. For example, if you have 4 on-site spare nodes, you can lose up to 5 before you go below the required number.
Step 4 - Let's Write the Technical RFP
Now that you have a good set of requirements in hand, your homework is done and you are ready for your final exam - writing the technical RFP. There are a few general guidelines I recommend in doing the actual writing. First, always be as specific as possible to reduce miscommunication with the vendor. However, always be ready for miscommunication since it will happen. Also, be as flexible as possible to allow the vendor to innovate and, hopefully, demonstrate their ability to provide good value. If you can, provide figures. They will help reduce the miscommunication. And finally, be as concise and clear as possible (sounds like an English exam doesn't it?).
What you put into the technical cluster RFP is really up to you and your specific case. At the bare minimum you need to specify what hardware you want (or don't want) and possibly what kind of codes you want to run. However, there are some things I strongly recommend go into the RFP. Let's break these things up into three categories, miscellaneous, company, and benchmarkingMiscellaneous
The miscellaneous category is one that many people forget. This category deals with things such as the environmental aspects and practical aspects. Environmental aspects include what kind of power and connections you need for the cluster, the footprint of the cluster, weight, height, required cooling, noise, etc. These are very important things to consider when planning your cluster. You can either request this information from the various vendors or you can make them requirements in your technical cluster RFP. One thing you need to think about is getting the cluster from it's delivery point to its final destination. Make sure the path can accommodate the weight and the size, particularly the height, of the cluster (you would be surprised how many people forget to check all of the doorways from the delivery point to the server room!)
Miscellaneous items also includes things like asking for the OS to be included in the response to the RFP (unless you specify it otherwise) and a description of the cluster management system (CMS) including instructions on how to rebuild the cluster from bare metal, restoring a node, bringing down the cluster, bringing the cluster up, and monitoring the cluster. It should also include a request for details about the warranty period and exactly what is covered and not covered in the warranty. Moreover, you should request details about tech support. The vendor should provide a single phone number and point of contact (POC) for your cluster. They should also provide details on what is supported (including possible OS problems), and how quickly they will respond to the problem. This also include hardware and software support. Be sure to ask about installing commercial applications after the cluster is installed. Some vendors will use this as an excuse for not supporting their clusters. Ask if they vendor installs and supports MPI. Also, inquire about security patches and the procedure for installing them. You can also ask the vendor about what patches they apply to the kernel (if they do patch the kernel)
Company
The next category, "company" allows you to get some information about the company
itself. You can ask for information about things such as total
cluster revenue, but many companies don't like to volunteer that
information and as far as I know, they are under no obligation to do
so. But some good things to ask about the company are
- What have they given to the beowulf community?
- Do they support open-source projects and if so which ones and how deep is their involvement?
- Do they support open-source cluster projects and if so which ones and how deep is their involvement?
- Do they support on-site administrator training? (I suggest requiring this)
- Does the vendor stock replacement parts for your cluster?
- Ask the company to provide a description of a support call that has gone well?
- Ask the company to provide a description of a support call that has not gone well? What did they do to recover from it?
- Can they describe they experience with clusters? Experience with Linux and clusters?
- What do they do to tune their clusters for performance?
- Recommendations from customers?
I'm sure you can think of more good questions to ask the vendors. If the vendor is a good one, then it should have no problem in answering these questions.
Benchmarks
In the final category, "benchmarks", I recommend asking the vendors to run benchmarks on the proposed hardware. The purpose of running the benchmarks is for several reasons: it forces the vendor to actually test the proposed hardware; it determines which vendors can run and complete all of the benchmarks; it allows a direct comparison between vendors on the benchmarks; it allows the vendors to show off their tuning capabilities; and it gives the vendor some flexibility so they can show their knowledge of clusters
Ideally, you should have the vendors run your benchmarks. This will give you the best information on the performance of the cluster running your applications with your data sets. However, I know this isn't always possible. The next best thing are to run synthetic or open-source benchmarks.
The benchmarks I recommend asking the vendors to run are all open-source, so there are no issues with obtaining them. However, be warned that these are "synthetic" benchmarks in that they are not your codes. So, I wouldn't recommend betting the speed of your codes on the results of the benchmarks. I recommend asking for four varieties of benchmarks: nodal benchmarks which are sometimes called micro-benchmarks; network benchmarks; message passing benchmarks; and file system benchmarks. I would have the vendor run various codes in each category several times and report the average, standard deviation, geometric mean, and the raw scores. This allows you to see the spread of the scores
Nodal benchmarks, such as lmbench, stream, and cpu_rate, are very useful in measuring various aspects of the the performance of the nodes. This allows you to differentiate between various hardware offerings from the vendors
Network benchmarks can spot problems in network configurations and also allow you to compare network performance between vendors. The best program for doing this is probably Netpipe. I would require that Netpipe be run in several ways including via MPI over the proposed network. Also, I would require the vendor to run MPI Link Checker from Microway on the test cluster. As an added precaution, I would require it to also be run on the delivered cluster with some guarantee about performance (latency and bandwidth) between all connections
The next category of benchmarks, message-passing benchmarks is useful for comparing vendors and the ability to tune the cluster for performance. The current best set of message-passing benchmarks is the NAS Parallel Benchmarks.
The final category, file system benchmarks, will give you performance numbers for local disk performance (if you have disks in the nodes), and file system performance over the network. Benchmarks such as IOZone, Bonnie++, and Postmark, are useful for performance testing. I would have the vendors run the benchmarks using the proposed file system (you can either pick the file system or let the vendor choose) on the proposed cluster nodes. I would also have them run the exact same benchmarks on any NFS mounted file systems you are using in the cluster. Be sure to ask the vendor to provide the mount and export options for all NFS exported file systems
Step 5 - Selecting Prospective Vendors
Once you get back the benchmark results and the cost of the proposed solutions, it's time to either select the winner from a technical point of view or to down select to the finalists. However, before you do this, I would suggest developing a scoring scheme for the various aspects of the cluster. You could assign certain scores for completing each benchmark and another score based on performance on the benchmark. You can also assign scores based on other factors such as weight, power, cooling, etc. Then when you receive the response from the vendors, you can give them an overall technical score
Depending upon the procurement procedures in your company, the technical score may only be a certain percentage of the overall score. For example, the technical score could be 70% of the total and the remaining 30% could be the score of the vendor itself including cost. The procurement policies of your business or lab or university will determine the final breakdown between the technical score and the other scores
A few quick comments about the process. Once you have the RFP developed, send it out! Don't waste time it. Also, be sure to give the vendors time to perform the benchmarks. If you don't hear from the vendors, be sure to check with them to make sure they understand everything and are making progress. Also, be flexible, because there will be questions and concerns from the vendors. If there are changes to be made, be sure all vendors know about it. Also, don't share pricing information or benchmark results between the vendors. And finally, beware of companies that low bid. They are trying to buy your current business but may end up costing you down the road.
Step 6 - How to Down Select
Now that you have sent out the RFP (or rather the procurement people have sent it out) and you have gotten the responses back from the vendors and have scored all of the vendors, how do you down select or select the winner? Well, that's a very good question, and one that is difficult to answer, because it depends upon your procurement policies. However, I can offer a few words of advice
Be sure to define the scoring before you send out the RFP, but don't tell the vendors the scoring. Then when you get the scores back, put together a review team that has a vested interest in the cluster. Have the team members score the vendors and on the basis of the scores, rank the vendors. Then have the team discuss the ranking of the vendors and perhaps make adjustments in the rankings. Then you can select the winner or the vendors to be considered for the final competition. Depending upon the policies of where you work, your winner(s) will have to be filtered through procurement and the central IT management. Be ready to explain the scoring system, the actual scores, the rankings based on the scores, and any adjustments done to the rankings.
Then, if you can, provide feedback to the companies that were not selected. This feedback will help them improve their product offering for the next competition you have
Example Scenario
In Sidebar 2, a sample scenario is listed. This scenario is totally fictitious and does not represent any real competition that I'm aware of. Also, the numbers used in the sample technical cluster RFP are totally fictitious as well. However, the overall structure is one that I recommend for a technical cluster RFP. Here is the scenario:
A group of researchers is interested in a cluster to support an MPI application. In this case, based on the requirements of the users and the speed of certain processors that came from testing their application, the group knows how many nodes they need. Various interconnect technologies have been studied to understand the impact of increased networking bandwidth and decreasing latency on the performance of the codes. The number of processors is fixed at 256 and dual processor nodes are allowed. There is a master node that serves out the compilers and queuing/scheduling software to the rest of the cluster. There is also a dedicated file system server that has to have 4 TB (Terabyte) of space to the cluster only (not on the "outside" network). The goal is to meet all of these requirements for the lowest cost from a company that provides good value. Please read the sidebar for an example technical RFP that I have put together.
From this basic framework you can add or subtract things that fit your specific needs. You can also turn some of the requirements into requests for information or just as easily turn the requests for information into requirements. You can also use the concepts to create a technical cluster RFP for the most computational power for a fixed price. There are many variations than can be done using the pieces the example provides.
Final Comments
I hope this article has proved useful to you. It's a bit long, but I wanted to make sure that most people could take away one useful thing from the article. Writing a technical RFP can be a very long and grueling process with the potential for many disagreements. However, if you do your homework then writing one is not difficult. In the end, doing your homework and following some of these guidelines can help you save time and money
Sidebar Two: Links Mentioned in Article |
Sidebar Three: RFP Outline |
OverviewThe Cluster shall have 256 processors in the compute nodes, one master node with up to two processors, and one file server node with up to two processors, all on a private network with the master node also having an additional network connection to an outside network. Compute Node Requirements:
Master Node Requirements:
File Serving (FS) Node Requirements:
Networking Requirements:There are two private networks connecting the compute nodes, the master node, and the file serving node.
The switch for the gigabit network is vendor selected. A single switch to connect all nodes is required. Please provide the following information:
The computational network is vendor selected from the list below. The following performance numbers may be used to select the network (Note: higher performance is preferred):
Please provide the following information:
Physical and Environmental Requirements:
Software Requirements:
Benchmarking Requirements:
Other Information:
Warranty/Maintenance Requirements:
Delivery Requirements:
|
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Jeff Layton is proud that he has 4.33 computers for every person in his house - the most in his neighborhood but he's not telling his neighbors.