Getting the most out of your cluster is always important. But how does one do this? Do you really need to dissect your code and analyze every instruction to get optimal performance? By testing basic assumptions, the authors were able to improve the performance of an eight node cluster, getting the equivalent of three extra nodes, at no additional cost.
Introduction
Cluster computing is great, so they say. Cobble together a few thousand servers and some Ethernet, grab some freely available software and you can now calculate the meaning of Life the Universe and Everything Well, OK, that has been done, for those of you that have not read Hitchhikers Guide to the Galaxy, the answer is there. It only took seven and half million years for the Deep Thought computer to determine this important answer. On to the next problem. Well, maybe, what if someone managed to optimize Deep Thought by 20%, then the question could be answered in a mere six million years saving those anxious to know the answer to the ultimate question quite a lot of waiting. What about your cluster? How much more performance could you be getting out of your system?
Rest assured, we will not be tackling such large problems, however, the right cluster optimizations can make things go much faster. In our quest, we have found that optimization is more like a strategy than flipping a switch. Let's start where all journeys start - getting money.
Because We All Have Budgets
Of course if your budget was unlimited, there is no need to optimize. You can just throw lots of money at buying the fastest performing hardware man has ever known (and in large quantities) and compute your heart out. For most of us however, we have a budget and we need to fit our problem into these budgets. Optimization is, therefore, about getting the best "bang for your buck" because you only have so many bucks.
Previously on Cluster Monkey, we ran a series of articles entitled "The Value Cluster" where we described how to built an eight node cluster from commodity parts for less than $2500. While many people would think that it is rather silly to build a cluster for less than the cost of "real cluster" node, we beg to differ. The optimization problem is identical to "real clusters". And, more importantly, through careful design and optimization much can be learned (and computed) using a small amount of money these days (all hail Moore's Law!).
Interestingly, our friend Forest Hoffman, the author of the Extreme Linux column, has also written about the idea as well, but wasn't crazy enough to take it as far as we did. Our series of three articles ended with the cluster running some tools, such as LAM/MPI and Sun Grid Engine (SGE) along with some benchmarks.
In this article we want to talk about how you might go about tuning your cluster for increased performance. Of course there is nothing better than optimizing your application, but we are going to use a more general measure of performance, the HPL benchmark. HPL stands for High Performance Linpack and is used to rate computers on the Top500 list.
Top500 - Yeah We Know
There is an adaptation of a famous phrase that is appropriate for computers: "There are lies, damn lies, and then there are benchmarks." While this is true in a general sense, benchmarks can perform a vital function for clusters. If you want to change something in the cluster, how do you know if you have improved or hurt the performance? The only way to tell is to benchmark the cluster before and after the change. You can think of the benchmark results as a time trial for a cluster. By twiddling with the carburetor or changing the fuel mixture we can run the time trial again and see if the changes to the cluster have made things worse or better. Instead of miles per hour, however, we will use GFLOPS (Trillion Floating point operations per second).
Now that we're resolved to the fact that we have to run benchmarks, we have to select what benchmark(s) we are going to run. Ideally, we would like to have a number of benchmarks that test all aspects of the cluster. However, for the purposes of this article we are going to restrict our scope to a single benchmark.
The Top500 benchmark has become the source of a love/hate relationship for many people. The love part is derived from the fact that it is an easy benchmark to run and has been used for many years, so there is a great volume of historical data. The hate part comes from the fact that the Top500 doesn't measure the performance of your codes on the cluster. Consequently, it is meaningless for estimating the true performance of the cluster. Think of it as plotting a line with one data point.
In our case, however, we are not trying to get on the Top500 list, we are simply using a well understood benchmark to test changes made to the cluster.
Dr. Jack Dongarra started collecting performance data for a range of computer systems using a Linpack based benchmark he developed. The benchmark tests how long it takes to solve the dense linear matrix equation Ax = b for the vector x. At first the benchmark only tested single processor or SMP machines. Then it was modified for distributed memory machines, such as clusters, and was called HPL High Performance Linpack) or as most know it "the Top500 benchmark." Dr. Dongarra and others started keeping a list of the 500 fastest machines in the world based on the benchmark. The list is announced twice a year: at the Supercomputer conference, usually in November in the U.S., and in at the International Supercomputer Conference, usually in Germany in June.
SUBHEAD: The Value Cluster The value cluster was designed to be low cost, low power, but provide usable computing performance. Before you can do anything with a cluster, however, you need to name it. After it was built it reminded us of an old Sci-Fi movie called KRONOS. Pictures can be found in the first Value Cluster article. While in the movie KRONOScame from outer space and destroyed things, our Kronos, came in separate parts and was brought by brown trucks.
The Kronos cluster has eight AMD Sempron 2500 processors (and nice shiny silver cases). The head node has 512 MBytes of RAM (PC2700), a 64MB AGP Video card, a DVD/CD Reader/Burner, 160 GBytes of RAID-1 storage, two Fast Ethernet links, and one Gigabit Ethernet link (Intel Pro 1000 NIC). The remaining seven nodes have 256 MBytes of RAM (PC2700), a Fast Ethernet link, and a Gigabit Ethernet link. To minimize cost and hassle, these nodes are disk-less. To support this design, we used the Warewulf Cluster Toolkit (see Sidebar One). The underlying distribution is Fedora Core 2. There are also two switches; an eight port Fast Ethernet switch and an eight port Gigabit Ethernet switch (with jumbo packet support). All this was purchased in the fall of 2004 for about $2500. Consult the Resources Sidebar for more information on how to construct your own Kronos cluster.
To build HPL we used gcc/g77 version 3.3.3 and LAM/MPI version 7.0.6. Of course other compiler and MPI versions are available and in the future we plan on testing these combinations.
Sidebar One: The Warewulf Distribution |
In deciding which software would be best suited for a small personal cluster we chose the Warewulf distribution (see the Resources sidebar). Warewulf has many nice features that allow you to manage the cluster. We chose to use disk-less nodes to keep our cost low, but also to eliminate version skew and excessive image copying required by many other cluster distributions. To keep things simple, Warewulf uses a small ramdisk (hard disk emulated in memory) on each compute node to hold the minimum number of files required to run the node. Once the nodes are booted NFS is used for mounting things like /home or /opt. Warewulf is based on building a Virtual Node File System (VNFS) on the master node. Once you have this files system built, it is packaged and sent to the computer nodes when they boot up. The VNFS image can be made very small (30-40 MB) as it contains only what you need to run codes on the nodes. If you run a du -hon a compute node you will see something like the following: Filesystem Size Used Avail Use% Mounted on none 110M 36M 74M 33% / none 110M 0 110M 0% /dev/shm 10.0.0.253:/home 8.4G 5.5G 2.5G 70% /home The number of files (libraries and executables) to "just run binaries" on the compute nodes is surprisingly small. The advantage of this approach is that the nodes are managed from one place (the master node) and will not develop "hard drive personalities" that make administration difficult. In addition, nodes can be quickly rebooted with different kernels, libraries, etc, without having to wait for hard drive spin up, file system checks, or other overhead. Of course the VNFS "eats" into the available RAM on the computer nodes, but we believe that memory density will continue to increase and cost will continue to drop so that a 50-60MB RAM disk will be become a small percentage of available memory. Note, we also give up 32 MB to the video system because of the built-in video on the motherboard, but the same argument applies, RAM is cheap, the incremental cost of adding 128 MB or more of RAM to cover that lost to both the video and ramdisk is around $10-$12 per node as of the writing of this article. Finally, it is important to note that everything on the compute nodes is lost upon reboot. All the configuration aspects of the nodes are managed from the VNFS on master node and not by editing configuration files on the compute nodes. Some users are surprised to find that vi is not installed on the compute nodes. |
Optimization Strategy
One very helpful feature of Linux clusters is the availability of source code for all of the plumbing. In addition, since clusters use commodity components you are free to buy parts and adapt your cluster to your needs. You could never do this with traditional supercomputers.
Let's start discussing the things we decided to tweak in our cluster to get better performance. There are a large number of things we could tweak, but we're going to focus on the easier things that make the most impact on performance (at least in our experience). However, rather than just willy-nilly "tweak and peak" to see what happens, we should actually take a few minutes and plan out what we want to do.
The first most obvious thing to tweak are the parameters of the Top500 benchmark itself. It has a number of parameters that can be varied to obtain the best performance for a given machine. The HPL website and an IBM paper give you some tuning hints (see Resources Sidebar).
The second thing we would like to attack is the interconnect. The performance of the interconnect can have a large impact on the performance of the application. Consequently, spending some time on the interconnect performance can go a long way to improving the performance of the cluster. Of course, the interconnect can be a large expense, but in our application we want to see how well we can do with commodity interconnects.
Thirdly, since we're talking about the Top500 benchmark that calls a BLAS (Basic Linear Algebra Subprograms) library, it would be a good thing to try various BLAS libraries to see how they help performance.
Finally, we should mention that trying various compiler/MPI combinations can have a huge impact on performance. Good MPI libraries also allow you to profile the execution and then tweak the MPI implementation for maximum performance. Also, trying various compilers in combination these libraries can go a long way in helping performance. We unfortunately will not have the space to cover all these possibilities. In theory, swapping compilers and MPI libraries is a simple process as everything is standardized. In reality it is a bit more tricky and is will be the subject of another article.
Sidebar Two: A Word About Optimizing Multiple Parameters. |
Optimization is a process of finding the best values for controllable parameters (i.e. things you can change). Think of these parameters as knobs on control panel. Ideally, if you turn one knob at a time to maximize your result, then the parameters are independent. In the real world, and with clusters, changing one parameter may force other parameters to be changed to maximize the result. These parameters are called dependent. Dependent parameters make optimization tricky and volumes have been written on this subject. Keep in mind that with clusters, changing a parameter may make the setting of other parameters, previous optimized, less optimal and more "fiddling with knobs may be in order." |
So we're going to turn three dials on our cluster (HPL parameters, interconnects, and libraries). It may sound like just a few tweaks, but as you will see, the right choice can have big effects on the performance. We are also going to keep track our progress in Table One. Changing multiple parameters at the same time can be a challenge (see Sidebar Two), so we are going to move carefully through the changes.
Tuning HPL
While HPL is designed to be tuned to the machine on which it is running, the key lesson here is that your applications may benefit from this type of tuning as well. However we will not be covering how to tune HPL in detail but will instead mention some of the major parameters that we tuned for best performance. Please check the Resources Sidebar for detailed information about tuning the HPL program.
Building HPL is pretty straight forward. You will need to following the instructions and make sure you have a working mpif77 and mpicc compiler wrappers and the correct path to your BLAS libraries. As mentioned, HPL is a very "tunable" program. It uses an input file, HPL.dat, to input many of the program parameters.
We found that tuning the the following parameters was most effective:
- Problem Size - (Ns) this variable is the maximum array that can be fit into your total cluster memory. If you set this too large, your nodes will run out of memory. If you set this too low then you will not achieve your maximum performance.
- Block Size - (Nb) the block size is the size for data distribution.
- Process Grids - (PxQ) these variables determine how the problem should be mapped on to the cluster and thus the communication pathways. They must be a multiple of your total number of processors. In our case we use 1x8 or 2x4. The HPL website suggests that P and Q should be as close as possible, but notes that with Ethernet, 1x8 may be best.
As mentioned, we used LAM as our default MPI, the g77 compiler as our default compiler, and ATLAS as our BLAS Library. The HPL.dat file can be set up to run a whole range of values for many given parameters. This option can generate a lot of test data and may also take a lot of time to run. For our first run, we used some of the default values in the HPL.dat file. The results are shown as Test 1 in Table One. Our results were lousy, to say the least. Time to read the documentation (see Resources). According to the documentation we could be using more sane values for our cluster. The results, indicated as Test 2 (See Table), used better values for our cluster and indeed, we now have an initial rate of 10.6 GFLOPS (good enough for 24th place on the Top500 list in June 1993 or 417th place in November 1997).
At this point, most users need to make a decision. If this were a program we actually would use, we can choose to accept this performance and start queuing up jobs or we can try and optimize for better performance. Let's adjust the parameters a bit (with a real program that is not as tunable as HPL, this would mean some code tinkering). In Test 3, we increased both the block and problem size and we get better performance (11.51 GFLOPS).
These results were for a 1x8 process grid, as recommended for Ethernet in the documentation. We decided to test this assumption and ran using a 2x4 process grid. Low and behold, we got better performance. (There is a lesson here, by the way.) The results are under Test 4 (12.74 GFLOPS) in Table One. These improvements place us at about 45% of the peak performance of all the processors. At the point we were happy with what we had achieved in the time we spent and decided to move onto the next cluster parameter to tweak; the interconnect.
Table 1: Benchmark Results (See text for abbreviations)
Test | Ns | Nb | PxQ | Lib | Ethernet | % of Peak | GFLOPS |
1 | 34 | 4 | 1x8 | Atlas | GigE-defaults | 0 | .0084 |
2 | 10000 | 144 | 1x8 | Atlas | GigE-defaults | 38 | 10.6 |
3 | 11500 | 240 | 1x8 | Atlas | GigE-defaults | 41 | 11.51 |
4 | 11500 | 240 | 2x4 | Atlas | GigE-defaults | 45 | 12.74 |
5 | 11500 | 240 | 2x4 | Atlas | GigE, ITR=0, RID=0, MTU=1500 | 47 | 13.26 |
6 | 11500 | 240 | 2x4 | AMD | GigE, ITR=0, RID=0, MTU=1500 | 41 | 11.45 |
7 | 12300 | 180 | 2x4 | Atlas | GigE, ITR=0, RID=0, MTU=6000 | 52 | 14.53 |
8 | 12300 | 180 | 2x4 | Atlas | FastE-defaults | 27 | 7.45 |