Life, The Universe, and Your Cluster - A Study in Cluster Optimization

Published on Saturday, 25 March 2006 14:00
Written by Doug Eadline and Jeff Layton
Hits: 17400

Getting the most out of your cluster is always important. But how does one do this? Do you really need to dissect your code and analyze every instruction to get optimal performance? By testing basic assumptions, the authors were able to improve the performance of an eight node cluster, getting the equivalent of three extra nodes, at no additional cost.

Introduction

Cluster computing is great, so they say. Cobble together a few thousand servers and some Ethernet, grab some freely available software and you can now calculate the meaning of Life the Universe and Everything Well, OK, that has been done, for those of you that have not read Hitchhikers Guide to the Galaxy, the answer is there. It only took seven and half million years for the Deep Thought computer to determine this important answer. On to the next problem. Well, maybe, what if someone managed to optimize Deep Thought by 20%, then the question could be answered in a mere six million years saving those anxious to know the answer to the ultimate question quite a lot of waiting. What about your cluster? How much more performance could you be getting out of your system?

Rest assured, we will not be tackling such large problems, however, the right cluster optimizations can make things go much faster. In our quest, we have found that optimization is more like a strategy than flipping a switch. Let's start where all journeys start - getting money.

Because We All Have Budgets

Of course if your budget was unlimited, there is no need to optimize. You can just throw lots of money at buying the fastest performing hardware man has ever known (and in large quantities) and compute your heart out. For most of us however, we have a budget and we need to fit our problem into these budgets. Optimization is, therefore, about getting the best "bang for your buck" because you only have so many bucks.

Previously on Cluster Monkey, we ran a series of articles entitled "The Value Cluster" where we described how to built an eight node cluster from commodity parts for less than $2500. While many people would think that it is rather silly to build a cluster for less than the cost of "real cluster" node, we beg to differ. The optimization problem is identical to "real clusters". And, more importantly, through careful design and optimization much can be learned (and computed) using a small amount of money these days (all hail Moore's Law!).

Interestingly, our friend Forest Hoffman, the author of the Extreme Linux column, has also written about the idea as well, but wasn't crazy enough to take it as far as we did. Our series of three articles ended with the cluster running some tools, such as LAM/MPI and Sun Grid Engine (SGE) along with some benchmarks.

In this article we want to talk about how you might go about tuning your cluster for increased performance. Of course there is nothing better than optimizing your application, but we are going to use a more general measure of performance, the HPL benchmark. HPL stands for High Performance Linpack and is used to rate computers on the Top500 list.

Top500 - Yeah We Know

There is an adaptation of a famous phrase that is appropriate for computers: "There are lies, damn lies, and then there are benchmarks." While this is true in a general sense, benchmarks can perform a vital function for clusters. If you want to change something in the cluster, how do you know if you have improved or hurt the performance? The only way to tell is to benchmark the cluster before and after the change. You can think of the benchmark results as a time trial for a cluster. By twiddling with the carburetor or changing the fuel mixture we can run the time trial again and see if the changes to the cluster have made things worse or better. Instead of miles per hour, however, we will use GFLOPS (Trillion Floating point operations per second).

Now that we're resolved to the fact that we have to run benchmarks, we have to select what benchmark(s) we are going to run. Ideally, we would like to have a number of benchmarks that test all aspects of the cluster. However, for the purposes of this article we are going to restrict our scope to a single benchmark.

{mosgoogle right}

The Top500 benchmark has become the source of a love/hate relationship for many people. The love part is derived from the fact that it is an easy benchmark to run and has been used for many years, so there is a great volume of historical data. The hate part comes from the fact that the Top500 doesn't measure the performance of your codes on the cluster. Consequently, it is meaningless for estimating the true performance of the cluster. Think of it as plotting a line with one data point.

In our case, however, we are not trying to get on the Top500 list, we are simply using a well understood benchmark to test changes made to the cluster.

Dr. Jack Dongarra started collecting performance data for a range of computer systems using a Linpack based benchmark he developed. The benchmark tests how long it takes to solve the dense linear matrix equation Ax = b for the vector x. At first the benchmark only tested single processor or SMP machines. Then it was modified for distributed memory machines, such as clusters, and was called HPL High Performance Linpack) or as most know it "the Top500 benchmark." Dr. Dongarra and others started keeping a list of the 500 fastest machines in the world based on the benchmark. The list is announced twice a year: at the Supercomputer conference, usually in November in the U.S., and in at the International Supercomputer Conference, usually in Germany in June.

SUBHEAD: The Value Cluster The value cluster was designed to be low cost, low power, but provide usable computing performance. Before you can do anything with a cluster, however, you need to name it. After it was built it reminded us of an old Sci-Fi movie called KRONOS. Pictures can be found in the first Value Cluster article. While in the movie KRONOScame from outer space and destroyed things, our Kronos, came in separate parts and was brought by brown trucks.

The Kronos cluster has eight AMD Sempron 2500 processors (and nice shiny silver cases). The head node has 512 MBytes of RAM (PC2700), a 64MB AGP Video card, a DVD/CD Reader/Burner, 160 GBytes of RAID-1 storage, two Fast Ethernet links, and one Gigabit Ethernet link (Intel Pro 1000 NIC). The remaining seven nodes have 256 MBytes of RAM (PC2700), a Fast Ethernet link, and a Gigabit Ethernet link. To minimize cost and hassle, these nodes are disk-less. To support this design, we used the Warewulf Cluster Toolkit (see Sidebar One). The underlying distribution is Fedora Core 2. There are also two switches; an eight port Fast Ethernet switch and an eight port Gigabit Ethernet switch (with jumbo packet support). All this was purchased in the fall of 2004 for about $2500. Consult the Resources Sidebar for more information on how to construct your own Kronos cluster.

To build HPL we used gcc/g77 version 3.3.3 and LAM/MPI version 7.0.6. Of course other compiler and MPI versions are available and in the future we plan on testing these combinations.

Sidebar One: The Warewulf Distribution

In deciding which software would be best suited for a small personal cluster we chose the Warewulf distribution (see the Resources sidebar). Warewulf has many nice features that allow you to manage the cluster. We chose to use disk-less nodes to keep our cost low, but also to eliminate version skew and excessive image copying required by many other cluster distributions. To keep things simple, Warewulf uses a small ramdisk (hard disk emulated in memory) on each compute node to hold the minimum number of files required to run the node. Once the nodes are booted NFS is used for mounting things like /home or /opt. Warewulf is based on building a Virtual Node File System (VNFS) on the master node. Once you have this files system built, it is packaged and sent to the computer nodes when they boot up. The VNFS image can be made very small (30-40 MB) as it contains only what you need to run codes on the nodes. If you run a du -hon a compute node you will see something like the following:

Filesystem       Size  Used Avail Use% Mounted on
none             110M   36M   74M  33% /
none             110M     0  110M   0% /dev/shm
10.0.0.253:/home 8.4G  5.5G  2.5G  70% /home

The number of files (libraries and executables) to "just run binaries" on the compute nodes is surprisingly small. The advantage of this approach is that the nodes are managed from one place (the master node) and will not develop "hard drive personalities" that make administration difficult. In addition, nodes can be quickly rebooted with different kernels, libraries, etc, without having to wait for hard drive spin up, file system checks, or other overhead.

Of course the VNFS "eats" into the available RAM on the computer nodes, but we believe that memory density will continue to increase and cost will continue to drop so that a 50-60MB RAM disk will be become a small percentage of available memory. Note, we also give up 32 MB to the video system because of the built-in video on the motherboard, but the same argument applies, RAM is cheap, the incremental cost of adding 128 MB or more of RAM to cover that lost to both the video and ramdisk is around $10-$12 per node as of the writing of this article.

Finally, it is important to note that everything on the compute nodes is lost upon reboot. All the configuration aspects of the nodes are managed from the VNFS on master node and not by editing configuration files on the compute nodes. Some users are surprised to find that vi is not installed on the compute nodes.

Optimization Strategy

One very helpful feature of Linux clusters is the availability of source code for all of the plumbing. In addition, since clusters use commodity components you are free to buy parts and adapt your cluster to your needs. You could never do this with traditional supercomputers.

Let's start discussing the things we decided to tweak in our cluster to get better performance. There are a large number of things we could tweak, but we're going to focus on the easier things that make the most impact on performance (at least in our experience). However, rather than just willy-nilly "tweak and peak" to see what happens, we should actually take a few minutes and plan out what we want to do.

The first most obvious thing to tweak are the parameters of the Top500 benchmark itself. It has a number of parameters that can be varied to obtain the best performance for a given machine. The HPL website and an IBM paper give you some tuning hints (see Resources Sidebar).

The second thing we would like to attack is the interconnect. The performance of the interconnect can have a large impact on the performance of the application. Consequently, spending some time on the interconnect performance can go a long way to improving the performance of the cluster. Of course, the interconnect can be a large expense, but in our application we want to see how well we can do with commodity interconnects.

Thirdly, since we're talking about the Top500 benchmark that calls a BLAS (Basic Linear Algebra Subprograms) library, it would be a good thing to try various BLAS libraries to see how they help performance.

Finally, we should mention that trying various compiler/MPI combinations can have a huge impact on performance. Good MPI libraries also allow you to profile the execution and then tweak the MPI implementation for maximum performance. Also, trying various compilers in combination these libraries can go a long way in helping performance. We unfortunately will not have the space to cover all these possibilities. In theory, swapping compilers and MPI libraries is a simple process as everything is standardized. In reality it is a bit more tricky and is will be the subject of another article.

Sidebar Two: A Word About Optimizing Multiple Parameters.

Optimization is a process of finding the best values for controllable parameters (i.e. things you can change). Think of these parameters as knobs on control panel. Ideally, if you turn one knob at a time to maximize your result, then the parameters are independent. In the real world, and with clusters, changing one parameter may force other parameters to be changed to maximize the result. These parameters are called dependent. Dependent parameters make optimization tricky and volumes have been written on this subject. Keep in mind that with clusters, changing a parameter may make the setting of other parameters, previous optimized, less optimal and more "fiddling with knobs may be in order."

So we're going to turn three dials on our cluster (HPL parameters, interconnects, and libraries). It may sound like just a few tweaks, but as you will see, the right choice can have big effects on the performance. We are also going to keep track our progress in Table One. Changing multiple parameters at the same time can be a challenge (see Sidebar Two), so we are going to move carefully through the changes.

Tuning HPL

While HPL is designed to be tuned to the machine on which it is running, the key lesson here is that your applications may benefit from this type of tuning as well. However we will not be covering how to tune HPL in detail but will instead mention some of the major parameters that we tuned for best performance. Please check the Resources Sidebar for detailed information about tuning the HPL program.

Building HPL is pretty straight forward. You will need to following the instructions and make sure you have a working mpif77 and mpicc compiler wrappers and the correct path to your BLAS libraries. As mentioned, HPL is a very "tunable" program. It uses an input file, HPL.dat, to input many of the program parameters.

We found that tuning the the following parameters was most effective:

As mentioned, we used LAM as our default MPI, the g77 compiler as our default compiler, and ATLAS as our BLAS Library. The HPL.dat file can be set up to run a whole range of values for many given parameters. This option can generate a lot of test data and may also take a lot of time to run. For our first run, we used some of the default values in the HPL.dat file. The results are shown as Test 1 in Table One. Our results were lousy, to say the least. Time to read the documentation (see Resources). According to the documentation we could be using more sane values for our cluster. The results, indicated as Test 2 (See Table), used better values for our cluster and indeed, we now have an initial rate of 10.6 GFLOPS (good enough for 24th place on the Top500 list in June 1993 or 417th place in November 1997).

At this point, most users need to make a decision. If this were a program we actually would use, we can choose to accept this performance and start queuing up jobs or we can try and optimize for better performance. Let's adjust the parameters a bit (with a real program that is not as tunable as HPL, this would mean some code tinkering). In Test 3, we increased both the block and problem size and we get better performance (11.51 GFLOPS).

These results were for a 1x8 process grid, as recommended for Ethernet in the documentation. We decided to test this assumption and ran using a 2x4 process grid. Low and behold, we got better performance. (There is a lesson here, by the way.) The results are under Test 4 (12.74 GFLOPS) in Table One. These improvements place us at about 45% of the peak performance of all the processors. At the point we were happy with what we had achieved in the time we spent and decided to move onto the next cluster parameter to tweak; the interconnect.

Table 1: Benchmark Results (See text for abbreviations)

Test Ns Nb PxQ Lib Ethernet % of Peak GFLOPS
1 34 4 1x8 Atlas GigE-defaults 0 .0084
2 10000 144 1x8 Atlas GigE-defaults 38 10.6
3 11500 240 1x8 Atlas GigE-defaults 41 11.51
4 11500 240 2x4 Atlas GigE-defaults 45 12.74
5 11500 240 2x4 Atlas GigE, ITR=0, RID=0, MTU=1500 47 13.26
6 11500 240 2x4 AMD GigE, ITR=0, RID=0, MTU=1500 41 11.45
7 12300 180 2x4 Atlas GigE, ITR=0, RID=0, MTU=6000 52 14.53
8 12300 180 2x4 Atlas FastE-defaults 27 7.45

 


Tweak your Interconnect

The interconnect is one of the biggest areas for performance improvement. It can be as simple as upgrading a Fast Ethernet network to a Gigabit Ethernet network, or as subtle as moving the NIC (Network Interface Card) to a different slot. Tweaking your interconnect is a simple way to improve your performance. Better yet, it doesn't cost you anything! This aspect is one of the beauties of clusters - the software is very accessible so adjusting parameters from their default settings is allowed.

We did an initial test of HPL using the Fast Ethernet network on the cluster (recall that the cluster has a Fast Ethernet network for cluster administration traffic and a Gigabit Ethernet network for computational traffic). We used the optimal HPL problem parameters we found in the previous section but ran over the Fast Ethernet network. In Test 8 in Table One, we found that we could achieve 7.45 GFLOPS over the Fast Ethernet network (only about 27% of theoretical peak performance).

Compare the performance of Fast Ethernet, Test 8, to the performance of Gigabit Ethernet, Test 4, we get about a 71% increase in performance by switching to a higher performance network - in this case Gigabit Ethernet. However, being the cluster monkeys that we are, we thought we could still improve the performance of our Gigabit Ethernet network.

We tested the Gigabit Ethernet network of Kronos using Netpipe (see Resources Sidebar) with the default network settings. Netpipe is a great tool for measuring the latency and peak bandwidth of a network connection between two nodes. We found that the default latency of the Gigabit Ethernet network (between two nodes and through the switch) was 117 microseconds. In our experience, we thought this was a bit high (the Fast Ethernet network was giving us latencies in the 42 microsecond range), so we started to investigate the Intel Pro/1000 NIC driver (e1000).

As with most Gigabit Ethernet drivers these days, Intel provides documentation for optimizing small packet transfers. The Intel e1000 driver incorporates interrupt rate throttling (IRT) that is also called interrupt coalescence or interrupt moderation. Normally, when an Ethernet frame is received the kernel is interrupted so that is can process the frame. At high data rates, these interrupts can cause performance problems including a CPU that is constantly interrupted to process the frame. The interrupt throttling concept holds multiple Ethernet frames for processing so that only a single system interrupt is used to process multiple frames. This reduces the load on the CPU but also has the potential to increase latency because the frames are held for some period before they are processed.

From the Intel manual, we found the parameters to turn off were Interrupt Throttling Rate (InterruptThrottleRate=0) and Receive Interrupt Delay (RxIntDelay=0). We re-ran Netpipe and found that the latency had dropped to 29 microseconds (not bad for Gigabit Ethernet). In Table One, we abbreviate InterruptThrottleRate as ITR and RxIntDelay as RID.

In our giddy cluster excitement, we re-ran HPL with interrupt throttling off and found that our performance had increased from 12.74 GFLOPS to 13.26 GFLOPS (Test 5 in Table 1). So we increased performance by 4% with some simple changes to the NIC driver. What other changes could we make?

In designing the cluster, we specifically chose a SMC 8508T 8-port Gigabit Ethernet switch because it allows the use of jumbo packets (data packets that are larger than the 1500 byte Ethernet standard) that could be a potential boost to performance. We experimented with various packets sizes by adjusting the MTU (Maximum Transmission Unit) for the Gigabit Ethernet NICs. We tried values up to 9000 and found that an MTU of 6000 gave us the best HPL performance. These results are shown as Test 7 and gave us a performance of 14.53 GFLOPS. We also twiddled with Ns and Nb a bit to get these numbers, but in all tests we found the larger MTU improved performance. Interestingly, This result would have moved us from 24th to 8th place the Top500 list in June 1993 and kept us on the list at 431 in June of 1998. This result is also 52% of the theoretical peak performance which is quite good for Gigabit Ethernet.

Libraries for Acceleration

When optimizing, one easy method to improve performance, particularly for HPL, is to try and find carefully tuned libraries for mathematics functions. HPL depends upon calls to the BLAS (Basic Linear Algebra Subroutine) library. Improving the performance of the BLAS library should improve the performance of HPL. This technique works for other codes as well. Examining the performance of supporting libraries can help lead to improving the performance of the overall code.

Up to this point we had been using ATLAS as our BLAS Library. We had built version 3.7.8 of ATLAS for the system using gcc/g77 version 3.3.3. ATLAS is a unique library in that it creates code based on your specific processor architecture. (see Sidebar Three on ATLAS).

Sidebar Three: Atlas

The Basic Linear Algebra Subroutine (BLAS) package is a standard or specification of the semantics and syntax for computing basic vector and matrix operations. There is a reference version of the BLAS libraries, written in Fortran, but it only serves as a reference and it's performance is fairly low. Many vendors have written their own tuned BLAS libraries for their architecture that have much better performance than the reference implementation.

ATLAS - Automatically Tuned Linear Algebra Software - was born at Jack Dongarra's Innovative Computing Laboratory at the University of Tennessee. It is a software package that creates a tuned BLAS library for the hardware on which it is to be run. It basically is a code generator that tests a wide range of options, such as blocking and unrolling factors, in generated code to achieve the performance for a vast majority of the BLAS functions. It also does this for some of the LINPACK (Linear Algebra Package) functions.

Many codes take the approach of finding the best performance that spans a wide range of platforms. Consequently, the performance is almost guaranteed not to be optimal or at least close to it. What makes ATLAS unique is that it adapts the resulting BLAS library code to the host architecture. Consequently, performance should be better than a generic BLAS library.

You can get ATLAS at Source Forge. All you need is a good C compiler (the website mentions the versions of gcc that produce acceptable results).

There are other fast BLAS libraries available such as ACML (AMD Core Math Library) and the GOTO Library (see Resources Sidebar). Due to limited time and limited space in this article we chose to test the ACML library.

We downloaded the latest version from AMD's website (ACML is free, by the way) and rebuilt HPL using ACML instead of ATLAS. We ran HPL with interrupt throttling off and with our old MTU setting (MTU=1500) to compare to the performance of ATLAS. The result, Test 6, was only 11.45 GFLOPS, which was down from 13.26 GFLOPS with the ATLAS library. We haven't had time to fully investigate why ACML was so much slower than ATLAS. However, it does illustrate that switching supporting libraries can have a large impact on performance. And, it pays to test all your assumptions. If this is starting to sound like a theme, it is.

Don't Panic

There are many other options we did not cover in this article, which we hope to cover in a future article. For example, we could try different MPI implementations, different compilers, different BLAS libraries. We could even try a different networking software instead of TCP/IP.

One software option that looks very appealing for our situation is called GAMMA (see Resources Sidebar). GAMMA stands for Genoa Active Messages and is a low latency, high throughput message library that runs over Gigabit Ethernet networks. It promises lower latencies than normal TCP traffic - approximately 10.6 microseconds for Intel Gigabit Ethernet hardware.

We Did Break a Record

Since our little eight node cluster will never make the Top500 list, we thought it might be interesting to see how how we stand in the dollars per GFLOPS column. Previous records in this area have been held by systems from the Aggregate.org site where the reining champ was KASY0, which provided an HPL double precision cost of $211 per GFLOPS. KASY0 also set an astounding single precision record of $84 per GFLOPS. The Top500 results are based on the double precision HPL.

We are pleased to announce that we have broken the double precision HPL record held by KASY0. Our current record is $171 per GFLOPS. If you use today's prices for the hardware, we have broken the $150 per GFLOPS barrier with an even lower value of $142 per GFLOPS. Table Two summarizes our results.

Table Two - Optimization Results

Metric Value
Total GFLOPS 14.53
Percent of Peak 52%
Cost at Construction $2,490
Dollars per GFLOPS $171
Cost Today $2,063
Dollars per GFLOPS (Today) $142
Power Usage at Load (Watts) 900
MFLOPS per Watt 16.14

Tim Mattox (father of KASY0) helped us with the theoretical peak number for Kronos. According to Tim, the theoretical peak is computed by simply adding the total GHz (1.756) of all the processors and then multiplying by 2 for double precision because the Athlon/Sempron has two independent floating point units, one that can perform an FADD per clock, and the other can do one FMUL per clock. The peak performance was 28.10 GFLOPS in our case.

We should also note that the cost at construction vs. cost today could be considered less than fair as all computer hardware costs less in the future. If you were to build the cluster today, the prices would be $427 lower, thus decreasing the dollar cost per GFLOP. Now if we really wanted to get lower, we would remove one of the 160 MB hard drives (there are two in a RAID1 configuration) and the video card (use on-board video) reducing the price to $1,961 and a dollars per GFLOPS of $135!. But, we believe the point is made without striping down Kronos any further.

There was also no accounting for construction cost. This number is hard to capture. If more people get interested in these types of metrics, we can assume two categories: DIY (Do It Yourself) and Turn-key (no construction the system is delivered ready to run). For now we will consider KRONOS in the DIY category.

We were not as happy with the MFLOPS per Watt, but still found it tolerable as KRONOS draws less current than a typical hair drier.

Good Bye and Thanks for all the FLOPS

We would like thank AMD for supporting this project. The results indicate that the idea of a low cost "value" cluster is not so crazy after all. Perhaps the biggest lesson in this process is to test the assumptions. The naive user might assume that default Ethernet settings, vendor math libraries, and recommended program parameters will be optimal. In our case, we found that we could increase the performance from a default base line of 10.6 to 14.53 GFLOPS. To add some perspective, the performance increase is the equivalent of adding three more compute nodes to our cluster. Your codes may be able to achieve similar enhancements.

We believe that there is more performance improvements to be had in our cluster. We are not quite ready to set it loose on finding the answer to the eternal question of "Life, the Universe, and Everything", but we know we are getting closer. We did find it strangely improbable, however, that the last 28 machines on the first Top500 list (June 1993) had a rating 0.42 GFLOPS. Moreover, progress towards understanding the question continues.{mosgoogle right}

This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Douglas Eadline and Jeffrey Layton can be found swinging from the trees at clustermonkey.net.

Sidebar Resources

The Kronos Value Cluster: You can find a complete hardware/software recipe for the Kronos cluster here on Cluster Monkey.

HPL Tuning

Top500 List

Running HPL LINPACK on IBM xSeries Linux Clusters

Warewulf Cluster Toolkit

AMD (thanks again)

AMD Libraries/

Atlas Libraries

GOTO Math Library

KASY0 at Aggregate.org

Netpipe benchmark

Intel e1000 Optimization

GAMMA Cluster Optimized Ethernet