Article Index

Tweak your Interconnect

The interconnect is one of the biggest areas for performance improvement. It can be as simple as upgrading a Fast Ethernet network to a Gigabit Ethernet network, or as subtle as moving the NIC (Network Interface Card) to a different slot. Tweaking your interconnect is a simple way to improve your performance. Better yet, it doesn't cost you anything! This aspect is one of the beauties of clusters - the software is very accessible so adjusting parameters from their default settings is allowed.

We did an initial test of HPL using the Fast Ethernet network on the cluster (recall that the cluster has a Fast Ethernet network for cluster administration traffic and a Gigabit Ethernet network for computational traffic). We used the optimal HPL problem parameters we found in the previous section but ran over the Fast Ethernet network. In Test 8 in Table One, we found that we could achieve 7.45 GFLOPS over the Fast Ethernet network (only about 27% of theoretical peak performance).

Compare the performance of Fast Ethernet, Test 8, to the performance of Gigabit Ethernet, Test 4, we get about a 71% increase in performance by switching to a higher performance network - in this case Gigabit Ethernet. However, being the cluster monkeys that we are, we thought we could still improve the performance of our Gigabit Ethernet network.

We tested the Gigabit Ethernet network of Kronos using Netpipe (see Resources Sidebar) with the default network settings. Netpipe is a great tool for measuring the latency and peak bandwidth of a network connection between two nodes. We found that the default latency of the Gigabit Ethernet network (between two nodes and through the switch) was 117 microseconds. In our experience, we thought this was a bit high (the Fast Ethernet network was giving us latencies in the 42 microsecond range), so we started to investigate the Intel Pro/1000 NIC driver (e1000).

As with most Gigabit Ethernet drivers these days, Intel provides documentation for optimizing small packet transfers. The Intel e1000 driver incorporates interrupt rate throttling (IRT) that is also called interrupt coalescence or interrupt moderation. Normally, when an Ethernet frame is received the kernel is interrupted so that is can process the frame. At high data rates, these interrupts can cause performance problems including a CPU that is constantly interrupted to process the frame. The interrupt throttling concept holds multiple Ethernet frames for processing so that only a single system interrupt is used to process multiple frames. This reduces the load on the CPU but also has the potential to increase latency because the frames are held for some period before they are processed.

From the Intel manual, we found the parameters to turn off were Interrupt Throttling Rate (InterruptThrottleRate=0) and Receive Interrupt Delay (RxIntDelay=0). We re-ran Netpipe and found that the latency had dropped to 29 microseconds (not bad for Gigabit Ethernet). In Table One, we abbreviate InterruptThrottleRate as ITR and RxIntDelay as RID.

In our giddy cluster excitement, we re-ran HPL with interrupt throttling off and found that our performance had increased from 12.74 GFLOPS to 13.26 GFLOPS (Test 5 in Table 1). So we increased performance by 4% with some simple changes to the NIC driver. What other changes could we make?

In designing the cluster, we specifically chose a SMC 8508T 8-port Gigabit Ethernet switch because it allows the use of jumbo packets (data packets that are larger than the 1500 byte Ethernet standard) that could be a potential boost to performance. We experimented with various packets sizes by adjusting the MTU (Maximum Transmission Unit) for the Gigabit Ethernet NICs. We tried values up to 9000 and found that an MTU of 6000 gave us the best HPL performance. These results are shown as Test 7 and gave us a performance of 14.53 GFLOPS. We also twiddled with Ns and Nb a bit to get these numbers, but in all tests we found the larger MTU improved performance. Interestingly, This result would have moved us from 24th to 8th place the Top500 list in June 1993 and kept us on the list at 431 in June of 1998. This result is also 52% of the theoretical peak performance which is quite good for Gigabit Ethernet.

Libraries for Acceleration

When optimizing, one easy method to improve performance, particularly for HPL, is to try and find carefully tuned libraries for mathematics functions. HPL depends upon calls to the BLAS (Basic Linear Algebra Subroutine) library. Improving the performance of the BLAS library should improve the performance of HPL. This technique works for other codes as well. Examining the performance of supporting libraries can help lead to improving the performance of the overall code.

Up to this point we had been using ATLAS as our BLAS Library. We had built version 3.7.8 of ATLAS for the system using gcc/g77 version 3.3.3. ATLAS is a unique library in that it creates code based on your specific processor architecture. (see Sidebar Three on ATLAS).

Sidebar Three: Atlas

The Basic Linear Algebra Subroutine (BLAS) package is a standard or specification of the semantics and syntax for computing basic vector and matrix operations. There is a reference version of the BLAS libraries, written in Fortran, but it only serves as a reference and it's performance is fairly low. Many vendors have written their own tuned BLAS libraries for their architecture that have much better performance than the reference implementation.

ATLAS - Automatically Tuned Linear Algebra Software - was born at Jack Dongarra's Innovative Computing Laboratory at the University of Tennessee. It is a software package that creates a tuned BLAS library for the hardware on which it is to be run. It basically is a code generator that tests a wide range of options, such as blocking and unrolling factors, in generated code to achieve the performance for a vast majority of the BLAS functions. It also does this for some of the LINPACK (Linear Algebra Package) functions.

Many codes take the approach of finding the best performance that spans a wide range of platforms. Consequently, the performance is almost guaranteed not to be optimal or at least close to it. What makes ATLAS unique is that it adapts the resulting BLAS library code to the host architecture. Consequently, performance should be better than a generic BLAS library.

You can get ATLAS at Source Forge. All you need is a good C compiler (the website mentions the versions of gcc that produce acceptable results).

There are other fast BLAS libraries available such as ACML (AMD Core Math Library) and the GOTO Library (see Resources Sidebar). Due to limited time and limited space in this article we chose to test the ACML library.

We downloaded the latest version from AMD's website (ACML is free, by the way) and rebuilt HPL using ACML instead of ATLAS. We ran HPL with interrupt throttling off and with our old MTU setting (MTU=1500) to compare to the performance of ATLAS. The result, Test 6, was only 11.45 GFLOPS, which was down from 13.26 GFLOPS with the ATLAS library. We haven't had time to fully investigate why ACML was so much slower than ATLAS. However, it does illustrate that switching supporting libraries can have a large impact on performance. And, it pays to test all your assumptions. If this is starting to sound like a theme, it is.

Don't Panic

There are many other options we did not cover in this article, which we hope to cover in a future article. For example, we could try different MPI implementations, different compilers, different BLAS libraries. We could even try a different networking software instead of TCP/IP.

One software option that looks very appealing for our situation is called GAMMA (see Resources Sidebar). GAMMA stands for Genoa Active Messages and is a low latency, high throughput message library that runs over Gigabit Ethernet networks. It promises lower latencies than normal TCP traffic - approximately 10.6 microseconds for Intel Gigabit Ethernet hardware.

We Did Break a Record

Since our little eight node cluster will never make the Top500 list, we thought it might be interesting to see how how we stand in the dollars per GFLOPS column. Previous records in this area have been held by systems from the Aggregate.org site where the reining champ was KASY0, which provided an HPL double precision cost of $211 per GFLOPS. KASY0 also set an astounding single precision record of $84 per GFLOPS. The Top500 results are based on the double precision HPL.

We are pleased to announce that we have broken the double precision HPL record held by KASY0. Our current record is $171 per GFLOPS. If you use today's prices for the hardware, we have broken the $150 per GFLOPS barrier with an even lower value of $142 per GFLOPS. Table Two summarizes our results.

Table Two - Optimization Results

Metric Value
Total GFLOPS 14.53
Percent of Peak 52%
Cost at Construction $2,490
Dollars per GFLOPS $171
Cost Today $2,063
Dollars per GFLOPS (Today) $142
Power Usage at Load (Watts) 900
MFLOPS per Watt 16.14

Tim Mattox (father of KASY0) helped us with the theoretical peak number for Kronos. According to Tim, the theoretical peak is computed by simply adding the total GHz (1.756) of all the processors and then multiplying by 2 for double precision because the Athlon/Sempron has two independent floating point units, one that can perform an FADD per clock, and the other can do one FMUL per clock. The peak performance was 28.10 GFLOPS in our case.

We should also note that the cost at construction vs. cost today could be considered less than fair as all computer hardware costs less in the future. If you were to build the cluster today, the prices would be $427 lower, thus decreasing the dollar cost per GFLOP. Now if we really wanted to get lower, we would remove one of the 160 MB hard drives (there are two in a RAID1 configuration) and the video card (use on-board video) reducing the price to $1,961 and a dollars per GFLOPS of $135!. But, we believe the point is made without striping down Kronos any further.

There was also no accounting for construction cost. This number is hard to capture. If more people get interested in these types of metrics, we can assume two categories: DIY (Do It Yourself) and Turn-key (no construction the system is delivered ready to run). For now we will consider KRONOS in the DIY category.

We were not as happy with the MFLOPS per Watt, but still found it tolerable as KRONOS draws less current than a typical hair drier.

Good Bye and Thanks for all the FLOPS

We would like thank AMD for supporting this project. The results indicate that the idea of a low cost "value" cluster is not so crazy after all. Perhaps the biggest lesson in this process is to test the assumptions. The naive user might assume that default Ethernet settings, vendor math libraries, and recommended program parameters will be optimal. In our case, we found that we could increase the performance from a default base line of 10.6 to 14.53 GFLOPS. To add some perspective, the performance increase is the equivalent of adding three more compute nodes to our cluster. Your codes may be able to achieve similar enhancements.

We believe that there is more performance improvements to be had in our cluster. We are not quite ready to set it loose on finding the answer to the eternal question of "Life, the Universe, and Everything", but we know we are getting closer. We did find it strangely improbable, however, that the last 28 machines on the first Top500 list (June 1993) had a rating 0.42 GFLOPS. Moreover, progress towards understanding the question continues.

This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Douglas Eadline and Jeffrey Layton can be found swinging from the trees at clustermonkey.net.

Sidebar Resources

The Kronos Value Cluster: You can find a complete hardware/software recipe for the Kronos cluster here on Cluster Monkey.

HPL Tuning

Top500 List

Running HPL LINPACK on IBM xSeries Linux Clusters

Warewulf Cluster Toolkit

AMD (thanks again)

AMD Libraries/

Atlas Libraries

GOTO Math Library

KASY0 at Aggregate.org

Netpipe benchmark

Intel e1000 Optimization

GAMMA Cluster Optimized Ethernet

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.