Article Index

Tools of the cluster trade, don't leave home without them

In past articles, we looked at basic Linux networking. At this point you should have a pretty good idea of how the basic network, common to nearly every modern computer system (TCP/IP over Ethernet), is structured at the packet level. We have learned about Ethernet packets and their header, encapsulating IP packets with their header, encapsulating TCP packets with their header. This process of encapsulation isn't quite finished -- there is nothing to prevent anyone from adding YAH (yet another header) inside the TCP header to further encapsulate actual data, but since a lot of network communications carry the data inside the TCP layer without further headers or encapsulation, we'll quit at this point and move on to the next burning question.

Given this marvelous understanding of how the network is supposed to be working, the burning question du jour is: How well is it working?

Like all things associated with the network, this question is not a terribly easy to answer. The network is complicated (as we have seen) with lots of moving parts, many layers (each with its own header) where things can be mis-configured or go wrong. Also, even when it is working "correctly", it can be working poorly and taking a relatively long time to send and receive messages.

In this column we'll examine and review some of the many tools available to verify function and to measure performance of a Linux (or other Unix) based network. The latter is especially important for would-be cluster engineers, as network performance is often a critical design parameter for clusters intended to do parallel computations involving some degree of interprocessor communications (IPCs).

Here is a list of useful tools. We won't examine all of these in detail as we continue to assume a working knowledge of Linux/Unix administration, but since many of these commands are "fronted" these days by graphical interfaces that hide what is actually being done, one can actually set up a Linux system successfully without using many of them. If you are somebody that falls in this category, I'd suggest reading the man pages associated with the commands and some of the HOWTO documents indicated below.

  • ifconfig is the basic network configuration tool, but it also can be used to examine current network settings.
  • route is the tool that configures the default IP routes for each interface. It also can be used to examine the current routing table.
  • ping is a simple tool that "bounces" a packet off of a targeted host using ICMP (Internet Control Message Protocol). We have not yet covered ICMP in this column; the only things you need to know about it right now are that it is a protocol restricted to root (so all ICMP-based applications are either suid root or must be run by root) and that it is the basis of ping. ping is extremely useful for verifying that a network is "up and running", as if you can ping a host you can probably connect to ports on it, if there are any offered for connection.
  • traceroute is a more involved ICMP-based tool that actually traces out a route between hosts. This tool isn't so useful on a cluster (where nodes are typically "on the same network") but is very useful overall when a network is failing, especially one with one or more router hops in between hosts.
  • tcpdump is a "network microscope" used to watch actual network traffic at the packet level, covered in previous columns. It requires an interface in "promiscuous mode" (where an application can read packets intended for other applications and users) and can generally only be run by the superuser. "Generally" being one of the many reasons that unencrypted network traffic cannot be assumed to be secure, of course.
  • netstat is a tool that gives you a broad picture of the instantaneous state of the network, including all open sockets. It can be filtered or restricted to only certain kinds of things, and can be run with a delay so that it updates its output every few seconds.
  • /proc/net/dev contains information about all running interfaces. This information is read and digested and turned into usable traffic rates by a number of performance monitors and related tools. We'll likely examine some in a future column.
  • nmap is a tool generally used to test network security, but it is also a valuable diagnostic tool for verifying network function and discovering open ports associated with tools and processes. The tools above are primarily used to control the network and verify network functionality; the following tools test network performance.
  • ping can be used as a very crude measure of network (ICMP/ping) latency in flood mode (usable only as root as it is very nearly a denial of service attack if run for more than a second or so). Figure One is an example where lucifer pings uriel, both on the same 100BT switch. Note the relatively poor latency at 0.06 seconds or so each way. While not indicative of the interface's real potential, it does verify that the connection between the two hosts works, that both hosts are up, and that the target host (uriel) is responsive.
  • lmbench This package is Larry McVoy's famous benchmark suite, available from It has been around a long time and is one of the many tools Linus Torvalds and Friends use to test kernel and library performance while they work. It is a suite of "microbenchmark" tools, each of which typically tests and times just one thing. Some of the tools test and time network bandwidth and latency for various kinds of connections. These numbers are quite comparable across systems and can help you form a reliable picture of raw network performance.
  • netpipe This program is a very powerful tool for benchmarking network performance. It has the advantage of being able to be built to directly test MPI and PVM performance, as well as the performance of specialized cluster interfaces such as Myrinet that don't run on TCP/IP. It is being actively developed and maintained by Guy Helmer at Ames Laboratory at Iowa State University. You can find a detailed discussion of netpipe in Probing Gigabit Ethernet. We will only discuss it briefly in this column.
  • netperf This package is due to Rick Jones of Hewlett-Packard (which hosts and is one of the original network benchmark tools -- the first one I ever used to any major extent. It is very easy to use and is reasonably powerful and informative. It languished unloved for a number of years to the point where it would no longer compile on many of the systems I owned without a bit of hacking, but in the last couple of years it has revived, moved to version 2.3, and seems to still be very useful. We will spend some time with netperf below to see what it can tell us about network performance.

#ping -i 0 uriel
PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=64
time=0.139 ms
64 bytes from ( icmp_seq=2 ttl=64
time=0.123 ms
64 bytes from ( icmp_seq=3 ttl=64
time=0.121 ms
Figure One: Ping times example


To get lmbench you need to visit I and work your way down through their software projects links. This method gives I an opportunity to convince you that their source control software is the best in the world and worth buying. They accomplish this goal, in part, by requiring you to use their bitkeeper product to download (clone) their open source projects, including lmbench (which is GPL, with a couple of very reasonable restrictions concerning modification).

The download/build procedure is complicated enough, and the tool powerful enough, to warrant a column all its own in the future. For the moment, then, we'll give only the executive summary. After downloading and unpacking bitkeeper and using it to clone lmbench's LMbench3 repository, one uses bitkeeper to "get" the sources out of the repository SCCS directories. On my system a straightforward "make" was all that was then required, although I'm sure YMMV (Your Millage May Vary) .

I then ran the "bw_tcp" (TCP bandwidth) benchmark on two hosts (ganesh and g01). On g01 (remote node) I executed:

$bw_tcp -s 

(which starts a "server" -- a daemon -- that listens for a connection and then services it) and on ganesh (host node) I executed:

&bw_tcp g01 
0.065536 11.67 MB/sec 

Very simple. Note that the bandwidth between these two hosts is 93.4 Mbps out of 100 available, or about 93% efficiency. This number is a fairly typical data throughput number for a good TCP connection on 100BT Ethernet with an MTU of 1500 (where the theoretical maximum would be 1436 -- Ethernet frame less TCP/IP header -- over 1518 or 94.6 Mbps). Given that there are mandatory pauses between frames, this result is very close to theoretical maximum indeed.

Similarly, lat_tcp can be used to measure the latency. After running lat_tcp -s on g01, the latency can be measured by running the following on ganesh:

$lat_tcp g01
TCP latency using g01: 145.8439 microseconds

This number is a bit higher than it might be, probably because there are two switch hops between these two hosts. We can test this result by running between two hosts on the same switch:

$lat_tcp g01   
TCP latency using g01: 90.9072 microseconds

which is around 55 microseconds faster as it should be.

Both benchmarks permit variation of message size and repetition in order to obtain a statistical picture of network performance at various scales. Graphing performance as a function of message size is often very revealing.


As noted, netpipe has been covered in a recent issue so we will only briefly review it here for purposes of comparison. After downloading its source (see Resource Sidebar) and building it, one similarly runs a "receiver" (daemon) on the remote host:

$NPtcp -r

Then the benchmark itself is run on the host to measure times and rates connecting to the remote host. An example of netpipe is shown in Figure Two.

$NPtcp -t -h g01 -P
Latency: 0.000079
Now starting main loop
  0:         1 bytes 3179 times -->    0.11 Mbps in 0.000072 sec
  1:         2 bytes 3461 times -->    0.21 Mbps in 0.000073 sec
Figure Two: Example netpipe results

The most interesting thing to note is that netpipe and lmbench get very different answers for the single packet latency: 145 microseconds versus 79 microseconds. They differ elsewhere as well. This result leads us to a number of very natural questions (such as: which one is correct?) which we will defer to a future column. For the moment, let us accept each as valid (sort of) in the context of comparing systems but not necessarily accurate measurements of anything but the particular kind of code used for the test in the two cases.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.