Hits: 11727

Know the network, know the cluster.

In previous columns we looked briefly at compute node hardware for use in clusters and learned some basic cluster design principles. The most important of these was to focus on the problem at hand – getting the most work done for the least amount of money, effort, and time invested. This analysis means being able to resist the natural inclination to buy nodes with the highest possible clock speed to the exclusion of all else (unless you know for a fact that clock speed is the only important rate determining factor for your cluster and that the highest clock nodes buy you the most clock cycles per second, in aggregate, for your money, which is almost never true). We learned that there were actually quite a few very broad systems descriptors – CPU architecture, clock, memory architecture, disk, the network – any one of which might turn out to be "the" rate limiting resource that bottlenecks your application and determines the amount of work you get done for a given dollar investment.

We learned to study our own applications to understand where they are likely to be bottlenecked (what kind of work is being done where it is doing the most work and what resource is holding it back). We learned to use that application as the critical benchmark wherever possible – measuring the performance of a node prototype on your application beats the heck out of trying to predict that performance on the basis of the vendor's published benchmark numbers or a bare knowledge of its CPU clock plus the fact that Joe down the hall told you that the boxes were "really fast" and Joe is well known to be an expert in all things computational.

However, we have barely scratched the surface in terms of learning about computer architecture. Things we haven't discussed at all include the details of computer architecture – how motherboards are organized, just what a bus is and why bus speed and width might be important to your application, how a microprocessor talks to memory (and how MANY different kinds of memory there are likely to be on a system), and how the processors talk to peripheral devices. We haven't discussed operating systems and compilers and libraries. And we haven't discussed the network.

All of this is important in cluster design, but if you are armed with the right benchmarks (your application) and are somewhat constrained in your choice of operating systems and compilers and libraries (or are just going to stick with Linux, gcc, and the standard libraries that accompany your favorite Linux distribution) then knowing all of these details yet isn't that important. Of course, knowing some of these details is important, and that's where we are going next.

If you've been reading this column from the beginning, you should already have a pretty good idea that the network is almost as important as the computer in cluster design. Yes, the computer nodes do the actual computational work, but we derived in moderate detail equations for the speedup you can expect dividing a task among many nodes, and it turned out that in many cases it was the speed of the network that determined whether or not the application would actually complete in less time when spread out on many nodes. This result has profound implications in cluster design, which we will now explore.

A Networking Primer

Networking is one of those subjects that is really too deep and complex to be reducible to a "simple" column that makes you an instant expert. However, the world's greatest expert (it isn't me – I don't even know the world's greatest expert) wasn't born knowing all about networking, they had to start somewhere. So today we're going to start here. If you are already an expert, my apologies – skip ahead to the next article, go get a beer, stare out the window for a while.

What is a network? For our purposes it is everything associated with communications between an application on computer A and another application on computer B. We will abstract the communication process only one degree and assume that the information being communicated is transmitted in finite-size chunks called packets. A packet can be thought of as a single unitary burst of information going from computer A to computer B or vice versa.

Whoa, you say. Hold on there, dawg. That means my application's communications code, the communications libraries it calls, the operating system that runs that communications code on the CPU, the CPU itself, the computer's memory, its bus(es), its physical device(s) such as network interface(s), the physical wire itself, and all of it repeated all over again on the other end, that's the network?

Yes, that's it exactly. The network is the totality of how information gets between applications, and technically includes the networked applications themselves! It can even describe at least some of the ways that information gets between two applications running on the same computer. The network is more than just the physical layer over which data passes between two computers!

Here is the venerable ISO/OSI (International Standards Organization, Open System Interconnect) model for "a network". It consists of seven layers, the last of which are themselves applications! In this model, you the user are the final layer of any network:

  1. The Physical Layer. This is layer defines the physical media over which the network runs, e.g. wire or optical fiber or radio frequency.
  2. The Data Link Layer. This defines the way raw data is encapsulated for transmission on the physical layer – the size and structure of basic packets of information. This layer is generally associated with the actual interface in your computer to which the physical layer is coupled. The one you are most likely to be familiar with is Ethernet, although there are others.
  3. The Network Layer. This layer is responsible for routing data between specific systems and contains (possibly elaborate) mechanisms for assigning and retrieving a systems identity and address information. In general the network layer is represented by the Internet Protocol (IP), which specifies its own addressing information, data encapsulation, and has its own packet header (itself encapsulated within an e.g. Ethernet packet).
  4. The Transport Layer. This layer manages transmission control and the division of long messages into suitable packets. It is also typically the layer where reliability is built into the network protocol (or not). There are two protocols in common use on top of IP. TCP (Transmission Control Protocol), is a reliable but relatively slow protocol that manages sequencing of a stream of packets and retransmission on failures (both important to run a service over a wide area network where packets that are part of the same message can take different routes and arrive out of order or missing). TCP is state-aware – a connection is maintained for an extended period of time for most TCP transactions. UDP (User Datagram Protocol) is stateless/connectionless, unreliable (in the sense that the protocol itself doesn't provide sequencing or retransmission), and much faster. Important services are built on top of both of these protocols.
  5. The Session Layer. The layers above are all recognizable as "the network" to most people, but the last three layers move increasingly into application and user territory. The session layer defines how data moves over the network. Session layer tools in Linux are typically Remote Procedure Calls (RPCs) that manage data transformation transparently for the user.
  6. The Presentation Layer. This layer continues this work by providing a "standard" data representation that is independent of the host. The "eXternal Data Representation" (XDR) sits at this layer; tools convert local data into a canonical form and vice versa.
  7. The Application Layer. This layer contains network services (user applications). For example: mail, ftp, http, DNS (domain name service) and more are all network applications.

Although the ISO/OSI model is well-reasoned and respected, it is also a bit more complex than it needs to be. Much of the software in common use in the Unix/Linux world (and by inheritance in the Windows world as well) can be equally understood in terms of the more economical TCP/IP model:

  1. The Link Layer. This layer includes layers 1 and 2 above, and includes the physical network, the network devices, and their (raw) device drivers. "100BT Ethernet on UTP" is one possible description of such a layer. Others include ATM (Asynchronous Transmission Mode) on various media and a variety of cluster specific, proprietary link layers such as Myrinet, Scalable Coherent Interconnect (SCI), QsNet, and more.
  2. The Network Layer. This layer is more or less the same as the ISO/OSI layer – IP and things such as ICMP live here to help the network route physical packets.
  3. The Transport Layer. Again the same, this layer is e.g. TCP and UDP.
  4. The Application Layer. Also the same.
This model skips the session and presentation layer. In other words, in the TCP/IP model, the physical network isn't split into media and device, and the user is left responsible for any transformations required for the data. This model is popular because the near-universal acceptance of TCP/IP networking over Ethernet as the basis for e.g. the web has led to far more commonality in data representation than there once was, and because the tools required to manage data conversion add significantly to both overhead and security risk.

It is still useful to know something about both models, including the missing parts, when engineering a cluster. In future columns, we'll get really serious about learning about the network, and look inside some actual packets, because a bunch of computers without a network (even a "network" consisting of a grad student that walks from system to system carrying a floppy disk cannot do work together and is therefore not a cluster by any definition. For now, though, let's focus on a few very important design elements of networking in the context of cluster computing.

Latency, Bandwidth, and Interface.

The cruel truth of cluster design is that for many problems the speed of the computer itself is less important than the speed of the network. If the problem requires a lot of communication per step of computation, communication speed may well be the rate determining component of the overall computation.

If the computation requires sending lots of small packets to many different hosts per computational step, it is likely to be latency bound, that is, bound by the speed with which the computer can set up and send a small packet of information, possibly containing only a single byte of data. If the computation requires sending large packets between many hosts or a few, it is more likely to be bound by the bandwidth of the interface – the amount of data per unit time it can send when the data is encapsulated for maximum transmission rate. A given network can have high bandwidth and poor latency. It is in principle possible for it to have low bandwidth and good latency. The best, and most expensive, networks have high bandwidth and good (low) latency.

What determines the latency and bandwidth of a network (recalling that this is everything between the running executable on one node and that on another node)? In rough order of importance, we start with the link layer itself, including its bus interface to the computer's hardware, puts hard upper bounds on bandwidth and hard lower bounds on latency. A given network card, on a given bus, interconnected with a given medium, simply cannot exceed the physical limitations imposed by the physics and engineering of the overall design. In many cases the link layer is named for its upper-bound bandwidth: 100BT Ethernet can send as much as 100 million bits per second of information, including all link-layer headers and overhead, down UTP wires. Gigabit Ethernet can send (you guessed it) at most 1 billion bits per second.

Sidebar: Sneakernet
There is indeed a network known to networking gurus as "sneakernet". As an exercise, think about how and why this can be considered a network. Not necessarily a particular slow network, by the way – imagine the grad student carrying a 200 GB external hard disk from one computer to another instead of a floppy! [There is also bongo-net – Ed.]

The link layer also comes with a latency minimum. This limit might be determined by the amount of time required to send a minimum-length packet in the link protocol, plus a protocol-mandated pause in between packets, for example. Real world link layer latency will almost certainly be larger, because in nearly all situations the packets will pass through a network switch, and the switch itself adds 5-20 microseconds of latency as it routes the packets here and there. Latency can be much higher on a loaded network, as Ethernet is particularly expensive (inserting mandated dead times) when there is contention for a particular wire or device.

The latency seen by an application will be larger still! This results because both the network and transport layers (needed to route the packets properly and to arrange for reliable delivery of messages) add overhead. Much of this overhead is inside the kernel in the TCP stack, for TCP/IP communications, and is therefore beyond your ability to control or minimize. To give you an idea of the penalty here, pure link layer Ethernet latency on a high end card with direct memory access (DMA) might well be < 20 microseconds including the intermediary switch and the network driver component. By the time you add in the TCP layer, this can increase to 50-60 microseconds even on a quiet line.

Using Netpipe (See our tutorial on Netpipe) one can directly measure network performance (both latency and bandwidth) for TCP stream packets of different size passing between two hosts. For example, on my home network on a relatively quiet and very cheap 100BT Ethernet switch (but with relatively expensive 3Com 3c905 Ethernet cards) I measure 55 microseconds per packet sending packets with a payload of one byte. This is a "data bandwidth" of only 0.0175 megabytes per second!

Netpipe also tells me what the bandwidth is for large messages, ones that are very much limited by wire capacity. On this particular link, I can get just under 90 megabits per second out of 100 theoretical capacity. 90% of wire speed is fairly typical for a well designed Ethernet, although one can easily do much worse if one uses cheap network components! Beware especially the network adapters that come "free" on many motherboards or that are sold for $10 apiece by many vendors. These adapters are often missing buffers and DMA capabilities that make them far slower than a better card. You might not notice it on a PC running Windows and doing nothing but browsing the web and reading email, but in a high performance computing application such a card can seriously cripple a cluster.

In most parallel applications using the network, network performance will be still a bit worse than this. Most people do not write raw socket code to pass messages between application components running on different nodes – they use a message passing library that handles a lot of the bookkeeping and details of reliable message transmission for them such as PVM or MPI. This task adds one more layer of overhead, increasing latency and decreasing bandwidth to the application a tiny bit. In most cases, though, this is well worth it as PVM or MPI will likely do a better job of implementing this essential bookkeeping than you would writing it all yourself inside your application. Netpipe can be linked directly to PVM and MPI to test performance using the library calls for communications instead of raw sockets. It is therefore a very versatile tool.


We aren't finished with the network by any means. This month's column has done little more than scratch the surface of networking as a crucial design component of a serious Beowulf or compute cluster. At this point, though, you should at least understand the network in very general terms (and should understand the meaning and importance of terms like latency and bandwidth and application network interface).

There is so much more to learn, though! We haven't discussed the details of TCP/IP (the contents and layout of the Ethernet, the IP, and the TCP layer headers). Although I've mentioned Netpipe as a useful tool, I haven't told you where to get it or how to use it. And we have yet to discuss the more expensive "super-networks" that are what you need to consider if you are designing a cluster for a fine-grained, synchronous parallel application! These are the networks with application layer latencies of perhaps 1-5 microseconds, and with peak bandwidths higher than a mere gigabit per second. In many cases, they accomplish these excellent numbers by providing you with libraries for e.g. PVM (sometimes) or MPI (nearly always) that don't use TCP/IP at all! They provide their own custom network and transport layer (which are still required if you want reliable delivery of packets) and these libraries directly use hardware features to ensure minimum latency and maximum bandwidth in packet delivery.

We'll come back to networking many times in future columns and see if we cannot make you real experts. A good cluster engineer should know how to use Netpipe to measure real world performance of a network and a tool such as tcpdump to be able to look at raw packets and understand at least approximately what they are seeing. So should a good cluster software engineer – writing an efficient network application requires a deep understanding of the delays introduced by network IPCs, as we saw in our very first few columns in this space.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page