Cluster Networking: The Dark Side of IP over Ethernet | Cluster Newbie

The cruel truth about IP Datagrams and other things you may have forgot (or never learned).

In the last column we learned some things that every Cluster Engineer should know about Ethernet and the Internet Protocol (IP). The former specification, recall, is defined by IEEE documents that are "open" but not freely re-publishable; the latter by fully open RFCs that you can read yourself for free and from which I can actually cut and paste while describing them. The article contained a synopsis of information from RFC 791 (IP) and referenced RFC 792 (ICMP) and RFC 894 (IP over Ethernet).

Of course when you read that article (studied it, really) you noticed the fact that the smallest packet that can be sent to deliver one single byte of actual data via IP over Ethernet is exactly 64 bytes, the smallest permitted Ethernet packet size. Of this 64 bytes, 18 bytes are Ethernet header and CRC, 20 bytes are IP header, one byte is data, and the rest (25 bytes, about 40% of the packet) is "padding" (although in practice it will generally be at least partially used for higher level e.g. TCP headers discussed below).

It is worthwhile to spend a moment meditating upon this cruel truth. The ratio of 63 bytes of mandatory envelope (40% of which might well be "blank paper") to one byte of message is one reason that IP over Ethernet is a poor choice for cluster designs and applications that are expected to send lots of small messages, and we haven't even gotten to the TCP layer yet (which uses some of the wasted padding for its own header but doesn't alter the 63:N ratio of overhead to message for sending small messages with N bytes).

While we are considering cruel truths, we also should recall that (counted or not) there are 8 more bytes of metaphorical bell-ringing time required to raise the carrier and grab the line to send any packet at all, and that the probability of collisions goes way up if our metaphorical room full of individuals with messages for one another have to shout their messages one word at a time. These observations have led to the development of quite a number of alternative network protocols that transmit packets in rings, use much smaller headers, do more with dedicated hardware. In future columns we will probably get around to looking at at least some of these efficient but expensive networks.

Fortunately (or not) on many systems the ratio of header size to data size turns out to be nearly irrelevant because other elements, many of them hardware based, determine the irreducible packet latency (the absolute minimum time between Ethernet packets of minimum size). For example, in between two systems in my home network (switched 100 BT) the transmission latency for a 1 byte message (64 byte packet plus 8 bytes worth of preamble) is around 50 microseconds according to NPtcp (Netpipe was discussed in the Right Stuff column in the very first issue of ClusterWorld (December 2003), but we'll come back to it in this column fairly soon as well as it is a critical network benchmarking tool).

Achieving 50 microseconds, much of it fixed by the switch and hardware independent of the protocol stack, is actually quite good for a minimum length TCP/IP packet on Ethernet containing a single byte of actual data as these things go. For a packet size roughly twice the minimum the latency is only roughly 68 microseconds, considerably less than twice 50. This is good news and bad news. The good news is that the bandwidth is growing rapidly with packet size as it costs only 18 more microseconds to send some 80 times as much data (we'll learn below just how to compute the TCP data capacity of a 128 byte Ethernet packet) and that this slow growth continues until one approaches packet sizes that saturate the medium.

The bad news is that this is really quite poor in absolute terms. The interface can send at most 20,000 minimum size packets per second, or around 20 KBytes/second, on a network that can carry 10 MBytes/second for large packet sizes. 50 microseconds translates to 50,000 to 150,000 instructions on modern CPUs. In many cluster applications the actual computation will block, wasting all these cycles, while waiting for communications to complete.

This news is not all the dark. Last month we learned that IP is lovely, but the protocol by itself has many warts. It is not very reliable. A variety of things can cause a network to drop occasional packets. IP has no way of positively identifying that a packet has been dropped, even if it is in the middle of an important message, and has no way of requesting that the dropped packet be retransmitted. IP has no way of dealing with the vagaries of networks with complicated routes, where packets that are part of a single (fragmented) datagram can arrive out of order. IP can send datagrams of at most 64 kilobytes in length, but real messages might be much longer.

Applications require "connections", and their connections need to be multiplex-able -- several applications on one host need to be able to exchange data with several applications on another host "at the same time" -- but IP is connectionless. It just drops a packet onto a wire in the untested expectation that it will be received, is totally unaware of higher level applications, and does not support the notion of a persistent "connection" between applications running on different ends of a network.

We cannot do much about the latency issue for IP on Ethernet as most of the problem is beyond our control -- in the hardware itself or equally inaccessible in the kernel. However, we can do quite a lot to achieve application connectivity, flexibility, and reliability. To get there, we need to add one or more layers in the ISO/OSI stack. This requirement leads us to learn about the Transmission Control Protocol (TCP) and its unreliable (but faster) cousin, User Datagram Protocol (UDP). Let's look at the latter first.

The User Datagram Protocol

By now the idea should be familiar. Ethernet is very low overhead but not routable or robust and does not support the notion of connections (persistent or otherwise) between applications as opposed to kernels. IP adds routability via a header tailored to that purpose, but remains less than robust and doesn't grok connections between applications. There are two ways to get reliability and connections. One is to invent a big new header with support for both. This idea adds a certain amount of overhead to every connection, whether or not every feature is being used or is necessary. For some applications, missing the occasional packet may not matter compared to getting the packets one does get as efficiently as possible. The other idea is to add connections only (an abstraction we will assume is required for any pair of applications to talk over a network) and let those applications deal with reliability to the extent that they feel appropriate.

This is the basis of UDP, defined in RFC 768. The UDP header (copied verbatim from this RFC) is very, very simple and shown in Figure One:

  0      7 8     15 16    23 24    31  
 +--------+--------+--------+--------+ 
 |     Source      |   Destination   | 
 |      Port       |      Port       | 
 +--------+--------+--------+--------+ 
 |                 |                 | 
 |     Length      |    Checksum     | 
 +--------+--------+--------+--------+ 
 |                                     
 |          data octets ...            
 +---------------- ...

Figure One: UDP header specification (Note the bitwise layout.)

This design is simplicity itself. UDP introduces a new abstraction, that of the port. A port is presumed to be associated with an application running on either end of the connection. Yes, UDP is described as a "connectionless protocol" but by this it is meant that UDP connections are not persistent and verified, not that they are not defined by the addition of the notion of the port.

We won't say much about ports here. You can read /etc/services if you want to see what "well known ports" are known on your machine(s), or you can read any of a long string of RFCs that define or modify this list. There are also ports that are reserved from being assigned in this way so that a persistent connection can find free port numbers on both ends to dynamically create persistent connections without blocking a well known port in the meantime.

Beyond source and destination ports (which should be thought of as being concatenated with IP number to specify e.g. port 80 on host 192.168.1.129 as a way of sending a packet out that will be received by the application listening on port 80 of host 192.168.1.129) the header contains the length of the UDP message, including header, and yet another checksum. Following this is the data, once again padded so that it contains an even number of bytes. The UDP header is 8 bytes long, so our minimum packet now becomes 18 bytes of Ethernet header encapsulating 20 bytes of IP header encapsulating 8 bytes of UDP header (46 bytes total header) encapsulating a data message padded as required (at the IP level) so that the minimum packet length is 64 bytes and the maximum is 1518 (for a standard MTU of 1500)!

Corrupted packets can be detected and dropped at the UDP level, at the IP header level, and at the Ethernet level. It is up to the applications sending and receiving data to ensure that message streams arrive in the right order, that dropped packets or corrupted messages are retransmitted, and so forth. These are "unlikely" to occur on local area networks where the packets cannot take multiple routes to their destination and where Ethernet itself ensures a fairly reliable delivery of packets on good hardware, so UDP is often used for local, non-persistent connections where sender and receiver are "on the same wire". It is also used for applications that want to achieve reliability at the absolutely lowest cost and think that they can beat TCP. Some applications in this category include parallel computing messaging libraries, e.g. PVM and core network services such as NFS.

TCP isn't that easy to beat, though. While it does have (perhaps) more controls than many connections need, as we observed last month latency scales weakly with packet length and the TCP code tends to be pretty well optimized (having been kicked around performance-wise for a rather long time). TCP does have some annoying features associated with their notion of persistent connections (which can be both blessing and curse). As I write this column, vendors are appearing that promise to move the TCP stack out of the kernel altogether and into the network interface. This feature will lower the latency cost of TCP and (perhaps more importantly) reduce the CPU burden of doing the actual work associated with reliable (re)transmission of data on an existing connection.

The Transmission Control Protocol

Like UDP, TCP adds support for connections by encapsulating the data being sent inside a TCP header encapsulated inside an IP datagram encapsulated in an Ethernet frame. However, it also adds reliability. TCP absolutely positively guarantees that if the hosts at both ends are functioning normally and the network in between isn't too horribly broken, that messages will be delivered between applications without corruption. Period. If the message cannot be so delivered (indicating a "broken connection", it promises to let the application know so it can try to reform the connection and retransmit the data at a later time).

This reliability is not cheap. It costs latency and CPU overhead and a bit of bandwidth but provides essential reliability for those who want to live in a reasonably deterministic and functional networking universe, especially one that extends over a wide area network where the second and third packets of a message stream can literally take different routes to their destination at the whim of intermediate routers with the third arriving before the second.

To learn about TCP we as usual look up the appropriate RFC (in this case 793). There we learn that the TCP header (basically the first part of an IP datagram's data section in a TCP/IP message) looks like Figure Two.

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Data |           |U|A|P|R|S|F|                               |
   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
   |       |           |G|K|H|T|N|N|                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Checksum            |         Urgent Pointer        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             data                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure Two: TCP Header Protocol

We note several additions compared to UDP. TCP supports the notion of persistent connections (known as sockets). Applications that communicate via a network connection have associated ports (as did UDP's transient connections). Typically an application on one host listens on one port awaiting a connection while an application on another host requests a connection. When a connection is obtained, it can either persist until it is broken by either party (consuming the port resource on both ends and blocking further connections on that port pair in the meantime) or the receiving application can fork a copy of itself with new port numbers drawn from the pool of open, unassigned ports. The original application can then go back to listening for new connections on the original port while the forked copy can manage the persistent connection until it terminates.

This is the basis for forking daemons and nearly all Unix network services. Let's look at the rest of the header.

To detect and correct packets that arrive out of order a sequence number is added. The Acknowledgment number permits handshaking -- if a receiver doesn't acknowledge receipt of each packet in sequence within a reasonable timeout, the application retransmits the missing packet. The data offset points to the first byte of data (just past the end of the header). There are a number of control bits used for specific purposes beyond the scope of this article (read the RFC, which is a good idea anyway). There are several more fields, but the most important field remaining is the checksum, which as always is used to detect a corrupted packet. Last there is the data itself, a TCP stream encapsulated in an IP datagram encapsulated in an Ethernet frame.

There are many details of TCP that are important but that we are perforce skipping. For example, TCP undergoes some rather elaborate rituals establishing connections, transmitting each packet in a sequence, handshaking so that correct-order receipt of the uncorrupted message is ensured, and breaking connections either deliberately (because one or both socket ends are closed) or because an end application, an end host, or the network itself fails. TCP has to be prepared to deal with literally anything that can happen on a network, as in the best tradition of Murphy's Law, anything that can go wrong eventually will. TCP has to provide enough power at the application level that an application can guarantee reliable delivery of a message -- eventually -- while not causing the system or a well-written TCP-based application to actually crash due to a failed connection.

Mind you, there is a lot of wiggle room in that "well-written" descriptor. A badly written application may well crash or hang if a connection fails, and it isn't horribly easy to write a application well using low level network systems calls. This case is one of many reasons that most cluster applications that use network IPC mechanisms do so via message passing libraries such as MPI or PVM that are well-written and insulate you from the care of managing a socket without getting into trouble.

Still, there is nothing like writing a daemon of your own, especially if you want absolutely maximal efficiency. One day this column will likely show you how.

Conclusion: Looking at TCP/IP Packets

One of the best ways to learn about real networks and debugging is to watch one work. In Linux the tool that permits you to do this is /usr/sbin/tcpdump. For example, try:

 tcpdump -i eth0

This dumps packet headers. Read the man pages for tcpdump to see all the different options and ways one can probe the network for problems. This application must be run as root. Other interesting options to try include:

 tcpdump -e -i eth0

(dumps Ethernet headers) or

tcpdump -c 100 -s 0 -X -i eth0 -l | tee eth0.dump

This latter view lets you see pretty much all of everything. If you don't use ssh or ssl to bidirectionally encrypt network traffic, you can read passwords and valuable data with ease. This view is what you must presume that is available to crackers on any traffic that leaves strictly controlled network space (and that if you are wise you'll assume is available to crackers even within your controlled network space).

The final tool work mentioning is nmap (likely /usr/bin/nmap on your system, if installed). This tool is a security probe and you might annoy your system administrator if you probe your network for security holes, so use this with caution unless you are said system administrator probing your own network. nmap can yield all sorts of valuable data about ports and services that are open and listening on any given host. It is thus a way to see if some service you think is being offered is in fact there as well as a way of determining whether or not some cracker is offering a back door service that you didn't know was there.

That seems enough for this column. At this point you should have a really good idea of how TCP/IP over Ethernet works. Next month we'll continue with our discussion of the network, concentrating on measuring network performance.

Sidebar: Networking Resources

Charles Spurgeon's Ethernet Web Site is This is a truly excellent resource and has been converted into an O'Reilly book.

Javin's Protocol Dictionary: This site has a nice review of the 802.3 specifications and the structure of packets, in particular the changes associated with gigabit Ethernet.

Charles L. Hedrick's Introduction to the Internet Protocols document is the document I originally used to learn about TCP/IP networking.

The original (and still operant) RFCs that define e.g. TCP and IP. However, there are many more, including RFCs that deal specifically with sending TCP/IP over Ethernet.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page