Cluster Networking: TCP/IP Over Ethernet

Published on Monday, 23 October 2006 07:00
Written by Robert G. Brown
Hits: 10650

Packets 'n' Protocols: Dark Secrets of Datagrams

Networks are critical components of parallel cluster supercomputers. While really advanced clusters (and big iron parallel supercomputers) use high performance dedicated function networks to connect processors and their associated memory, Beowulf-style cluster computers got their start with TCP/IP over humble Ethernet, and even today there are likely well over ten clusters running inexpensive TCP/IP over Ethernet to one cluster that uses a dedicated and expensive high end network.

Even cluster designs that do have a high end network for interprocessor communications typically have TCP/IP and Ethernet to handle the mundane traffic associated with accessing network disk, distributing tasks, and connecting to nodes for installation and management purposes. Most server-class motherboards come with one or more Ethernet interfaces built right in. It is truly ubiquitous.

{mosgoogle right} Basically, it is simply impossible to become an "expert" on cluster supercomputing without a working knowledge of TCP/IP over Ethernet, and such a knowledge should begin from the ground up. This column is devoted to teaching you the essential structure of a TCP/IP packet, also known as a "datagram", as it sits inside an Ethernet packet

We will begin our exploration by working our way up the first few layers of the ISO/OSI model we described in some detail in my last column. We won't pay a lot of attention to the physical medium per se, since Ethernet (more or less) runs over unshielded twisted pair, coaxial cable, fiber, or even wireless. Instead we'll start with the Ethernet interface itself.

Ethernet

Ethernet was originally developed at Xerox's famous Palo Alto Research Center (PARC), by Bob Metcalfe. It is a network that is largely defined by how it deals with Carrier Sense Multiple Access/Collision Detection (CSMA/CD). That is, by how adapters can share a medium.

It is impossible to communicate everything that you might need to know about Ethernet in a single column. Unfortunately, it is also impossible to provide you with a website that contains the complete, open specification for Ethernet protocols to fill in the inevitable gaps in the review below. This situation exists because "Ethernet" is defined not by the Request For Comments (RFCs) that constitute the truly open specification for TCP/IP but, rather it is defined by the IEEE 802 family of documents, in particular 802.3. These documents are not free. In order to read them, you must pay (and pay quite a lot) for them.

Much as I would like to spend a few thousand words indicating how annoying and evil I find this, I will refrain. IEEE is more than a professional society, it is a business, and a fairly successful one at that. They provide a valuable service, no doubt, but when I compare this definition of "open specification" with that of the RFC, I find it somewhat lacking.

Fortunately, in spite of this barrier to the actual core technical documentation it is not terribly difficult to find websites that provide fairly complete technical reviews of Ethernet. These sites provide most of the critical information without violating the actual copyright on the technical documents themselves. Since we are not interested in actually engineering an Ethernet adapter (a process that would almost certainly require the detailed specification) but rather in understanding how they work in general terms, they are all that we need. Some of these are collected in Useful Links below, others can be googled up in short order on your own.

To remain concrete, we'll stick to 10/100 Mbps (Megabits per second) Ethernet and describe its "Media Access Control (MAC) Frame" (Ethernet packet, IEEE 802.3 and 802.3u). The specification is different for gigabit Ethernet (802.3z) and jumbo frames (802.1q), but the idea is the same and differences are minor.

An Ethernet packet has the following format:

Figure One: Ethernet frame (packet)
  +----------+------------+-----------------+-------+
  |  7    1  |  6   6   2 |      16-1500    |   4   |
  | PRE  SFD | DST SRC LT |     Data/Pad    |  FCS  |
  +----------+------------+-----------------+-------+
    CSMA/CD     Header          Message        CRC     

There is a preamble that actually consists of eight bytes of alternating 0's and 1's that is not part of the actual packet -- it is used to grab the line and give receiving adapters time to synchronize to the incoming bit stream. It ends with a start-of-frame delimiter.

Think of this metaphorically as ringing a little bell before beginning to talk, with a rule that if you hear a ringing bell, you must remain silent until the bell-ringer is done talking. If you start ringing your bell but (because of e.g. speed of sound delays) you hear somebody else ringing theirs before the mandated interval of bell-ringing ends, you and the other bell-ringer (who has presumably heard yours as well) must both remain silent for a randomly chosen but short period of time before again trying to ring your bell and speaking your message. In Ethernet parlance, this is known as a "collision".

Although it sounds excessively polite and convoluted enough for a Swift dystopia, this collision resolving mechanism is robust and is actually a lovely way to stick dozens of people into a large room with messages to shout to one another at random times and ensure that no two messages are ever shouted out at the same time (which would garble information irretrievably). If only my three boys and all their friends would consent to carry a little bell...

The frame itself actually begins with a mandatory header that contains the six-byte destination address, the six-byte source address, and a two byte length/type descriptor that can be used to describe the type or length of the contents.

The addresses referred to here are Ethernet addresses, also known as MAC addresses, and are supposed to be unique at the hardware level across all Ethernet adapters in the world. In practice, they often aren't and can sometimes even be specified in software. Altering them in this way can lead (and has led in my direct experience) to spectacular networking failures. The fact that they can be altered at all creates exploitable security holes if they are used (as they often are) as a means of host identification.

Two tools that you might find useful for determining MAC addresses are /sbin/ifconfig> (for your own) and /sbin/arp (to determine the address associated with other systems on your network that have sent packets to your system). Read the man pages to see what they do.

The Ethernet header is followed by the packet contents, the actual data you wish to send. Note that there is a 46 byte minimum message length. If your actual data is shorter than this, it must be padded e.g. with zeros out to this minimum length.

Finally, the frame terminates with a frame check sequence, a 32 bit cyclic redundancy check (CRC) computed by the sending hardware and recomputed by the receiving hardware. They must match or the frame is rejected as damaged.

If one adds things up, one will see that the minimum length of an Ethernet packet is 64 bytes (plus the preamble, which is usually ignored) and the maximum length is 1518 bytes -- 14 header, 1500 data, and 4 CRC. The 8 byte bell-ringing period is not counted -- it is part of the minimum interpacket latency. Often the Ethernet header itself isn't counted in discussions as it is fixed for all encapsulated protocols. The 1500 byte maximum data component is called the maximum transmission unit (MTU).

The actual upper bound of the number of bytes that can be safely checked with the CRC is around 12000, a number much greater than the standard Ethernet MTU. It costs system resources to build a packet, and the more packets a message has to be broken into, the greater the overhead of sending the message. This potential scaling efficiency has led to an extension of Ethernet called jumbo frames that have a larger MTU. These larger packets can transmit data in pages or blocks that match those used by the kernel or important applications such as NFS (for example, in chunks of 4096 or 8192 bytes plus protocol header length) which increases speed and lowers the overhead. Jumbo frames are not supported by all hardware, but they should be able to coexist within reason with normal MTU frames.

An Ethernet packet could be used directly to transmit raw data. However, it rarely is. The reason is because Ethernet addresses are not hierarchically organized and hence are not routable in and of themselves over wide area networks -- you have to "be in the same room" and have to have an ARP (address resolution protocol, RFC 826) table to map Ethernet addresses to particular hosts. For example, the MAC address of the wireless adapter of my laptop is 00:04:5A:CE:7F:9B (six bytes, each represented as a hexadecimal number). This knowledge will not help you send it a packet, though, unless you happen to be on the same network as my laptop, which for all practical purposes means "inside my house". And you aren't.

Now, I'd really like for my laptop to be able to send packets to machines that are far away on different networks. To accomplish that, we need an "address" that:

Basically, networks and systems on those networks should have human-recognizable, hierarchically organized names that resolve into machine maskable, hierarchically organized addresses that otherwise will function much like an Ethernet address functions. Also, we'd like to be able to establish reliable connections with certain abstractions. These requirements are the basis of the Internet Protocol (and the Internet itself) and are the subject of many RFCs.


RFCs and the Internet Protocol

Unlike Ethernet's "open" but "not public and not free" specifications, the specification documents of the Internet Protocol are truly open, truly public, and truly free. They were developed by a loose consortium of super-geeks working in academia, industry, and in government labs, funded by the Defense Advanced Research Projects Agency (DARPA) and openly published as RFCs, and are one of the most marvelous of human works of all time. DARPA has long since relinquished primary control of the Internet to the Internet Engineering Task Force (IETF) but the Internet remains as a shining proof that defense research can produce peaceful dividends. The RFC process itself has proved to be a tremendous contribution in its own right. It is a nearly perfect realization of a genetic optimization process that allows for evolutionary growth, and is the direct parent for hundreds of mailing lists and development groups that even today directly drive the technical development of the Internet and Open Source software such as Linux and FreeBSD.

At this point a rather large fraction of the world's economy is derived from the network DARPA conceived and funded. Not even the much-touted space program has paid off its investment so overwhelmingly. Note that I'm using my bully pulpit to draw a harsh contrast between two competing "standards" paradigms -- IEEE's semi-closed process that ultimately yields intellectual property belonging to, blessed by and resold by the IEEE versus the fully open RFC process that leads to an openly and freely published standard specification. It is pretty clear which one I think is superior.

The particular RFC that originally specified IP itself is RFC 791 although there are others that govern (for example) the particular encapsulation of IP within Ethernet packets we are about to discuss and various extensions or modifications. All RFCs are readily available on the Internet where you can read them for free. The Resource sidebar has links to information should you wish to browse.

An IP packet is called a datagram, emphasizing the metaphor that it is like a piece of mail or a telegram -- it has an "envelope" (the header) that tells where and how to send its "contents", the actual message.

The IP specification actually goes beyond just providing hierarchical, routable, maskable addresses. It also provides for the rudiments of reliable data transmission. As we examine an IP header below, we'll note that it has a lot more fields, fields that deal with fragmentation (data streams that are too large to fit into a single packet's MTU), lost packets, and more. IP alone still isn't very reliable over wide area networks, but it is more reliable than Ethernet. As you doubtless recall from my last column, it is TCP that adds true reliability to the data transmission stream, where IP mostly adds routability.

Figure Two shows the IP header as it appears in RFC 791.

Figure 2: IP Header from RFC 791

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields in Figure Two are as follows: four bits for the IP version (version 4 for the headers we are describing, hence IPV4), four bits for the Internet Header Length (which points to the beginning of the data in 32 bit words, minimum value of 5), 8 bits for Type of Service (really quality of service), 16 bits for length of datagram including header(s) (which consequently must be less than or equal to 65,535 bytes) and several fields associated with fragmentation.

Next is the "Time to Live" (TTL) field, which is very is important. When a new packet is created this field is filled with a value determined by your IP implementation (typically 32 or 64, obviously less than 256 for IPV4) and then decremented by every IP (router) hop the packet traverses moving toward its destination. If the count ever reaches 0, the packet is killed. This prevents packets from being perpetuated by networking loops and clogging up the Internet the way Yorkshire pudding clogs up your arteries. Of course, it also means that if it is started at too low a value or if the network route is too long, packets may not reach their destination even when the network is traversable.

This condition can happen, especially if your TTL is set to 32 (a number that used to be generous compared to the "hop radius" of the Internet. The /usr/sbin/traceroute command is a good way to determine the number of routing hops between locations, although it can also fail if one of the intervening hops blocks Internet Control Message Protocol (ICMP) packets (a component of IP described by RFC 792).

You can see your (Linux) system's default TTL by entering:

cat /proc/sys/net/ipv4/ip_default_ttl
and can alter it with the /sbin/sysctl command, although this is likely not necessary or advisable.

The protocol specifies the next layer of encapsulation used by the packet. The header checksum makes corruption of the header itself (only) detectable. The rest of the header is much like the Ethernet header but with a different order: source address first followed by destination address, each four bytes long (for IPv4). The options and padding are themselves optional in that they may not appear in all packets. Sometimes data will start right after destination address, sometimes not. The minimum (and typical) length of a primary IP header is thus twenty bytes (5 32 bit words). The maximum length is specified in the RFC as 60 bytes depending on options used and padding.

Following the header is the data in the datagram. The shortest message that can be sent is one byte of data accompanied by 20 bytes of header, or 21 bytes total (with no fragmentation possible).

The entire IP datagram must itself be encapsulated as the "data" part of an Ethernet packet. This encapsulation is not arbitrary, as you might expect there is an RFC (894) that describes how it is to be done. Fortunately this is a very simple RFC and the encapsulation is done in pretty much the obvious way. The IP datagram becomes the data part of the Ethernet packet (basically wrapping it inside an Ethernet prologue/header and epilogue/footer). Small datagrams are padded with zeros as needed to reach the minimum Ethernet data size, but the padding is not included in the datagram length so the zeros are ignored by the receiving system that unwraps the packets to get at their contents.

Lets quit here for now, and come back to this next time where we will learn that IP over Ethernet has a dark side (we promised you dark secrets, if you recall). As it stands, it is connectionless (you throw a packet out there hoping it will find a home). It isn't reliable (it may or may not ever find a home but the protocol has no way of determining if it does or doesn't). It is quite costly for certain patterns of messages. Some of these issues will be addressed by adding on the TCP layer, and some won't. Hopefully we'll see you there. {mosgoogle right}

Sidebar: Networking Resources
Charles Spurgeon's Ethernet Web Site is This is a truly excellent resource and has been converted into an O'Reilly book.

Javin's Protocol Dictionary: This site has a nice review of the 802.3 specifications and the structure of packets, in particular the changes associated with gigabit Ethernet.

Charles L. Hedrick's Introduction to the Internet Protocols document is the document I originally used to learn about TCP/IP networking.

The original (and still operant) RFCs that define e.g. TCP and IP. However, there are many more, including RFCs that deal specifically with sending TCP/IP over Ethernet.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly