Cluster Networking: The Dark Side of IP over Ethernet

Article Index

The Transmission Control Protocol

Like UDP, TCP adds support for connections by encapsulating the data being sent inside a TCP header encapsulated inside an IP datagram encapsulated in an Ethernet frame. However, it also adds reliability. TCP absolutely positively guarantees that if the hosts at both ends are functioning normally and the network in between isn't too horribly broken, that messages will be delivered between applications without corruption. Period. If the message cannot be so delivered (indicating a "broken connection", it promises to let the application know so it can try to reform the connection and retransmit the data at a later time).

This reliability is not cheap. It costs latency and CPU overhead and a bit of bandwidth but provides essential reliability for those who want to live in a reasonably deterministic and functional networking universe, especially one that extends over a wide area network where the second and third packets of a message stream can literally take different routes to their destination at the whim of intermediate routers with the third arriving before the second.

To learn about TCP we as usual look up the appropriate RFC (in this case 793). There we learn that the TCP header (basically the first part of an IP datagram's data section in a TCP/IP message) looks like Figure Two.

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Data |           |U|A|P|R|S|F|                               |
   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
   |       |           |G|K|H|T|N|N|                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Checksum            |         Urgent Pointer        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             data                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure Two: TCP Header Protocol
We note several additions compared to UDP. TCP supports the notion of persistent connections (known as sockets). Applications that communicate via a network connection have associated ports (as did UDP's transient connections). Typically an application on one host listens on one port awaiting a connection while an application on another host requests a connection. When a connection is obtained, it can either persist until it is broken by either party (consuming the port resource on both ends and blocking further connections on that port pair in the meantime) or the receiving application can fork a copy of itself with new port numbers drawn from the pool of open, unassigned ports. The original application can then go back to listening for new connections on the original port while the forked copy can manage the persistent connection until it terminates.

This is the basis for forking daemons and nearly all Unix network services. Let's look at the rest of the header.

To detect and correct packets that arrive out of order a sequence number is added. The Acknowledgment number permits handshaking -- if a receiver doesn't acknowledge receipt of each packet in sequence within a reasonable timeout, the application retransmits the missing packet. The data offset points to the first byte of data (just past the end of the header). There are a number of control bits used for specific purposes beyond the scope of this article (read the RFC, which is a good idea anyway). There are several more fields, but the most important field remaining is the checksum, which as always is used to detect a corrupted packet. Last there is the data itself, a TCP stream encapsulated in an IP datagram encapsulated in an Ethernet frame.

There are many details of TCP that are important but that we are perforce skipping. For example, TCP undergoes some rather elaborate rituals establishing connections, transmitting each packet in a sequence, handshaking so that correct-order receipt of the uncorrupted message is ensured, and breaking connections either deliberately (because one or both socket ends are closed) or because an end application, an end host, or the network itself fails. TCP has to be prepared to deal with literally anything that can happen on a network, as in the best tradition of Murphy's Law, anything that can go wrong eventually will. TCP has to provide enough power at the application level that an application can guarantee reliable delivery of a message -- eventually -- while not causing the system or a well-written TCP-based application to actually crash due to a failed connection.

Mind you, there is a lot of wiggle room in that "well-written" descriptor. A badly written application may well crash or hang if a connection fails, and it isn't horribly easy to write a application well using low level network systems calls. This case is one of many reasons that most cluster applications that use network IPC mechanisms do so via message passing libraries such as MPI or PVM that are well-written and insulate you from the care of managing a socket without getting into trouble.

Still, there is nothing like writing a daemon of your own, especially if you want absolutely maximal efficiency. One day this column will likely show you how.

Conclusion: Looking at TCP/IP Packets

One of the best ways to learn about real networks and debugging is to watch one work. In Linux the tool that permits you to do this is /usr/sbin/tcpdump. For example, try:

 tcpdump -i eth0 

This dumps packet headers. Read the man pages for tcpdump to see all the different options and ways one can probe the network for problems. This application must be run as root. Other interesting options to try include:

 tcpdump -e -i eth0 

(dumps Ethernet headers) or

tcpdump -c 100 -s 0 -X -i eth0 -l | tee eth0.dump

This latter view lets you see pretty much all of everything. If you don't use ssh or ssl to bidirectionally encrypt network traffic, you can read passwords and valuable data with ease. This view is what you must presume that is available to crackers on any traffic that leaves strictly controlled network space (and that if you are wise you'll assume is available to crackers even within your controlled network space).

The final tool work mentioning is nmap (likely /usr/bin/nmap on your system, if installed). This tool is a security probe and you might annoy your system administrator if you probe your network for security holes, so use this with caution unless you are said system administrator probing your own network. nmap can yield all sorts of valuable data about ports and services that are open and listening on any given host. It is thus a way to see if some service you think is being offered is in fact there as well as a way of determining whether or not some cracker is offering a back door service that you didn't know was there.

That seems enough for this column. At this point you should have a really good idea of how TCP/IP over Ethernet works. Next month we'll continue with our discussion of the network, concentrating on measuring network performance.

{mosgoogle right}

Sidebar: Networking Resources
Charles Spurgeon's Ethernet Web Site is This is a truly excellent resource and has been converted into an O'Reilly book.

Javin's Protocol Dictionary: This site has a nice review of the 802.3 specifications and the structure of packets, in particular the changes associated with gigabit Ethernet.

Charles L. Hedrick's Introduction to the Internet Protocols document is the document I originally used to learn about TCP/IP networking.

The original (and still operant) RFCs that define e.g. TCP and IP. However, there are many more, including RFCs that deal specifically with sending TCP/IP over Ethernet.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.