Clos networks, Numactl and Multi-Core chips, and Parallel Storage

Article Index

These are the days of our lives

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on Clos networks, Numactl and Multi-Core chips, and Parallel Storage. I found some threads that I thought were very useful to highlight. Although the discussions are several years old, I think there is still useful information that can be used by everyone.

Clos networks

There are many, many types of networks - almost as many as there are opinions about clusters! One that has been used for a while and perhaps people did not know about it, is the Clos network. A Clos network was originally designed for telephone switching and are used in computing networks for growing switch infrastructures to large port counts. On June 7, 2005, Joe Mack, a long time cluster expert, asked a question about Clos networks. In particular, he wanted to know if smaller Clos networks (3 layers and below) are blocking and if they have 5 or more layers, if they are non-blocking.

Greg Lindahl was the first to reply and explain that blocking and non-blocking Clos networks are all a myth. He explained that there is always interference on the network because of traffic from other nodes (the blocking and non-blocking comments were for the telecom industry).

{mosgoogle right}

There was some discussion about what is meant by "blocking" and "non-blocking" but then Patrick Geoffray from Myricom responded with a clear, but long, response. Patrick explained that blocking for Clos networks, particularly for Myricom's implementation, you can get a "classical" wormhole blocking where the FIFO (First In, First Out) on the input port is full and the requested output port is used by another packet at that precise time. But he pointed out that this is independent of the dimension of the Clos. Patrick also explained how blocking can happen and how networks are designed to perhaps alleviate some of the possibility of blocking. Interestingly, he also pointed ways to reduce the overall contention in a Clos network:

  • Load balance the routes over redundant paths
  • Fragment messages (i.e. pass smaller packets if possible)
  • Use route dispersion (Use multiple routes per destination)

Patrick went on to say, "There is just a maximum number of nodes you can connect on a Clos of diameter n with N-port crossbars. In 2002, at the time of the referenced email, the Myrinet crossbar we shipped had 16 ports. So the scaling table was the following:"

Clos diameter:    1  3   5    7
Max nodes count: 16 128 1024 8192

Then Patrick said, "We now use a 32-port crossbar in the 256 ports box, so the crossover points have shifted a bit:"

Dimension (Clos): 3    5
Max nodes count: 512 (a lot, ouch my head hurts)

Then Patrick finished up with some theoretical discussion about Clos networks. It's a good email to read if you are interested in networking.

This was a nice, albeit short, discussion about Clos networking. The interesting part is that there is more than one way to skin a network cat, so to speak, for cluster network design. If you are at all interested in networking for clusters, be sure to read this thread.

Numactl and Multi-Core chips

The entire world is multi-core now. I fully expect the next Christmas toy craze, whatever it is, will have toys with a couple of cell chips in there performing image recognition, speech synthesis, and robotic movement. (Could I make a cluster out of Teddy Ruxpin 2? Never mind).

A discussion started on the beowulf list about performance on a dual-core Opteron 275 node (the discussion even dared to talk about Gaussian performance, but that is verboten). But the discussion started talking about processor affinity, locking a process to a specific core, and the tools that surround this. While not the beginning of the discussion, a post by Igor Kozin, a very knowledgeable poster to the beowulf list, talked about using 1 core per socket for some performance testing. He mentioned using 'taskset' for that purpose. Then Mikhail Kuzminsky mentioned the command, numactl. Both tools can be used to produce the desired outcome - pinning a process to a specific core and/or to make sure computationally intense tasks don't end up sharing the same socket (or core) if another socket (or core) is free.

Another project that is worth looking into is the Portable Linux Processor Affinity (PLPA). Started as part of the OpenMPI effort, the PLPA is a single API that attempts to solve the problem that there are multiple API's for processor affinity within the Linux family. Specifically, the functions sched_setaffinity() and sched_getaffinity() have numbers and types of parameters depending on your Linux vendor and/or version of glibc.

This can be an issue with NUMA (Non-Uniform Memory Access) architectures such as the Opteron. With the Opteron you have a bank of memory tied to each socket. Plus the CPU has the memory controller on-board. But each socket (core) can access the memory on the other socket. You may have a numerically intensive process on one socket, but you may be accessing memory on another socket (not good for latency).

Mark Hahn, posted that the sched_setaffinity function actually does most of the work But he also pointed out that using affinity for all tasks on a node might not be a good idea. For example, he wondered if having one of the cores do the interrupt handling would be worthwhile (but he did point that many Opteron boards at the time tied the IO bridge to a single socket).

Then the discussion branched off to talk more about numactl and what it does for you. Vincent Diepeveen posted an email about some testing he did with numactl and measuring latency in his own program. He saw some differences in latency when using the various cores on the node (there were 4 cores on the board). Stuart Midgley then pointed out that numactl and indeed all affinity tools are not designed to help latency. He said that the 2.6 kernel does a pretty good job of putting the pages on the memory controller attached to the core that the process is using. But things are not perfect, so occasionally the memory pages can get spread around. He also pointed out that, "... the system buffer cache will get spread around effecting everyone." Then he went on to say, "With numactl tools you will force the pages to be allocated on the right memory/cpu. The processes buffer cache will also be locked down (which is another VERY important issue)... ." Stuart mentioned that he has used numactl tools to double the performance of his codes.

I have also seen the Linux kernel scheduler move processes around the cores. In particular, when a daemon wakes up and needs some CPU time, the kernel can move a numerically intensive task from one core and put it on another so the daemon can use the first core. This move can be very costly in terms of performance since you now have two numerically intensive tasks sharing the same core. But, once the daemon does what it needs to and goes back to sleep, the kernel is pretty good about moving one of the numerically intensive tasks back to the free core. If you have a number of daemons on the node, then this is more likely to happen. For MPI jobs this can impact overall performance. It can also affect the repeatability of performance. For example, if you run the same code a number of times, the spread in the performance is much larger when you don't use numactl than if you use it. So using numactl can help your performance and help get more repeatable timings. You can also help yourself by turning off as many daemons as possible on the compute nodes.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.