The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on checkpoint/restart on 2.6 series kernels, torus vs. fat-tree topologies for cluster interconnects, diskless cluster nfs, and a cluster built from the new Mac Mini.
Checkpointing/restarting have been something of the Holy Grail for several years. On November 2, 2004, Brian Dobbins asked people what they used for checkpoint/restart and if anyone knew of a solution for a 2.6 kernel on AMD64 architectures. He said that they were interested in checkpointing because they wanted to be able to stop a long running application and run another application that was much more critical.
Reuti was the first to respond. He mentioned that if you wanted to do checkpointing at the kernel level, you would be saddled with restrictions such as no forking, no threads, etc. He suggested looking at checkpointing at the application level. In a later post he suggested a method for having the application checkpoint. He also suggested looking at the Condor project.
Jeff Moyer suggested sending the SIGSTOP signal to the application running on the nodes. Then you could run the higher priority application, and send the SIGCONT signal to the initial application. Jeff pointed out that as long as the memory, including swap, wasn't exhausted, this approach should work. Reuti mentioned that such an approach might invalidate timings since the initial application is "paused" for some period of time. However, if Brian is only interested in running another application, then the timings from the first application shouldn't make too much difference.
Glen Gardner mentioned that forcing a program to pause at some point could cause problems with the MPI codes (you might need to use blocking message passing to make sure the code looks for the pause).
Chris Samuel pointed out that LAM/MPI has some checkpointing capability. It uses Berkeley labs' "checkpoint" package to do the checkpointing while LAM handles the parallel coordination. However, at the time of the posting, "checkpoint" only worked with 2.4 kernels.
Isaac Dooley posted to mention that charm project at UIUC. As part of the project, an Adaptive MPI (AMPI) is available that allows checkpointing. He also said that AMPI has some load-balancing capabilities as well as some fault tolerance.
One of the fun things about clusters is that since you have control over all aspects of the machine, you can change things as you want. One of the things that you can change is the network topology. On Nov. 7, 2004, Chris Sideroff asked about some of the pros and cons of a torus (2D/3D) and fat-tree topologies.
Joachim Worringen responded that he thought a torus layout was easier to scale because effort and cost scale linearly with number of nodes. He pointed out that many popular supercomputers have used this topology including ASCII Red, Cray T3D/E, and IBM's BlueGene/L. Dolphin's SCI allows you to do the same thing with clusters. He also thought that a fat-tree topology had better bisection bandwidth than a torus. Interconnects such as Myrinet and Quadrics both do this. Joachim went on to say that he thought the manageability of a fat tree topology would be easier because a failed node does not affect message routing. However, he said that a good administrator with good tools would not have trouble administering a cluster with a torus topology.
Mark Hahn responded that he thought switched networks were the way to go. He said that they have fewer hops than a mesh like topology and the cabling was also much easier. He went on to say that he thought fat tree topologies were more popular because of bandwidth, although he didn't understand why people like bandwidth, and latency (fewer hops). Mark mentioned that a mesh-like topology was good for certain codes. He also mentioned that you don't have to have just a mesh topology or a switched network - you could have a combination (aren't clusters just great!).
Chris Sideroff, the originator of the discussion, then posted to say that he tested Dolphin's SCI network had found that it had exceptional latency performance.
At this point the well regarded Greg Lindahl jumped in to say that the good latency performance of SCI was due to the NIC not the topology. He said that there were some examples of low latency fat tree topologies.
Joachim Worringen responded that the torus network adds insignificant latency for each hop the data has to take (nano-seconds per hop). Patrick Geoffrey said that going through a crossbar cost about 100-150 nano-seconds (ns). He also thought that a hop on a torus network was similar. He then discussed the latency on a larger system for both a 3D torus and a Clos topology. His final comment was that the latency depended upon the size of the system. He also said that torus topology had many more cables than a Clos topology.
As a follow-up Ashley Pittman said that the company he works for estimates the latency through the cross-bar is about 25ns for elan4 (you can guess who he works for). Then there was some discussion as to the exact latency through the crossbar's for various interconnects.
On Dec. 7, 2004, Josh Kayse asked a fun question. He built a diskless cluster that got it's image over GigE. The master node was built using five 36 GB SCSI drives in a RAID-5 configuration. However, the engineers that use the cluster use files over nfs for their message passing (yuck) and will not be changing to MPI any time soon. So Josh wanted to know if there were ways of testing NFS performance and then improve it.
Craig Tierney was the first to respond and asked some questions about the disk performance in the master node. He thought that Josh should be seeing about 100 MB/s for both reads and writes.
Chris Samuel also asked some questions about the configuration. In particular, was the GigE network capable of jumbo frames? He also pointed Josh to the NFS Performance Howto. Ed Hill also posted some suggestions about nfs options to try to improve performance.
Bernd Schubert responded by suggesting switching to TCP instead of UDP. He also had a "P.S." at the end of his post suggesting a switch to unfs3 (user-space NFS version 3).
Ariel Sabiguero also posted and suggested some fault tolerance on the file system on the master node because so much depended upon it. He also suggested adding some memory to the master node to help performance.
And finally, in the thread, our favorite author, Robert Brown, posted (and I actually read the whole thing). Robert asked for details about the cluster, such as the type of nodes, number of nodes, etc. Robert also spent some time talking about scaling of the user codes. It was quite a good posting and he ended by explaining that Josh should take the time to explain, "... your engineers are either going to have to accept algebraically derived reality and restructure their code so it speeds up (almost certainly abandoning the NFS communications model) or accept that it not only won't speed up, it will actually slow down run in parallel on your cluster." However, in famous Robert fashion he finished with, "Unless, of course, you compute for minutes compared to communicate for seconds, in which case your speedup should be fine. That's why you have to analyze the problem itself and learn the details of its work pattern before designing a cluster or parallelization."
While this thread was fairly short, it ended on a very strong note, courtesy of Robert. He spent some time analyzing the code that Josh's users were running. He looked at some of the more extremes, communication dominating and processing dominating, that could influence the code and backed it up with some simple example. Robert pointed out that from this simple analysis, you can determine where to focus your efforts.
The new Macintosh mini was announced at the beginning of this year. So, in slashdot fashion, and with some tongue in cheek, Mark Hahn, on January 11, 2005, asked if anyone was working on building a cluster of Mac minis. He thought the boxes might be good for bioinformatics codes, or Monte Carlo simulations, or even video wall applications (the Mac mini has a decent graphics capability for such a small box). However, the Mac mini only has Fast Ethernet.
Dean Johnson posted and thought the box would be very interesting. He pointed out how small the box is in that he could get 14 of them in the same space of his desktop box. He offered the phrase "Mac mini blade bracket."
Several people, Dean Johnson and Sean Dilda, said that they were concerned about the same size of the hard drive (2.5" laptop drive).
Robert Depenbrock then posted that he thought you get 6 of them onto one 19" 1.5U rack module.
Finally, John Sjoholm satisfied all of our dreams and said that there was a project in Sweden featuring 15 Mac minis. He said he would report back to the list about progress, but I haven't seen a post yet (hint - please email John and let's find out about the progress).