Cluster Potporri for $200 Alex | Beowulf List

The answer is: Aspects of Cluster Management, network bandwidth in clusters (extremely good information here), Intel EM64T code and Opterons, and PVFS on a medium size cluster

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on aspects of cluster management, network bandwidth in clusters (with some extremely good information), Intel EM64T code and Opterons, and PVFS on a medium size cluster.

About Management

This thread, while short, points out a few concepts that are worth repeating. On Oct. 15, 2004, llwaeva asked a question about how to manage their 8 node cluster. In particular, he wanted to know how do you easily update 8 nodes at the same time, how do you handle user accounts, NFS and NIS, etc.

The ever vigilant (or is it vigilante) Robert Brown was the first to respond. Robert suggested that for a small 8 node cluster you could easily use the head node as the NIS (Network Information Services) server as well as a NFS server. After a little reminiscing, Robert said that NIS and NFS would not even affect the head node much. He also said that there were some alternatives to NIS. For example, you could use rsync to propagate accounts from the head node to the other nodes. He also suggested kickstart or yum to automate OS installation and maintenance.

John Hearns from across the pond (since we are waxing nostalgic), mentioned that cluster management could be accommodated by several toolkits, Rocks and Warewulf to name two. He also mentioned that there are utilities that allow parallel execution of commands (i.e. commands that happen on all of the nodes). pdsh is one of the better known parallel command tools. John also mentioned that rsync was a reasonable way of managing user accounts.

Gary Milenkamp said that he used systemimager for software management. He also said that he recently started using LDAP and it greatly simplified the account management process.

The respected Sean Dilda jumped in to say that his cluster relies heavily on NIS and NFS. He uses NFS for user home directories but packaged up the commonly used codes to install on the nodes. He also said that you could easily run NIS/NFS on the head node of a smallish cluster as well as a job scheduler.

Bandwidth: who needs it?

There has been some discussion about the importance of network bandwidth and latency for application performance. On Oct. 16, 2004 Mark Hahn, resident expert, posted a question to the mailing list. He was looking for applications that push the limits of MPI bandwidth (around 800-900 MB/sec).

The esteemed Greg Lindahl posted with some very interesting observations. He said that bandwidth is important not only for large messages but also medium ones. He gave what he called a naive formula for how long it takes to send a message.

T_size = T_0 + size/max_bandwidth

where T_size time it takes to send a packet of a certain size, T_0 is a fixed constant (zero packet latency), size is the size of the packet, and max_bandwidth is the maximum bandwidth of the interconnect. He gave an example that for a 4K message with T_0 = 5 microseconds. For a maximum bandwidth of 400 MB/s the messages takes 15 microseconds and for 800 MB/s bandwidth, the messages takes 10 microseconds. He pointed out that effectively you're only getting 266 MB/s and 400 MB/s respectively.

Jim Lux, posted to add that bandwidth is important if you have any sort of "store and forward" process (for example, a switch). The reason is that you have to wait for the message to arrive before you can send it to its next destination.

Richard Walsh, knowledgeable contributor, posted to say that Greg's formula provides some interesting comments on the mid-range regime of message sizes whose transfer times are affected about equally by bandwidth and latency. Richard pointed out that if you half the latency in Greg's formula, you get the same affect as doubling the bandwidth (for a message of 4K). Richard went on to point out that for a given interconnect there is a characteristic message size whose transfer time is equally sensitive to perturbations in bandwidth and latency (i.e. the latency and bandwidth piece of the transfer time are equal). He gave an example for a Quadrics Elan-4-like interconnect which has a characteristic message size of 1.6K.

Philippe Blaise then posted that you can manipulate Greg's formula to get the characteristic message size.

size_1/2 = T_0*max_bandwidth

where size_1/2 is the characteristic message size (sometimes called N/2). For example, he said the for T_0 = 5 microseconds, the characteristic message size for a 400 MB/s and 800 MB/s bandwidth would be.

T_size_1/2 = 5 *400 = 2 KB
T_size_1/2 = 5*800 = 4 KB

Given the importance of N/2 (characteristic message size) to the performance of codes and their scaling, there are some interesting things we can take from this discussion. Firstly, bandwidth, latency and time to get transmit a message are all coupled. While some people focus on latency, latency, latency at the expense of all else, their focus may be misplaced. In addition to latency, bandwidth is important in determining the time to transmit a message (and this is what we all want to focus on - performance and scalability).

This analysis also shows that you can't just improve bandwidth without improving latency if you want to maintain good performance. As Philippe pointed out, just increasing bandwidth increases N/2 (not always a good thing). So, to maintain a good value of N/2 you have to balance both bandwidth and latency.

This discussion shows how latency and bandwidth can be related using a simple formula. Focusing on one without the other is may not very useful.

Intel 64bit (EMT) Fortran code and AMD Opteron

I'm always amazed at what comes out of seemingly innocent questions. Often times simple questions get small simple answers. In other instances, a simple question can generate a an avalanche of responses. conversations.

For example, on Oct. 28, 2004, Roland Krause asked about experiences with the Intel EM-64T compiler on Opteron systems. Craig Tierney then posted asking if binaries from the Intel EM64T compiler would even work on Opterons without SSE3. He went on to say that if you don't vectorize, or only build 32-bit applications you should be fine. "However, for most applications the vectorization is going to give you the big win." This last statement started some interesting discussion.

Greg Lindahl quickly pointed out that most people think that vectorization is "going to give you a big win" but that SIMD (Single Instruction Multiple Data) optimization doesn't help any of the codes in the SPECfp benchmark. He mentioned that the Opteron can use both floating point pipes with scalar code, which is different than the Pentium 4. He went on to say, "I'd say this myth is the #1 myth in the HPC industry right now." I think that for this observation alone, this thread was well worth reading.

The accomplished Mikhail Kuzminsky then posted that you could run code generated with the Intel EM64T compiler if you did not use any SSE3 instructions. It also said it was possible to not use SSE3 when compiling. Serguei Patchkovskii echoed Mikhail's comments.

Nicholas Breakley pointed to Polyhedron's benchmarks. He mentioned that the Intel compiler came in just behind Pathscale's compiler on Opteron.

Intel has had some difficulties with Opteron processors with their compilers. Here is a web page that describes the problems and offers some solutions. While many people tend to blame Intel for these problems, and I can see their point since AMD is a competitor, I also think that Intel is correct in their statement that it is difficult to support the Opteron at high levels of optimization since it's not their chip. So, in the meantime, we have the people mentioned on the webpage who have been patching Intel compilers to help the AMD folks.

PVFS on 80 proc (40 node) cluster

On Oct. 30, 2004 (Halloween Eve), Jeff Candy asked about experiences people had with PVFS (Parallel Virtual File System) on a cluster with about 80 CPUs using GigE. He was considering using PVFS over using NFS. Jeff very quickly posted that the code he was running was a large physics code that does about 200 KB of I/O every 10 to 60 seconds. Then every 10 minutes or so, a 100 MB file is written. Jeff also said that he wanted a single file system for /home and for his working directory.

Brian Smith then posted that Jeff should consider PVFS or any other parallel file system over NFS mounting for concurrent scratch space. Brian thought that PVFS2 was better than PVFS1.

Rob Latham, one of the PVFS developers, then posted to mention that if you used shared storage, "heartbeat" and enough hardware, you could have redundant PVFS1 and PVFS2 nodes (previous posting indicated that PVFS did not have redundancy built in). Rob also pointed out that while PVFS did not currently have software redundancy, it was a very active area of research. Rob also pointed out that people have been using PVFS and have not found the reliability to be a problem. People will run their applications using PVFS as the scratch space and then just copy their checkpoint date to a tape or long-term storage.

Of course, one solution that people didn't really put forth was to use both NFS and PVFS. You can use NFS for /home and PVFS for scratch space. You could even make your compute nodes diskless if you like. Ahhh, the flexibility of clusters.

Sidebar One: Links Mentioned in Column

Intel Compilers on Opteron

Fortran Compiler Comparisons

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.