Article Index

The answer is: Aspects of Cluster Management, network bandwidth in clusters (extremely good information here), Intel EM64T code and Opterons, and PVFS on a medium size cluster

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on aspects of cluster management, network bandwidth in clusters (with some extremely good information), Intel EM64T code and Opterons, and PVFS on a medium size cluster.

About Management

This thread, while short, points out a few concepts that are worth repeating. On Oct. 15, 2004, llwaeva asked a question about how to manage their 8 node cluster. In particular, he wanted to know how do you easily update 8 nodes at the same time, how do you handle user accounts, NFS and NIS, etc.

The ever vigilant (or is it vigilante) Robert Brown was the first to respond. Robert suggested that for a small 8 node cluster you could easily use the head node as the NIS (Network Information Services) server as well as a NFS server. After a little reminiscing, Robert said that NIS and NFS would not even affect the head node much. He also said that there were some alternatives to NIS. For example, you could use rsync to propagate accounts from the head node to the other nodes. He also suggested kickstart or yum to automate OS installation and maintenance.

John Hearns from across the pond (since we are waxing nostalgic), mentioned that cluster management could be accommodated by several toolkits, Rocks and Warewulf to name two. He also mentioned that there are utilities that allow parallel execution of commands (i.e. commands that happen on all of the nodes). pdsh is one of the better known parallel command tools. John also mentioned that rsync was a reasonable way of managing user accounts.

Gary Milenkamp said that he used systemimager for software management. He also said that he recently started using LDAP and it greatly simplified the account management process.

The respected Sean Dilda jumped in to say that his cluster relies heavily on NIS and NFS. He uses NFS for user home directories but packaged up the commonly used codes to install on the nodes. He also said that you could easily run NIS/NFS on the head node of a smallish cluster as well as a job scheduler.

Bandwidth: who needs it?

There has been some discussion about the importance of network bandwidth and latency for application performance. On Oct. 16, 2004 Mark Hahn, resident expert, posted a question to the mailing list. He was looking for applications that push the limits of MPI bandwidth (around 800-900 MB/sec).

The esteemed Greg Lindahl posted with some very interesting observations. He said that bandwidth is important not only for large messages but also medium ones. He gave what he called a naive formula for how long it takes to send a message.

T_size = T_0 + size/max_bandwidth
where T_size time it takes to send a packet of a certain size, T_0 is a fixed constant (zero packet latency), size is the size of the packet, and max_bandwidth is the maximum bandwidth of the interconnect. He gave an example that for a 4K message with T_0 = 5 microseconds. For a maximum bandwidth of 400 MB/s the messages takes 15 microseconds and for 800 MB/s bandwidth, the messages takes 10 microseconds. He pointed out that effectively you're only getting 266 MB/s and 400 MB/s respectively.

Jim Lux, posted to add that bandwidth is important if you have any sort of "store and forward" process (for example, a switch). The reason is that you have to wait for the message to arrive before you can send it to its next destination.

Richard Walsh, knowledgeable contributor, posted to say that Greg's formula provides some interesting comments on the mid-range regime of message sizes whose transfer times are affected about equally by bandwidth and latency. Richard pointed out that if you half the latency in Greg's formula, you get the same affect as doubling the bandwidth (for a message of 4K). Richard went on to point out that for a given interconnect there is a characteristic message size whose transfer time is equally sensitive to perturbations in bandwidth and latency (i.e. the latency and bandwidth piece of the transfer time are equal). He gave an example for a Quadrics Elan-4-like interconnect which has a characteristic message size of 1.6K.

Philippe Blaise then posted that you can manipulate Greg's formula to get the characteristic message size.

size_1/2 = T_0*max_bandwidth
where size_1/2 is the characteristic message size (sometimes called N/2). For example, he said the for T_0 = 5 microseconds, the characteristic message size for a 400 MB/s and 800 MB/s bandwidth would be.
T_size_1/2 = 5 *400 = 2 KB
T_size_1/2 = 5*800 = 4 KB

Given the importance of N/2 (characteristic message size) to the performance of codes and their scaling, there are some interesting things we can take from this discussion. Firstly, bandwidth, latency and time to get transmit a message are all coupled. While some people focus on latency, latency, latency at the expense of all else, their focus may be misplaced. In addition to latency, bandwidth is important in determining the time to transmit a message (and this is what we all want to focus on - performance and scalability).

This analysis also shows that you can't just improve bandwidth without improving latency if you want to maintain good performance. As Philippe pointed out, just increasing bandwidth increases N/2 (not always a good thing). So, to maintain a good value of N/2 you have to balance both bandwidth and latency.

This discussion shows how latency and bandwidth can be related using a simple formula. Focusing on one without the other is may not very useful.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.