|
Page 1 of 2
The answer is: Aspects of Cluster Management,
network bandwidth in clusters (extremely good information here),
Intel EM64T code and Opterons,
and PVFS on a medium size cluster
The Beowulf mailing list provides detailed discussions about
issues concerning Linux HPC clusters.
In this article I review some postings to the
Beowulf list on
aspects of cluster management, network bandwidth in clusters
(with some extremely good information),
Intel EM64T code and Opterons, and PVFS on a medium size cluster.
About Management
This thread, while short, points out a few concepts that are worth
repeating. On Oct. 15, 2004, llwaeva asked a question about how to
manage their 8 node cluster. In particular, he wanted to know how do
you easily update 8 nodes at the same time, how do you handle user
accounts, NFS and NIS, etc.
The ever vigilant (or is it vigilante) Robert Brown was the first to
respond. Robert suggested that for a small 8 node cluster you could
easily use the head node as the NIS (Network Information Services)
server as well as a NFS server. After a little reminiscing, Robert
said that NIS and NFS would not even affect the head node much. He
also said that there were some alternatives to NIS. For example, you
could use rsync to propagate accounts from the head node to the other
nodes. He also suggested kickstart or yum to automate OS installation
and maintenance.
John Hearns from across the pond (since we are waxing nostalgic),
mentioned that cluster management could be accommodated by several
toolkits, Rocks and Warewulf to name two. He also mentioned that
there are utilities that allow parallel execution of commands (i.e.
commands that happen on all of the nodes). pdsh is one of
the better known parallel command tools. John also mentioned that
rsync was a reasonable way of managing user accounts.
Gary Milenkamp said that he used systemimager for software
management. He also said that he recently started using LDAP and
it greatly simplified the account management process.
The respected Sean Dilda jumped in to say that his cluster relies
heavily on NIS and NFS. He uses NFS for user home directories but
packaged up the commonly used codes to install on the nodes. He
also said that you could easily run NIS/NFS on the head node of a
smallish cluster as well as a job scheduler.
Bandwidth: who needs it?
There has been some discussion about the importance of network
bandwidth and latency for application performance. On Oct. 16, 2004
Mark Hahn, resident expert,
posted a question to the mailing list. He was looking for applications
that push the limits of MPI bandwidth (around 800-900 MB/sec).
The esteemed Greg Lindahl posted with some very interesting
observations. He said that bandwidth is important not only for
large messages but also medium ones. He gave what he called a naive
formula for how long it takes to send a message.
T_size = T_0 + size/max_bandwidth
where T_size time it takes to send a packet of a certain size,
T_0 is a fixed constant (zero packet latency), size
is the size of the
packet, and max_bandwidth is the maximum bandwidth of the
interconnect. He gave an example that for a 4K message with
T_0 = 5 microseconds. For a maximum bandwidth of 400 MB/s
the messages takes 15 microseconds and for 800 MB/s bandwidth, the
messages takes 10 microseconds. He pointed out that effectively
you're only getting 266 MB/s and 400 MB/s respectively.
Jim Lux, posted to add that bandwidth is important if you have any
sort of "store and forward" process (for example, a switch). The
reason is that you have to wait for the message to arrive before you
can send it to its next destination.
Richard Walsh, knowledgeable contributor, posted to
say that Greg's formula provides
some interesting comments on the mid-range regime of message sizes
whose transfer times are affected about equally by bandwidth and
latency. Richard pointed out that if you half the latency in Greg's
formula, you get the same affect as doubling the bandwidth (for a
message of 4K). Richard went on to point out that for a given
interconnect there is a characteristic message size whose transfer
time is equally sensitive to perturbations in bandwidth and
latency (i.e. the latency and bandwidth piece of the transfer
time are equal). He gave an example for a Quadrics Elan-4-like
interconnect which has a characteristic message size of 1.6K.
Philippe Blaise then posted that you can manipulate Greg's formula
to get the characteristic message size.
size_1/2 = T_0*max_bandwidth
where size_1/2 is the characteristic message size (sometimes
called N/2).
For example, he said the for T_0 = 5 microseconds, the
characteristic message size for a 400 MB/s and 800 MB/s bandwidth
would be.
T_size_1/2 = 5 *400 = 2 KB
T_size_1/2 = 5*800 = 4 KB
Given the importance of N/2 (characteristic message size) to the
performance of codes and their scaling, there are some interesting
things we can take from this discussion. Firstly, bandwidth, latency
and time to get transmit a message are all coupled. While some people
focus on latency, latency, latency at the expense of all else, their
focus may be misplaced. In addition to latency, bandwidth is important
in determining the time to transmit a message (and this is what we all
want to focus on - performance and scalability).
This analysis also shows that you can't just improve bandwidth without
improving latency if you want to maintain good performance. As Philippe
pointed out, just increasing bandwidth increases N/2 (not always a
good thing). So, to maintain a good value of N/2 you have to balance
both bandwidth and latency.
This discussion shows how latency and bandwidth can be related
using a simple formula. Focusing on one without the other is may
not very useful.
|