[Fwd: Re: 32-port gigabit switch]
Robert G. Brown
rgb at phy.duke.edu
Fri Mar 7 15:58:00 EST 2003
On Fri, 7 Mar 2003, Jeff Layton wrote:
> > I'm not trying to start a flame war, and I'm really curious. I suggest
> > that you're starting the flame war with your attacking tone and lack of
> > any facts (or even one example) backing up your statements. Just saying
> > "it depends" doesn't help the rest of us learn. When is Gigabit better?
> Where's RGB when you need him? :) I think enough people have
> pointed out that your statement is wrong. Have you looked in the
> beowulf archives? How about a googling?
God, you don't need me -- your answer below was better than anything I
could do...except maybe saying "children, children, children, play nice
on the list".
Seriously, everybody remember that NOBODY is flaming and no purpose is
served by even introducing the term. This is a serious, low noise list,
these are modestly complex issues, and (as Jeff's example fairly clearly
shows) you CAN'T RELY on a theoretical understanding alone to predict
performance, so fighting over whether latency or bandwidth is more
important, especially without a particular application in hand for
context, is pretty useless. Understand them is more useful to help
explain performance a posteriori observed on YOUR CODE, or to help you
identify parts you CAN'T explain and try to figure them out.
As several people have already answered, parallel performance can be
dominated by latency (lots of little messages), bandwidth (fewer BIG
messages) or, quite often, neither one! Some tasks just don't do a lot
of communicating. Others communicate a fair bit, but are cleverly
organized. One whole point of parallel program design is to minimize
delays due to communications, and sometimes one can actually succeed.
In both of Jeff's more puzzling results below, perhaps the application
does a lot of the communications with DMA in approximate parallel with
the application, so the messaging bandwidth is at least partially hidden
from the application thread itself. Or something else may be going on.
All one should >>really<< learn is that Jeff would have been stupid to
buy a more expensive network for a marginal cost of (say) $800/node that
had twice the nominal bandwidth and a fraction of the expected latency
if in fact his application would show little or no speedup with the
faster network. However, Jeff is not stupid, Jeff is smart, Jeff tested
and prototyped his application(s) on various candidate networks,
measured relative speeds, computed marginal costs, and selected the
network that let him do the most work for the least money. Jeff
deserves a raise, praise, and to be emulated! Jeff da Man!
This is a totally flame-free lesson on cluster design, and not just on
networking. AssUMe makes an Ass outa U and Me. YMMV. Try before you
buy. The only meaningful benchmark is YOUR APPLICATION. The only goal
is to MAXIMIZE WORK DONE PER DOLLAR SPENT (with a setup that doesn't
drive you nuts handling maintenance and admin). And make your granting
agency happy, have really sexy looking nodes, give your bid-buddy a lot
of business and so forth, of course...the usual;-)
This doesn't mean you shouldn't learn all about latency, bandwidth,
bottlenecks and so forth -- au contraire! -- only that you should take
them with a grain of salt and use them to help understand measured
performance with an eye to improving it (or your code) and be very
chary/wary about prediction. Jeff da Man is even doing THIS below.
> > In my experience the computation portion of a Beowulf will always
> > require low latencies for optimal performance.
> OK. We have 3 MPI applications. Two are internally written and
> one is from NASA. We have extensively tested these 3 codes with
> many varying data sets on all kinds of HPC equipment (Cray's,
> SGI Origin's, SP's, clusters, etc.). However, I'll focus on clusters
> (beowulf's in particular).
> We have tested on equipment with Myrinet, GigE, and FastE. The
> nodes were the same and only the network changed along with
> some tuning to get the best performance out of each. Here's what
> we have found:
> Code 1 - First internal code. Running on Myrinet compared to GigE
> only gives you about 20% better wall-clock time for some cases. For
> other cases, Myrinet is slower than GigE (still trying to explain that
> one :). Myrinet is about twice as fast as FastE.
> Observations - We think this code is more constrained by latency
> than bandwidth when you compare Myrinet and GigE. We have
> looked at the message sizes and they are fairly small (tiny). This
> pushes this code down the bandwidth/mesage size curve almost
> to the point where you measure latency. So latency appears to be
> a driver for this code. Also, not much overlapping communication/
> computation in this code.
> Code 2 - Second internal code. Running on Myrinet compared to
> GigE is only about 3% faster for just about all cases. Myrinet is
> about twice as fast as FastE.
> Observations - Although we should see better performance with
> Myrinet compared to GigE due to better bandwidth, we think this
> code is limited by bandwidth instead of latency. The message sizes
> for this code are very large, pushing the code way up the bandwidth/
> message size curve. We're still working on identifying all of the
> bottlenecks, but from a networking standpoint, this is what we
> have concluded so far. Also, not much overlapping communication/
> computation in this code.
> Code 3 - NASA code. This code only runs about 2-3% faster on
> GigE and Myrinet compared to FastE. The code appears to be well
> thought out with respect to overlapping communication/computation.
> Obsverations - This code appears not to be constrained by either
> latency nor bandwidth.
> Disclaimer - There are lots of things I ignored in this simple analysis
> such as memory bandwith, etc. The data to support these observations
> also came from testing on other systems and on testing with other
> types of networking (Quadrics, Scali, etc.). All of the numbers are
> wall-clock times.
> With these general rules of thumb (we always test before we
> buy) and knowing the mix between the codes, we do a price/performance
> to configure the best system. Right now (and this is subject to change),
> GigE provides better price/performance for our code mix.
> Of course, this also depends on what GigE equipment we're talking
> about. I think Mark has pointed out in the past, as well as others, that
> not all GigE equipment is created equal (this is also generally true for
> FastE as well). However, for the GigE equipment we have tested on
> and also have in production we have found GigE is the way to go for
> us for our mix of codes.
> > On the other hand, when I have applications that need to transfer a lot
> > of data as well, I find that having two networks is the way to go. One
> > for control and messaging traffic (low latency - Myrinet) and one for
> > data traffic (high throughput - Gigabit).
> What kinds of applications?
> So you run control and MPI messsage traffic over Myrinet and
> NFS over GigE? Myrinet has better bandwidth than GigE, so
> it appears that if data transfer is important I would switch NFS
> to Myrinet and MPI traffic to GigE (unless of course you see a
> big difference in performance). If you do see a big difference in
> performance, what about using two Myrinet networks (trying to
> get you some sales Patrick! :)?
> If latency is that important, have you tried Quadrics? In our
> experience it has lower latencies than Myrinet. What MPI
> implementations have you tried? Do you run 1 ppn with single
> CPUs, or 1 ppn with SMP nodes, or 2 ppn with SMP nodes,
> or something else? All of things can have a large impact on
> > If you would rather take it off list, then feel free to email me
> > directly, but I would really like to know because I can't think of one
> > example that works.
> I hope my response answered your question. Anybody care to
> present another example where bandwidth is more important than
> latency? Greg? Mark? RGB? Doug? Don?
> Dr. Jeff Layton
> Senior Engineer
> Lockheed-Martin Aeronautical Company - Marietta
> Aerodynamics & CFD
> "Is it possible to overclock a cattle prod?" - Irv Mullins
> This email may contain confidential information. If you have received this
> email in error, please delete it immediately, and inform me of the mistake by
> return email. Any form of reproduction, or further dissemination of this
> email is strictly prohibited. Also, please note that opinions expressed in
> this email are those of the author, and are not necessarily those of the
> Lockheed-Martin Corporation.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf