New cluster benchmark proposal (Re: top500 list)

Wed Nov 19 09:15:33 EST 2003

On Wed, 19 Nov 2003, Felix Rauch wrote:

> On Tue, 18 Nov 2003, Robert G. Brown wrote:
> > I don't think that they are quite "done", though, (at least the last
> > time I checked) so yes, I'd call it a "start" to the idea.  Not really
> > my idea, as you can see.  I think there are lots of folks who have
> > thought on this, and lots more that have a de facto suite they use
> > whether or not they are packaged.  lmbench, netpipe, netperf, bonnie,
> > memtest86 -- lots of tools out there for doing bits of this, some of
> > them very nice.
> 
> Please correct me if I'm wrong, but if I remember correctly netpipe
> and netperf are one-to-one benchmarks. While these are important to
> find out more about (and tune) the performance of your NICs, we need
> more to find out about the overall performance of the whole cluster
> network.

No, of course I agree in detail with all of the observations below.
This was what I meant when I suggested tests involving various message
passing communications patterns in raw sockets, MPI, PVM -- in more
detail, master-slave (boring but often relevant), tree distribution,
all-to-all with and without some effort to avoid collisions, etc.
Netpipes is very nice and does let you test PVM and MPI, but isn't
really engineered for driving a cluster switch to its figurative knees.

> 
> There are switches who's backplane offers only half bisectional
> bandwidth, which might be fine for some applications. Other switches
> are advertized to offer full bisectional bandwidth, but they simply
> can't hold the promise. Other switches are expensive but deliver real
> full bisection bandwidth. Some applications don't care if they don't
> have a full-bisection-bandwidth network -- others do.
> 
> So, for a comprehensive cluster benchmark, we should also have tools
> to get insight into the inner workings of the network. Our reserach
> group introduced such a benchmark as part of our paper
> "Cost/Performance Tradeoffs in Network Interconnects for Clusters of
> Commodity PCs" presented at this years CAC workshop (see [1]). We
> found out that some switches perform rather poorly for some
> communication patterns and that a full bisection bandwidth can play a
> role for the performance of some applications (e.g. car traffic
> simulation).
> 
> While we don't have a ready-to-be-used-for-all-clusters kind of
> benchmark, I still hope the ideas might be valuable for this
> discussion.
> 
> - Felix
> 
> [1] http://www.cs.inf.ethz.ch/CoPs/publications/#cac03

This is the kind of thing that should ultimately be a component of any
full suite.

What we really need are some handy dandy students who want to write and
GPL all of this stuff and publish it.  Alas, I'm a physicist and don't
have the right kind of students, and although I do work on writing it
myself I lack the time to really put it all together.  It does seem like
the sort of project a CS department with research efforts in cluster
computing might want to tackle and "own", the way the Clemson guys own
PVFS.

Maybe I'll talk to my CS cluster colleagues here at Duke and see if a
joint proposal can be worked out, perhaps collaboratively with a few
other interested groups elsewhere.  I seriously think that there is real
computer science work to be done here, with an end stage goal being the
creation of a daemon or kernel module that automagically generates
microbenchmark numbers (ideally from a suite of modules that can be
added or deleted at any time by e.g. dropping a suitably instrumente
program file in a suitable directory) that are subsequently published in
/proc (I've suggested this on the lmbench list at least twice now, to no
avail).

The advantage of this is that one COULD then rewrite e.g. ATLAS so that
instead of having to be rebuilt for each micro-architecture on which it
might run (a tedious and time consuming process) it simply drops its
basic parametric tests in (if they aren't already in the default set)
and runs.  When it runs it reads in increasingly accurate numbers from
/proc and dynamically autotunes.  One could likely add a damped gradient
search to the autotuning routine so that it can actually adjust itself
(gradually) to very specific features of the system on which it is
running, including the effect of the rest of its typical dynamic load.

And not just ATLAS, of course.  ANY program that might need to switch
algorithm or access pattern based on microperformance metrics could
benefit.  As a single example, it might be possible to write a PVM or
MPI program that automagically selects an optimal message passing
pattern IF there were microbenchmark results immediately available
indicating message passing efficiency at various scales (varying message
size, distribution pattern, number of nodes).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf