[Beowulf] query: aggregate cluster performance monitoring without multicast

Robert G. Brown rgb at phy.duke.edu
Fri Jan 9 08:29:23 EST 2004

On Thu, 8 Jan 2004, Chris Dagdigian wrote:

> {Forwarded to this list on behalf of a friend with some email troubles...}
> > I am in the process of trying to get a stopgap perfomance monitoring
> > system going on a 64 CPU Linux cluster with LSF.  Ultimately, I hope to
> > use PCP for data collection, but since nobody seems to be doing this yet,
> > we are going to be rolling our own solution.  To meet some of their needs,
> > management has asked for an interim solution that gives them a web page
> > with aggregate usage statistics and such.
> > 
> > Unfortunately, ganglia is a non-starter because our networking group can't
> > enable multicast for the private network the cluster lives on (it would
> > break almost everything else).
> > 
> > Does anyone have any suggestions for an alternative that can be quickly
> > implemented, doesn't rely on multicast, and that can generate graphs of aggregate
> > statistics on the fly?
> Regards,
> Chris
> BioTeam.net


xmlsysd is a daemon that runs on each client.  It parses /proc and runs
certain systems commands on demand, extracts the requested data, wraps
it up in an XML tagging that is fairly self-explanatory and easy/trivial
to parse with any of a bunch of XML toolsets and libraries in a variety
of scripting and compiler languages, and sends it back via a standard
point-to-point TCP connection (no multicast) to the connecting host.  It
is throttleable -- it only returns the information you want to monitor,
in fairly coarse-grained blocks (e.g. it tends to deliver all the
information from a given /proc file all at once as the primary overhead
is rewinding and reading the file at all, not parsing any given line
once it is read, and the burden of sending a packet of 300 bytes and 750
bytes is similarly nearly identical).

wulfstat is a provided tty/ncurses monitoring program that reads in an
XML-based cluster descriptor/config file (it is very easy to define a
cluster using what amount to scanf/printf wildcards and numerical
ranges), connects to xmlsysd on all hosts specified in the config file,
and builds a display that updates every five seconds or so of the entire
cluster in a tty window that can be scrolled or paged.  The default
display currently presents a vmstat-like set of data, but one can also
monitor only 5/10/15 load average, network stats, memory stats, cpu and
time info (including duty cycle) and userspace running jobs (which can
additionally be masked in the config file with -- you guessed it -- an
XML-based set of tags).

Wulfstat runs doesn't generate graphs of statistics on the fly, but
because xmlsysd is trivial to connect to and control (you can telnet to
the port and type in its control commands and see exactly what it
returns and how to control/throttle configure it online) and because it
returns xml-wrapped data, it would be would be easy, I think, to make it
generate graphs and it will definitely scale to 64 nodes.  In fact I
think one could probably write a simple perl or python script that
connects to the cluster xmlsysds, polls the cluster every (say) minute,
extracts the quantities you wish to monitor, and write them out on a
per-node basis to a file where any graphing utility you like could plot
them.  There are also almost certainly web/php widgets for plotting -- I
just haven't had the need for them and so haven't written this myself
(my nodes tend to go to load 1 per CPU and stay there for a month or so,
which is pretty boring -- the "interesting" exceptions are easy to spot
in real time in wulfstat).

If you go this route and implement the fully GPL (v2b, where the b means
you owe me a beer if and when we ever meet if you use it:-) toolset,
please consider sending me any also GPL cool tools you develop that
extend the usefulness of the basic xmlsysd and I'll wrap them up in the
package for others.  In fact, GPL v2b them and I'll have to buy YOU a
beer back if/when we ever meet in human person with a bar handy...;-)


P.S. - I don't know exactly how many sites are using xmlsysd/wulfstat,
but there are a few -- the tools have been around for a while and I do
actively maintain them and have been known to add displays to wulfstat
on request just to make a given user happy.  One day I'll have LOTS of
time on my hands I'm sure and will even port wulfstat to a gtk "real"
GUI form, but the nice thing about a tty based display is that it works
on a basic tty console if no X is running at all (which may well happen)
and everybody has a vast range of xterm choices under X, so it is
definitely the common denominator of monitor interfaces.

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list