[Beowulf] query: aggregate cluster performance monitoring without multicast
Robert G. Brown
rgb at phy.duke.edu
Fri Jan 9 11:57:41 EST 2004
On Fri, 9 Jan 2004, Lombard, David N wrote:
> I'd argue the web interface is the common denominator. Viewable via
> text browser, plus you get out of the OS wars...
For users, David, I agree, although we tend to be linux-centric and
wulfstat is popular enough. procstatd had a perl-cgi web interface
written by a friend here at Duke and I'm sure it would be easy enough to
> Yes, I know that real sys admins *never* use a web interface...
It isn't never, but I did write this as a pretty high end
administrative/monitoring interface that would be useful to sysadmins
running lans and not just clustering people running clusters. I also
wrote/designed it so that it would be easy to write things like web
interfaces for it. I just haven't got the time (or TOO much motivation)
to do so myself, as I don't have users clamoring for it.
Somebody (I can't offhand recall who) was looking at it to possibly wrap
it up to work within ganglia -- one way to get a web interface. From
what I've seen of the ganglia interface, though, wulfstat is MUCH more
practically useful for most cluster users or administrators as it
stands, xterm or not. Icons and nifty bar charts are pretty, but take
up a lot of room. A grid of regularly updated number cells is ugly, but
oh, so informative and compact.
The current default display is something like (sorry for the ncurses
cut/paste ugliness, but hey, it does cut and paste and the column/boxes
do actually align:-):
name |st|Load 1,5,15|eth0 rx|eth0 tx| si/so | pi/po |ctxt|intr|users
ganesh |up|0.0,0.0,0.0| 8415 | 882 | 0/0 | 0/2 | 156| 160| x
c00 |up|2.0,2.0,2.0| 330 | 676 | 0/0 | 0/2 | 9| 105| x
c01 |up|2.0,2.0,2.0| 330 | 675 | 0/0 | 0/2 | 9| 105| x
status, load, network traffic (ganesh's is relatively large because it
is receiving all of the xmlsysd data from the many monitored nodes),
swapin/swapout, pagein/pageout, context switches, interrupts (users
doesn't do anything yet but is intended to somehow indicate what and
whose jobs are running...already available in detail on another
display). At a glance and without a mouse or even a keyboard used for
paging, one has a very good chance of identifying cluster problems on as
many hosts as will fit in a tty window on as many tty windows as one can
fit on a screen -- hundreds, at least. Usually problems show up as
downed nodes, anomalous load averages, unusual network traffic, lots of
swapping or paging (indicating memory problems) or lots of ctxt/intr
activity (indicating a possible thrashing kernel task, I/O where there
shouldn't be, etc.).
I do have a mental vision of an even more efficient "panel of lights"
display that would run under X as a GTK process that used "temperature"
metaphor colors to encode the number values, green to red and still
presented the number values if the mouse pointer hovered over a light --
such a display could probably allow a really big cluster (1024 nodes
plus) to be monitored at a glance with detail readily available as
problems emerged. I just don't have that many nodes that wulfstat
doesn't work locally, and writing the app and debugging it is maybe a
week's work. With such a display and lights that were order of 10-20
pixels square, one could put a LOT of them onscreen in a neat little
I don't know if this would be possible in a web browser, at least on a
timescale you could live with. Binary can have portability issues, but
it is as fast as it gets.
> David N. Lombard
> My comments represent my opinions, not those of Intel Corporation.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf