Environment monitoring

Donald Becker becker at scyld.com
Wed Oct 1 12:36:26 EDT 2003

On Wed, 1 Oct 2003, Rocky McGaugh wrote:

> On Wed, 1 Oct 2003, Robert G. Brown wrote:
> > Alas, if only somebody would give the lm_sensors folks a copy of a good
> > book on XML for christmas, and they decided to take the monumental step
> > then we could ALL reap the fruits of their labor without needing a copy
> > of the lm78 version 1.22a API manual and having to write an application
> > that supports each of the sensors THROUGH THEIR INTERFACE one at a
> > time...;-)
> We have that. lm_sensors+cron+gmond.

I think you missed RGB's point.  The lm_sensors implementation sucks.
Sure, any one specific implementation can be justified.  But having each
implementation use a different output and calibration shows that this
is not an architecture, just a collection of hacks.

The usual reply at this point is "just update the user-level script for
the new motherboard type".  Yup... and you should probably update the
constants in your programs' delay loops at the same time.

With lm_sensors you can get a one-off hack working, but cannot implement
a general case.  Compare this to IPMI, which presents the same
information.  IPMI has a crufty design and ugly implementations, but it
is an architected system.  With care you can implement and deploy code
that works on a broad range of current and future machine.

While I'm on the soapbox, gmond deserves its own mini-butane-torch
I implemented the translator from Beostat (our status/statistics
subsystem) to gmond (per-machine information for Ganglia), so I have a
pretty good side-by-side comparison.

First, how did they choose what statistics to present?
Apparently just because the numbers were there.

What is the point of using a XML DTD if it is just used to
package undefined data types?  A wrapper around a wrapper...

Example metric lines:
<METRIC NAME="load_fifteen" VAL="1.41" TYPE="float" UNITS="" TN="246"
 TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
<METRIC NAME="proc_total" VAL="77" TYPE="uint32" UNITS="" TN="154"
   TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
Not only are these metric types not enumerated, they are made more
confusing by abbreviations and no definition.

To tie both together:  What is "proc_total"?
Number of processors?  Number of processes?  Does it count system
daemons?  It seems to be the useless number "ps x | wc", rather than
the number of end user, application processes.

Many statistics are only usable when used/presented as a set.  Why split
the numbers into multiple elements?  It just multiplies the size and
parsing load.

Background: Beostat is our status/statistics interface that we published
3+ years ago.  It exports interfaces at multiple levels:
    network protocol,
    shared memory table
       only for very performance sensitive programs, such as schedulers
    dynamic library
       the preferred interface for programs
    command output
Thus Beostat is a infrastructure subsystem, rather than a single-purpose
stack of programs.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list