XML for formatting (Re: Environment monitoring)

Robert G. Brown rgb at phy.duke.edu
Fri Oct 17 09:29:47 EDT 2003

On 14 Oct 2003, Dean Johnson wrote:

> As someone who has done programming environment tools most of his
> reasonably long professional life, I must say you have hit the nail on
> the head. I have rooted through more than my share of shitty binary
> formats in my day, and I can honestly say that I go home happier as a
> result of dealing with an XML trace file in my current project. I was
> happily working away dealing with only XML, but then it happened. The
> demons of my past rose their ugly heads when I decided that it would be
> a good thing to get some ELF information outta some files. Being the
> industrious guy I am, I went and got ELF docs from Dave Anderson's
> stash. Did that help? Nope, not really, as it was mangled 64-bit focused
> ELF. Was it documented? Nope, not really. You could look at the elfdump
> code to see what that does, so in a backwards way, it was documented.
> The alternative was to ferret out the format by bugging enough compiler
> geeks until they gave up the secret handshake. The alternative that I
> eventually took was to go lay down until the desire to have the ELF
> information went away. ;-)

And yet Don's points are also very good ones, although I think that is
at least partly a matter of designer style.  XML isn't, after all, a
markup language -- it is a markup language specification.  As an
interface designer, you can implement tags that are reasonably human
readable and well-separated in function or not.  His observation that
what one would REALLY like is a self-documenting interface, or an
interface with its data dictionary included as a header, is very
apropos.  I also >>think<< that he is correct (if I understood his final
point correctly) in saying that someone could sit down and write an
XML-compliant "DML" (data markup language) with straightforward and
consistent rules for encapsulating data streams.

Since those rules would presumably be laid down in the design phase, and
since a wise implementation of them would release a link-level library
with prebuilt functions for creating a new data file and its embedded
data dictionary, writing data out to the file, opening the file, and
reading/parsing data in from the file, it would actually reduce the
amount of wheel reinventing (and very tedious coding!) that has to be
done now while creating/enforcing a fairly rigorous structural
organization on the data itself.

One has to be very careful not to assume that XML will necessarily make
a data file tremendously longer than it likely is now.  For short files
nobody (for the most part) cares, where by short I mean short enough
that file latencies dominate the read time -- using long very
descriptive tags is easy in configuration files.  For longer data files
(which humans cannot in general "read" anyway unless they have a year or
so to spare) there is nothing to prevent XMLish of the following sort of
very general structure:

<?xml version="1.0"?>
This is part of the production data of Joe's Orchards.  Eat Fruit from
  <line id="0">
   <field id="0"><name>apples</name><fmt>%-10.6f</fmt><units>bushels</units></field>
   <field id="1"><sep>|</sep></field>
   <field id="2"><name>oranges</name><fmt>%-12.5e</fmt><units>crates</units></field>
   <field id="3"><sep>|</sep></field>
   <field id="4"><name>price</name><fmt>%-10.2f</fmt><units>dollars</units></field>
13.400000  |77.00000e+2 |450.00
589.200000 |102.00000e+8|6667.00

The stuff between the <data> tags could even be binary.  Note that the
data itself isn't individually wrapped and tagged, so this might be a
form of XML heresy, but who cares?  For a configuration file or a
small/short data file containing numbers that humans might want to
browse/read without an intermediary software layer, I would say this is
a bad thing, but for a 100 MB data file (a few million lines of data)
the overhead introduced by adding the XML encapsulation and dictionary
is utterly ignorable and the mindless repetition of tags in the
datastream itself pointless.

Note well that this encapsulation is STILL nearly perfectly human
readable, STILL easily machine parseable, and will still be both in
twenty years after Joe's Orchard has been cut down and turned into
firewood (or would be, if Joe had bothered to tell us a bit more about
the database in question in the description).  The data can even be
"validated", if the associated library has appropriate functions for
doing so (which are more or less the data reading functions anyway, with
error management).  I should note that the philosophy above might be
closer to that of e.g. TeX/LaTeX than XML/SGML/MML (as discussed below).

I've already done stuff somewhat LIKE this (without the formal data
dictionary, because I haven't taken the time to write a general purpose
tool for my own specific applications, which is likely a mistake in the
long run but in the long run, after all, I'll be dead:-) in wulfstat.
The .wulfhosts xml permits a cluster to be entered "all at once" using a
format like:


which is used to generate the hostname strings required to open
connections to hosts e.g. g01, g02, ... g15.  Obviously the same trick
could be used to feed scanf, or to feed a regex parser.

The biggest problem I have with XML as a data description/configuration
file base isn't really details like these, as I think they are all
design decisions and can be done poorly or done well.  It is that on the
parsing end, libxml2 DOES all of the above, more or less.  It generates
on the fly a linked list that mirrors the XML source, and then provides
tools and a consistent framework of rules for walking the list to find
your data.  How else could it do it?  The one parser has to read
arbitrary markup, and it cannot know what the markup is until opens the
file, and it opens/reads the file in one pass, so all it can do is mosey
along and generate recursive structs and link them.

However, that is NOT how one wants to access the data in code that wants
to be efficient.  Walking a llist to find a float data entry that has a
tag name that matches "a" and an index attribute that matches "32912" is
VERY costly compared to accessing a[32912].  At this point, the only
solution I've found is to know what the data encapsulation is (easy,
since I created it:-), create my own variables and structs to hold it
for actual reference in code, open and read in the xml data, and then
walk the list with e.g. xpath and extract the data from the list and
repack it into my variables and structs.

This latter step really sucks.  It is very, very tedious (although
perfectly straightforward to write the parsing/repacking code (so much
so that the libxml guy "apologizes" for the tedium of the parsing code
in the xml.org documentation:-).  It is this latter step that could be
really streamlined by the use of an xmlified data dictionary or even (in
the extreme C case) encapsulating the actual header file with the
associated variable struct definitions.

It is interesting and amusing to compare two different approaches to the
same problem in applications where the issue really is "markup" in a
sense.  I write lots of things using latex, because with latex one can
write equations in a straightforward ascii encoding like $1 =
\sin^2(\theta) + \cos^2(\theta)$.  This input is taken out of an ascii
stream by the tex parser, tokenized and translated into characters, and
converted into an actual equation layout according to the prescriptions
in a (the latex) style file plus any layered modifications I might
impose on top of it.

[Purists could argue about whether or not latex is a true markup language
-- tex/latex are TYPESETTING languages and not really intended to
support other functions (such as translating this equation into an
internal algebraic form in a computer algebra program such as macsyma or
maple).  However, even though it probably isn't, because every ENTITY
represented in the equation string isn't individually tagged wrt
function, it certainly functions like markup at a high level with
entries entered inside functional delimiters and presented in a
way/style that is associated with the delimiters "independent" of the
delimiters themselves.]

If one compares this to the same equation wrapped in MML (math markup
language, which I don't know well enough to be able to reproduce here)
it would likely occupy twenty or thirty lines of markup and be utterly
unreadable by humans.  At least "normal" humans.  Machines, however,
just love it, as one can write a parser that can BOTH display the
equation AND can create the internal objects that permit its
manipulation algebraically and/or numerically.  This would be difficult
to do with the latex, because who knows what all these components are?
Is \theta a constant, a label, a variable?  Are \sin and \cos variables,
functions, or is \s the variable and in a string (do I mean
s*i*n*(theta) where all the objects are variables)?  The equation that
is HUMAN readable and TYPESETTABLE without ambiguity with a style file
and low level definition that recognizes these elements as
non-functional symbols of certain size and shape to be assembled
according to the following rules is far from adequately described for
doing math with it.

For all that, one could easily write an XML compliant LML -- "latex
markup language" -- a perfectly straightforward translation of the
fundamental latex structures into XML form.  Some of these could be
utterly simple (aside for dealing with special character issues:

{\em emphasized text} ->  <em>emphasized text</em>
\begin{equation}a = b+c\end{equation} -> <equation>a = b+c</equation>

linuxdoc is very nearly this translation, actually, except that it
doesn't know how to handle equation content AFAIK.  This sort of
encapsulation is highly efficient for document creation/typesetting
within a specific domain, but less general purpose.

The point is <beep>.... [the following text that isn't there was omitted
in the fond hope that my paypal account will swell, following which I
will make a trip to a purveyor of fine beverages.]


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list