[OT] statistical calculations

Andrew Piskorski atp at piskorski.com
Mon Nov 24 09:29:18 EST 2003

> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: Martin WHEELER <mwheeler at startext.co.uk>

> On Sun, 23 Nov 2003, Martin WHEELER wrote:
> > I have to process a group of several thousand acquired datasets, each
> > containing well over one hundred numerical items; and eventually, I'm
> > going to have to work with a statistician to pull some meaningful
> > figures out of it all.
> > In other words, the data have to be massaged in some pretty fancy ways.
> > 
> > For various reasons outwith my control this is being done principally
> > via a spreadsheet (wouldn't have been an obvious choice for me, but hey,
> > I only know about words, not numbers).  Can anyone on this list used to
> > doing this stuff point me towards a GPLed spreadsheet with built-in
> > statistical functions?  or an add-in to gnumeric / OpenOffice etc.?
> > (I believe such exist.)  Or maybe a library of GPLed spreadsheet macros?
> > Please correct me if I'm barking up a wrong tree here.

> Ask on the GSL (Gnu Scientific Library) list.  There have been mentions
> on the list of people wrapping/encapsulating list functions in various
> ways, but I can't remember offhand if any of them were inside a
> spreadsheet per se.  It also depends to some extent on what you mean by
> "built in statistical functions" -- GSL has the basic functions but is
> not a package like R.  Which is the second thing you should probably
> look at on: www.r-project.org.  R is a full-service stats suite with a
> variety of interfaces including web -- hopefully somebody has wrapped it
> up into a spreadsheet of some sort.

Martin, R should definitely do whatever statistical stuff you want.
There is also an R plugin for the Gnumeric spreadsheet, and some stuff
to let MS Excel call R.  I've never tried either of those plugins, but
they might be good if you don't want to use R directly:


For general vendor data clean-up and conversion issues, well, that
depends.  :)  You didn't say enough for me to know whether you need to
worry about that or not, but most of the vendor data I've seen (not in
linguistics) has always needed cleanup of some sort!

In my own line of work, for that sort of thing (which means for
financial/market data), I mostly write Tcl code to read and manipulate
the files, shove all the data into an RDBMS like Oracle or PostgreSQL,
then sometimes do additional processing in the database.  This works
well, but if you're not already using an RDBMS you probably should NOT
want to get into that for just for this one application.

Most likely, as long as your data all fits (or almost fits?) into RAM,
and you don't need the many-readers many-writers (concurrency,
atomicity, etc.) support that a real RDBMS provides, stuffing all your
data into a R's built in matrix or dataframe types should be fine.
Depending on what the vendor files look like to begin with, you may
want to pre-process them a bit with a Tcl, Perl, Python, or whatever
script first to make them easier to get into R via R's read.table()

Andrew Piskorski <atp at piskorski.com>
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list