[Beowulf] filesystem metadata mining tools

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Sat Sep 12 19:02:10 EDT 2009

On 9/12/09 8:10 AM, "Rahul Nabar" <rpnabar at gmail.com> wrote:

> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
> But I want to drill down more:
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
> *What is the most common file (or filename)
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.

Another useful application for such a tool would be to get better KLOC
counts of source code trees.  I find that our trees have lots of duplication
among branches (e.g. Everyone has a "test.c" for unit test in with their
modules, and all of them are pretty similar)

Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list