[Beowulf] filesystem metadata mining tools
rpnabar at gmail.com
Sat Sep 12 11:10:43 EDT 2009
As the number of total files on our server was exploding (~2.5 million
/ 1 Terabyte) I
wrote a simple shell script that used find to tell me which users have how
many. So far so good.
But I want to drill down more:
*Are there lots of duplicate files? I suspect so. Stuff like job submission
scripts which users copy rather than link etc. (fdupes seems puny for
a job of this scale)
*What is the most common file (or filename)
*A distribution of filetypes (executibles; netcdf; movies; text) and
*A distribution of file age and prevelance (to know how much of this
material is archivable). Same for frequency of access; i.e. maybe the last
* A file size versus number plot. i.e. Is 20% of space occupied by 80% of
I've used cushion plots in the past (sequiaview; pydirstat) but those
seem more desktop oriented than suitable for a job like this.
Essentially I want to data mine my file usage to strategize. Are there any
tools for this? Writing a new find each time seems laborious.
I suspect forensics might also help identify anomalies in usage across
users which might be indicative of other maladies. e.g. a user who had a
runaway job write a 500GB file etc.
Essentially are there any "filesystem metadata mining tools"?
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf