data storage location

hanzl at hanzl at
Fri Sep 12 12:43:27 EDT 2003

> The alternative approach is to keep copies of the data on local disk on
> each node. This gives you good IO rates, but you then have a substantial
> data management problem; how to you copy 100Gb to each node in your
> cluster in a sensible amount of time, and how do you update the data and
> make sure it is kept consistent?
> ...
> If your dataset is larger than the amount of local disk on your nodes, you
> then have to partition your data up, and integrate that with your queuing
> system, so that jobs which need a certain bit of the data end up on a node
> which actually holds a copy.

This is exactly what we do. But moving the right data at right place
and doing good job scheduling at the same time is not easy. Ideally we
would like to automate it via huge caches on local disks:

 - central server has some 400GB of read-only data
 - nodes cache them on their harddisks as needed
 - queing system preferes some regular patterns in job/node assignment
   to make cache hits likely

Cache-like behavior would save a lot of manual work but unfortunately
I am not aware of any working solution for linux, I want something
like cachefs (nonexistent for linux) or caching ability of AFS/Coda
(too cumbersome for cluster) or theoretical features of
Intermezzo (still and maybe forever unfinished).

At the moment we work on small kernel hack to solve this,
unfortunately repeating for the n-th time what many others once did
and never maintained.

Maybe genome research will generate more need for this data access
pattern and more chance for re-usable software solution?


Vaclav Hanzl

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list