data storage location
Donald Becker
becker at scyld.com
Sat Sep 13 07:56:38 EDT 2003
On Fri, 12 Sep 2003 hanzl at noel.feld.cvut.cz wrote:
> > The alternative approach is to keep copies of the data on local disk on
> > each node. This gives you good IO rates, but you then have a substantial
> > data management problem; how to you copy 100Gb to each node in your
> > cluster in a sensible amount of time, and how do you update the data and
> > make sure it is kept consistent?
One significant question is the access pattern and requirement.
Is there a single or multiple application access patterns?
Does the application read all or only a subset of the data?
Is the subset predictable? Externally by a scheduler?
Does the application step linearly through the data, or randomly access?
If linear stepping, is the state space of the application small?
If small, as with a best-match search, the processing time per file
byte tends to be small. We recommend a static split of the data
across machines and migrating the process instead.
In the case of a single file read we can often do this without
modifying the application, or localize the changes to a few
lines around the read loop.
If large, e.g. building a tree in memory from the data, what is the
per-byte processing time?
across machines and migrate the process.
How is the data set updated?
A read-only data set allows many file handling options.
If files are updated as a whole, you may use a caching versioned file
system. That is specialized, but provides many opportunities for
optimization
Handling arbitrary writes in the middle of files requires a consistent file
system, and the cost for consistency is very high.
> Cache-like behavior would save a lot of manual work but unfortunately
> I am not aware of any working solution for linux,
Several exist. It depends on the semantics you need, as doing this
efficiently requires making assumptions about the access patterns and
desired semantics.
> or caching ability of AFS/Coda
> (too cumbersome for cluster) or theoretical features of
> Intermezzo (still and maybe forever unfinished).
...Declare it a success and move on
--
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list