data storage location

Donald Becker becker at scyld.com
Sat Sep 13 07:56:38 EDT 2003


On Fri, 12 Sep 2003 hanzl at noel.feld.cvut.cz wrote:

> > The alternative approach is to keep copies of the data on local disk on
> > each node. This gives you good IO rates, but you then have a substantial
> > data management problem; how to you copy 100Gb to each node in your
> > cluster in a sensible amount of time, and how do you update the data and
> > make sure it is kept consistent?

One significant question is the access pattern and requirement.

Is there a single or multiple application access patterns?
Does the application read all or only a subset of the data?
  Is the subset predictable?  Externally by a scheduler?
Does the application step linearly through the data, or randomly access?
  If linear stepping, is the state space of the application small?
     If small, as with a best-match search, the processing time per file
        byte tends to be small.  We recommend a static split of the data
        across machines and migrating the process instead.
	In the case of a single file read we can often do this without
	modifying the application, or localize the changes to a few
	lines around the read loop.  
     If large, e.g. building a tree in memory from the data, what is the
        per-byte processing time?  
        across machines and migrate the process.

How is the data set updated?
  A read-only data set allows many file handling options.
  If files are updated as a whole, you may use a caching versioned file
    system.  That is specialized, but provides many opportunities for
    optimization 
  Handling arbitrary writes in the middle of files requires a consistent file
    system, and the cost for consistency is very high.

> Cache-like behavior would save a lot of manual work but unfortunately
> I am not aware of any working solution for linux,

Several exist.  It depends on the semantics you need, as doing this
efficiently requires making assumptions about the access patterns and
desired semantics.

> or caching ability of AFS/Coda
> (too cumbersome for cluster) or theoretical features of
> Intermezzo (still and maybe forever unfinished).

...Declare it a success and move on


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list