AW: mulitcast copy or snowball copy

Donald Becker becker at scyld.com
Mon Aug 18 12:50:57 EDT 2003


On Mon, 18 Aug 2003, Rene Storm wrote:

> Rene: Yes, I think 1 mb is large, but I have to copy files upto 2GB
> each. (Overall 30 GB)
> 	And the cluster is 128++ nodes.

Those are important parameters.
What network type are you using?
  If Ethernet, what switches and topology?
     (My guess is that you are using "smart" switches, likely connected
      with a chassis backplane.)

> > Here is an example: ...the longer
> the multicast filter list ... the more packets dropped.

> Rene: Thats right, but what if I ignore dropped packets and accept the
> corrupt files ?  I would be able to rsync them later on.

This is costly.  "Open loop" multicast protocols work by having the
receiver track the missing blocks, and requesting (or interpolating)
them later.  Here you are discarding that information and doing much
extra work on both the sending and receiving side by later locating the
missing blocks.

An alternative is closed-loop multicast, with positive acknowledgment
before proceeding more than one window.

> First Multicast to create files, Second step is to compare with rsync.
> 	I've tried this and it isn't really slow, if you're doing the
> rsync via snowball.

This is verifying/filling with a neighbor instead of the original
sender.  Except here you don't know when you are both missing the same
blocks.

> If you are doing this for yourself, the solution is easy.
...
> Rene: That's the problem with all the things you do, first they are for
> your own and then everybody wants them ;o)

If your end goal is to publish papers, do the hack.
If your end goal is make works useful for other, you have to start with
a wider view.

>> [Do] *not* implement a program that copies a file
>> to every available node.  Instead use a system where you first get a
>> list of available ("up") nodes, and then copy the files to that node
>> list.  When the copy completes continue to use that node list rather
>> then letting jobs use newly-generated "up" lists.
> 
> Rene: Good idea

This approach applies to a wide range of cluster tasks.
A similar idea is that you don't care as much about which nodes are
currently up as you care about which nodes have remained up since you
last checked.

[[ Ideally you could ask "which nodes will be up when this program
completes", but there are all sorts of temporal and halting issues
there. ]]

>> You can implement low-overhead fault checking by counting down job
>> issues and job completion.  As the first machine falls idle, check that
>> the final machine to assign work is still running.  As the next-to-last
>> job completes, check that the one machine still working is up.
> 
> Rene: But how do I get this status back to my "master", e.g command from
> master: node16 copy to node17? 

We have a positive completion indication as part of the Job/Process
Management subsystem.

If you consider the problem, the final acknowledgment must flow from the
last worker to the process that is checking for job completion.  You
might as well put that process on the cluster master.  The natural
Unix-style implementation is having the controlling machine hold the
parent of the process tree implementing the work, even if the work is
divided elsewhere.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list