[Beowulf] copying big files (Henning Fehrmann)

Henning Fehrmann henning.fehrmann at aei.mpg.de
Mon Aug 11 12:45:02 EDT 2008


I found some time to play with dolly and nettee. They do what I was
looking for. Thank you for the hints.

> > I will say that my dream would be for something like dolly to get some
> sort
> > of transfer recovery mechanism, though I realize that would be quite
> > difficult in such a topology. 
> nettee has some failover and continuation capabilities at different
> points - but not what I think you want. The development version has a
> few extra modes for cases where data is being merged, but that isn't
> relevant to this discussion. When setting up the initial chain nettee
> can connect to an alternate node (from a list of failovers) if the
> target node will not answer.  It also has the ability to keep going if
> the local disk becomes unwritable, and it can continue a download on a
> chain down to the node above the point of failure. 
> However, nettee cannot at present rewire around a failed node to
> continue a download to the node(s) below it.  That would indeed be quite
> difficult, since one could have a situation like this:
>   A -> B  (A knows it has sent 100MB)
>   B -> C  (B knows it has sent  98MB, then it blows up)
>   C       (C knows it has received 98 MB)
> A and C will eventually figure out that B has died, and they could
> conceivably negotiate a new connection, but A may no longer have the
> missing 2 MB (it might have been sent out a pipe, processed, and not
> stored in the raw state anywhere.)  On the other hand, the development
> version uses ring buffers, and one could set those to be very large,
> enabling a certain level of "redo" from A.  So if C comes back and says
> "I only have 98MB" A can see if it has the missing parts and go on if it
> does.  It still might not though.  If B has stalled for long enough
> the ring buffer on A may have completely filled from the previous node,
> overwriting the data needed to recover.  I guess it would be possible to
> implement a "safety region" in the ring buffer which could not be
> overwritten.

I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using  dolly.
I guess it was due to a busy node somewhere in the chain.  
Increasing the number of clients up to 100 failed in both cases.

For nettee I got:
nettee: fatal error writing to child: Connection reset by peer

for dolly:
Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710      movebytes
read/write: Connection
 reset by peer
         errno = 104

I will do more systematic test the next days. 
David Mathog, are you interested in bug reports?

Henning Fehrmann
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list