mulitcast copy or snowball copy
becker at scyld.com
Mon Aug 18 10:51:27 EDT 2003
On Mon, 18 Aug 2003, Rene Storm wrote:
> I want to distribute large files over a cluster.
How large? Some people think that 1MB is large, while others consider
large files to be 2GB+ (e.g. "Large File Summit"). This will have a
significant impact on how you copy the file.
> To raise performance I decided to copy the file to the local HD of any
> node in the cluster.
> Did someone find a multicast solution for that or maybe something with
> snowball principle?
There are several multicast file distribution protocols, but they all
share the same potential flaw: they use multicast. That means that they
will work great in a few specific installations, generally small
clusters on a single Ethernet switch.
But as you grow, multicast becomes more of a problem.
Here is a strong indicator for using multicast
A shared media or repeater-based network (e.g. traditional Ethernet)
Here are a few of the contra-indications for using multicast
"Smart" Ethernet switches which try to filter packets
Random communication traffic while copying
Heavy non-multicast traffic while copying
Multiple multicast streams
NICs with mediocre, broken or slow to configure multicast filters
Drivers not tuned for rapid multicast filter changes
Or, in summary, "using the cluster for something besides a multicast demo.
Here is an example: The Intel EEPro100 design configures the multicast
filter with a special command appended to the transmit command queue.
The command is followed by a list of the multicast addresses to accept.
While the command is usually queued to avoid delaying the OS, the chip
makes an effort to keep the Rx side synchronous by turning off the
receiver while it's computing the new multicast filter. So the longer
the multicast filter list and the more frequently it is changed, the
more packets dropped. And what's the biggest performance killer with
multicast? Dropped packets..
> My idea was to write some scripts which copy files via rsync with snowball,
If you are doing this for yourself, the solution is easy.
Try the different approaches and stop when you find one that works for you.
If you are building a system for use by others (as we do), then the
problem becomes more challenging.
> but there are some heavy problems.
> What happens if one node (in the middle) is down.
Good: first consider the semantics of failure. That means both recovery
and reporting the failure.
My first suggesting is that *not* implement a program that copies a file
to every available node. Instead use a system where you first get a
list of available ("up") nodes, and then copy the files to that node
list. When the copy completes continue to use that node list rather
then letting jobs use newly-generated "up" lists.
A geometrically cascading copy can work very well. It very effectively
uses current networks (switched Ethernet, Myrinet, SCI, Quadrics,
Infiniband), and can make use of the sendfile() system call.
For a system such as Scyld, use a zero-base geometric cascade: move the
work off of the master as the first step. The master generates the work
list and immediately shifts the process creation work off to the first
compute node. The master then only monitors for completion.
You can implement low-overhead fault checking by counting down job
issues and job completion. As the first machine falls idle, check that
the final machine to assign work is still running. As the next-to-last
job completes, check that the one machine still working is up.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system
Annapolis MD 21403 410-990-9993
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf