Q: ATM Beowulf

Wed Nov 12 09:10:35 EST 2003

On Wed, 12 Nov 2003, Rafael Angel Garcia Leiva wrote:

> 
> Hi everybody,
> 
> I am planning to build a cluster (around 300 nodes) for Monte-Carlo 
> simulation. I will run the same program, but with different input data files, 
> on each node. I expect that the computation time is much greater than 
> communication time, and that I will have to transfer large amount of (input 
> and output) data files from working nodes to the master server.
> 
> Does make sense to use LAN emulation over ATM for this kind of clusters? Has 
> anyone experimented with ATM interconnections? Do you think is it 
> cost-effective today (specially compared to Fast / Gigabit Ethernet)?

>From what you describe, it perhaps depends on what "large amounts of
input and output files" works out to in more detail, but the answer is
almost certainly not.

The problem is embarrassingly parallel (completely independent programs)
which makes it relatively easy to figure out how performance is likely
to depend on the actual sizes (transfer times) of the programs relative
to their run time.

What you probably need to do is set up (or borrow from a friendly vendor
-- most serious cluster vendors have a test cluster and will cheerily
loan you an account) a few test nodes with gigabit interconnects.
Measure the time it takes to actually run your program alone, then the
time it takes to run your program WHILE copying its "next" input data
set in and its "last" data set out (without any sort of e.g. ssh
encryption -- use as raw as possible a data transfer tool).  Depending
on how effective your NIC is at doing DMA transfers and how I/O bound
the MC code is, copying large files while your job is running may not
count as a "serial penalty" against your CPU/memory bound computation.

It will also give you a pretty accurate idea of what the actual transfer
times are on Gbps ethernet relative to run times.  This in turn will
give you some clue as to required server capacity and whether or how to
distribute/gather the files from a single server or multiple servers
(whether or not this will help will of course depend on what use you
make of the files when you get them).

Part of the problem with your question is that as you frame it nobody
can answer it -- yet.  It requires detailed data.  If by "large" you
mean a 10 MB input file (which is yeah, pretty large) and a 100 MB
output file (ditto), well, that is roughly 1-2 seconds on a 100BT
connection for input transfer, 10-15 seconds for output transfer.  If
the program runs for 24 hours per input and output transfer, well, you
could run on 300 nodes with 100BT and never warm up the lines.

If by "large" you mean 10x (100MB in, 1 GB out), but still 24 hours
computation it would STILL run pretty perfectly on 100BT.  If by "large"
you mean ANOTHER 10x (1 GB out, 10 GB in), you're finally up to a
significant fraction of an hour for the data transfer at 100BT relative
to a daylong run.  However, at 1000BT you are still on the order of
minutes of I/O total (maybe 90 Gbits to transfer on a 1 Gbps line at
perhaps 50% efficiency -- three minutes or so?) and keeping all 300
nodes fed takes only 900 minutes (fifteen hours), which is less than the
1440 minutes of a day.  So a single server with Gbps ethernet could
distribute and collect results from 15+ hour long computations with 3
minutes of pure serial I/O per computation on 300 nodes and (barely) not
block.  If your NIC and disk channel manages DMA and can run modestly in
parallel with your computation, it simply improves things.

Obviously the important thing is the RATIO of computation to additional
per node serial communication (assuming optimal round-robin task
organization); if this ratio remains at roughly 300:1 a single server
with the cheapest network that can sustain the ratio should suffice.  If
the ratio is less than this, you have to start to think.

For example, would it be better (or even possible) to a) channel bond to
increase bandwidth and decrease serial I/O time; b) use more than one
server (servers are cheap relative to high end networks, and fortunately
your task can use stacked relatively cheap switches as you'll only use a
single channel at a time to the nodes in round robin -- IIRC jgigabit
ethernet gets more expensive than alternative high end networks if you
insist on putting 300 nodes on a single switching fabric); c) use a high
end network.  I don't usually think of ATM in this last category -- I'd
think of Myrinet or SCI, probably the latter because it is switchless
and perhaps a bit cheaper per node while still adequate, and you're VERY
LIKELY to be able to find a task organization where it is adequate.
Also, I'm pretty sure both Myrinet and SCI use DMA very effectively and
will largely parallelize the actual data transfer with your computation.

Ultimately, when you work out the actual numbers for the different
networks (at least approximately) you have to do a cost benefit analysis
and just pick the cheapest alternative that will scale to 300 nodes.
Fortunately, with an EP computation it is pretty easy to actually do
this very systematically and be pretty confident that you have a
near-optimal design.

   rgb

> 
> Thanks in advance.
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf