On Sat, Feb 14, 2009 at 6:43 PM, David Mathog <span dir="ltr">&lt;<a href="mailto:mathog@caltech.edu">mathog@caltech.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="Ih2E3d">Tiago Marques &lt;<a href="mailto:a28427@ua.pt">a28427@ua.pt</a>&gt;<br>

<br>

<br>

</div><div class="Ih2E3d">&gt; I&#39;ve been trying to get the best performance on a small cluster we<br>

have here<br>

&gt; at University of Aveiro, Portugal, but I&#39;ve not been enable to get most<br>

&gt; software to scale to more than one node.<br>

<br>

</div>&lt;SNIP&gt;<br>

<div class="Ih2E3d"><br>

&gt; The problem with this setup is that even calculations that take more<br>

than 15<br>

&gt; days don&#39;t scale to more than 8 cores, or one node. Usually performance is<br>

&gt; lower with 16cores, 12 cores, than with just 8. From what I&#39;ve been<br>

reading,<br>

&gt; I should be able to scale fine at least till 16 cores and 32 for some<br>

&gt; software.<br>

<br>

</div>&lt;SNIP&gt;<br>

<div class="Ih2E3d">&gt;<br>

&gt; I tried with Gromacs to have two nodes using one processor each, to<br>

check if<br>

&gt; 8 cores were stressing the GbE too much, and the performance dropped too<br>

&gt; much compared with running two CPUs on the same node.<br>

<br>

</div>Lots of possibilities here. Most of them are probably coming down to the<br>

code not being written to make good use of a cluster environment, and/or<br>

there not being any way to do that (single threaded code with a lot of<br>

unpredictable branching).<br>

<br>

For Gromacs I suggest you ask on that mailing list. &nbsp;My recollection is<br>

that it was known to scale poorly, but that was a couple of years ago,<br>

and maybe they have improved it since then. &nbsp;If it doesn&#39;t scale you can<br>

always get more throughput by running one independent job on each of<br>

your nodes, using local storage to avoid network contention to the file<br>

server. &nbsp;It may take 15 days to finish a run, but at least you&#39;ll have N<br>

times more work completed. &nbsp;Running N independent jobs will give you at<br>

least as much throughput as running 1 job on N cores. &nbsp;Admittedly it is<br>

nice to have the results in 1/Nth the time.<br>

</blockquote><div></div><div>Already did that, not too many helpful people on Gromacs list... They just told me to wait for 4.0 version, which I did, which scales better, though still not as I hoped.</div><div></div><div>

Were already running a single job per node for months but it would be good to have the chance to run jobs faster, sometimes it&#39;s needed.</div><div>&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Some of what you may be seeing with poorer performance on more cores on<br>

one node is probably related to the effect on memory access, especially<br>

through cache. &nbsp;Code that can go in and out of cache runs much faster<br>

than anything which has to go to main memory, and as soon as you run two<br>

competing (which depends on architecture) processes you may find that<br>

the two programs are throwing each other&#39;s data out of any shared cache,<br>

which can result in dramatic slowdowns.<br>

<br>

Give gprof a shot too. &nbsp;You want to see where your code is spending most<br>

of its time. &nbsp;If it spends 95% of its time in routines with no network<br>

IO, then the network is likely not your issue. &nbsp;And vice versa.<br>

<div class="Ih2E3d"></div></blockquote><div></div><div>I have thought of that, but I didn&#39;t manage to do it on the more important codes. It compiles but just doesn&#39;t spit out the profiling output.</div><div></div>

<div>I have used &quot;iftop&quot; to measure network usage and it&#39;s probably around 300-400Mbit/s, so I was poiting the problem at latency, throughput seems fine. While copying files with &quot;scp&quot;, I can get 93MB/s.</div>

<div>&nbsp;</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">

&gt; unexpected for me, since the benchmarks I&#39;ve seen on Gromacs website state<br>

&gt; that I should be able to have 100% scaling on this case, sometimes more.<br>

<br>

</div>Contact the person who said that, get the exact conditions, and see if<br>

you can replicate them. &nbsp;You might have a network issue, but unless you<br>

are comparing apples to apples it may be hard to figure it out.</blockquote><div></div><div>True. Thanks for the help.</div><div></div><div>I must ask, doesn&#39;t anybody on this list run like 16 cores on two nodes well, for a code and job that completes like in a week?</div>

<div>Or most code that gets done in a week/two weeks only scales with InfiniBand and the like? For like 99% of the cases.</div><div></div><div>Best regards,</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Tiago Marques</div><div>&nbsp;</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>

<br>

Regards,<br>

<font color="#888888"><br>

David Mathog<br>

<a href="mailto:mathog@caltech.edu">mathog@caltech.edu</a><br>

Manager, Sequence Analysis Facility, Biology Division, Caltech<br>

</font><div><div class="Wj3C7c">_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</div></div></blockquote><br>

<br />-- 

<br />This message has been scanned for viruses and

<br />dangerous content by

<a href="http://www.mailscanner.info/"><b>MailScanner</b></a>, and is

<br />believed to be clean.