<FONT face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size=2><div><div><font color="#990099"></font><blockquote style="border-left: 2px solid #000000; padding-right: 0px; padding-left: 5px; margin-left: 5px; margin-right: 0px;">To: beowulf@beowulf.org<br>From: Greg Lindahl &lt;lindahl@pbm.com&gt;<br>Sent by: beowulf-bounces@beowulf.org<br>Date: 03/27/2009 12:03AM<br>Subject: Re: [Beowulf] Lowered latency with multi-rail IB?<br><br><font size="3" face="Courier New,Courier,monospace">On Thu, Mar 26, 2009 at 11:32:23PM -0400, Dow Hurst DPHURST wrote:<br><br>&gt; We've got a couple of weeks max to finalize spec'ing a new cluster.&nbsp; Has <br>&gt; anyone knowledge of lowering latency for NAMD by implementing a <br>&gt; multi-rail IB solution using MVAPICH or Intel's MPI?<br><br>Multi-rail is likely to increase latency.<br><br>BTW, Intel MPI usually has higher latency than other MPI<br>implementations.<br><br>If you look around for benchmarks you'll find that QLogic InfiniPath<br>does quite well on NAMD and friends, compared to that other brand of<br>InfiniBand adaptor. For example, at<br><br><a href="http://www.ks.uiuc.edu/Research/namd/performance.html">http://www.ks.uiuc.edu/Research/namd/performance.html</a><br><br>the lowest line == best performance is InfiniPath. Those results<br>aren't the most recent, but I'd bet that the current generation of<br>adaptors has the same situation.<br><br>-- Greg<br>(yeah, I used to work for QLogic.)<br><br>_______________________________________________<br>Beowulf mailing list, Beowulf@beowulf.org<br>To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a><br></font>

</blockquote><br>I'm very familiar with that benchmark page.&nbsp; ;-)<br><br>One motivation for designing a MPI layer to lower latency with multi-rail is when making use of accelerator cards or GPUs.&nbsp; There is so much more work being done that the interconnect quickly becomes the limiting factor.&nbsp; One Tesla GPU is equal to 12 cores for the current implementation of NAMD/CUDA so the scaling efficiency really suffers.&nbsp; I'd like to see how someone could scale efficiently beyond 16 IB connections with only two GPUs per IB connection when running NAMD/CUDA.<br><br>Some codes are sped up far beyond 12x and reach 100x such as VMD's cionize utility.&nbsp; I don't think that particular code requires parallelization (not sure).&nbsp; However, as NAMD/CUDA is tuned, the efficiency on the GPU is increased, and new bottlenecks found and fixed from previously ignored sections of code, there will be even more than a 12x speedup.&nbsp; So, a solution to the interconnect bottleneck needs to be developed and I wondered if multi-rail would be the answer.&nbsp; Thanks so much for your thoughts!<br>Best wishes,<br>Dow<br></div></div></FONT><br />-- 

<br />This message has been scanned for viruses and

<br />dangerous content by

<a href="http://www.mailscanner.info/"><b>MailScanner</b></a>, and is

<br />believed to be clean.