<br><br><div><span class="gmail_quote">On 12/18/07, <b class="gmail_sendername">Mark Hahn</b> &lt;<a href="mailto:hahn@mcmaster.ca" target="_blank">hahn@mcmaster.ca</a>

&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

 &gt; The machines are running the 2.6 kernel and I have confirmed that the max<br>&gt; TCP send/recv buffer sizes are 4MB (more than enough to store the full<br>&gt; 512x512 image).<br><br>the bandwidth-delay product in a lan is low enough to not need 

<br>this kind of tuning.</blockquote><div><br>I didn&#39;t actually do any tuning, I just checked the max buffer size that the linux auto-tuning can use is sufficient.&nbsp;</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

 &gt; I loop with the client side program sending a single integer to rank 0, then<br>&gt; rank 0 broadcasts this integer to the other nodes, and then all nodes send<br>&gt; back 1MB / N of data.<br><br>hmm, that&#39;s a bit harsh, don&#39;t you think?&nbsp;&nbsp;why not have the rank0/master 

<br>as each slave for its contribution sequentially?&nbsp;&nbsp;sure, it introduces a bit<br>of &quot;dead air&quot;, but it&#39;s not as if two slaves can stream to a single master<br>at once anyway (each can saturate its link, therefore the master&#39;s link is 

<br>N-times overcommitted.)</blockquote><div><br>I guess I figured that the data is relatively small compared to the bandwidth, whereas the latency for ethernet is relatively high. &nbsp;I also thought the switch would be  able to efficiently&nbsp;buffer&nbsp;and&nbsp;forward&nbsp;the&nbsp;data.&nbsp;&nbsp;I&nbsp;am&nbsp;not&nbsp;much&nbsp;of&nbsp;a&nbsp;networking&nbsp;guy&nbsp;(more&nbsp;a&nbsp;graphics&nbsp;guy)&nbsp;so&nbsp;I&nbsp;realize&nbsp;I&nbsp;could&nbsp;be&nbsp;way&nbsp;off&nbsp;base&nbsp;here.&nbsp; 

</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

 &gt; To make sure there was not an issue with the MPI broadcast, I did one test<br>&gt; run with 5 nodes only sending back 4 bytes of data each.&nbsp;&nbsp;The result was a<br>&gt; RTT of less than 0.3 ms.<br><br>isn&#39;t that kind of high?&nbsp;&nbsp;a single ping-pong latency should be ~50 us - 

<br>maybe I&#39;m underestimating the latency of the broadcast itself.</blockquote><div><br>This is quite&nbsp;a&nbsp;bit&nbsp;more&nbsp;than a single ping-pong. The viewer sends to the master node (rank 0), and then the master node broadcasts to&nbsp;all other nodes, and then all nodes send back to the viewer node. &nbsp;I don&#39;t know if this is still seems high? 

<br>&nbsp;</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

 &gt; One interesting pattern I noticed is that the hiccup frame RTTs, almost<br>&gt; without exception, fall into one of three ranges (approximately 50-60,<br>&gt; 200-210, and 250-260). Could this be related to exponential back-off? 

<br><br>perhaps introduced by the switch, or perhaps by the fact that the bcast<br>isn&#39;t implemented as an atomic (eth-level) broadcast.<br></blockquote><div><br>But the bcast is always just sending 4 bytes (a single integer), and as mentioned above no hiccups occur until the size of the final gather packets (from all nodes to the viewer) is increased.

<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>&gt; Tommorow I will experiment with jumbo frames and flow control settings (both 

<br>&gt; of which the HP Procurve claims to support).&nbsp;&nbsp;If these do not solve the<br>&gt; problems I will start sifting through tcpdump.<br><br>I would simply serialize the slaves&#39; responses first.&nbsp;&nbsp;the current design

<br>

 tries to trigger all the slaves to send results at once, which is simply<br>not logical if you think about it, since any one slave can saturate<br>the master&#39;s link.<br></blockquote><div><br>I still have the feeling that the switch should be able to handle this more efficiently, but since your idea is relatively simple to implement I will give it a try and see what the performance is like.

<br><br>Thanks for your input.<br><br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>regards, mark hahn.<br></blockquote></div><br>

!DSPAM:4768af33169886491211187!