[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Simon Kelley simon at thekelleys.org.uk
Thu Dec 14 18:01:47 EST 2006

Donald Becker wrote:

>>I'm not quite following here: It seems like you might be advocating
>>retransmits every half second. I'm current doing classical exponential
>>backoff, 1 second delay, then two, then four etc. Will that bite me?
> Where are you you doing exponential back-off?  
re-transmits in the TFTP server: sent a block and await the 
corresponding ACK; if it doesn't arrive for timeout, re-send. This is 
needed to recover from lost data packets, client retries only recover 
from lost ACKs (at least they do in implementations which have been 
immunised against sorcerers-apprentice syndrome.)

> The TFTP client will/should/might do a retry every second.  (Background:
> TFTP uses "ACK" of the previous packet to mean "send the next one".  The
> only way to detect this is a retry is timing.) The client might do a
> re-ARP first.  In corner cases it might not reply to ARP itself.
> [[ Step up on the soapbox. ]]
> What idiot thought that exponential backoff was a good idea?
> Exponential backoff doesn't make sense where your base time period is a
> whole second and you can't tell if the reason for no response is
> failure, busy network or no one listening.
> My guess is that they were just copying Ethernet, where modified,
> randomized exponential backoff is what makes it magically good.
> Exponential backoff makes sense at the microsecond level, where you have
> a collision domain and potentially 10,000 hosts on a shared ether.  Even
> there the idea of "carrier sense" or 'is the network busy' is what
> enables Ethernet to work at 98+% utilization rather than the 18% or 37%
> theoretical of Aloha Net.  (Key difference: deaf transmitter.)
> What usually happens with DHCP and PXE is that the first packet is used
> getting the NIC to transmit correctly.  The second packet is used to get
> the switch to start passing traffic.  The third packet get through but we
> are already well into the exponential fallback.
> PXE would be much better and more reliable if it started out
> transmitting a burst of four DHCP packets even spaced in the first
> second, then falling back to once per second.  If there is a concern
> about DHCP being a high percentage of traffic in huge installations
> running 10baseT, tell them to buy a server. Or, like, you know, a
> router.  Because later the ARP traffic alone will dwarf a few DHCP
> broadcasts.

It's probably worth differentiating DHCP and TFTP here. I guess the 
reason for exponential-backoff of to avoid congestion-collapse as the 
ratio of bits-on-the-wire to useful work decreases. By the time a host 
is doing TFTP the network-path should be established, so bursting 
packets shouldn't be needed. Maybe delaying backoff would make sense.
>>I'm doing round-robin, but I don't see how to throttle active
>>connections: do I need to do that, or just limit total bandwidth?
> Yes, you need to throttle active TFTP connections.  The clients
> currently winning can turn around a next-packet request really quickly.
> If a few get in lock step, the server will have the next chunk of the
> file warm in the cache.  This is the start of locking out the first
> loser.
> You can't just let the ACKs queue up in the socket as a substitute for
> deferring responses either.  You have to pull them out ASAP and mark
> that client as needing a response.  This doesn't cost very much.   You
> need to keep the client state structure anyway.  This is just one more
> bit, plus updating the timeval that you should be keeping anyway.
All true. I'll experiment with some throttling approaches.



Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list