Network RAM revisited

Wed May 28 19:29:52 EDT 2003

On Wed, 28 May 2003, Mark Hahn wrote:

> > Another question that bothers me is network latency deteriorates severely 
> > after packet size goes beyond 1-1.5 KB. 
> 
> I don't see that, unless by "severe" you mean latency=bandwidth/size ;)
> fragmenting a packet should definitely not cause a big decrease in 
> throughput.  also, support for jumbo MTU's is not that uncommon.

Mark is dead right here.  In fact, there are two regimes of bottleneck
in networking.  Small packets are latency dominated -- the interface
cranks out packets as fast as it can, and typically bandwidth (such as
it is) increases linearly with packet SIZE as R_l * P_s (max rate for
latency bounded packets times packet size).  Double the packet size,
double the "bandwidth", but you just can't get any more pps through the
interface/switch/interface combo (often with TCP stack on the side
dominating the whole thing).

What you're seeing around P_s = 1 KB is a crossover from latency
dominated to bandwidth dominated (bottlenecked) traffic.  You are
approaching wire speed.  As soon as this occurs you CAN'T continue spit
out packets at the maximum rate so speed continues to increase linearly
with packet size as the wire simply won't hold any more bps during the
time you are using it.  This causes latency to "increase" (or rather,
the packet rate to decrease) as packet delivery starts to be delayed by
the sheer time required to load the packet onto the wire at the maximum
rate, not the time required to put the packet together and initiate
transmission.

The result is a near-exponential saturation curve exhibiting linear
growth saturating at wirespeed less all sorts of cumulative overhead and
retransmissions and other inefficiencies, typically about 10 MBps data
transmission (around 90% of the theoretical limit after allowing for
mandatory headers) for 100BT TCP/IP although this varies a LOT with the
quality of your NIC and switch and wiring and protocol/stack.  At one
time fragmenting a packet stream of messages each just larger than the
MTU caused one to fall off to a performance region that was once again
latency dominated and cost one a "jump" down in bandwidth, but in recent
years this jump has been small to absent as NIC latencies have dropped
so even fragmented/split packets are still in the bandwidth dominated
region where bandwidth is nearly saturated and slowly varying.

> in summary: I believe network shared memory is simply not a great computing
> model.  if I was supervising a thesis project, I'd probably try to steer 
> the student towards something vaguely like Linda...

I'm not sure I agree with this.  There have certainly been major CS
projects (like Duke's Trapeze) that have been devoted to creating a
reasonably transparent network-based large memory model because there
ARE problems (or at least have been problems) where there is a need for
very large virtual memory spaces but disk-based swap is simply too slow.
A FAST network (not ethernet, and not TCP/IP) with 5 usec or so latency
and 100 MBps scale bandwidth, large B, can still beat up disk swap for
certain (fairly random or at least nonlocal) memory access patterns.
You assert that these patterns can be avoided by careful design and
could be right.  However, there is some virtue in having a general
purpose magic-wand level tool where accessing more memory than you have
kicks in a transparent mechanism for distributing the memory and
runtime-optimizing its access -- basically creating an additional level
of memory speed in the memory speed hierarchy -- so users don't HAVE to
code for a particular size or architecture.

I do think that simply providing "networked swap" through the existing
VM is unlikely to be a great solution (although it might "work" with
tools already in the kernel for at least testing purposes).  The VM is
almost certainly tuned for a single, highly expensive level of memory
outside of physical DRAM, and there are too many orders of magnitude
between DRAM and disk latencies and bandwidths for the tuning to "work"
correctly for an intermediary network layer with very different
advantages and nonlinearities.  Then there is the page issue which if
nothing else might require tuning or might favor certain hardware
(capable of jumbo packets that can hold a full page) or both (different
tunings for different hardware).

What I would recommend is that the student talk to somebody like Jeff
Chase at Duke and look over the literatures on existing and past
projects that have addressed the issue.  They'll need to quote Jeff's
and the other people's work anyway in any sort of a sane dissertation,
and they are by far the best people to tell them if the idea is still
extant and worthy of further work (perhaps built on top of their base,
perhaps not) or not.

It is also wise to (as they are apparently doing) find a few people with
applications that would "use a huge VM tomorrow" if it existed as a
mainstream option in (say) an OTC distribution or even a specialized
scyld or homebrew kernel.  Weather, cosmology, there are a few
"enormous" problems where researchers ALWAYS want to work bigger than
current physical limits and budgets permit and can still get useful
results even with the penalties imposed by disk or other VM extensions.
For those workers, a "cluster" might be a single compute node and a farm
of VM extension nodes providing some sort of CC-NUMA across the
aggregate memory space for just the one core processor.  If you have the
problem in hand, it makes developing and testing a meaningful solution a
whole lot easier...

Or, as you say Mark, 64 bit systems may rapidly make the issue at least
technically moot.  COST nonlinearities, though, might still make a
distributed/cluster DRAM VM attractive, as it might well be cheaper to
buy a farm of 16 2 GB systems even with myrinet or sci interconnects to
get to ballpark of 32 GB VM than it could be to buy a motherboard and
all the memory capable of running 32 GB native on a single board.  They
won't sell a lot of the latter (at least at first), and they'll charge
through the nose for developing them... and then there are the folks
that would say fine, how about a network of 32 32 GB nodes to let me get
to a TB of transparent VM?

Indeed, one COULD argue that this is an idea that ONLY makes sense
(again) now that there are 64 bit systems (and address spaces, kernels,
compiler support) proliferating.  Kernels that can use (mostly) all the
memory one can put NOW on a 32 bit board have only recently come along,
and grafting a de facto 64 bit address space onto a 32 bit architecture
to permit accessing more than 4 GB of VM with no existing compiler or
kernel support would be immensely painful and/or special library
("message passing" to slaves that do nothing but lookup or store data)
evil (which may be why trapeze more or less stopped).  64 bit
architectures could revive it again, especially while the aforementioned
nonlinear price break between 64 bit (but still 2-4 GB limited on the
cheap end) motherboards and 64 bit (but capable of holding 32 GB or
more, expensive) motherboards holds.  An Opteron is hardly more
expensive than a plain old high end Athlon at similar levels of memory
filling, and even a kilodollar scale premium for low-latency
interconnects could still keep the price well (an order of magnitude?)
below a large memory box for quite a while.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf