Solaris Fire Engine.

Robert G. Brown rgb at
Tue Oct 21 08:20:10 EDT 2003

On Tue, 21 Oct 2003 pesch at wrote:

> In a cluster, would it not make more sense to catch an attack in a firewall rather than at the kernel level? If
> so, should cluster builders  perhaps look for other - more cluster specific - kernels? Should kernel development
> at some point split in two distinct lines: one for single computer applications and one for clusters?

It's the usual problem (and a continuation of my XML rant in a way, as
it is at least partly motivated by this).  Sure, one can do this.
However, it is very, very expensive to do so, a classic case of 90% of
the work producing 10% of the benefit, if that.  

As Don pointed out, even Scyld, with highly talented people who are (in
principle:-) even making money doing so found maintaining a separate
kernel line crushingly expensive very quickly.  Whenever expense is
mentioned, especially in engineering, one has to consider benefit, and
do a CBA.  The CBA is the crux of all optimization theory; find the
point of diminishing returns and stay there.  I would argue that
splitting the kernel is WAY far beyond that point.  Folks who agree can
skip the editorial below.  For that matter, so can folks who

The expense can be expressed/paid one of several ways -- get a distinct
kernel optimized and stable, get an entire associated distribution
optimized and stable, and then freeze everything except for bugfixes.
You then get a local optimum (after a lot of work) that doesn't take a
lot of work to maintain, BUT you pay the penalty of drifting apart from
the rest of linux and can never resynchronize without redoing all that
work (and accepting all that new expense).  New, more efficient gcc?
Forget it -- the work of testing it with your old kernel costs too much.
New device drivers?  Hours to days of testing for each one.  Eventually
a key application or improvement appears in the main kernel line (e.g.
64 bit, Opteron support) that is REALLY different, REALLY worth more to
nearly everybody than the benefit they might or might not gain from the
custom HPC optimized kernel, and your optimized but stagnant kernel is

Alternatively, you can effectively twin the entire kernel development
cycle, INCLUDING the testing and debugging.  Back in my ill-spent youth
I spent a considerable amount of time on the linux-smp list (I couldn't
take being on the main linux kernel list even then, as its traffic
dwarfs both the beowulf list and the linux-smp list combined).  I also
played a tiny bit with drivers on a couple of occassions.  The amount of
work, and number of human volunteers, required to drive these processes
is astounding, and I would guess that it would have to be done on
twinned lists as the kernelvolken would likely not welcome a near
doubling of traffic on their lists or doubling of the work burden trying
to figure out just who owns a given emergent bug (and inevitably they
WOULD have to help figure out who owns emergent bugs, as some of them
WOULD belong to them, others to the group supporting the split off
sources, if they were to proceed independently but "keep up" with the
development kernel so that true divergence did not occur).

A better alternative exists (and is even used to some extent).  The
linux kernel is already highly modular.  It is already possible to e.g.
bypass the IP stack altogether (as is done by myrinet and other high
speed networks) with custom device drivers that work below the IP and
TCP layers -- just doing this saves you a lot of the associated latency
hit in high speed networks, as TCP/IP is designed for WAN routing and
security and tends to be overkill for a secure private LAN IPC channel
in a beowulf.  This route requires far less maintenance and
customization -- specialized drivers for MPI and/or PVM and/or a network
socket layer, plus a kernel module or three.  Even this is "expensive"
and tends to be done only by companies that make hefty marginal profits
for their specific devices, but it is FAR cheaper than maintaining a
separate kernel altogether.  

I would also lump into this group applying and testing on an ad hoc
basis things like Josip's network optimization patches which make
relatively small, relatively specific changes that might technically
"break" a kernel for WAN application but can produce measureable
benefits for certain classes of communication pattern.  This sort of
thing is NOT for everybody.  It is like a small scale version of the
first alternative -- the patches tend to be put together for some
particular kernel revision and then frozen (or applied "blindly" to
succeeding kernel revisions until they manifestly break).  Again this
motivates one to freeze kernel and distribution once one gets everything
working and live with it until advances elsewhere make it impossible to
continue doing so.  This is the kind of thing where MAYBE one could get
the patches introduced into the mainstream kernel sources in a form that
was e.g.  sysctl controllable -- "modular", as it were, but inside the
non-modular part of the kernel as a "proceed at your own risk" feature.

Expense alternatives in hand, one has to measure benefit.  We could
break up HPC applications very crudely into groups.  One group is code
that is CPU bound -- where the primary/only bottleneck is the number of
double precision floating point (and associated integer) computations
that the computer can retire per second.  Another might be memory bound
-- limited primarily by the speed with which the system can move values
into and out of memory doing some simple operations on them in the
meantime.  Still another might be disk or other non-network I/O bound
(people who crunch large data sets to and from large storage devices).
Finally yes, one group might be bound by the network and network based
IPC's in a parallel division of a program.

This latter group is the ONLY group that would really benefit from the
kernel split; the rest of the kernel is reasonably well optimized for
raw computations, memory access, and even hardware device access (or can
be configured and tuned to be without the need of a separate kernel
line). I would argue that even the network group splits again, into
latency limited and bandwidth limited.  Bandwidth limited applications
would again see little benefit from a hacked kernel split as TCP can
deliver data throughput that is roughly 90% of wire speed (or better)
for ethernet, depending on the quality of hardware as much as the
kernel.  Of course, the degree of the CPU's involvement in sending and
receiving these messages could be improved; one would like to be able to
use DMA as much as possible to send the messages without blocking the
CPU, but this matters only if the CPU can do something useful while
awaiting the network IPC transfers; often it cannot.

The one remaining group that would significantly benefit is the latency
limited group -- true network parallel applications that need to send
lots of little messages that cannot be sensibly aggregated in software.
The benefit there could be profound, as the TCP stack adds quite a lot
of latency (and CPU load) on top of the irreducible hardware latency,
IIRC, even on a switched network where the CPU doesn't have to deal with
a lot of spurious network traffic.  Are there enough members of this
group to justify splitting the kernel?  I very much doubt it.  I don't
even think that the existence of this group has motivated the widespread
adoption of a non-IP ethernet transport layer -- nearly everybody just
lives with the IP stack latency OR...

...uses one of the dedicated HPC networks.

This is the real kicker.  TCP latency is almost two orders of magnitude
greater than either myrinet or dolphin/sci latency (which are both order
of microseconds instead of order of hundreds of microseconds).  They
>>also<< deliver very high bandwidth.  Sure, they are expensive, but you
know that you are paying for precisely what YOU need for YOUR HPC
computations.  I don't have to pay for them (even indirectly, by helping
out with a whole secondary kernel development track) when MY code is CPU
bound; the big DB guys don't have to pay for it when THEIR code depends
on how long it takes to read in those ginormous databases of e.g.
genetic data; the linear algebra folks who need large, fast memory don't
pay for it (unless they try splitting up their linear algebra across the
network, of course:-) -- it is paid for only the people who need it, who
send lots of little messages or who need its bleeding edge bandwidth or

One COULD ask, very reasonably, for just about any of the kernel
optimizations that can be implemented at the modular level -- that is a
matter of writing the module, accepting responsibility for its
integration into the kernel and sequential debugging in perpetuity (that
is, becoming a slave of the lamp, in perpetuity bound to the kernel
lists:-).  Alas, TCP/IP is so bound up inside the main part of the
kernel that I don't think it can be separated out into modules any more
than it already is.

^^^^^ ^^^^^, (closing omitted in the fond hope of remuneration)


(C'mon now -- here I am omitting all sorts of words from my rants and my
paypal account is still dry as a bone, dry as a desert, bereft of all
money, parched as my throat in the noonday sun.  Seriously, either I
make some money or I'm gonna compose a 50 kiloword opus for my next


Robert G. Brown	             
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at

Beowulf mailing list, Beowulf at
To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list