Request on advice on which kernel? 2.2 or 2.4?

Martin Siegert siegert at sfu.ca
Wed Oct 3 18:15:57 EDT 2001


On Wed, Oct 03, 2001 at 09:57:44AM -0400, Donald Becker wrote:
> On Wed, 3 Oct 2001, Michelle Kuttel wrote:
> 
> > I would like to request some opinions/advice on which kernel is best for
> > my Beowulf cluster.  We have a cluster of 16 Dual processor PentiumIII-866
> > MHz work nodes (head node AMD athlon 1Ghz CPU, single processor).  It has
> > been running for a few months now (computational chemistry CHARMM code
> > principally).
> >  I have installed both 2.2.14-5 kernel (with Loncaric's
> > tcpfix kernel patch)
> 
> We use and recommend this TCP patch.  Josip did excellent work.
> 
> > and the 2.2.4 kernel at different times.
> 
> The biggest advantage of 2.4 kernel is the SMP improvements to the
> network stack.  You'll see less benefit with your single processor
> nodes, with most of the benefit on four processor nodes.

This brings up another issue: the APIC code (bugs?) in the 2.4 series
of kernels. I encouter the following problem: when using 2.4 kernels
(I have tried almost every version starting from RedHat's 2.4.3-12 smp
kernel over 2.4.5 - 2.4.10 including various ac versions as well) and
the LAM MPI distribution some MPI programs will hang almost every time.
These are mostly parallel FFT jobs (from the fftw library) using global
communication patterns (MPI_Alltoall). I am using dual Athlon 1.2GHz nodes
each with 4 3com NICs, three of which are channel bonded.
I make the following observations:

- the program hangs when executing a r = read(sock, buf, nbytes) statement
  over and over again. Typically: r=56 or r=696 and nbytes=116765796, i.e.,
  if you decrease 116765796 in steps of 56 or 696, the program hangs for
  practical purposes.

- when using mpich the program does not hang.

- when using the 2.2.19 smp kernel the program does not hang.

- using the append="noapic" setting in /etc/lilo.conf with a 2.4.x kernel 
  reduces the failure rate but still the program hangs with a probability
  that is unacceptable for a production environment.

>From this I concluded that I cannot use a 2.4 kernel and LAM. I do not know
with certainty what is causing the failures:

- is it a LAM bug?

- is it a 3c59x driver bug?

- is it a 2.4 kernel bug?

Besides this problem I have encountered by now several RedHat 7.1 machines
on campus (UP or SMP) that had network problems which could be solved by
including the "noapic" option in lilo.conf. Are there chances that the
APIC problems in the 2.4 kernels are resolved soon (there seem to be changes
to the APIC code in 2.4.10, but I still have problems)? Is there a performance
hit related to the "noapic" option?

Anyway, with the release of mpich-1.2.2 this problem isn't as pressing
anymore as it was a few weeks ago. The performance of MPI jobs under
mpich-1.2.2 is much improved, particularly for smaller message sizes. Big
thankyou to the mpich developpers!

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list