upgrading rh73 on an xCAT cluster

Joe Landman landman at scalableinformatics.com
Wed Sep 24 03:59:19 EDT 2003


Hi Cristian:

  This sounds like a mixture of a few problems, including a name
service/host lookup issue for the ssh slowness.  The NFS server could be
a number of things.  Which kernel are you using?  How many machines are
mounting the NFS system?  What does nfsstat say, and what does vmstat
indicate?  How many nfsd's are running?

  I have found that with name services (either through NIS, DNS, etc)
timing out, ssh gets quite slow.  One way to try this is doing an ssh to
the ip address of the compute node rather than its name.  If the times
are quite similar, there may be other issues at work.  If the ip address
method is much faster, then name resolution is not working somewhere.

Joe

  

On Wed, 2003-09-24 at 14:51, Cristian Tibirna wrote:
> Hello
> 
> Yesterday I upgraded (first time after 7 months... I know, I know) the rh73 
> rpms and the kernel. Since then, I have two nasty issues:
> 
> 1)
> The update installed a new openssh (3.1.p1-14)
> 
> The auth of sshd through pam is annoyingly slower. All ssh connections (both 
> from outside to the master and from any node to any node inside) _are_ 
> succeeding, but a lot slower. I see this in the /var/log/messages too:
> 
> Sep 24 13:16:04 n15 sshd(pam_unix)[27164]: authentication failure; logname=\ 
> uid=0 euid=0 tty=NODEVssh ruser= rhost=n01  user=root
> Sep 24 13:16:04 n15 sshd(pam_unix)[24856]: session opened for user root by\ 
> (uid=0)
> 
> Both messages are for the same ssh connection attempt and the attempt 
> succeeds, as I said. The only visible effect to the user is the slowness (the 
> first failure is followed by a programmed delay in pam).
> 
> I looked a bit around the 'net and people have already complained a lot about 
> this problem but I found no solution.
> 
> 2)
> I also updated the kernel to 2.4.20-20.7 (redhat rpm).
> 
> Afterwards, my (and other users') SGE qmake jobs just get stuck in the middle 
> (i.e. function correctly for a while then suddenly just sit there and do 
> nothing for long time, without having completed). I feel it's some sort of 
> NFS lockup problem as the master node (NFS server) gets very high loads 
> (6.0-8.0) compared to before (2.0-3.0) the update of the kernel. The 
> /var/log/messages says nothing useful.
> 
> 
> Did anybody already updated a rh73 cluster equipped with SGE and using ssh 
> internally? Observed these problems? Found solutions?
> 
> Thanks a lot.
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list