lam - recon works lamboot doesn't!

Steve Yam styam at hns.com
Mon Aug 6 15:41:45 EDT 2001


I believe that your problem is that in your /etc/hosts file, you have the hostname for each machine pointing to
127.0.0.1 (loopback).  This causes a problem because lamboot runs lamd on the 2nd (and 3rd.. etc) node giving it
the IP address for the 1st node.  If kitkat has a /etc/hosts file containing "127.0.0.1 kitkat", then running the
remote lamd, snickers will attempt to access 127.0.0.1 for the 1st node, instead of kitkat's real IP address.
You can solve the problem by modifying that entry in /etc/hosts to point to the actual ip address instead of the
loopback.

-Steve Yam
Hughes Network Systems


Eric Linenberg wrote:

> I am trying to run lam and recon works A-OK, but lamboot gives me errors.
> Could someone possibly give me some insight into this problem! I have read
> everythig I can to no avail.  Help a newbie!
>
> Thanks,
> eric
>
> [guest at kitkat bin]$ lamboot -d -v -b beowulf
>
> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: /usr/local/lam/etc/beowulf
> lamboot: opening hostfile /usr/local/lam/etc/beowulf
> lamboot: found the following hosts:
> lamboot:   n0 kitkat
> lamboot:   n1 snickers
> lamboot:   n2 twix
> lamboot:   n3 rolo
> lamboot:   n4 butterfinger
> lamboot: found 5 host node(s)
> lamboot: origin node is 0 (kitkat)
> Executing hboot on n0 (kitkat - 2 CPUs)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
> 127.0.0.1 -P 35993 -n 0 -o 0
>    ""
> hboot: process schema = "/usr/local/lam/etc/lam-conf.lam"
> hboot: found /usr/local/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/bin/lamd
> hboot: attempting to execute
> [1]  24080 lamd -H 127.0.0.1 -P 35993 -n 0 -o 0 -d
> Executing hboot on n1 (snickers - 2 CPUs)...
> lamboot: -b used, assuming same shell on remote nodes
> lamboot: got local shell /bin/bash
> lamboot: attempting to execute "/usr/bin/rsh snickers -n hboot -t -c
> lam-conf.lam -d -v -s -I "-H 127.0.0.1 -P 35993 -n 1 -o 0    ""
> hboot: process schema = "/usr/local/lam/etc/lam-conf.lam"
> hboot: found /usr/local/lam/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/local/lam/bin/lamd
> [1]    918 lamd -H 127.0.0.1 -P 35993 -n 1 -o 0 -d
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> -----------------------------------------------------------------------------
> wipe ...
>
> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
>
> Executing tkill on n0 (kitkat)...
> Executing tkill on n1 (snickers)...
> lamboot did NOT complete successfully
>
> thanks,
> -eric
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list