From franz.marini at mi.infn.it  Tue Jul  1 03:20:34 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Tue, 1 Jul 2003 09:20:34 +0200 (CEST)
Subject: Job Posting - cluster admin.
In-Reply-To: <1056715034.2172.21.camel@rohgun.cse.duke.edu>
References: <1056715034.2172.21.camel@rohgun.cse.duke.edu>
Message-ID: <Pine.LNX.4.53.0307010918500.11448@merlino.mi.infn.it>

On Fri, 27 Jun 2003, Bill Rankin wrote:

> FYI - we are seeking a Beowulf admin for our university cluster.  If you
> know of anyone that is interested, please forward them this information.

Hrm... From the description it looks like the perfect job for me :)

Just to know, would you sponsor a H-1B ? ;)

Have a good day y'all !

Franz


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Tue Jul  1 04:03:17 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 1 Jul 2003 10:03:17 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <19WuWG-1Uo-00@etnus.com>
References: <19WuWG-1Uo-00@etnus.com>
Message-ID: <200307011003.17755.joachim@ccrl-nece.de>

James Cownie:
> Mark Hahn wrote:
> > does anyone have references handy for recent work on interconnect
> > latency?
>
> Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
>
> It doesn't have Inifinband, but does have Quadrics, Myrinet 2000, GigE and
> IBM.

Nice paper showing interesting properties.  But some metrics seem a little bit 
dubious to me: in 5.2, they seem to see an advantage if the "overlap 
potential" is higher (when they compare Quadrics and Myrinet) - which usually 
just results in higher MPI latencies, as this potiential (on small messages) 
can not be exploited. Even with overlapping mulitple communication 
operations, the faster interconnect remains faster. This is especially true 
for small-message latency.

>From the contemporary (cluster) interconnects, SCI is missing next to 
Infiniband. It would have been interesting to see the results for SCI as it 
has a very different communication model than most of the other interconnects 
(most resembling the T3E one).

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Andrew.Cannon at nnc.co.uk  Tue Jul  1 09:15:21 2003
From: Andrew.Cannon at nnc.co.uk (Cannon, Andrew)
Date: Tue, 1 Jul 2003 14:15:21 +0100
Subject: Cluster over standard network
Message-ID: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>

Hi all,

Has anyone implemented a cluster over a normal office network using the PCs
on people's desks as part of the cluster? If so, what was the performance of
the cluster like? What sort of performance penalty was there for the
ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford,
Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
 NNC Limited
 Booths Hall
 Chelford Road
 Knutsford
 Cheshire
 WA16 8QZ
 
 Country of Registration: United Kingdom
 Registered Number: 1120437
 
 This e-mail and any files transmitted with it are confidential and 
 intended solely for the use of the individual or entity to whom they   
 are addressed. If you have received this e-mail in error please notify 
 the NNC system manager by e-mail at eadm at nnc.co.uk.
***********************************************************************************

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Tue Jul  1 09:48:30 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 01 Jul 2003 08:48:30 -0500
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <1057067310.26428.14.camel@terra>

On Tue, 2003-07-01 at 08:15, Cannon, Andrew wrote:
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
> 
> Just a thought...
> 
> 

It all depends on what your application does with that network and how beefy your nodes are. 
For instance, if you were to run something like Amber (molecular dynamics) over an office
LAN, I can pretty much guarantee that you will not win any office popularity polls. It
simply saturates the network. If your nodes are reasonably slow you might be better,
relatively, as you might have reduced network traffic because you nodes are spending
more time thinking. I wouldn't depend on it.

On the other hand, you have to consider what those pesky co-workers are doing to YOUR
network. ;-) Use of M$ Outlook and streaming mp3's off fileservers, to mention a couple,
will cut into YOUR bandwidth causing performance problems.

Just my $0.02 worth.

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  1 10:00:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 1 Jul 2003 16:00:16 +0200
Subject: AW: Cluster over standard network
Message-ID: <29B376A04977B944A3D87D22C495FB2301275E@vertrieb.emplics.com>

Hi Andy,

think this depends on your desktop pc's.
I already have installed such a "cluster", but the desktops were dual 2.4 Ghz PCs with 'own' giga ethernet.
It all worked with dual boot on the boot loader and automatic switching of the boot-options in the evening and morning.

But there some problems you should take a closer look at.
 What would you do if your job is still running in the morning and the employees are on the way to their offices ?
 Could your network bear up with the heavy traffic or woult it disturb things like eg backup server. (If you haven't a
 seperat backbone.)
 What if someone would like to impress the boss and do some overtime ?

I would recommend, that you use some of the diskless cds or floppys out there (like knoppix or mosix-on-floppy)
to check your equipment against your demands.

If your office pc are already using linux, you could/should take a look at openmosix.
>From openmosix.org:
"Once you have installed openMosix, the nodes in the cluster start talking to one another and the cluster adapts itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node. openMosix continuously attempts to optimize the resource allocation."

We are using openmosix on our clusters and on our servers as well. Works fine for no-parallel jobs.

Greetings
Ren?

-----Urspr?ngliche Nachricht-----
Von: Cannon, Andrew [mailto:Andrew.Cannon at nnc.co.uk] 
Gesendet: Dienstag, 1. Juli 2003 15:15
An: Beowolf (E-mail)
Betreff: Cluster over standard network


Hi all,

Has anyone implemented a cluster over a normal office network using the PCs on people's desks as part of the cluster? If so, what was the performance of the cluster like? What sort of performance penalty was there for the ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford, Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
 NNC Limited
 Booths Hall
 Chelford Road
 Knutsford
 Cheshire
 WA16 8QZ
 
 Country of Registration: United Kingdom
 Registered Number: 1120437
 
 This e-mail and any files transmitted with it are confidential and 
 intended solely for the use of the individual or entity to whom they   
 are addressed. If you have received this e-mail in error please notify 
 the NNC system manager by e-mail at eadm at nnc.co.uk.
***********************************************************************************

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From louisr at aspsys.com  Tue Jul  1 10:40:04 2003
From: louisr at aspsys.com (Louis J. Romero)
Date: Tue, 1 Jul 2003 08:40:04 -0600
Subject: Hard link /etc/passwd
In-Reply-To: <20030630032016.88507.qmail@web10607.mail.yahoo.com>
References: <20030630032016.88507.qmail@web10607.mail.yahoo.com>
Message-ID: <200307010840.04590.louisr@aspsys.com>

hi Justin,

Keep in mind that concurrent access is not a given.  The last writer gets to 
update the file.  All other edits will be lost.

Louis
On Sunday 29 June 2003 09:20 pm, Justin Cook wrote:
> Good day,
> I have an 11 node diskless cluster.  All slave node
> roots are under /tftpboot/node1 ... /tftpboot/node2
> ... so on.  Is it safe to hard link the /etc/passwd
> and /etc/group file to the server nodes for
> consistency across the network?
>
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Louis J. Romero
Chief Software Architect

Aspen Systems, Inc.
3900 Youngfield Street
Wheat Ridge, Co 80033
Toll Free: (800) 992-9242
Tel +01 (303) 431-4606 Ext. 104
Cell +01 (303) 437-6168
Fax +01 (303) 431-7196
louisr at aspsys.com
http://www.aspsys.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bob at drzyzgula.org  Tue Jul  1 10:40:01 2003
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Tue, 1 Jul 2003 10:40:01 -0400
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <20030701104001.I1838@www2>

On Tue, Jul 01, 2003 at 02:15:21PM +0100, Cannon, Andrew wrote:
> 
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
> 
> Just a thought...
> 
> TIA

This is actually the way much of this stuff used to be
done, before commodity computers became both powerful and
inexpensive enough [1] to make it worth buying them just to
place in a computing cluster. It was quite common in the
early 1990s (and likely still is, in many organizations),
for example, to have PVM running on production office
and lab networks. However, one did have to be reasonably
considerate. One didn't usually use these ad hoc clusters
during business hours (or at least ran the jobs at idle
priority if one did) and one usually asked permission of
the person to whom the computer had been assigned before
adding it to the cluster. One also had to be careful not
to cause problems with other off-hours operations, such
as filesystem backups.

Of course this approach has disadvantages, and may not 
work well at all for certain types of network-intensive
applications. But if one had, for example, a Monte Carlo
simulation to run, and there was no hope of getting mo'
better computers, it could make the difference between
the the analysis being done or not done.

--Bob

[1] Or perhaps I should say before cast-off computers were
powerful enough, since that's what the first Beowulf was
made from, but that phase didn't last very long; it soon
became obvious the the cluster idea was useful enough to
justify the purchase of new machines, and cast-off machines
had problems with reliability and power consumption that
made them less than ideal for this application.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  1 11:57:54 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 1 Jul 2003 17:57:54 +0200
Subject: WG: Cluster over standard network
Message-ID: <29B376A04977B944A3D87D22C495FB23D50F@vertrieb.emplics.com>


Hi Andy,

think this depends on your desktop pc's.
I already have installed such a "cluster", but the desktops were dual 2.4 Ghz PCs with 'own' giga ethernet. It all worked with dual boot on the boot loader and automatic switching of the boot-options in the evening and morning.

But there some problems you should take a closer look at.
 What would you do if your job is still running in the morning and the employees are on the way to their offices ?  Could your network bear up with the heavy traffic or woult it disturb things like eg backup server. (If you haven't a  seperat backbone.)  What if someone would like to impress the boss and do some overtime ?

I would recommend, that you use some of the diskless cds or floppys out there (like knoppix or mosix-on-floppy) to check your equipment against your demands.

If your office pc are already using linux, you could/should take a look at openmosix. From openmosix.org: "Once you have installed openMosix, the nodes in the cluster start talking to one another and the cluster adapts itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node. openMosix continuously attempts to optimize the resource allocation."

We are using openmosix on our clusters and on our servers as well. Works fine for no-parallel jobs.

Greetings
Ren?


##############################
Hi all,

Has anyone implemented a cluster over a normal office network using the PCs on people's desks as part of the cluster? If so, what was the performance of the cluster like? What sort of performance penalty was there for the ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford, Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Tue Jul  1 18:48:08 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 1 Jul 2003 15:48:08 -0700
Subject: Linux support for AMD Opteron with Broadcom NICs
Message-ID: <20030701224808.GA15167@stikine.ucs.sfu.ca>

Hello,

I have a dual AMD Opteron for a week or so as a demo and try to install
Linux on it - so far with little success.

First of all: doing a google search for x86-64 Linux turns up a lot of
press releases but not much more, particularly nothing one could download
and install. Even a direct search on the SuSE and Mandrake sites shows
only press releases. Sigh.

Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
Thus I did a ftp installation which after (many) hickups actually worked.
However, that distribution does not support the onboard Broadcom 5704
NICs. I also could not get the driver from the broadcom web site to work
(insmod fails with "could not find MAC address in NVRAM").

Thus I tried to compile the 2.4.21 kernel which worked, but
"insmod tg3" freezes the machine instantly.

Thus, so far I am not impressed.

For those of you who have such a box: which distribution are you using?
Any advice on how to get those GigE Broadcom NICs to work?

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Tue Jul  1 19:41:16 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Tue, 01 Jul 2003 17:41:16 -0600
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F021C1C.4050309@inel.gov>

Martin Siegert wrote:

> Hello,
> 
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
> 
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
> 
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
> 
> Thus, so far I am not impressed.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
> 
> Cheers,
> Martin
> 

The evaluation box I had an account on ran SuSE and Mark Hahn is running 
RedHat 9 without problems.  Other than customizing a regular x86 distro,
you'll probably have to buy an enterprise or cluster version for now.

http://www.suse.com/us/business/products/server/sles/prices_amd64.html
http://www.mandrakesoft.com/products/clustering

It doesn't look like Debian is ready yet:
https://alioth.debian.org/projects/debian-x86-64/

I couldn't find redhat's opteron pages.

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nashif at planux.com  Tue Jul  1 20:14:47 2003
From: nashif at planux.com (Anas Nashif)
Date: Tue, 1 Jul 2003 20:14:47 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <200307012014.47101.nashif@planux.com>

On July 1, 2003 06:48 pm, Martin Siegert wrote:
> Hello,
>
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
>
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
>
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
>
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
>
> Thus, so far I am not impressed.
>
> For those of you who have such a box: which distribution are you using?

SuSE SLES 8.
> Any advice on how to get those GigE Broadcom NICs to work?

Works out of the box with broadcom. (bcm5700 module, tg3 is not always 
recommended)


Anas
>
> Cheers,
> Martin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jhearns at freesolutions.net  Wed Jul  2 06:13:04 2003
From: jhearns at freesolutions.net (John Hearns)
Date: Wed, 02 Jul 2003 11:13:04 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca> <20030701224808.GA15167@stikine.ucs.sfu.ca> <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
Message-ID: <3F02B030.1040305@freesolutions.net>

Simon Hogg wrote:

>
> As you say, at the moment the best bet seems to be to *buy* the 
> enterprise editions.  For those of us who are loathe to spend any 
> money or who 'just like' Debian, there is a bit of waiting still to 
> do.  According to one developer;
>
> "There is work ongoing on a Debian port, but it will be a while yet - 
> the x86-64 really needs sub-architecture support for effective support 
> (to allow mixing of 32- and 64-bit things), and that is a big step for 
> us. Feel free to chip in and help! :-)".
>
> However, as far as I am aware, it should be possible to install a 
> vanilla x86-32 distribution and recompile everything for 64-bit (with 
> a recent GCC (3.3 is the best bet at the moment I suppose)).
>
That's how Gentoo does things. Anyone heard of Gentoo running on X86-64 
? Might be fun.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Wed Jul  2 05:31:53 2003
From: seth at hogg.org (Simon Hogg)
Date: Wed, 02 Jul 2003 10:31:53 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <3F021C1C.4050309@inel.gov>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
 <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>

At 17:41 01/07/03 -0600, Andrew Shewmaker wrote:
>Martin Siegert wrote:
>
>>Hello,
>>I have a dual AMD Opteron for a week or so as a demo and try to install
>>Linux on it - so far with little success.
>>First of all: doing a google search for x86-64 Linux turns up a lot of
>>press releases but not much more, particularly nothing one could download
>>and install. Even a direct search on the SuSE and Mandrake sites shows
>>only press releases. Sigh.
>>Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
>>Thus I did a ftp installation which after (many) hickups actually worked.
>>However, that distribution does not support the onboard Broadcom 5704
>>NICs. I also could not get the driver from the broadcom web site to work
>>(insmod fails with "could not find MAC address in NVRAM").
>>Thus I tried to compile the 2.4.21 kernel which worked, but
>>"insmod tg3" freezes the machine instantly.
>>Thus, so far I am not impressed.
>>For those of you who have such a box: which distribution are you using?
>>Any advice on how to get those GigE Broadcom NICs to work?
>>Cheers,
>>Martin
>
>The evaluation box I had an account on ran SuSE and Mark Hahn is running 
>RedHat 9 without problems.  Other than customizing a regular x86 distro,
>you'll probably have to buy an enterprise or cluster version for now.

As you say, at the moment the best bet seems to be to *buy* the enterprise 
editions.  For those of us who are loathe to spend any money or who 'just 
like' Debian, there is a bit of waiting still to do.  According to one 
developer;

"There is work ongoing on a Debian port, but it will be a while yet - the 
x86-64 really needs sub-architecture support for effective support (to 
allow mixing of 32- and 64-bit things), and that is a big step for us. Feel 
free to chip in and help! :-)".

However, as far as I am aware, it should be possible to install a vanilla 
x86-32 distribution and recompile everything for 64-bit (with a recent GCC 
(3.3 is the best bet at the moment I suppose)).

However, your original problem seems not to be how to get it installed, but 
rather how to get your Broadcom GigE to work.  I'm afraid I don't know the 
answer to that one!

I know this doesn't answer your question, but hope it gives somebody some 
more impetus to get this darned Debian port finished :-)

HTH (although probably won't).

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fgp at pcnet.ro  Wed Jul  2 07:46:34 2003
From: fgp at pcnet.ro (Florian Gabriel)
Date: Wed, 02 Jul 2003 14:46:34 +0300
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F02C61A.4060409@pcnet.ro>

Martin Siegert wrote:

>Hello,
>
>I have a dual AMD Opteron for a week or so as a demo and try to install
>Linux on it - so far with little success.
>
>First of all: doing a google search for x86-64 Linux turns up a lot of
>press releases but not much more, particularly nothing one could download
>and install. Even a direct search on the SuSE and Mandrake sites shows
>only press releases. Sigh.
>
>Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
>Thus I did a ftp installation which after (many) hickups actually worked.
>However, that distribution does not support the onboard Broadcom 5704
>NICs. I also could not get the driver from the broadcom web site to work
>(insmod fails with "could not find MAC address in NVRAM").
>
>Thus I tried to compile the 2.4.21 kernel which worked, but
>"insmod tg3" freezes the machine instantly.
>
>Thus, so far I am not impressed.
>
>For those of you who have such a box: which distribution are you using?
>Any advice on how to get those GigE Broadcom NICs to work?
>
>Cheers,
>Martin
>
>  
>
You can try the preview distribution "gingin64" from here:
http://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 11:01:25 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 08:01:25 -0700 (PDT)
Subject: memory nightmare
Message-ID: <20030702075618.D6562-100000@euler.salk.edu>


I need some advice about how to handle some ambiguous results from
memtest86.  I also have some general questions about bios options
related to ECC memory.

First some background: I'm building a diskless cluster that will soon
grow to 100 dual athlon nodes.  At present it has 10 diskless nodes
and a server.  The boards are Gigabyte Technologies model GA7DPXDW-P,
and the cpus are Athlon MP 2200+.  In April I bought 69 1 gigabyte
ecc registered ddr modules from a vendor who had twice before sold me
reliable memory.  This time, however, the memory was bad.  Testing in
batches of 3 sticks per motherboard, nearly 100% failed memtest86,
and some machines crashed or would not even boot.  They replaced
all 69 sticks.  Of this second batch, about 60% failed memtest86,
and the longer I tested, the more would fail.  I again returned
them all.  In both of these batches, the failures were numerous,
often thousands or hundreds of thousands or even millions of errors.
The errors were usually multibit errors, where the "fail bits" were
things like 0f0f0f0f or ffffffff.  The most commonly failing test
seemed to be test number 6, but others failed, too.

I am now testing the third batch of 69 sticks.  I decided, more-or-less
arbitrarily, that I would consider them good if they passed 48 hours
of memtest86.  Testing in batches of 3 per board, all but 6 groups of
3 sticks passed 48 hours of memtest86.  I have been able to identify a
single failing stick in 2 of the 6 failed batches by testing 1 stick
per motherboard.  I am still testing the others, 1 stick per board,
but so far none has failed.

So here is the problem:  I have these 4 batches, of 3 sticks each,
which failed memtest86 when tested in batches of 3.  The failures did
not occur on each pass of memtest's 16 tests.  Instead the sticks would
pass all of the tests for several passes.  In one case the failure
did not occur until after memtest86 had been running, without error,
for 42 hours on that machine.  That particular failure was in a single
word in test 6.  The worst of the 4 batches failed at 14 memory
locations.  I have now been testing 9 of these 12 suspect sticks,
1 stick per motherboard, for several days.  Several have now passed
more than 100 hours of memtest86 without error.

Can I trust them?

Should I keep them or return them?

If I return them, how long must I run memtest86 on the replacements
before I can trust those?

Can I trust the 55 or so sticks that passed 48 hours of memtest86 in
batches of 3?

The vendor has been making a good-faith effort to solve the problem,
and has even agreed to refund my money for the whole purchase if I'm
not happy with it.

What would you do in this situation?


Those are the most urgent questions for which I need answers, but I
have a few others of a more general nature:

Is there a specific vendor or brand of memory that is much more
reliable than others?  Since the above-described ordeal, I've heard
that Kingston has a good reputation.  Anyone care to endorse or
refute that?  Any other good brands/vendors you care to mention?

My understanding is that ECC can correct only single-bit errors, and
so would not help with the kind of multibit errors that have been
troubling me lately.  But I have some basic questions on ECC that
you might be able to answer (I've asked the motherboard maker's tech
support, but to no avail!):

In the bios for my GA7DPXDW-P motherboards, there are these 4
alternatives for the SDRAM ECC Setting:

    Disabled
    Check only
    Correct Errors
    Correct + scrub

I'm pretty sure I understand what 'Disabled' does.  Can anyone
explain to me what the others do, and how they differ?  Also, if ECC
correction is enabled, does this slow down the machine in any way?
Is there any disadvantage to having ECC correction enabled?


TIA,

Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul  2 10:38:05 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 2 Jul 2003 10:38:05 -0400 (EDT)
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
Message-ID: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>

> However, as far as I am aware, it should be possible to install a vanilla 
> x86-32 distribution

it is.  dual-opterons are very xeon-compatible.  I was in a hurry to 
fiddle with one that came into my hands, so I just ripped the HD out
of a crappy i815/PIII system (containing a basic RH9 install),
and plugged it into the dual-opteron (MSI board).  worked fine.
I compiled a specific kernel for it, and it was even finer
(I don't use modules, but the AMD Viper ide controller and broadcom
gigabit drivers seemed to work perfectly fine.)  the machine is now 
in day-to-day use as a workstation running Mandrake (ia32 version,
I think, though probably also with a custom kernel).

I did some basic testing, and was pleased with performance - about
what I'd expect from a dual-xeon 2.6-2.8.  none of that testing was 
with an x86-64 compiler/kernel/runtime, though - in fact, I was just
using Intel's compilers ("scp -r xeon:/opt/intel /opt"!)

do be certain that your dimms are arranged right - our whitebox vendor
seemed to think that all the dimms should go in cpu0's bank first,
with no inter-bank or inter-node interleaving.  performance was ~30%
better under Stream when the dimms were properly distributed and 
both kinds of interleaving enabled in bios.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sgaudet at wildopensource.com  Wed Jul  2 13:26:29 2003
From: sgaudet at wildopensource.com (Stephen Gaudet)
Date: Wed, 02 Jul 2003 13:26:29 -0400
Subject: memory nightmare
References: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <3F0315C5.6020401@wildopensource.com>

Hello Jack,


<snip>

> So here is the problem:  I have these 4 batches, of 3 sticks each,
> which failed memtest86 when tested in batches of 3.  The failures did
> not occur on each pass of memtest's 16 tests.  Instead the sticks would
> pass all of the tests for several passes.  In one case the failure
> did not occur until after memtest86 had been running, without error,
> for 42 hours on that machine.  That particular failure was in a single
> word in test 6.  The worst of the 4 batches failed at 14 memory
> locations.  I have now been testing 9 of these 12 suspect sticks,
> 1 stick per motherboard, for several days.  Several have now passed
> more than 100 hours of memtest86 without error.
> 
> Can I trust them?
> 
> Should I keep them or return them?
> 
> If I return them, how long must I run memtest86 on the replacements
> before I can trust those?
> 
> Can I trust the 55 or so sticks that passed 48 hours of memtest86 in
> batches of 3?
> 
> The vendor has been making a good-faith effort to solve the problem,
> and has even agreed to refund my money for the whole purchase if I'm
> not happy with it.
> 
> What would you do in this situation?

First, I'd make sure the memory comes from a major supplier, Kingston, 
Crucial, Virtium, Ventura, Transend, etc...

Next, make sure all the ram has the same chipset Samsung, Infineon, 
etc...  If you have various sticks in these systems where the chip 
manufacture is different they sometime don't behave well.  So try to 
make everything match.

Last I check cooling.  Do these systems have proper cooling?


> Those are the most urgent questions for which I need answers, but I
> have a few others of a more general nature:
> 
> Is there a specific vendor or brand of memory that is much more
> reliable than others?  Since the above-described ordeal, I've heard
> that Kingston has a good reputation.  Anyone care to endorse or
> refute that?  Any other good brands/vendors you care to mention?

See above.  I personally never buy ram unless it's on Intel's approved 
list and comes with a lifetime warranty.  I realize this is an AMD 
solution.  However, anyone that is approved by Intel in most cases is a 
real supplier with technical depth and could of helped with this problem.

When I had strange problems like this in the past with various systems,
Virtium, Ventura and others took a system into their lab in order to
fix the problem.


> My understanding is that ECC can correct only single-bit errors, and
> so would not help with the kind of multibit errors that have been
> troubling me lately.  But I have some basic questions on ECC that
> you might be able to answer (I've asked the motherboard maker's tech
> support, but to no avail!):
> 
> In the bios for my GA7DPXDW-P motherboards, there are these 4
> alternatives for the SDRAM ECC Setting:
> 
>     Disabled
>     Check only
>     Correct Errors
>     Correct + scrub
> 
> I'm pretty sure I understand what 'Disabled' does.  Can anyone
> explain to me what the others do, and how they differ?  Also, if ECC
> correction is enabled, does this slow down the machine in any way?
> Is there any disadvantage to having ECC correction enabled?

What's the motherboard manufacture call for?

Cheers, and Happy 4th of July,

Steve Gaudet

Wild Open Source (home office)
----------------------
Bedford, NH 03110
pH:603-488-1599
cell:603-498-1600
http://www.wildopensource.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:02:36 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:02:36 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <3F0315C5.6020401@wildopensource.com>
Message-ID: <20030702104756.M6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Stephen Gaudet wrote:

> First, I'd make sure the memory comes from a major supplier, Kingston,
> Crucial, Virtium, Ventura, Transend, etc...

The supplier is not one of those you listed above.  I've been dealing with
them as well as with the vendor, and, at this point, I'd prefer not to
disclose their name on the list.  (Yes, I know, Steve: I should have just
bought these sticks from you in the first place!  Oh well.  We live and
learn.)

>
> Next, make sure all the ram has the same chipset Samsung, Infineon,
> etc...  If you have various sticks in these systems where the chip
> manufacture is different they sometime don't behave well.  So try to
> make everything match.

The latest batch of 69 sticks all used Samsung chips.

>
> Last I check cooling.  Do these systems have proper cooling?
>

Yes, definitely.  I monitor that closely.  Ambient temperature around the
motherboards never exceeded 77 deg F throughout these tests, and was
less than 70F most of the time.  I can't monitor cpu temperature directly
when memtest86 is running, but, in the same enclosure, when I can monitor
cpu temperatures, they are typically 55C or less.  I've been experimenting
with different heatsinks.  Some of the boards have Thermalright sk6+/Delta
60X25mm coolers, which keep the cpus below 40C most of the time.

>

Thanks and best wishes,
Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Wed Jul  2 13:41:02 2003
From: award at andorra.ad (Alan Ward)
Date: Wed, 02 Jul 2003 19:41:02 +0200
Subject: sharing a power supply
Message-ID: <3F03192E.4040904@andorra.ad>

Dear listpeople,

I am building a small beowulf with the following configuration:

- 4 motherboards w/ onboard Ethernet
- 1 hard disk
- 1 (small) switch
- 1 ATX power supply shared by all boards

The intended boot sequence is the classical (1) master boots off
hard disk; (2) after a suitable delay, slaves boot off master
with dhcp and root nfs.

I would appreciate comments on the following:

a) A 450 W power supply should have ample power for all -
but can it deliver on the crucial +5V and +3.3V lines? Has anybody
got real-world intensity measurements on these lines for Athlons
I can compare to the supply's specs?

b) I hung two motherboards off a single ATX supply. When I hit
the switch on either board, the supply goes on and both motherboards
come to life. Does anybody know a way of keeping the slaves still
until the master has gone through boot? e.g. Use the reset switch?
Can one of the power lines control the PLL on the motherboard?


Best regards,
Alan Ward


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Wed Jul  2 13:47:58 2003
From: becker at scyld.com (Donald Becker)
Date: Wed, 2 Jul 2003 10:47:58 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <Pine.LNX.4.44.0307021028160.15025-100000@beohost.scyld.com>

On Wed, 2 Jul 2003, Jack Wathey wrote:

> I need some advice about how to handle some ambiguous results from
> memtest86.  I also have some general questions about bios options
> related to ECC memory.
..
> The boards are Gigabyte Technologies model GA7DPXDW-P,
> ...Testing in batches of 3 sticks per motherboard, nearly 100% failed

My immediate reaction is that you have a motherboard that has memory
configuration restrictions.  A typical restriction is that can only use
two DIMMs when they are "double sided" (with two memory chips per signal
line instead of one) or have larger-capacity memory chips.

My second reaction is that you are running the chips too fast for ECC,
either because the serial EEPROM has been reprogrammed to claim that the
chips are faster or the BIOS settings have been tweaked.  Remember than
a ECC memory system is slower than the same chips without ECC!

> In the bios for my GA7DPXDW-P motherboards, there are these 4
> alternatives for the SDRAM ECC Setting:
> 
>     Disabled
>     Check only

   As the memory read is happening, start checking the data.  If the check
   fails, interrupt later.

>     Correct Errors

   When the memory read is started, check the data.  Hold the result
   until the check passes or the data is corrected.

>     Correct + scrub

   Correct read data as above, holding the transaction and writing
   corrected data back to the DIMM if an error is found.

> I'm pretty sure I understand what 'Disabled' does.  Can anyone
> explain to me what the others do, and how they differ?  Also, if ECC
> correction is enabled, does this slow down the machine in any way?

Yes.  The typical cost is one clock cycle of read latency.
It might seem obviously easy to overlap the ECC check when it usually
passes, but you can't really hide all of the cost.  The memory-read path is
always latency-critical.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kuku at physik.rwth-aachen.de  Tue Jul  1 02:11:22 2003
From: kuku at physik.rwth-aachen.de (Christoph P. Kukulies)
Date: Tue, 1 Jul 2003 08:11:22 +0200
Subject: Hard link /etc/passwd
In-Reply-To: <F563491C-AB3F-11D7-B87C-000393BE4DE6@engr.uky.edu>
References: <20030630032016.88507.qmail@web10607.mail.yahoo.com> <F563491C-AB3F-11D7-B87C-000393BE4DE6@engr.uky.edu>
Message-ID: <20030701061122.GA18433@gilberto.physik.rwth-aachen.de>

On Mon, Jun 30, 2003 at 05:15:21PM -0400, William Dieter wrote:
> You have to be careful when doing maintenance.  For example, if you do:
> 
> mv /etc/passwd /etc/passwd.bak
> cp /etc/passwd.bak /etc/passwd
> 
> all of the copies will be linked to the backup copy.  Normally you 
> might not do this, but some text editors sometimes do similar things 
> silently...
> 
> A symbolic link might be safer.

But it won't work in his diskless environment. Symbolic links are not visible
outside the chrooted environment of the specific diskless clients.

It's gotta be hard links.

> 
> >Good day,
> >I have an 11 node diskless cluster.  All slave node
> >roots are under /tftpboot/node1 ... /tftpboot/node2
> >... so on.  Is it safe to hard link the /etc/passwd
> >and /etc/group file to the server nodes for
> >consistency across the network?
> 
--
Chris Christoph P. U. Kukulies kukulies (at) rwth-aachen.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul  1 21:53:05 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Wed, 02 Jul 2003 09:53:05 +0800
Subject: Linux support for AMD Opteron with Broadcom NICs
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F023B01.2706C3A0@nchc.gov.tw>

Hi,

We installed SuSE Enterprise 8 for AMD64 on our dual AMD Opteron box,
it works fine for the on-board Broadcom NICs. SuSE Enterprise 8 for
AMD64
is not free, however. It uses a special 2.4.19Suse kernel which SuSE 
has done a lot of works to make sure most drivers behave normally.
We tried kernel 2.4.21 but it failed for Realtek NICs. At the moment,
there are not so many drivers supported in kernel 2.4.21 for Opteron.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC 


Martin Siegert wrote:
> 
> Hello,
> 
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
> 
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
> 
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
> 
> Thus, so far I am not impressed.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
> 
> Cheers,
> Martin
> 
> --
> Martin Siegert
> Manager, Research Services
> WestGrid Site Manager
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Wed Jul  2 06:01:43 2003
From: johnt at quadrics.com (John Taylor)
Date: Wed, 2 Jul 2003 11:01:43 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>

I agree with Joachim et al on the merit of the paper - it raises some
important issues relating to the overall efficacy of MPI in certain
circumstances.

In relation to IB there has been some work at Ohio State, comparing Myrinet
and QsNet. The latter however only discusses MPI, where as the UPC group in
the former discuss lower level APIs that suit better some algorithms as well
as being the target of specific compiler environments.

On the paper specifically at Berkeley my only concern is that there is no
mention on the influence of the PCI-Bridge implementation, not withstanding
its specification. For instance the system at ORNL is based on ES40 which on
a similar system gives an 8byte latency so...

prun -N2 mping 0 8 
  1 pinged   0:        0 bytes      7.76 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      8.11 uSec     0.12 MB/s
  1 pinged   0:        2 bytes      8.06 uSec     0.25 MB/s
  1 pinged   0:        4 bytes      8.35 uSec     0.48 MB/s
  1 pinged   0:        8 bytes      8.20 uSec     0.98 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   2469.61 uSec   212.30 MB/s
  1 pinged   0:  1048576 bytes   4955.28 uSec   211.61 MB/s

similar to the latency and bandwidth achieved for the author's benchmark.

whereas the same code on the same Quadrics hardware running on a Xeon
(GC-LE) platform gives

prun -N2 mping 0 8
  1 pinged   0:        0 bytes      4.31 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      4.40 uSec     0.23 MB/s
  1 pinged   0:        2 bytes      4.40 uSec     0.45 MB/s
  1 pinged   0:        4 bytes      4.39 uSec     0.91 MB/s
  1 pinged   0:        8 bytes      4.38 uSec     1.83 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   1632.61 uSec   321.13 MB/s
  1 pinged   0:  1048576 bytes   3252.28 uSec   322.41 MB/s
  
It may also be the case that the Myrinet performance could also be improved
(it is stated as PCI 32/66 in the paper) based on benchmarking a more recent
PCI-bridge. These current performance measurements may lead to differing
conclusions w.r.t latency although there is still the issue of the two-sided
nature.

For completeness here is the shmem_put performance on a new bridge.

prun -N2 sping -f put  -b 1000 0 8
  1:        4 bytes      1.60 uSec     2.50 MB/s
  1:        8 bytes      1.60 uSec     5.00 MB/s
  1:       16 bytes      1.58 uSec    10.11 MB/s


John Taylor
Quadrics Limited
http://www.quadrics.com

> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 01 July 2003 09:03
> To: Beowulf mailinglist
> Subject: Re: interconnect latency, dissected.
> 
> 
> James Cownie:
> > Mark Hahn wrote:
> > > does anyone have references handy for recent work on interconnect
> > > latency?
> >
> > Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
> >
> > It doesn't have Inifinband, but does have Quadrics, Myrinet 
> 2000, GigE and
> > IBM.
> 
> Nice paper showing interesting properties.  But some metrics 
> seem a little bit 
> dubious to me: in 5.2, they seem to see an advantage if the "overlap 
> potential" is higher (when they compare Quadrics and Myrinet) 
> - which usually 
> just results in higher MPI latencies, as this potiential (on 
> small messages) 
> can not be exploited. Even with overlapping mulitple communication 
> operations, the faster interconnect remains faster. This is 
> especially true 
> for small-message latency.
> 
> From the contemporary (cluster) interconnects, SCI is missing next to 
> Infiniband. It would have been interesting to see the results 
> for SCI as it 
> has a very different communication model than most of the 
> other interconnects 
> (most resembling the T3E one).
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sgaudet at wildopensource.com  Wed Jul  2 14:24:42 2003
From: sgaudet at wildopensource.com (Stephen Gaudet)
Date: Wed, 02 Jul 2003 14:24:42 -0400
Subject: memory nightmare
References: <20030702104756.M6682-100000@euler.salk.edu>
Message-ID: <3F03236A.3050106@wildopensource.com>

Hello Jack,


Jack Wathey wrote:
> 
> On Wed, 2 Jul 2003, Stephen Gaudet wrote:
> 
> 
>>First, I'd make sure the memory comes from a major supplier, Kingston,
>>Crucial, Virtium, Ventura, Transend, etc...
> 
> 
> The supplier is not one of those you listed above.  I've been dealing with
> them as well as with the vendor, and, at this point, I'd prefer not to
> disclose their name on the list.  (Yes, I know, Steve: I should have just
> bought these sticks from you in the first place!  Oh well.  We live and
> learn.)
> 
> 
>>Next, make sure all the ram has the same chipset Samsung, Infineon,
>>etc...  If you have various sticks in these systems where the chip
>>manufacture is different they sometime don't behave well.  So try to
>>make everything match.
> 
> 
> The latest batch of 69 sticks all used Samsung chips.

Same part number and speed?  What does the motherboard manufacture call 
for in regards to cas latency 2 or 3?  Best is usually 2.


>>Last I check cooling.  Do these systems have proper cooling?

Ok.

> Yes, definitely.  I monitor that closely.  Ambient temperature around the
> motherboards never exceeded 77 deg F throughout these tests, and was
> less than 70F most of the time.  I can't monitor cpu temperature directly
> when memtest86 is running, but, in the same enclosure, when I can monitor
> cpu temperatures, they are typically 55C or less.  I've been experimenting
> with different heatsinks.  Some of the boards have Thermalright sk6+/Delta
> 60X25mm coolers, which keep the cpus below 40C most of the time.

Don't rule out the motherboard or processors.  I agree with you looks 
like ram. However, might turn out to be a bad series of motherboards, 
and or processors.   Memtest86 also shows cache errors.  My own system 
here at home had memmory errors and I though for sure it was the ram. 
Turned out to be the memory controller chip on the motherboard.

Steve Gaudet

Wild Open Source (home office)
----------------------
Bedford, NH 03110
pH:603-488-1599
cell:603-498-1600
http://www.wildopensource.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mprinkey at aeolusresearch.com  Wed Jul  2 14:50:05 2003
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Wed, 2 Jul 2003 14:50:05 -0400 (EDT)
Subject: memory nightmare
In-Reply-To: <3F0315C5.6020401@wildopensource.com>
References: <3F0315C5.6020401@wildopensource.com>
Message-ID: <46008.66.118.77.29.1057171805.squirrel@ra.aeolustec.com>


>
> First, I'd make sure the memory comes from a major supplier, Kingston,
> Crucial, Virtium, Ventura, Transend, etc...
>
> Next, make sure all the ram has the same chipset Samsung, Infineon,
> etc...  If you have various sticks in these systems where the chip
> manufacture is different they sometime don't behave well.  So try to
> make everything match.
>
> Last I check cooling.  Do these systems have proper cooling?
>

I would add only to verify that you have sufficient and consistent power.  I
have seen many more "memory" errors caused by malfunctioning power supplies
than by bad memory modules.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From erwan at mandrakesoft.com  Tue Jul  1 10:52:36 2003
From: erwan at mandrakesoft.com (Erwan Velu)
Date: 01 Jul 2003 16:52:36 +0200
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <1057071156.9954.15.camel@revolution.mandrakesoft.com>

Le mar 01/07/2003 ? 15:15, Cannon, Andrew a ?crit :
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
You may have a look on the quite "old" Icluster initiative http://www-id.imag.fr/Grappes/icluster/description.html.
They did it and you can see their benchmarks.. It was a 200 E-PC cluster
using an ethernet network. It was in top500 !
-- 
Erwan Velu
Linux Cluster Distribution Project Manager
MandrakeSoft
43 rue d'aboukir 75002 Paris
Phone Number : +33 (0) 1 40 41 17 94
Fax Number   : +33 (0) 1 40 41 92 00
Web site     : http://www.mandrakesoft.com
OpenPGP key  : http://www.mandrakesecure.net/cks/ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From cgdethe at yahoo.com  Wed Jul  2 00:55:45 2003
From: cgdethe at yahoo.com (chandrashekhar dethe)
Date: Tue, 1 Jul 2003 21:55:45 -0700 (PDT)
Subject: help
Message-ID: <20030702045545.65035.qmail@web10806.mail.yahoo.com>


Hello,
Myself Prof.C.G.Dethe, Asst. Professor in the
department of electronics and Tele. SSGM, College of
Engg. Shegaon (M.S.) India. I wish to set up an
experimental high performance linux cluster in our
lab. I want to begin with simply 8 nodes. This will be
given as an project to PG student.
I wish to write a proposal for this purpose to Dept.
of Science and Tech. Govt. of India. Pl. let us know
the hardware + software requirements for this cluster
which will be used for research work mainly. 

with regards,
-cgdethe

Prof.C.G.Dethe
SSGM College of Engg. Shegaon 444 203
Dist. Buldhana 
State: Maharashtra.
INDIA.


=====
with regards,

- C.G.DETHE.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tim.carlson at pnl.gov  Wed Jul  2 14:47:40 2003
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Wed, 02 Jul 2003 11:47:40 -0700 (PDT)
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>
Message-ID: <Pine.LNX.4.44.0307021146090.5069-100000@roach.emsl.pnl.gov>

On Wed, 2 Jul 2003, Leonard Chvilicek wrote:

> I was reading in some of the mailing lists that the AMD Opteron dual
> processor system was getting around 80-90% efficiency on the second
> processor.  I was wondering if that holds true to the Itanium2 platform?
> I looked through some of the archives and did not find any benchmarks or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

You are not going to be able to beat a dual Itanium in terms of efficiency
if you are talking about a linpack benchmark. Close to 98% efficient.

Tim

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:31:10 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:31:10 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <Pine.LNX.4.44.0307021028160.15025-100000@beohost.scyld.com>
Message-ID: <20030702111109.X6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Donald Becker wrote:

>
> My immediate reaction is that you have a motherboard that has memory
> configuration restrictions.  A typical restriction is that can only use
> two DIMMs when they are "double sided" (with two memory chips per signal
> line instead of one) or have larger-capacity memory chips.

I'll look into that. I doubt this is the problem, though, because last
December I got a batch of 30 1-gig sticks from the same vendor that pass
memtest86 just fine in batches of 3 per board, on the very same
motherboards.  The batch from December used Nanya chips and were
high-profile.  The latest batch are Samsung low-profile.  I don't know if
these are "double-sided" or not.  The only restriction I know of, from the
motherboard manual, is that the memory must be "registered ECC ddr", which
these are.  Also, most of the failing sticks I've seen fail when tested
one stick per board.

>
> My second reaction is that you are running the chips too fast for ECC,
> either because the serial EEPROM has been reprogrammed to claim that the
> chips are faster or the BIOS settings have been tweaked.  Remember than
> a ECC memory system is slower than the same chips without ECC!

ECC was turned off during the memtest86 runs.  I'm using the default bios
settings for memory timing parameters.

>
> > In the bios for my GA7DPXDW-P motherboards, there are these 4
> > alternatives for the SDRAM ECC Setting:
> >
> >     Disabled
> >     Check only
>
>    As the memory read is happening, start checking the data.  If the check
>    fails, interrupt later.
>
> >     Correct Errors
>
>    When the memory read is started, check the data.  Hold the result
>    until the check passes or the data is corrected.
>
> >     Correct + scrub
>
>    Correct read data as above, holding the transaction and writing
>    corrected data back to the DIMM if an error is found.
>
> > I'm pretty sure I understand what 'Disabled' does.  Can anyone
> > explain to me what the others do, and how they differ?  Also, if ECC
> > correction is enabled, does this slow down the machine in any way?
>
> Yes.  The typical cost is one clock cycle of read latency.
> It might seem obviously easy to overlap the ECC check when it usually
> passes, but you can't really hide all of the cost.  The memory-read path is
> always latency-critical.

Thanks, Don!  That helps a lot.

Best wishes,
Jack

>
> --
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
> Annapolis MD 21403			410-990-9993
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Wed Jul  2 14:24:33 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 02 Jul 2003 13:24:33 -0500
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
References: <3F03192E.4040904@andorra.ad>
Message-ID: <1057170273.26434.57.camel@terra>

On Wed, 2003-07-02 at 12:41, Alan Ward wrote:
> Dear listpeople,
> 
> I am building a small beowulf with the following configuration:
> 
> - 4 motherboards w/ onboard Ethernet
> - 1 hard disk
> - 1 (small) switch
> - 1 ATX power supply shared by all boards
> 
> The intended boot sequence is the classical (1) master boots off
> hard disk; (2) after a suitable delay, slaves boot off master
> with dhcp and root nfs.
> 
> I would appreciate comments on the following:
> 
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?
> 
> b) I hung two motherboards off a single ATX supply. When I hit
> the switch on either board, the supply goes on and both motherboards
> come to life. Does anybody know a way of keeping the slaves still
> until the master has gone through boot? e.g. Use the reset switch?
> Can one of the power lines control the PLL on the motherboard?
> 

Use two power supplies, one for the master, one for the slaves. Not an
optimal solution.

How long will PXE sit around waiting? Is it settable? If it will wait
long enough, it won't matter how long it takes for the master to boot.

-- 

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From leonard_chvilicek at rac.ray.com  Wed Jul  2 13:47:42 2003
From: leonard_chvilicek at rac.ray.com (Leonard Chvilicek)
Date: Wed, 2 Jul 2003 12:47:42 -0500
Subject: Dual Itanium2 performance
Message-ID: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>


Hello,

I was reading in some of the mailing lists that the AMD Opteron dual
processor system was getting around 80-90% efficiency on the second
processor.  I was wondering if that holds true to the Itanium2 platform?
I looked through some of the archives and did not find any benchmarks or
statistics on this.  I found lots of dual Xeons but no dual Itaniums.

Thanks in advance ....

 
Leonard Chvilicek
Senior IT Strategist I
Raytheon Aircraft
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Wed Jul  2 15:01:57 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 02 Jul 2003 12:01:57 -0700
Subject: memory nightmare
In-Reply-To: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <5.2.0.9.2.20030702115149.018931d0@mailhost4.jpl.nasa.gov>

At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:

>I need some advice about how to handle some ambiguous results from
>memtest86.  I also have some general questions about bios options
>related to ECC memory.

<big snip>

>My understanding is that ECC can correct only single-bit errors, and
>so would not help with the kind of multibit errors that have been
>troubling me lately.  But I have some basic questions on ECC that
>you might be able to answer (I've asked the motherboard maker's tech
>support, but to no avail!):


First off... you're correct that ECC (or, EDAC (error detection and 
correction)) corrects single bit errors, and detects double bit errors. 
It's designed to deal with occasional bit flips, usually from radiation 
(neutrons resulting from cosmic rays, background radiation from the 
packaging, etc.), and really only addresses errors in the actual memory cells.

If you have errors in the data going to and from the memory, ECC does 
nothing, since the bus itself doesn't have EDAC.

The probability of a single bit flip (or upset) is fairly low (I'd be 
surprised at more than 1 a day), the probability of multiple errors is 
vanishingly small. One rate I have seen referenced is around 2E-12 
upsets/bit/hr. (remember that you won't see an upset in a bit if you don't 
read it).. There are some other statistics that show an upset occurs in a 
typical PC-like computer with 256MB of RAM about once a month. Fermilab has 
a system called ACPMAPS with 156 Gbit of memory, and they saw about 2.5 
upsets/day (7E-13 upset/bit/hr)

Lots of interesting information at 
http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf 
and, of course, the origingal papers from IBM (Ziegler, May and Woods)

On all systems I've worked on over the last 20 years that used ECC, 
multiple bit errors were always a timing or bus problem, i.e. electrical 
interfaces. If you're getting so many problems, it's indicative of some 
fundamental misconfiguration or mismatch between what the system wants to 
see and what your parts actually do.  Maybe wait states, voltages, etc. are 
incorrectly set up?


>James Lux, P.E.

Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 15:46:37 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 12:46:37 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <5.2.0.9.2.20030702115149.018931d0@mailhost4.jpl.nasa.gov>
Message-ID: <20030702124310.D6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Jim Lux wrote:

> At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:
>
>
> On all systems I've worked on over the last 20 years that used ECC,
> multiple bit errors were always a timing or bus problem, i.e. electrical
> interfaces. If you're getting so many problems, it's indicative of some
> fundamental misconfiguration or mismatch between what the system wants to
> see and what your parts actually do.  Maybe wait states, voltages, etc. are
> incorrectly set up?
>

Thanks, Jim.  That's most enlightening.  Several other respondents
alluded to incorrect timing parameters, too.  I'll look into this.

Best wishes,
Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Wed Jul  2 12:33:44 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Wed, 02 Jul 2003 12:33:44 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>
Message-ID: <3F030968.7030100@lmco.com>

Mark Hahn wrote:

> do be certain that your dimms are arranged right - our whitebox vendor
> seemed to think that all the dimms should go in cpu0's bank first,
> with no inter-bank or inter-node interleaving.  performance was ~30%
> better under Stream when the dimms were properly distributed and
> both kinds of interleaving enabled in bios.
>

Care to post from Stream numbers as well as the hardware
configuration? :)

TIA!

Jeff

-- 
Jeff Layton
Senior Engineer - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta

"Is it possible to overclock a cattle prod?" - Irv Mullins


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 15:35:42 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 12:35:42 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <46008.66.118.77.29.1057171805.squirrel@ra.aeolustec.com>
Message-ID: <20030702123131.T6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Michael T. Prinkey wrote:

> I would add only to verify that you have sufficient and consistent power.  I
> have seen many more "memory" errors caused by malfunctioning power supplies
> than by bad memory modules.

Good point, but not likely to be the culprit here.  Most of the nodes in
these tests use 300W pfc power supplies from PC Power & Cooling.  They're
diskless nodes with no floppy, no cdrom, and no PCI cards except for the
video cards, which are there only when I'm running memtest86.

Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:54:23 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:54:23 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <3F03236A.3050106@wildopensource.com>
Message-ID: <20030702114239.R6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Stephen Gaudet wrote:

> Same part number and speed?  What does the motherboard manufacture call
> for in regards to cas latency 2 or 3?  Best is usually 2.

I'm pretty sure they're all the same part number and speed, because the
supplier fabricated them all at the same time for me.  I don't know what
the MB maker recommends for cas latency.  They recommend setting DDR
timing to "Auto" in the bios, which causes the bios to set the timing
parameters automatically.  That's how I have them set.  If that parameter
is set to manual, then a whole bunch of parameters, including cas latency,
become accessible in the bios menu, but I have never tinkered with those,
and the MB manual has no recommended values for them.

> Don't rule out the motherboard or processors.  I agree with you looks
> like ram. However, might turn out to be a bad series of motherboards,
> and or processors.   Memtest86 also shows cache errors.  My own system
> here at home had memmory errors and I though for sure it was the ram.
> Turned out to be the memory controller chip on the motherboard.
>

I suppose it's remotely possible, but not likely.  All of the boards will
run memtest86 for many days, and my number-crunching code for many weeks,
with no problems at all, when I use memory from the batch I bought last
December.  Most of the failing sticks I've encountered since April will
fail consistently, whether tested alone or with other sticks, whether
tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
use in my server.  It's only a few sticks in the most recent batch of 69
that are failing in this rare and intermittent way that I can't seem to
reproduce when the sticks are tested one per motherboard.


Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 16:43:03 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 13:43:03 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
Message-ID: <20030702130036.J6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Alan Ward wrote:

> I would appreciate comments on the following:
>
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?

I made these measurements for my diskless dual-Athlon nodes.  They are
Gigabyte Technologies GA7DPXDW-P, with MP2200+ processors.  They have
on-board NIC, which I use, but otherwise they are stripped down to the
bare essentials: just motherboard, 2 cpus with coolers, and memory.  No
video card, no pci cards of any kind, no floppy, no cdrom, etc.

They have 2 power connectors: the standard 20-pin ATX connector and a
square 4-pin connector that supplies 12V to the board.  I did the
measurements by putting a 0.005 ohm precision resistor (www.mouser.com,
part #71-WSR-2-0.005) in series with each of the 5v, 3.3V and 12V lines,
and then measuring the voltage across that.  Rather than cut up the wires
of a power supply, I cut up the wires of extension cables:

http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=22998
http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=27314

There are multiple wires in these cables for each voltage.  Obviously you
need to be careful to cut and solder together the right ones.  A
motherboard manual should give you the pinout details.

Here are the results I got for my nodes:

cpus          memory installed        voltage line        current drawn
----------   ------------------       ------------        -------------
idle         2GB (2 sticks)               +5V                 13.1A
loaded       2GB (2 sticks)               +5V                 17.1A
idle         2GB (2 sticks)               +3.3V               0.34A
loaded       2GB (2 sticks)               +3.3V               0.34A
idle         2GB (2 sticks)               +12V                4.2A
loaded       2GB (2 sticks)               +12V                5.3A
idle         4GB (4 sticks)               +5V                 15.3A
loaded       4GB (4 sticks)               +5V                 19.7A
idle         4GB (4 sticks)               +3.3V               0.34A
loaded       4GB (4 sticks)               +3.3V               0.34A
idle         4GB (4 sticks)               +12V                4.2A
loaded       4GB (4 sticks)               +12V                5.3A

For my stripped-down nodes, only the +5V line turns out to be crucial.
You might want to repeat the measurements yourself, especially if your
nodes have more hardware plugged into them than mine.

Hope this helps,

Jack

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Tue Jul  1 11:18:19 2003
From: johnt at quadrics.com (John Taylor)
Date: Tue, 1 Jul 2003 16:18:19 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC96@stegosaurus.bristol.quadrics.com>

I agree with Joachim et al on the merit of the paper. In relation to IB
there has been some work at Ohio State, comparing Myrinet and QsNet. The
latter however only discusses MPI, where the UPC group in the former, quite
correctly IMHO, discuss lower level APIs that suit better some applications
and algorithms as well as being the target of specific compiler
environments.

On the paper specifically at Berkeley my only concern is that there is no
mention on the influence of the PCI-Bridge implementation, not withstanding
its specification. For instance the system at ORNL is based on ES40 which on
a similar system gives an 8byte latency so...

prun -N2 mping 0 8 
  1 pinged   0:        0 bytes      7.76 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      8.11 uSec     0.12 MB/s
  1 pinged   0:        2 bytes      8.06 uSec     0.25 MB/s
  1 pinged   0:        4 bytes      8.35 uSec     0.48 MB/s
  1 pinged   0:        8 bytes      8.20 uSec     0.98 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   2469.61 uSec   212.30 MB/s
  1 pinged   0:  1048576 bytes   4955.28 uSec   211.61 MB/s

similar to the latency and bandwidth achieved for the author's benchmark.

whereas the same code on the same Quadrics hardware running on a Xeon
(GC-LE) platform gives

prun -N2 mping 0 8
  1 pinged   0:        0 bytes      4.31 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      4.40 uSec     0.23 MB/s
  1 pinged   0:        2 bytes      4.40 uSec     0.45 MB/s
  1 pinged   0:        4 bytes      4.39 uSec     0.91 MB/s
  1 pinged   0:        8 bytes      4.38 uSec     1.83 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   1632.61 uSec   321.13 MB/s
  1 pinged   0:  1048576 bytes   3252.28 uSec   322.41 MB/s
  
It may also be the case that the Myrinet performance could also be improved
(it is stated as PCI 32/66 in the paper) based on benchmarking a more recent
PCI-bridge. 


John Taylor
Quadrics Limited
http://www.quadrics.com

> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 01 July 2003 09:03
> To: Beowulf mailinglist
> Subject: Re: interconnect latency, dissected.
> 
> 
> James Cownie:
> > Mark Hahn wrote:
> > > does anyone have references handy for recent work on interconnect
> > > latency?
> >
> > Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
> >
> > It doesn't have Inifinband, but does have Quadrics, Myrinet 
> 2000, GigE and
> > IBM.
> 
> Nice paper showing interesting properties.  But some metrics 
> seem a little bit 
> dubious to me: in 5.2, they seem to see an advantage if the "overlap 
> potential" is higher (when they compare Quadrics and Myrinet) 
> - which usually 
> just results in higher MPI latencies, as this potiential (on 
> small messages) 
> can not be exploited. Even with overlapping mulitple communication 
> operations, the faster interconnect remains faster. This is 
> especially true 
> for small-message latency.
> 
> From the contemporary (cluster) interconnects, SCI is missing next to 
> Infiniband. It would have been interesting to see the results 
> for SCI as it 
> has a very different communication model than most of the 
> other interconnects 
> (most resembling the T3E one).
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bruno at rocksclusters.org  Wed Jul  2 14:27:05 2003
From: bruno at rocksclusters.org (Greg Bruno)
Date: Wed, 2 Jul 2003 11:27:05 -0700
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>
Message-ID: <C869329B-ACBA-11D7-AC11-000393754EA0@rocksclusters.org>

> I was reading in some of the mailing lists that the AMD Opteron dual
> processor system was getting around 80-90% efficiency on the second
> processor.

just curious -- what benchmark was being used?

> I was wondering if that holds true to the Itanium2 platform?
> I looked through some of the archives and did not find any benchmarks 
> or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

running linpack and linking against the goto blas 
(http://www.cs.utexas.edu/users/flame/goto/), a two-cpu opteron 
achieved 87% of peak.

a two-cpu itanium 2 achieved 98% of peak.

  - gb

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jducom at nd.edu  Wed Jul  2 17:56:42 2003
From: jducom at nd.edu (Jean-Christophe Ducom)
Date: Wed, 02 Jul 2003 16:56:42 -0500
Subject: 3ware Escalade 8500 Serial ATA RAID
Message-ID: <3F03551A.8030608@nd.edu>

Did anybody try this card? What are the performances compared to the parallel 
ATA? How stable is the driver on Linux?
Thank you

	JC

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Wed Jul  2 17:51:44 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed, 2 Jul 2003 14:51:44 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
Message-ID: <Pine.LNX.3.96.1030702144901.32727C-100000@Maggie.Linux-Consulting.com>


hi ya

hang a 100uf or 1000uf ( +50v or +100v ) electrolytic
capacitor across the mb power-on switch  to slow down its
power-on signal  ... or do a extra resistor-capacitor circuit ..

-- dont run 4 mb off one power supply.. you'd probably
   exceed the current output of the power supply
	- it will work.. it will just run hot and soon die

	( 1/2 life rule for every 10C increase in temp )

c ya
alvin

On Wed, 2 Jul 2003, Alan Ward wrote:

> Dear listpeople,
> 
> I am building a small beowulf with the following configuration:
> 
> - 4 motherboards w/ onboard Ethernet
> - 1 hard disk
> - 1 (small) switch
> - 1 ATX power supply shared by all boards
> 
> The intended boot sequence is the classical (1) master boots off
> hard disk; (2) after a suitable delay, slaves boot off master
> with dhcp and root nfs.
> 
> I would appreciate comments on the following:
> 
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?
> 
> b) I hung two motherboards off a single ATX supply. When I hit
> the switch on either board, the supply goes on and both motherboards
> come to life. Does anybody know a way of keeping the slaves still
> until the master has gone through boot? e.g. Use the reset switch?
> Can one of the power lines control the PLL on the motherboard?

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul  2 19:04:03 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 2 Jul 2003 19:04:03 -0400 (EDT)
Subject: 3ware Escalade 8500 Serial ATA RAID
In-Reply-To: <3F03551A.8030608@nd.edu>
Message-ID: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca>

> Did anybody try this card? What are the performances compared to the parallel 
> ATA? How stable is the driver on Linux?

it's just their 7500 card with sata translators on the ports;
I can't see how pata/sata would make any difference.

I've had good luck with my 7500-8, but have heard others both
complain and praise them.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From leonard_chvilicek at rac.ray.com  Wed Jul  2 16:20:25 2003
From: leonard_chvilicek at rac.ray.com (Leonard Chvilicek)
Date: Wed, 2 Jul 2003 15:20:25 -0500
Subject: [Rocks-Discuss]Dual Itanium2 performance
Message-ID: <0258E449E0019844924F40FE68D15B2D5FFE90@ictxchp02.rac.ray.com>


The code that they were using was a CFD code called TAU and they were
getting over 90% efficiency on the 2nd processor on the Dual Opteron
system.

Thanks for your information Tim & Greg

Have a great 4th of July!
 
Leonard 

-----Original Message-----
From: Greg Bruno [mailto:bruno at rocksclusters.org] 
Sent: Wednesday, July 02, 2003 1:27 PM
To: Leonard Chvilicek
Cc: beowulf at beowulf.org; npaci-rocks-discussion at sdsc.edu
Subject: Re: [Rocks-Discuss]Dual Itanium2 performance


> I was reading in some of the mailing lists that the AMD Opteron dual 
> processor system was getting around 80-90% efficiency on the second 
> processor.

just curious -- what benchmark was being used?

> I was wondering if that holds true to the Itanium2 platform? I looked 
> through some of the archives and did not find any benchmarks or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

running linpack and linking against the goto blas 
(http://www.cs.utexas.edu/users/flame/goto/), a two-cpu opteron 
achieved 87% of peak.

a two-cpu itanium 2 achieved 98% of peak.

  - gb

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcookeman at yahoo.com  Wed Jul  2 20:36:52 2003
From: jcookeman at yahoo.com (Justin Cook)
Date: Wed, 2 Jul 2003 17:36:52 -0700 (PDT)
Subject: SuSE 8.2 and LAM-MPI 7.0
Message-ID: <20030703003652.1234.qmail@web10606.mail.yahoo.com>

Gents and Ladies,
I am new to the Beowulf arena.  I am trying to get a
diskless cluster up with SuSE 8.2 and LAM-MPI 7.0.  I
plan on using nfs-root and nfs for all of the mount
points.  

If I do a minimal install with gcc and install lam-mpi
for my slave-node images am I on the right track? 
Does anyone have a better solution for me?

Justin

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From anand at novaglobal.com.sg  Wed Jul  2 22:01:02 2003
From: anand at novaglobal.com.sg (Anand Vaidya)
Date: Thu, 3 Jul 2003 10:01:02 +0800
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <200307031001.05515.anand@novaglobal.com.sg>

I have tested a Dual Opteron with Mandrake and RedHat Linux. (MSI board with 
4GB, and Avant 1U)

Mandrake did not have ISO images (when I downloaded) so I had to download the 
files & install via NFS. There were lot of problems though. Download it from 
ftp://ftp.leo.org/pub/comp/os/unix/linux/Mandrake/Mandrake/9.0/x86_64

RedHat GinGin which is RH's version of RHL for Opteron (64bit) can be 
downloaded from 
ftp://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/
as ISO images.

RH installed and ran extremely well. We did run some benchmarks (smp jobs). 
Pretty impressive!

HTH

-Anand

On Wednesday 02 July 2003 06:48 am, Martin Siegert wrote:
> Hello,
>
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
>
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
>
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
>
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
>
> Thus, so far I am not impressed.
>
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
>
> Cheers,
> Martin

-- 
------------------------------------------------------------------------------
Regards,
Anand Vaidya
Technical Manager

NovaGlobal Pte Ltd
Tel: (65) 6238 6400
Fax: (65) 6238 6401
Mo:  (65) 9615 7317 

http://www.novaglobal.com.sg/

------------------------------------------------------------------------------
Fortune Cookie for today:

------------------------------------------------------------------------------


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Wed Jul  2 20:37:59 2003
From: torsten at howard.cc (torsten)
Date: Wed, 2 Jul 2003 20:37:59 -0400
Subject: Kickstart Help
Message-ID: <20030702203759.1232970b.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 01:12:06 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:12:06 -0400
Subject: Kickstart Help
Message-ID: <20030703011206.6d22b1b6.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From beowulf at howard.cc  Thu Jul  3 01:01:12 2003
From: beowulf at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:01:12 -0400
Subject: Kickstart Help
Message-ID: <20030703010112.7be016f9.beowulf@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From csheer at hotmail.com  Thu Jul  3 03:22:27 2003
From: csheer at hotmail.com (John Shea)
Date: Thu, 03 Jul 2003 00:22:27 -0700
Subject: Java Beowulf Cluster
Message-ID: <Law9-F111K3r2Yk0vSf0001798d@hotmail.com>

For those who are interested in building beowulf cluster using Java, here is 
a great software
package you can try out at: http://www.GreenTeaTech.com.

John

-----------------------------------------------------------------------------------------------------------------------------------
Build your own GreenTea Network Computer at home, in the office, or on the 
Internet.
Check it all out at http://www.GreenTeaTech.com
----------------------------------------------------------------------------------------------------------------------------------

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*  
http://join.msn.com/?page=features/virus

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Thu Jul  3 04:22:08 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 3 Jul 2003 10:22:08 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>
Message-ID: <200307031022.08268.joachim@ccrl-nece.de>

John Taylor:
> For completeness here is the shmem_put performance on a new bridge.
>
>
> prun -N2 sping -f put  -b 1000 0 8
>   1:        4 bytes      1.60 uSec     2.50 MB/s
>   1:        8 bytes      1.60 uSec     5.00 MB/s
>   1:       16 bytes      1.58 uSec    10.11 MB/s

The latency decrease is impressive for this bridge - which one is it? Can you 
tell?

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Thu Jul  3 07:08:03 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 3 Jul 2003 13:08:03 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>
Message-ID: <200307031308.03813.joachim@ccrl-nece.de>

John Taylor:
> This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
> G3.

Hmm, this is not really a "new" bridge - or is it modified for HP? The other 
numbers (4.4us for Xeon) that you gave where also achieved on a GC-LE system. 
Where's the difference?

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 01:00:06 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:00:06 -0400
Subject: Kickstart Help
Message-ID: <20030703010006.65ab487a.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Thu Jul  3 06:39:21 2003
From: johnt at quadrics.com (John Taylor)
Date: Thu, 3 Jul 2003 11:39:21 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>

This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
G3.

 
> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 03 July 2003 09:22
> To: John Taylor; 'beowulf at beowulf.org'
> Subject: Re: interconnect latency, dissected.
> 
> 
> John Taylor:
> > For completeness here is the shmem_put performance on a new bridge.
> >
> >
> > prun -N2 sping -f put  -b 1000 0 8
> >   1:        4 bytes      1.60 uSec     2.50 MB/s
> >   1:        8 bytes      1.60 uSec     5.00 MB/s
> >   1:       16 bytes      1.58 uSec    10.11 MB/s
> 
> The latency decrease is impressive for this bridge - which 
> one is it? Can you 
> tell?
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From robert.crosbie at tchpc.tcd.ie  Thu Jul  3 07:56:55 2003
From: robert.crosbie at tchpc.tcd.ie (Robert bobb Crosbie)
Date: Thu, 3 Jul 2003 12:56:55 +0100
Subject: Kickstart Help
In-Reply-To: <20030703011206.6d22b1b6.torsten@howard.cc>
References: <20030703011206.6d22b1b6.torsten@howard.cc>
Message-ID: <20030703115655.GB6647@tchpc01.tcd.ie>

torsten hath declared on Thursday the 03 day of July 2003  :-:
> Hello All,
> 
> RedHat 9.0, headless node
> 
> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

I have done this with a bootnet floppy on the 7.x series a number of times.
mount the bootnet.img on a loopback ``mount -o loop bootnet.img /mnt''
then edit /mnt/syslinux.cfg and added:

	label ksfloppy
	  kernel vmlinuz
	  append "ks=floppy" initrd=initrd.img lang= lowres devfs=nomount ramdisk_size=8192

Then set "ksfloppy" to the default with:

	default ksfloppy

We generally get the ks.cfg over nfs which might be handier if your going to 
be booting from cdrom, with something like the following:

	label ksnfs
	   kernel vmlinuz
	   append "ks=nfs:11.22.33.44:/kickstart/7.3/" initrd=initrd.img lang=lowres devfs=nomount ramdisk_size=8192

(Installing a machine with the IP 4.3.2.1 will then look for the file 
"/kickstart/7.3/4.3.2.1-kickstart" on the nfs server, we just use symlinks).

Then umount /mnt and dd the image to floppy.

I presume you could do something similar by mounting the ISO 
and editing /mnt/isolinux/isolinux.cfg, although I have never tried it.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed 

Under the "%packages" section of the ks.cfg you can specify either package
collections "Software Developement" or individual packages "gcc" to be
installed. A snippit from our ks.cfg for 7.3 workstation installs looks like:

%packages --resolvedeps
@Classic X Window System
@GNOME
@Software Development
	[...etc...]
ntp
vim-enhanced
vim-X11 
xemacs
gv
	[...etc...]


> (since only a bare minimum  are needed, as this is a headless node)?

Getting the package list setup is a little bit of trial and error,
but you get there in the end :)

HTH,

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jlb17 at duke.edu  Thu Jul  3 08:14:36 2003
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Thu, 3 Jul 2003 08:14:36 -0400 (EDT)
Subject: Kickstart Help
In-Reply-To: <20030703010006.65ab487a.torsten@howard.cc>
Message-ID: <Pine.LNX.4.44.0307030811040.18503-100000@chaos.egr.duke.edu>

On Thu, 3 Jul 2003 at 1:00am, torsten wrote

> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

Modify syslinux.cfg to have the default be your ks entry.  Also, crank 
down the timeout.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed (since only a bare minimum  are needed, as
> this is a headless node)?

You control the packages in the, err, %packages section of the ks.cfg.  
You can specify families and individual packages in there, as well as 
specifying packages not to install.

Kickstart is pretty well documented.  All the options are listed here:

http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/s1-kickstart2-options.html

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wrankin at ee.duke.edu  Thu Jul  3 08:27:55 2003
From: wrankin at ee.duke.edu (Bill Rankin)
Date: 03 Jul 2003 08:27:55 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <200307030459.h634x5Y12821@NewBlue.Scyld.com>
References: <200307030459.h634x5Y12821@NewBlue.Scyld.com>
Message-ID: <1057235275.2186.22.camel@rohgun.cse.duke.edu>

Anand Vaidya <anand at novaglobal.com.sg>:

> RedHat GinGin which is RH's version of RHL for Opteron (64bit) can be 
> downloaded from 
> ftp://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/
> as ISO images.
> 
> RH installed and ran extremely well. We did run some benchmarks (smp jobs). 
> Pretty impressive!

I am also running Gingin64 on a Penguin Computing dual Opteron which
uses the Broadcom NICs.  It is running fine at this moment with no
complaints.  The only issues were:

1 - No floppy boot/install image.  Must boot from CD or (in my case) PXE
boot and install.

2 - IIRC, the Broadcom NIC was not properly recognized, but using the
one Broadcom NIC entry in the install list (forgot the model number)
works fine. 

Do a google for "gingin64" and it should get you the links.

There is a mailing list on Redhat for AMD64

https://listman.redhat.com/mailman/listinfo/amd64-list

Performance wise, using the stock 64 bit gcc on my molecular dynamics
codes shows overall performance of the 1.4 GHz Opteron 240 to be on par
with Xeon 2.4s.  YMMV.

- bill
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bvds at bvds.geneva.edu  Thu Jul  3 08:42:29 2003
From: bvds at bvds.geneva.edu (bvds at bvds.geneva.edu)
Date: Thu, 3 Jul 2003 08:42:29 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
Message-ID: <200307031242.h63CgTn02594@bvds.geneva.edu>


Simon Hogg wrote:

>However, as far as I am aware, it should be possible to install a vanilla
>x86-32 distribution and recompile everything for 64-bit (with a recent GCC
>(3.3 is the best bet at the moment I suppose)).
 
I attempted this:  start with 32-bit RedHat 9 and gradually move up
to 64 bit.  It proved to be rather difficult since you need to
compile a 64-bit kernel and you need to install gcc as a cross-compiler
to do this.  And then you would need to figure out how to handle
the 32- and 64-bit libraries, yuck!  I found it much easier to start 
over with gingin64 (which has worked well for me).  I found no 
advantage to installing a 32-bit OS.

Brett van de Sande
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Thu Jul  3 09:27:03 2003
From: angel at wolf.com (Angel Rivera)
Date: Thu, 03 Jul 2003 13:27:03 GMT
Subject: 3ware Escalade 8500 Serial ATA RAID
In-Reply-To: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca> 
References: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030703132703.24703.qmail@houston.wolf.com>

Mark Hahn writes: 

>> Did anybody try this card? What are the performances compared to the parallel 
>> ATA? How stable is the driver on Linux?
> 
> it's just their 7500 card with sata translators on the ports;
> I can't see how pata/sata would make any difference. 
> 
> I've had good luck with my 7500-8, but have heard others both
> complain and praise them.

We are using the 7500-8 to the tune of 20 of them in 10 boxes (28TB) in one 
rack and we are rather impressed with the card.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Thu Jul  3 09:21:35 2003
From: award at andorra.ad (Alan Ward)
Date: Thu, 03 Jul 2003 15:21:35 +0200
Subject: sharing a power supply
References: <3F03192E.4040904@andorra.ad>
Message-ID: <3F042DDF.9000700@andorra.ad>

Thanks to everybody for the help.

My final set-up will probably look like:

- master node on a 300W supply
- three slaves on a 450W supply.

I am counting on the following maximum draws for
each motherboard (Duron at 1300 + 512 MB RAM):

	15A / 5V
	<1A / 3.3V
	5A / 12V

This is _just_ inside the 450W supply's specs -
I hope they were not overly optimistic.

On the other hand, a good 350W supply can power
up a dual with 1GB RAM ...

Best regards,
Alan Ward

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From timm at fnal.gov  Thu Jul  3 09:58:34 2003
From: timm at fnal.gov (Steven Timm)
Date: Thu, 3 Jul 2003 08:58:34 -0500 (CDT)
Subject: interconnect latency, dissected.
In-Reply-To: <200307031308.03813.joachim@ccrl-nece.de>
Message-ID: <Pine.LNX.4.31.0307030857020.20505-100000@boxer.fnal.gov>

We also saw streams numbers that were much higher than expected
while using a HP Proliant DL360  (compared to machines from
other vendors that were supposedly using the exact same chipset,
memory, and CPU speed.)  HP didn't have an explanation for the increase.

Steve


------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Core Support Services Dept.
Assistant Group Leader, Scientific Computing Support Group
Lead of Computing Farms Team

On Thu, 3 Jul 2003, Joachim Worringen wrote:

> John Taylor:
> > This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
> > G3.
>
> Hmm, this is not really a "new" bridge - or is it modified for HP? The other
> numbers (4.4us for Xeon) that you gave where also achieved on a GC-LE system.
> Where's the difference?
>
>  Joachim
>
> --
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Thu Jul  3 09:29:14 2003
From: becker at scyld.com (Donald Becker)
Date: Thu, 3 Jul 2003 06:29:14 -0700 (PDT)
Subject: Java Beowulf Cluster
In-Reply-To: <Law9-F111K3r2Yk0vSf0001798d@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307030551280.15025-100000@beohost.scyld.com>

On Thu, 3 Jul 2003, John Shea wrote:

> Date: Thu, 03 Jul 2003 00:22:27 -0700
> From: John Shea <csheer at hotmail.com>
> To: beowulf at beowulf.org
> Subject: Java Beowulf Cluster
> 
> For those who are interested in building beowulf cluster using Java, here is 
> a great software
> package you can try out at: http://www.--GreenTeaTech.com.

Sorry about this obvious no-content marketing shill...
This person subscribed and immediately posted this message.
A quick search shows the same type of marketing on many other mailing
lists, usually posing as a unrelated user e.g.
   http://webnews.kornet.net/view.cgi?group=comp.parallel.pvm&msgid=9875
   https://mailer.csit.fsu.edu/pipermail/java-for-cse/2001/000013.html
BTW Greg, this person is actually Chris Xie, a marketing person at the
company.


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bobb at tchpc.tcd.ie  Thu Jul  3 10:34:24 2003
From: bobb at tchpc.tcd.ie (bobb)
Date: Thu, 3 Jul 2003 15:34:24 +0100
Subject: Kickstart Help
In-Reply-To: <20030703011206.6d22b1b6.torsten@howard.cc>
References: <20030703011206.6d22b1b6.torsten@howard.cc> <20030703115655.GB6647@tchpc01.tcd.ie>
Message-ID: <20030703143424.GA15206@tchpc01.tcd.ie>


torsten hath declared on Thursday the 03 day of July 2003  :-:
> Hello All,
> 
> RedHat 9.0, headless node
> 
> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

I have done this with a bootnet floppy on the 7.x series a number of times.
mount the bootnet.img on a loopback ``mount -o loop bootnet.img /mnt''
then edit /mnt/syslinux.cfg and added:

	label ksfloppy
	  kernel vmlinuz
	  append "ks=floppy" initrd=initrd.img lang= lowres devfs=nomount ramdisk_size=8192

Then set "ksfloppy" to the default with:

	default ksfloppy

We generally get the ks.cfg over nfs which might be handier if your going to 
be booting from cdrom, with something like the following:

	label ksnfs
	   kernel vmlinuz
	   append "ks=nfs:11.22.33.44:/kickstart/7.3/" initrd=initrd.img lang=lowres devfs=nomount ramdisk_size=8192

(Installing a machine with the IP 4.3.2.1 will then look for the file 
"/kickstart/7.3/4.3.2.1-kickstart" on the nfs server, we just use symlinks).

Then umount /mnt and dd the image to floppy.

I presume you could do something similar by mounting the ISO 
and editing /mnt/isolinux/isolinux.cfg, although I have never tried it.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed 

Under the "%packages" section of the ks.cfg you can specify either package
collections "Software Developement" or individual packages "gcc" to be
installed. A snippit from our ks.cfg for 7.3 workstation installs looks like:

%packages --resolvedeps
@Classic X Window System
@GNOME
@Software Development
	[...etc...]
ntp
vim-enhanced
vim-X11 
xemacs
gv
	[...etc...]


> (since only a bare minimum  are needed, as this is a headless node)?

Getting the package list setup is a little bit of trial and error,
but you get there in the end :)

HTH,

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Thu Jul  3 11:51:59 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Thu, 03 Jul 2003 11:51:59 -0400
Subject: Opteron benchmark numbers
Message-ID: <3F04511F.8030903@lmco.com>

Hello,

   I don't know if everyone has seen these results yet, but
here's a link to some Opteron numbers for a small (4
node of dual) cluster:

http://mpc.uci.edu/opteron.html

Enjoy!

Jeff

-- 
Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From erwan at mandrakesoft.com  Thu Jul  3 03:41:35 2003
From: erwan at mandrakesoft.com (Erwan Velu)
Date: 03 Jul 2003 09:41:35 +0200
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <1057218095.2268.19.camel@revolution.mandrakesoft.com>

> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
I will have a look on that point because MandrakeLinux for opteron owns the bcm5700 driver.
Could you send me the PCI-ID of your card ?

> For those of you who have such a box: which distribution are you using?
The MandrakeClustering product (http://www.mandrakeclustering.com) has
been shown during ISC2003 at Heidelberg (www.isc2003.org) on dual
opteron systems. People who want to test it can contact me directly.

Best regards,
-- 
Erwan Velu
Linux Cluster Distribution Project Manager
MandrakeSoft
43 rue d'aboukir 75002 Paris
Phone Number : +33 (0) 1 40 41 17 94
Fax Number   : +33 (0) 1 40 41 92 00
Web site     : http://www.mandrakesoft.com
OpenPGP key  : http://www.mandrakesecure.net/cks/ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From atp at piskorski.com  Thu Jul  3 14:00:22 2003
From: atp at piskorski.com (Andrew Piskorski)
Date: Thu, 3 Jul 2003 14:00:22 -0400
Subject: sharing a power supply
In-Reply-To: <200307031624.h63GOMY26657@NewBlue.Scyld.com>
References: <200307031624.h63GOMY26657@NewBlue.Scyld.com>
Message-ID: <20030703180022.GA66577@piskorski.com>

On Thu, Jul 03, 2003 at 03:21:35PM +0200, Alan Ward wrote:
> My final set-up will probably look like:
> 
> - master node on a 300W supply
> - three slaves on a 450W supply.

Alan, how did you go about attaching three motherboard connectors to
that one 450W supply?  Where'd you buy the connectors, and did you
have to solder them on or is there some sort of Y type splitter cable
available?

Also, did you do anything to get the three slaves to power on
sequentially rather than all at once?  Or are you just hoping that the
supply will be able to handle the peak load on startup?

In my limited experience with Athlons, I've seen cheap power supplies
cause memory errors.  (In my case, only while also spinning a hard
drive while compiling the Linux kernel; memtest86 did not cach the
problem.)  So I'd definitely be inclined to try using one high quality
supply rather than three cheap ones.  But until your emails to the
list though I hadn't heard of anyone doing it.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 14:49:35 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 14:49:35 -0400
Subject: Kickstart Help - Thanks!
In-Reply-To: <20030703115655.GB6647@tchpc01.tcd.ie>
References: <20030703011206.6d22b1b6.torsten@howard.cc>
	<20030703115655.GB6647@tchpc01.tcd.ie>
Message-ID: <20030703144935.12bf170f.torsten@howard.cc>

Thanks for the help.

Redhat 9.0 uses "isolinux" for the boot dist, so the
old "syslinux.cfg" is now "isolinux.cfg".

Getting the packages right is indeed trial and error.
I'm down to about 500MB, and reducing them one-by-one.

Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From egan at sense.net  Thu Jul  3 16:53:21 2003
From: egan at sense.net (Egan Ford)
Date: Thu, 3 Jul 2003 14:53:21 -0600
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <002e01c341a5$23e9a5b0$27b358c7@titan>

> For those of you who have such a box: which distribution are 
> you using?
> Any advice on how to get those GigE Broadcom NICs to work?

I have 2 boxes with 2 Opterons and 2 onboard Broadcoms NICs and have had
very minor but expected problems installing:

SLES8 x86_64
SLES8 x86
RH 7.3

Issues:

SLES8 x86_64 recognized the NIC in reverse order than that of RH73 and SLES8
x64.  Adding netdevice=eth1 to Autoyast network installer was the work
around.  FYI, Autoyast is like kickstart but for SuSE distros.

SLES8 x86 needed a minor tweak to the network boot image to find the
BCM5700s.  But the module was just fine.

RH 7.3 needed a new module and pcitable entry in the network boot image for
installation.  I also had to update the runtime bcm5700 support.  HINT:
RH7.3 installs the athlon kernel.  I'd love to know how to tell kickstart to
force i686.  I used version 6.2.11 from broadcom.com.

I am too lazy to do CD installs so I only tested network installing.  My
demo machines came with IDE drives, I suspect that if I had SCSI that RH7.3
would have needed that updated as well in the installer.

I just downloaded gingin64, but have not tested it yet.  I suspect that it
will work just fine.  Anyone know what gingin64 is?  RH8, RH9, RH10,...?

I am impressed with SLES8 x86_64.  The updated NUMA kernel with the numactl
command is very nice.  You can peg a process and its children to a processor
and memory bus or threads of an OMP application to the memory of the
processor the thread is running it.  Helps with benchmarks like STREAM and
SPECfp on multiprocessor systems.  Now if someone will add it as an option
to mpirun...


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Thu Jul  3 19:08:05 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Thu, 3 Jul 2003 16:08:05 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F042DDF.9000700@andorra.ad>
Message-ID: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com>


hi ya

On Thu, 3 Jul 2003, Alan Ward wrote:

> I am counting on the following maximum draws for
> each motherboard (Duron at 1300 + 512 MB RAM):
> 
> 	15A / 5V
> 	<1A / 3.3V
> 	5A / 12V
> 
> This is _just_ inside the 450W supply's specs -
> I hope they were not overly optimistic.

if you're connecting 3 systems .. that's 45A 
that the power supply has to deliver ...
	-- double that for current spikes and 
	optimal/normal performance and reliability
	of the power supply
 
if the ps can't deliver that current, than you're
degrading your powersupply and motherboard down 
to irreparable damage over time 

450W power supply doesnt mean anything ...
its the total amps per each delivered voltages
that yoou should be looking at  and how well you
want it regulated ...  there's not much room
for noise on the +3.3v power lines and it uses
lots of current on some of the memory sticks

if the idea of hooking up 4 systems to one ps was
to reduce heat and increase reliability, i think
using multiple systems on a ps designed for one
fully loaded mb/system will give you the opposite
reliability effect

i think 2 minimal-systems per powersupply is the max
for any power supply .. most ps and cases is designed for 
fully loaded case

fun stuff ... lots of smoke tests ... 
( bad idea to let the blue smoke out...
( for some reason, the systmes always stop working
( after you let out the blue smoke
( and blue smoke smells funny too

have fun
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alorant at octigabay.com  Fri Jul  4 01:08:34 2003
From: alorant at octigabay.com (Adam Lorant)
Date: Thu, 3 Jul 2003 22:08:34 -0700
Subject: GigE PCI-X NIC Cards
Message-ID: <001201c341ea$54e9d870$0300a8c0@Adam>

Hi folks.? Do any of you have any recommendations for a high performance
Gigabit Ethernet NIC card for PCI-X slots?? Are they any that I should stay
away from?? My primary application is NAS access.
?
Much appreciated,

Adam.
?

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From maurice at harddata.com  Fri Jul  4 02:37:00 2003
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 04 Jul 2003 00:37:00 -0600
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <200307021908.h62J8UY09280@NewBlue.Scyld.com>
Message-ID: <5.1.1.6.2.20030704003523.033deaa0@mail.harddata.com>

With regards to your message at 01:08 PM 7/2/03, beowulf-request at scyld.com. 
Where you stated:
>On Wed, 2 Jul 2003, Leonard Chvilicek wrote:
>
> > I was reading in some of the mailing lists that the AMD Opteron dual
> > processor system was getting around 80-90% efficiency on the second
> > processor.  I was wondering if that holds true to the Itanium2 platform?
> > I looked through some of the archives and did not find any benchmarks or
> > statistics on this.  I found lots of dual Xeons but no dual Itaniums.
>
>You are not going to be able to beat a dual Itanium in terms of efficiency
>if you are talking about a linpack benchmark. Close to 98% efficient.
>
>Tim

Perhaps, but as linpack is not what most people actually run on their 
machines for production, I think it is more useful to consider what 
efficiency on SMP you get on real production code.


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From maurice at harddata.com  Fri Jul  4 02:43:12 2003
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 04 Jul 2003 00:43:12 -0600
Subject: memory nightmare
In-Reply-To: <200307030459.h634xIY12831@NewBlue.Scyld.com>
Message-ID: <5.1.1.6.2.20030704004114.033e1a00@mail.harddata.com>

With regards to your message :
>From: Jack Wathey <wathey at salk.edu>
>To: Stephen Gaudet <sgaudet at wildopensource.com>
>cc: beowulf at beowulf.org
>Subject: Re: memory nightmare
>
>I suppose it's remotely possible, but not likely.  All of the boards will
>run memtest86 for many days, and my number-crunching code for many weeks,
>with no problems at all, when I use memory from the batch I bought last
>December.  Most of the failing sticks I've encountered since April will
>fail consistently, whether tested alone or with other sticks, whether
>tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
>use in my server.  It's only a few sticks in the most recent batch of 69
>that are failing in this rare and intermittent way that I can't seem to
>reproduce when the sticks are tested one per motherboard.
>
>
>Jack

Have you tried raising the memory voltage level on the motherboards to 2.7V ?
I see characteristics of failure  like you have described on many cheap 
motherboards.
Works fine with 1 stick, errors with 3 sticks of RAM.


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Fri Jul  4 03:19:39 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Fri, 4 Jul 2003 00:19:39 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <5.1.1.6.2.20030704004114.033e1a00@mail.harddata.com>
Message-ID: <Pine.LNX.3.96.1030704001559.2743A-100000@Maggie.Linux-Consulting.com>


hi ya

On Fri, 4 Jul 2003, Maurice Hilarius wrote:

> With regards to your message :
> >From: Jack Wathey <wathey at salk.edu>
> >To: Stephen Gaudet <sgaudet at wildopensource.com>
> >cc: beowulf at beowulf.org
> >Subject: Re: memory nightmare
> >
> >I suppose it's remotely possible, but not likely.  All of the boards will
> >run memtest86 for many days, and my number-crunching code for many weeks,
> >with no problems at all, when I use memory from the batch I bought last
> >December.  Most of the failing sticks I've encountered since April will
> >fail consistently, whether tested alone or with other sticks, whether
> >tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
> >use in my server.  It's only a few sticks in the most recent batch of 69
> >that are failing in this rare and intermittent way that I can't seem to
> >reproduce when the sticks are tested one per motherboard.

ditto that ...

all the generic 1GB mem sticks  ( ddr-2100) work fine by itself
but fails big time with 2 of um in the same mb ... 
	( wasted about a months of productivity during the random failures
	( and no failures since using 4x 512MB sticks

we wound up replacing the cheap asus mb with intel D845/D865 series and 
changed to 4x 512MB sticks instead and it worked fine

similarly, for finicky mb, we used name brand memory 256MB ddr-2100, and
it worked fine  ...

> Have you tried raising the memory voltage level on the motherboards to 2.7V ?
> I see characteristics of failure  like you have described on many cheap 
> motherboards.
> Works fine with 1 stick, errors with 3 sticks of RAM.

forgetful memory is not a good thing

c ya
alvin 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Fri Jul  4 03:53:42 2003
From: award at andorra.ad (Alan Ward)
Date: Fri, 04 Jul 2003 09:53:42 +0200
Subject: sharing a power supply
References: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com>
Message-ID: <3F053286.1090804@andorra.ad>

Hi Alvin


En/na Alvin Oga ha escrit:
(snip)
> 450W power supply doesnt mean anything ...
> its the total amps per each delivered voltages
> that yoou should be looking at  and how well you
> want it regulated ...  there's not much room
> for noise on the +3.3v power lines and it uses
> lots of current on some of the memory sticks

I am. As has been noted, it looks like there's very
little draw on 3.3V; we are way above specs.
You are right about 5V and spikes, though. Have to
try and see. Luckily, I have no other 5V devices
in the box (I think :-).

This 450W is given for 45A/5v and 25A/3.3V, with a
250W limit across these two lines.

> if the idea of hooking up 4 systems to one ps was
> to reduce heat and increase reliability, i think
> using multiple systems on a ps designed for one
> fully loaded mb/system will give you the opposite
> reliability effect

This is a small mobile console type system, on wheels.
The idea is to move it around from one desk to another,
so different people can litteraly get their hands on it.
Having little noise (thus fans) is about as important
as pure computing power at this stage - I need to have
them buy the concept first. The design isn't too bad;
the pics will be on the web ASAP.


Best regards,
Alan

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Fri Jul  4 03:53:52 2003
From: award at andorra.ad (Alan Ward)
Date: Fri, 04 Jul 2003 09:53:52 +0200
Subject: sharing a power supply
References: <200307031624.h63GOMY26657@NewBlue.Scyld.com> <20030703180022.GA66577@piskorski.com>
Message-ID: <3F053290.50800@andorra.ad>

Hi.

En/na Andrew Piskorski ha escrit:
> Alan, how did you go about attaching three motherboard connectors to
> that one 450W supply?  Where'd you buy the connectors, and did you
> have to solder them on or is there some sort of Y type splitter cable
> available?

I started with dominoes, and when I was sure it worked soldered them.
Jack Wathey posted the following:

 >> Rather than cut up the wires
 >> of a power supply, I cut up the wires of extension cables:
 >>
 >> http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=22998
 >> http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=27314

Being in southern Europe, there's no hope of getting these here.
But busted power supplies (for parts) are easy to find :-(

> Also, did you do anything to get the three slaves to power on
> sequentially rather than all at once?  Or are you just hoping that the
> supply will be able to handle the peak load on startup?

Can't do anything about that. When the supply goes on, it powers the
boards, and they start up, period. Maybe a breaker on the 5V and
3.3V lines would be a solution.

However, I reason the following: power-on spikes come from condensators.
But there are a lot more condensators in the power supplies than on
the motherboards - at the very least a factor of 100 more in capacity.
So I expect the spikes on the AC circuit as the supply is getting
charged up, rather than on the DC part.

(Comments, Alvin, Jack?)

> In my limited experience with Athlons, I've seen cheap power supplies
> cause memory errors.  (In my case, only while also spinning a hard
> drive while compiling the Linux kernel; memtest86 did not cach the
> problem.)  So I'd definitely be inclined to try using one high quality
> supply rather than three cheap ones.  But until your emails to the
> list though I hadn't heard of anyone doing it.

There seem to be two-stage power supplies for racks: a general 230V / 
12V converter for the whole rack, plus a simplified low-voltage supply
for each box. I've never even seen any of these around here, though.

What I'm doing is not strictly COTS. I loose the advantage of just
plugging the hardware in and worrying *only* about the soft ...


Best regards,
Alan


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bobb at tchpc.tcd.ie  Fri Jul  4 04:28:06 2003
From: bobb at tchpc.tcd.ie (bobb)
Date: Fri, 4 Jul 2003 09:28:06 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <002e01c341a5$23e9a5b0$27b358c7@titan>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca> <002e01c341a5$23e9a5b0$27b358c7@titan>
Message-ID: <20030704082806.GA32158@tchpc01.tcd.ie>

Egan Ford hath declared on Thursday the 03 day of July 2003  :-:
 
> I just downloaded gingin64, but have not tested it yet.  I suspect that it
> will work just fine.  Anyone know what gingin64 is?  RH8, RH9, RH10,...?

According to the release notes its 8.0.95.
http://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/os/x86_64/RELEASE-NOTES

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From daniel at labtie.mmt.upc.es  Fri Jul  4 12:08:31 2003
From: daniel at labtie.mmt.upc.es (Daniel Fernandez)
Date: 04 Jul 2003 18:08:31 +0200
Subject: Small PCs cluster
Message-ID: <1057334911.3814.28.camel@qeldroma.cttc.org>

Hi there,

I just started how to mantain a cluster, I mean monitoring
activity/temperature, finding/replacing damaged components and user
control. Recently we are planning here to add more nodes... but there's
a great problem, space. 

So we bought recently a Small Form Factor PC to test it, It's a Shuttle
SN41G2 equipped with a nForce2 chipset, It was a bit tricky at install
process because our older PCs were equipped with 3Com cards and
installed via BOOTP but that damn nVidia integrated ethernet only boots
via PXE, well, that's relatively easy to solve. And after installing
nVidia drivers seemed to work flawlessly.

It's obvious that we'll gain space but on the other hand heat
dissipation will be more difficult because will be more dissipated watts
per cubic-meter, that small PC case has a nice Heat-pipe for cooling the
main cpu though.

? Are there experiences ( successful or not ) about installing and
managing clusters with Small Form Factor PCs ? I'm not talking only
about heat but instability problems with integrated ethernet ( under
high activity ) as well.


-- 
Daniel Fernandez <daniel at labtie.mmt.upc.es>
Laboratori de Termot?cnia i Energia - CTTC
( Heat and Mass Transfer Center )
Universitat Polit?cnica de Catalunya

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tsyang at iesinet.com  Fri Jul  4 13:25:17 2003
From: tsyang at iesinet.com (T.-S. Yang)
Date: Fri, 04 Jul 2003 10:25:17 -0700
Subject: Small PCs cluster
In-Reply-To: <1057334911.3814.28.camel@qeldroma.cttc.org>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
Message-ID: <3F05B87D.9070108@iesinet.com>

Daniel Fernandez wrote:

> ..
> ? Are there experiences ( successful or not ) about installing and
> managing clusters with Small Form Factor PCs ? I'm not talking only
> about heat but instability problems with integrated ethernet ( under
> high activity ) as well.
> 

Your cluster is similar to the Space Simulator Cluster
http://space-simulator.lanl.gov/
There is a helpful paper in PDF format.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Fri Jul  4 13:55:34 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Fri, 4 Jul 2003 10:55:34 -0700
Subject: sharing a power supply
References: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com> <3F053286.1090804@andorra.ad>
Message-ID: <001d01c34255$77eed4e0$02a8a8c0@office1>

If quiet and compact is your goal, then maybe getting some standard smaller
supplies and doing some repackaging might be a better solution.  Pull the
fans out of the small supplies, mount them with some ducting and use 1 or 2
larger diameter fans.  In general a larger diameter fan will move more air,
more quietly, than a small diameter fan.

You're already straying into non-standard application of the parts, so
opening up the power supplies is hardly a big deal.  You might find that
using 3 small 200W supplies might be a better way to go than 1 monster 450W
supply.

There are also conduction cooled power supplies available (no fans at all)

----- Original Message -----
From: "Alan Ward" <award at andorra.ad>
To: "Alvin Oga" <alvin at Mail.Linux-Consulting.com>
Cc: <beowulf at beowulf.org>
Sent: Friday, July 04, 2003 12:53 AM
Subject: Re: sharing a power supply


> Hi Alvin
>
>
> En/na Alvin Oga ha escrit:
> (snip)
> > 450W power supply doesnt mean anything ...
> > its the total amps per each delivered voltages
> > that yoou should be looking at  and how well you
> > want it regulated ...  there's not much room
> > for noise on the +3.3v power lines and it uses
> > lots of current on some of the memory sticks
>
> I am. As has been noted, it looks like there's very
> little draw on 3.3V; we are way above specs.
> You are right about 5V and spikes, though. Have to
> try and see. Luckily, I have no other 5V devices
> in the box (I think :-).
>
> This 450W is given for 45A/5v and 25A/3.3V, with a
> 250W limit across these two lines.
>
> > if the idea of hooking up 4 systems to one ps was
> > to reduce heat and increase reliability, i think
> > using multiple systems on a ps designed for one
> > fully loaded mb/system will give you the opposite
> > reliability effect
>
> This is a small mobile console type system, on wheels.
> The idea is to move it around from one desk to another,
> so different people can litteraly get their hands on it.
> Having little noise (thus fans) is about as important
> as pure computing power at this stage - I need to have
> them buy the concept first. The design isn't too bad;
> the pics will be on the web ASAP.
>
>
> Best regards,
> Alan
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Fri Jul  4 13:29:16 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Fri, 04 Jul 2003 12:29:16 -0500
Subject: Small PCs cluster
In-Reply-To: <1057334911.3814.28.camel@qeldroma.cttc.org>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
Message-ID: <3F05B96C.6040801@tamu.edu>

Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
processor class machine, are pretty big compared to the Mini-ITX 
systems.  However, the heat-pipes seem to do a pretty good job of 
off-loading heat and making the heat-exchanger available to ambient air.

I've not built a cluster so far using this sort of case, but I've got a 
lot of past heat-pipe experience.  I'd be tring to maintain a low inlet 
temperature to the rack, and a fairly high, and (uncharacteristically) 
non-laminar airflow through the rack.  The idea is to get as much 
airflow incident to the heat-pipe heat exchanger as possible.

We did a fair bit of heat-pipe work while I was at NASA.  We found cood 
radiative characteristics in heat-pipe heat exchangers (the heat-pipes 
wouldn't have worked otherwise!) but they work best when they combine 
both convective and radiative modes and use a cool-air transport.

I've got a number of isolated small-form-factor PCs now running.  I've 
seen no instability with the integrated components in any of these.

gerry

Daniel Fernandez wrote:
> Hi there,
> 
> I just started how to mantain a cluster, I mean monitoring
> activity/temperature, finding/replacing damaged components and user
> control. Recently we are planning here to add more nodes... but there's
> a great problem, space. 
> 
> So we bought recently a Small Form Factor PC to test it, It's a Shuttle
> SN41G2 equipped with a nForce2 chipset, It was a bit tricky at install
> process because our older PCs were equipped with 3Com cards and
> installed via BOOTP but that damn nVidia integrated ethernet only boots
> via PXE, well, that's relatively easy to solve. And after installing
> nVidia drivers seemed to work flawlessly.
> 
> It's obvious that we'll gain space but on the other hand heat
> dissipation will be more difficult because will be more dissipated watts
> per cubic-meter, that small PC case has a nice Heat-pipe for cooling the
> main cpu though.
> 
> ? Are there experiences ( successful or not ) about installing and
> managing clusters with Small Form Factor PCs ? I'm not talking only
> about heat but instability problems with integrated ethernet ( under
> high activity ) as well.
> 
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Fri Jul  4 16:41:45 2003
From: torsten at howard.cc (torsten)
Date: Fri, 4 Jul 2003 16:41:45 -0400
Subject: Kickstart ks.cfg file example for headless node
Message-ID: <20030704164145.1e8be175.torsten@howard.cc>

Hello,

Does anyone have a kickstart file (ks.cfg) that they
use for a very minimal install on a headless node?

Thanks,
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From derek.richardson at pgs.com  Fri Jul  4 18:12:27 2003
From: derek.richardson at pgs.com (Derek Richardson)
Date: Fri, 04 Jul 2003 17:12:27 -0500
Subject: Kickstart ks.cfg file example for headless node
In-Reply-To: <20030704164145.1e8be175.torsten@howard.cc>
References: <20030704164145.1e8be175.torsten@howard.cc>
Message-ID: <3F05FBCB.9080408@pgs.com>

Torsten,
If using redhat, try their kickstart configurator for a basic 
configuration.  Here's a list of packages I use for compute nodes on a 
redhat 7.1 cluster :

%packages
@ Networked Workstation
@ Kernel Development
@ Development
@ Network Management Workstation
@ Utilities
autofs
dialog
lsof
ORBit
XFree86
audiofile
control-panel
dialog
esound
gnome-audio
gnome-libs
gtk+
imlib
kaffe
linuxconf
libungif
modemtool
netcfg
pythonlib
tcl
timetool
tix
tk
tkinter
tksysv
wu-ftpd
ntp
pdksh
ncurses
ncurses-devel
ncurses4
compat-egcs
compat-egcs-c++
compat-egcs-g77
compat-egcs-objc
compat-glibc
compat-libs
compat-libstdc++
xosview
quota
expect
uucp

I can't send you the entire kickstart, since it contains information 
relevant to the company I work for ( not to mention everyone would hate 
me for filling their inbox... ).  This list would probably need to be 
updated for what version you're using.  I'll send you ( off-list ) a 
kickstart that I use for redhat9 workstations that doesn't contain 
anything sensitive, it contains some examples of scripting post-install 
configuration and whatnot.
Oh, redhat maintains excellent documentation :
http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/
Regards,
Derek R.

torsten wrote:

>Hello,
>
>Does anyone have a kickstart file (ks.cfg) that they
>use for a very minimal install on a headless node?
>
>Thanks,
>Torsten
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>

-- 
Linux Administrator
derek.derekson at pgs.com
derek.derekson at ieee.org
Office 713-781-4000
Cell 713-817-1197
A list is only as strong as its weakest link.
		-- Don Knuth


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Fri Jul  4 18:43:54 2003
From: torsten at howard.cc (torsten)
Date: Fri, 4 Jul 2003 18:43:54 -0400
Subject: Kickstart ks.cfg file example for headless node
In-Reply-To: <3F05FBCB.9080408@pgs.com>
References: <20030704164145.1e8be175.torsten@howard.cc>
	<3F05FBCB.9080408@pgs.com>
Message-ID: <20030704184354.61bed075.torsten@howard.cc>

> I'll   send  you   (  off-list   )  a   kickstart  that   I  use   for
>redhat9  workstations  that  doesn't  contain  anything  sensitive,  it
>contains   some  examples   of  scripting   post-install  configuration
>and   whatnot.   Oh,   redhat  maintains   excellent  documentation   :
>http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/

Thanks for the info.

I'm  most  interested in  %packages.   The  manual talks  about  package
selection.  In order to reduce the  install size, I select no additional
packages. I  just want  a base  (40-50M) system.   My current  installed
system turns out to be huge (700M+).

I read in  the manual, it says "The Package  Selection window allows you
to choose which  package groups to install."  I understand  this to mean
that choosing a  package installs that package, in addition  to the base
system.  Have I misread?

By  selecting  no packages,  is  kickstart  installing all  packages  by
default?

If I select "@ base", will this only install the base and skip the rest?

My goal is a very small, very quick network install.

Thanks to everyone  for their help and patience.  Extra  thanks to Derek
for sending me an excellent ks.cfg example.

Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Sat Jul  5 04:31:59 2003
From: seth at hogg.org (Simon Hogg)
Date: Sat, 05 Jul 2003 09:31:59 +0100
Subject: OT? Opteron suppliers in UK?
Message-ID: <4.3.2.7.2.20030705092404.00aa0de0@pop.freeuk.net>

Attn: Any Opteron users in the UK

I'm looking for an Opteron-based system supplier (nice white-box assembler) 
in the UK.  Can any UK users recommend any suppliers (off-list!)  The 
prices I have seen so far seem a bit steep compared to our American cousins.

Thanks in advance, and apologies for the off-topic(?) post (but it is the 
weekend and just after 4th July, so list traffic is low :-)

Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Sat Jul  5 21:44:09 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Sat, 5 Jul 2003 18:44:09 -0700 (PDT)
Subject: Small PCs cluster
In-Reply-To: <3F05B96C.6040801@tamu.edu>
Message-ID: <Pine.LNX.3.96.1030705184126.21028C-100000@Maggie.Linux-Consulting.com>


hi ya

On Fri, 4 Jul 2003, Gerry Creager N5JXS wrote:

> Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
> processor class machine, are pretty big compared to the Mini-ITX 
> systems.  However, the heat-pipes seem to do a pretty good job of 
> off-loading heat and making the heat-exchanger available to ambient air.

the folks at mini-box.com has cdrom-sized chassis (1.75" tall)  running
off +12v DC input ...

and we have a mini-itx 1u chassis w/ 2 hd .. good up to p4-3Ghz 
( noisier than ??  but keeps the cpu nice and cool )

c ya
alvin


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Sun Jul  6 05:43:21 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Sun, 06 Jul 2003 17:43:21 +0800
Subject: GinGin64 on Opteron
References: <20030624032259.48447.qmail@web16809.mail.tpe.yahoo.com> <3EF85B85.1090200@inel.gov>
Message-ID: <3F07EF39.7D7110F7@nchc.gov.tw>


Hi,

This afternoon I tried to install RedHat's GinGig64 on our
dual Opteron box (Riowork HDAMA motherboard with 8GB RAM) and 
found that the installation script failed at the initiation stage
of system checking, the installation script only works 
normally when the memory size is reduced to 4GB (4 1GB RAM).
I wonder if anyone has tried this and has the similar finding.

On the other hand, SuSE Linux Enterprise Server 8 for AMD64
works fine for system with 8GB RAM. However, Unlike RedHat,
SuSE SLES8 does not load 3w-xxxx driver before initiating the
installation, so the installation script does not recognize
device such as /dev/sda, /dev/sdb, etc, created by 3Ware
RAID card earlier. I suspect that part of the reason might be 
caused by the power supply on my system is not large enough
(460W for 9 120GB hard disks, a dual opteron motherboard,
and 8GB RAM). I'll replace the power supply and try again
next week.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
> 
> Andrew Wang wrote:
> 
> > How well the existing tools run on Opteron machines?
> >
> > Does LAM-MPI or MPICH run in 64-bit mode? Also, has
> > anyone tried Gridengine or PBS on it?
> >
> > Lastly, is there an opensource Opteron compile farm
> > that I can access? I would like to see if my code
> > really runs correctly on them before buying!
> >
> > Andrew.
> 
> Most vendors will give you a remote account or send you
> an evaluation unit.  I imagine you'll probably be
> contacted off-list by several of them.
> 
> I've compiled a 64-bit MPICH, GROMACS, and a few other
> codes with a GCC 3.3 prerelease.  I have also used the
> beta PGI compiler with good results.  Some build
> scripts require slight modification to recognize
> x86-64 as an architecture, but most porting is trivial.
> GROMACS has some optimized assembly that didn't come
> out quite right, but I bet they have it fixed by now.
> 
> All my testing was a couple of weeks before the release,
> but I haven't gotten any in yet unfortunately.
> 
> Andrew
> 
> --
> Andrew Shewmaker, Associate Engineer
> Phone: 1-208-526-1276
> Idaho National Eng. and Environmental Lab.
> P.0. Box 1625, M.S. 3605
> Idaho Falls, Idaho 83415-3605
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mbosma at atipa.com  Mon Jul  7 16:11:32 2003
From: mbosma at atipa.com (Mark Bosma)
Date: 07 Jul 2003 15:11:32 -0500
Subject: GinGin64 on Opteron
Message-ID: <1057608692.11660.38.camel@atipa-dp>

We noticed the same behavior on a dual opteron machine last week that
was the same setup as yours - the install script would only work with 4
or less gigs of RAM.  Once installation was complete, the full 8 gigs
could be installed and the OS seemed to recognize it all.  So I've had
similar findings, but I haven't had time to find the cause yet.  I'd be
interested to hear if someone else has.

Mark Bosma
Atipa Technologies
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Mon Jul  7 16:55:47 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Mon, 7 Jul 2003 16:55:47 -0400 (EDT)
Subject: GinGin64 on Opteron
In-Reply-To: <1057608692.11660.38.camel@atipa-dp>
Message-ID: <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>

> similar findings, but I haven't had time to find the cause yet.  I'd be
> interested to hear if someone else has.

I'd guess that that boots and runs the installer simply
isn't configured right, perhaps even just an ia32 one).

does the installer work on a >4G machine if you simply give it a mem=4G
argument?  I'd guess the installer has no use for even 2G of ram...

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Mon Jul  7 17:24:50 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 7 Jul 2003 14:24:50 -0700
Subject: GinGin64 on Opteron
In-Reply-To: <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>
References: <1057608692.11660.38.camel@atipa-dp> <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030707212450.GA14775@stikine.ucs.sfu.ca>

On Mon, Jul 07, 2003 at 04:55:47PM -0400, Mark Hahn wrote:
> > similar findings, but I haven't had time to find the cause yet.  I'd be
> > interested to hear if someone else has.
> 
> I'd guess that that boots and runs the installer simply
> isn't configured right, perhaps even just an ia32 one).
> 
> does the installer work on a >4G machine if you simply give it a mem=4G
> argument?  I'd guess the installer has no use for even 2G of ram...

I tried GigGin64 on my demo box and it hung almost immediately:
the last thing the installer displayed was 

running /sbin/loader ...

Martin
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From adm35 at georgetown.edu  Mon Jul  7 18:56:09 2003
From: adm35 at georgetown.edu (Arnold Miles)
Date: Mon, 07 Jul 2003 18:56:09 -0400
Subject: Free 3-day seminar in using Beowulf clusters and programming
 MPI in Washington DC
Message-ID: <40ebbe40b61f.40b61f40ebbe@georgetown.edu>

All:

Georgetown University in Washington DC is hosting a free 3-day workshop/
seminar on High Performance Computing, High Throughput Computing and 
Distributed Computing on August 11, 12, and 13.  The main emphasis of 
this workshop is using Beowulf cluster and writing algorithms and 
programs for Beowulf clusters using MPI.

Information can be found at:
http://www.georgetown.edu/research/arc/workshop2.html

The first day is general information, and is aimed at anyone with any 
interest in Beowulf clusters and their use.  We encourage project 
managers, administrators, researchers, faculty, and students to attend, 
as well as programmers who want to get started using their clusters.

The second day will be split beetween lectures and labs on the use of 
Jini in distributed computing (Track 1), and parallel programming (Track 
2).  There will also be a session on using Beowulf clusters as a high 
throughput tool using Condor.  The third day will be an all day lab in 
parallel programming with MPI.  Track 2 assumes a knowledge of either C, 
C++ or Fortran.

Best of all, this seminar is fully funded by Georgetown University's 
Information Systems department, so there is no cost to attend this year!

Seating for day 2 and day 3 is limited.  Contact Arnie Miles at 
adm35 at georgetown.edu or Steve Moore at moores at georgetown.edu.  Hope to 
see you there.

Arnie Miles
Systems Administrator:  Advanced Research Computing
Adjunct Faculty:  Computer Science
202.687.9379
168 Reiss Science Building
http://www.georgetown.edu/users/adm35
http://www.guppi.arc.georgetown.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul  8 00:57:49 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Tue, 08 Jul 2003 12:57:49 +0800
Subject: etherchannel
Message-ID: <3F0A4F4D.FF742BC4@nchc.gov.tw>

Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature,
etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bvds at bvds.geneva.edu  Mon Jul  7 23:13:46 2003
From: bvds at bvds.geneva.edu (bvds at bvds.geneva.edu)
Date: Mon, 7 Jul 2003 23:13:46 -0400
Subject: semaphore problem with mpich-1.2.5
Message-ID: <200307080313.h683Dk722726@bvds.geneva.edu>


I have an Opteron system running GinGin64 with 
a 2.4.21 kernel and gcc-3.3.  I compiled
mpich-1.2.5 with --with-comm=shared, but mpirun 
crashes with the error:

 semget failed for setnum = 0

This is a known problem with mpich (see 
http://www-unix.mcs.anl.gov/mpi/mpich/buglist-tbl.html).

Has anyone else seen this error?

I found a discussion, reprinted below, by Douglas Roberts at LANL
(http://www.bohnsack.com/lists/archives/xcat-user/1275.html)
His fix worked for me.  Does anyone know of a "real" solution?

Brett van de Sande

********************************************************************

I think the reason we get sem_get errors is that the operating system is not
releasing inter-process communication resources (e.g. semaphores) when a
job is finished. It's possible to do this manually. ...
I wrote the following script, which removes
all the shared memory and semaphore resources held by the user:

#! /bin/csh

foreach id (`ipcs -m | gawk 'NR>4 {print $2}'`)
        ipcrm shm $id
end

foreach id (`ipcs -s | gawk 'NR>4 {print $2}'`)
        ipcrm sem $id
end

********************************************************************


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgupta at cse.iitkgp.ernet.in  Tue Jul  8 04:55:11 2003
From: rgupta at cse.iitkgp.ernet.in (Rakesh Gupta)
Date: Tue, 8 Jul 2003 14:25:11 +0530 (IST)
Subject: NIS problem ..
Message-ID: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>


Hi, 
   I am setting up a small 8 node cluster .. I have installed RedHat 9.0
on all the nodes. 
  Now I want to setup NIS .. I have ypserv , portmap, ypbind running on
one of the nodes (The server) on the others I have ypbind and portmap.

The NIS Domain is also set in /etc/sysconfig networkk .. 

Now when I do /var/yp/make .. an error of the following form comes

" failed to send 'clear' to local ypserv: RPC: Unknown HostUpdating
passwd.byuid " 

and a sequence of such messages follow..

can anyone please help me with this.


Regards
Rakesh


-- 
----------------------------------------------------------------------
Rakesh Gupta
Research Consultant
Computer Science and Engineering Department
IIT Kharagpur
West Bengal
India - 721302
URL: http://www.crx.iitkgp.ernet.in/~rakesh/
Phone:
  09832117500
--------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  8 06:42:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 8 Jul 2003 12:42:16 +0200
Subject: AW:  etherchannel
Message-ID: <29B376A04977B944A3D87D22C495FB2301276B@vertrieb.emplics.com>

Hi,

Take a look at /usr/share/doc/kernel-doc-2.4.18/networking/bonding.txt (at RH 7.3, don't know for higher versions)
You will have to recompile ifenslave for network-trunking.
This will result in a higher bandwidth, but your latency will grow (don't do that for mpich jobs, won't perform).

Before starting to configure I would do some benches (ping, Pallas), cause latency gets really worse.

greetings Rene


########################################################################
To install ifenslave.c, do:
    # gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave
    # cp ifenslave /sbin/ifenslave

3) Configure your system
------------------------
Also see the following section on the module parameters. You will need to add
at least the following line to /etc/conf.modules (or /etc/modules.conf):

        alias bond0 bonding

Use standard distribution techniques to define bond0 network interface. For
example, on modern RedHat distributions, create ifcfg-bond0 file in
/etc/sysconfig/network-scripts directory that looks like this:

DEVICE=bond0
IPADDR=192.168.1.1
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
ONBOOT=yes
BOOTPROTO=none
USERCTL=no

(put the appropriate values for you network instead of 192.168.1).

All interfaces that are part of the trunk, should have SLAVE and MASTER
definitions. For example, in the case of RedHat, if you wish to make eth0 and
eth1 (or other interfaces) a part of the bonding interface bond0, their config
files (ifcfg-eth0, ifcfg-eth1, etc.) should look like this:

DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

(use DEVICE=eth1 for eth1 and MASTER=bond1 for bond1 if you have configured
second bonding interface).

Restart the networking subsystem or just bring up the bonding device if your
administration tools allow it. Otherwise, reboot. (For the case of RedHat
distros, you can do `ifup bond0' or `/etc/rc.d/init.d/network restart'.)

If the administration tools of your distribution do not support master/slave
notation in configuration of network interfaces, you will need to configure
the bonding device with the following commands manually:

    # /sbin/ifconfig bond0 192.168.1.1 up
    # /sbin/ifenslave bond0 eth0
    # /sbin/ifenslave bond0 eth1
#####################################################


-----Urspr?ngliche Nachricht-----
Von: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw] 
Gesendet: Dienstag, 8. Juli 2003 06:58
An: beowulf at beowulf.org
Betreff: etherchannel


Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature, etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC _______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Tue Jul  8 15:09:34 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 8 Jul 2003 12:09:34 -0700
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <20030708190934.GA16851@stikine.ucs.sfu.ca>

On Tue, Jul 01, 2003 at 03:48:08PM -0700, Martin Siegert wrote:

> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?

Thanks to all of you who have responded with suggestions and pointers.
In the end this did turn out to be a hardware problem (this NICs plainly
did not work) and had nothing to do with the drivers and the distributions
that I tried. I am going to get another Opteron box and then will try once
more.

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From math at velocet.ca  Tue Jul  8 17:15:18 2003
From: math at velocet.ca (Ken Chase)
Date: Tue, 8 Jul 2003 17:15:18 -0400
Subject: lopsisded draw on power supplies
Message-ID: <20030708171518.A27289@velocet.ca>

So, what's people's experience with PC power supplies and power draw
on various voltage lines? 

We have a buncha old but large SCSI drives here that are somewhat hefty, and
we want to power them with as few ATX supplies as possible. We have no
motherboard involved (yes, we have to find a hack to get the power on with a
signal, but I think its just shorting a couple of the pins in the mobo
connector for a sec -- anyone got info on that?).

The thing is we'd only be drawing +5 and +12V out of the thing for the drives.
Im not sure how much of each really, during operation, but the drives are all
listed as max 1.1A +5V and 1.1 or 1.7A +12V (latter for bigger of the 2 types
of drives).

Even the 300W non-enermax cheapo power supply says it supplies 22A of
+12V, which is the limiting factor for # of drives. (It gives 36A of +5V).
The 650W enermax monster we have gives 46 +5V and 24 +12V strangely enough
(strange because its only 2 more amps of 12 for such a big supply.)

Im wondering what will happen if we have a load on only one type of voltage
because of no motherboard or other perifs. Is this a lopsided load that
we should beef up the power supply for? I dont think we should use a 300W
for like 16 odd drives, but perhaps a 400 is enough? Should we go 650?
Is it necessary? We'll certainly use enermax for this, with 2 fans in it.
How close to the rated max should we go? We're looking at 16 drives here,
which is short of the 22 or 24A listed on the supplies.

Thanks.

/kc
-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, Canada
Wiznet Velocet DSL.ca Datavaults  24/7: 416-967-4414  tollfree: 1-866-353-0363

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From deadline at plogic.com  Wed Jul  9 13:12:23 2003
From: deadline at plogic.com (Douglas Eadline)
Date: Wed, 9 Jul 2003 13:12:23 -0400 (EDT)
Subject: Informal Survey
Message-ID: <Pine.LNX.4.44.0307091253210.29893-100000@otto.plogic.com>


I am curious where everyone gets information on clusters.
Obviously this list is one source, but what about other
sources. In addition, what kind of information do people most 
want/need about clusters. Please comment on the following
questions if you have the time. You can respond to me directly
and I will summarize the results for the list.

1. Where do you find "howto" information on clusters
   (besides this list)

    a) Google
    b) Vendor
    c) Trade Show  
    d) News Sites (what news sites are there?)
    e) Other

2. If there were a subscription print/web magazine on clusters, what kind 
   of coverage would you want? 

    a) howto information
    b) new products
    c) case studies
    d) benchmarks
    e) other


Thanks,

Doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mohamed.siddiqu at wipro.com  Tue Jul  8 04:45:16 2003
From: mohamed.siddiqu at wipro.com (Mohamed Abubakkar Siddiqu)
Date: Tue, 8 Jul 2003 14:15:16 +0530
Subject: etherchannel
Message-ID: <6353EB090D04484B9AFF8E257A4BF84D3D5F68@blrhomx2.wipro.co.in>


Hi..


U can try Channel Bonding. Check Bonding Documentation from the Kernel source

Siddiqu.T


-----Original Message-----
From: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw]
Sent: Tuesday, July 08, 2003 10:28 AM
To: beowulf at beowulf.org
Subject: etherchannel


Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature,
etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Wed Jul  9 23:21:19 2003
From: torsten at howard.cc (torsten)
Date: Wed, 9 Jul 2003 23:21:19 -0400
Subject: Realtek 8139
Message-ID: <20030709232119.5a0a378b.torsten@howard.cc>

Hello All,

This is an FYI, followed by a request for ethernet card suggestions.

My secondary ethernet for my Beowulf cluster is a Realtek 8139 chip
D-Link 530TX.  I also have this chipset on the motherboard itself.

The chipset on the MB works, it seems, my suspicions are because it
is only 10MBit.  On the subnet, a 100MBit net, it is falling over itself.

First, I started getting NFS problems.  I google'd and found out that

A. The NFS "buffer" is overflowing, or not being cleared adequately.
B. The ethernet card is misconfigured.
C. The driver is poor or does not match the card.
D. The card is defective.

I also tried ftp, and after a few megs are transfered, the chip fails
to be able to transfer more.  I found many mentions of this chipset
being the low of the low, and it is driving me nuts.

Interestingly, I can IP masq the subnet and connect to the internet,
seemingly ok.  Just NFS and FTP are dying.  Blah.

I'm going to purchase some new network cards.  I'm leaning towards
3Com 3c905C-TXM cards because they are cheap enough ($20 pricewatch),
PCI, 100MBit, and have PXE roms, and, most of all, are known stable
and working under Linux.

I would like to solicit ethernet card recommendations before I purchase
another mistake.

Thanks,
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From palott at math.umd.edu  Wed Jul  9 23:14:31 2003
From: palott at math.umd.edu (P. Aaron Lott)
Date: Wed, 9 Jul 2003 23:14:31 -0400
Subject: gentoo cluster
Message-ID: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi,

Our group is interested in building a beowulf cluster using gentoo 
linux as the OS. Has anyone on the list had experience with this or 
know anyone who has experience with this? We're trying to figure out 
the best way to spawn nodes once we have configured one machine 
properly. Any suggestions such as pseudo kickstart methods would be 
greatly appreciated.

Thanks,

Aaron


palott at math.umd.edu
http://www.lcv.umd.edu/~palott
LCV:    IPST 4364A (301)405-4865
Office: IPST 4364D (301)405-4843
Fax:   (301)314-0827

P. Aaron Lott
1301 Mathematics Building
University of Maryland
College Park, MD 20742-4015


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (Darwin)

iD8DBQE/DNoizzvfVkBO8H4RAhquAJ0XVKDjkHxE6W52eZGNO80YKDJKdwCfSZqP
d6iwjdalKhqGI4xHGH4d678=
=QcSo
-----END PGP SIGNATURE-----

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kpodesta at redbrick.dcu.ie  Thu Jul 10 05:17:34 2003
From: kpodesta at redbrick.dcu.ie (Karl Podesta)
Date: Thu, 10 Jul 2003 10:17:34 +0100
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
References: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <20030710091733.GD1661@prodigy.Redbrick.DCU.IE>

On Wed, Jul 09, 2003 at 11:14:31PM -0400, P. Aaron Lott wrote:
> Hi,
> 
> Our group is interested in building a beowulf cluster using gentoo 
> linux as the OS. Has anyone on the list had experience with this or 
> know anyone who has experience with this? We're trying to figure out 
> the best way to spawn nodes once we have configured one machine 
> properly. Any suggestions such as pseudo kickstart methods would be 
> greatly appreciated.
> 
> Thanks,
> 
> Aaron

Not gentoo-specific, but there was a thread a few weeks back where
people posted up various (mostly similar) methods they use to clone 
nodes etc. 

On an old 23-node beowulf we have, we use a few small homegrown 
collected perl scripts written by the university networking society.

Once configuring a machine, we make an image of it (simple gzip/tar, 
stores itself on the head node, takes 2 mins), then register the other
nodes to 'clone' from this image we've just made, reboot the nodes from 
a floppy, and they clone themselves from the network at about 2 minutes 
a piece, takes about 5-10 mins maybe to clone all 23 nodes! Surprisingly 
quick for a simple ftp/un-tgz over standard ethernet from a single head node. 

We use the etherboot package to create a boot floppy which we use to 
boot the nodes, and our scripts modify the DHCP conf file to say which 
nodes should then be subsequently picked up and which linux kernel 
they should use to load up. The startup scripts that load after
the linux kernel ftp the node image down from the head node, un-gzip
the image, and un-tar it onto the machine. Hey presto, etc.

You could probably write something small yourself using etherboot/DHCP/targz
and some alteration of config files, or you could use cloning software 
like g4u (which I found really slow? It took like 30 minutes to clone
a node compared to 2 for our own scripts?), or you could use cluster
software like ROCKS. Depends on your time and/or inclination!

I'm not sure that simple tar'ing of a filesystem is the completely correct
way to go about it, but we don't have many actively live users (at least
not when I decide I'm going to clone nodes...), plus it's fast and dirty. 
So works for us, for now.. Something more 'proper' might require a dd'ing
of the disk, or something?

Kp
-- 
Karl Podesta
+ School of Computing, Dublin City University, Ireland
+ National Institute for Cellular Biotechnology, Ireland
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From daniel at labtie.mmt.upc.es  Thu Jul 10 08:14:42 2003
From: daniel at labtie.mmt.upc.es (Daniel Fernandez)
Date: 10 Jul 2003 14:14:42 +0200
Subject: Small PCs cluster
In-Reply-To: <3F05B96C.6040801@tamu.edu>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
	 <3F05B96C.6040801@tamu.edu>
Message-ID: <1057839282.764.20.camel@qeldroma.cttc.org>

Hi again,

Thanks for the answers, we also checked the Mini-ITX mainboard, but C3 
processors don't offer enough FPU raw speed. On the other hand, the 
integrated nVidia ethernet controller is in fact a Realtek 8201BL, this
is our last trouble before we decide what to purchase. 

Our actual cluster is equipped with 3Com 3c905CX-TX-M ethernet controllers,
our doubt is about that Realtek controller because I suspect that Realtek
ethernet nics put more load onto the main CPU ? can anyone confirm this ?

I suppose that the NIC for cluster of choice is 3Com around there, but...
? how about Realtek NICs under heavy load? If doesn't work well, we can
 afford an extra 3Com NIC of course.

-- 
Daniel Fernandez <daniel at labtie.mmt.upc.es>
Laboratori de Termot?cnia i Energia - CTTC


> On Fri, 2003-07-04 at 19:29, Gerry Creager N5JXS wrote:
> Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
> processor class machine, are pretty big compared to the Mini-ITX 
> systems.  However, the heat-pipes seem to do a pretty good job of 
> off-loading heat and making the heat-exchanger available to ambient air.
> 
> I've not built a cluster so far using this sort of case, but I've got a 
> lot of past heat-pipe experience.  I'd be tring to maintain a low inlet 
> temperature to the rack, and a fairly high, and (uncharacteristically) 
> non-laminar airflow through the rack.  The idea is to get as much 
> airflow incident to the heat-pipe heat exchanger as possible.
> 
> We did a fair bit of heat-pipe work while I was at NASA.  We found cood 
> radiative characteristics in heat-pipe heat exchangers (the heat-pipes 
> wouldn't have worked otherwise!) but they work best when they combine 
> both convective and radiative modes and use a cool-air transport.
> 
> I've got a number of isolated small-form-factor PCs now running.  I've 
> seen no instability with the integrated components in any of these.
> 
> gerry
> 
>  


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nashif at planux.com  Thu Jul 10 10:25:19 2003
From: nashif at planux.com (Anas Nashif)
Date: Thu, 10 Jul 2003 10:25:19 -0400
Subject: SuSE 8.2 for AMD64 Download
Message-ID: <3F0D774F.4010908@planux.com>

Hi,

8.2 for AMD64 is available on the FTP server:
ftp://ftp.suse.com/pub/suse/x86-64/8.2-beta/

Press Release in german:
http://www.suse.de/de/company/press/press_releases/archive03/82_x86_64_beta.html


Anas

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Thu Jul 10 11:04:44 2003
From: becker at scyld.com (Donald Becker)
Date: Thu, 10 Jul 2003 11:04:44 -0400 (EDT)
Subject: Small PCs cluster
In-Reply-To: <1057839282.764.20.camel@qeldroma.cttc.org>
Message-ID: <Pine.LNX.4.44.0307101056310.2967-100000@beohost.scyld.com>

On 10 Jul 2003, Daniel Fernandez wrote:

> Thanks for the answers, we also checked the Mini-ITX mainboard, but C3 
> processors don't offer enough FPU raw speed. On the other hand, the 
> integrated nVidia ethernet controller is in fact a Realtek 8201BL, this
> is our last trouble before we decide what to purchase. 

The nVidia Ethernet NIC uses the rtl8201BL _transceiver_.  Don't
confuse this with the rtl8139 NIC chip, which has the transceiver
integrated on the same chip with the NIC.

There have been several reports of mediocre preformance and kernel
problems from using the proprietary, binary-only nVidia driver.  It's
likely more efficient than the standard rtl8139 interface (before the
C+), but it's difficult to know without the driver source.

> Our actual cluster is equipped with 3Com 3c905CX-TX-M ethernet controllers,
> our doubt is about that Realtek controller because I suspect that Realtek
> ethernet nics put more load onto the main CPU ? can anyone confirm this ?

The 3c905C is one of the best Fast Ethernet NICs available.
It does well with everything but multicast filtering.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fant at pobox.com  Thu Jul 10 10:25:59 2003
From: fant at pobox.com (Andrew Fant)
Date: Thu, 10 Jul 2003 10:25:59 -0400 (EDT)
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <20030710100848.N15741-100000@net.bluemoon.net>

I am in the closing stages of a project to build a 64 CPU Xeon cluster
that is using gentoo as it's base os.  For installation and the like, I am
using Systemimager.  It's not perfect, but it has the decided advantage of
not depending on any particular packaging system to handle the installs.

You will probably want a http proxy on a head node to simplify the
installation process.  I just did a manual install of the O/S on the head
nodes and on one of the compute nodes, and cloned from there, though if
you want further automation, there is a gentoo installer project on
sourceforge, iirc, or you can script most of it in sh, of course.

Are you planning to run commercial apps on this cluster, or will it be
primarily user developed code?  I have found that most commercial apps can
be coerced into running under gentoo, but modifying their installed
scripts may be something of a PITA, and you almost certainly will get to
be good friends with rpm2targz.

One last caveat.  Depending on how "production" you are going to make this
cluster, you may need to be a little less agressive about updating ebuilds
and which versions of packages you install.  A good regression test suite
is good to have if you have layered software to install which isn't part
of an ebuild to start.

I'd be glad to talk to anyone else who has an interest in gentoo-based
beowulfish clusters.  In spite of the extra engineering work, I am pleased
with the results.

Andy

Andrew Fant      |   This    |  "If I could walk THAT way...
Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
Boston, MA USA   |   Hire    |    http://www.pharmawulf.com

On Wed, 9 Jul 2003, P. Aaron Lott wrote:

> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Thu Jul 10 05:23:44 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Thu, 10 Jul 2003 17:23:44 +0800
Subject: PVM
Message-ID: <3F0D30A0.D572627A@nchc.gov.tw>

Hi,

I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
running SLSE8 for AMD64, I got the following message:

> pvm
libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
libpvm [pid1483] mksocs() connect: No such file or directory
libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
libpvm [pid1483]: Console: Can't contact local daemon

I wonder if someone knows what is the reason causes this problem?
Thanks for any suggestion and help.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jducom at nd.edu  Thu Jul 10 11:49:32 2003
From: jducom at nd.edu (Jean-Christophe Ducom)
Date: Thu, 10 Jul 2003 10:49:32 -0500
Subject: etherchannel
References: <6353EB090D04484B9AFF8E257A4BF84D3D5F68@blrhomx2.wipro.co.in>
Message-ID: <3F0D8B0C.40209@nd.edu>

Or you can have a look at:
http://www.st.rim.or.jp/~yumo/

	JC


Mohamed Abubakkar Siddiqu wrote:
> Hi..
> 
> 
> 
> U can try Channel Bonding. Check Bonding Documentation from the Kernel source
> 
> Siddiqu.T
> 
> 
> 
> -----Original Message-----
> From: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw]
> Sent: Tuesday, July 08, 2003 10:28 AM
> To: beowulf at beowulf.org
> Subject: etherchannel
> 
> 
> Hi,
> 
> Does anyone know how to set up and configure etherchannel
> on Linux system? 
> 
> I have a motherboard has two Broadcom gigabit ports, and 
> a 24-port SMC Gigabit TigerSwitch which also has Broadcom
> chip on it. Both support IEEE 802.3ad protocol which allows
> to combine two physical LAN ports into a logical one and
> double the bandwitch.There are several name for such feature,
> etherchannel is just one of them.
> 
> I wonder if anyone has try this on a Linux system, say
> SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
> will be appreciated.
> 
> Best Regards
> 
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-Performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> **************************Disclaimer************************************
> 
> Information contained in this E-MAIL being proprietary to Wipro Limited is 
> 'privileged' and 'confidential' and intended for use only by the individual
>  or entity to which it is addressed. You are notified that any use, copying 
> or dissemination of the information contained in the E-MAIL in any manner 
> whatsoever is strictly prohibited.
> 
> ***************************************************************************
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 01:37:48 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 00:37:48 -0500 (CDT)
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <Pine.GSO.4.44.0307100011120.22038-100000@geosci.uchicago.edu>

On Wed, 9 Jul 2003 at 23:14, P. Aaron Lott wrote:

>
> Hi,
>
> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.
>

If all the nodes are identical hw wise, systemimager (with network boot)
is an easy way to go for any flavor of linux. come to think of it, they
may not need to be that identical as long as your kernel support the
hardware. a search for "cloning" on freshmeat gives a few others.

i'd be interested in how you gentoo-beowulf goes...i'm sure someone else
is running one, but i don't know of any.

russell

> Thanks,
>
> Aaron
>
>
>
> palott at math.umd.edu
> http://www.lcv.umd.edu/~palott
> LCV:    IPST 4364A (301)405-4865
> Office: IPST 4364D (301)405-4843
> Fax:   (301)314-0827
>
> P. Aaron Lott
> 1301 Mathematics Building
> University of Maryland
> College Park, MD 20742-4015
>
>
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 12:03:30 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 11:03:30 -0500 (CDT)
Subject: Small PCs cluster
In-Reply-To: <Pine.LNX.4.44.0307101056310.2967-100000@beohost.scyld.com>
Message-ID: <Pine.GSO.4.44.0307101057040.29006-100000@geosci.uchicago.edu>

On Thu, 10 Jul 2003 at 11:04, Donald Becker wrote:

> On 10 Jul 2003, Daniel Fernandez wrote:
>
>
> The 3c905C is one of the best Fast Ethernet NICs available.
> It does well with everything but multicast filtering.

Could you elaborate on it's issues with multicast filtering (or point me
somewhere)? I am having some problems with multicast on a multihomed box
with these NICs and this is the first I have heard of this.

thanks
russell


>
> --
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
> Annapolis MD 21403			410-990-9993
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From vanne at venda.uku.fi  Thu Jul 10 10:06:35 2003
From: vanne at venda.uku.fi (Antti Vanne)
Date: Thu, 10 Jul 2003 17:06:35 +0300 (EEST)
Subject: kernel level ip-config and nic driver as a module
Message-ID: <Pine.LNX.4.44.0307101645080.31467-100000@venda.uku.fi>

Hi, 

I'm building my second beowulf cluster and ran into trouble with 3com 
940 network interface chip that is embedded in the mobo. DHCP works 
fine, client gets IP, but tftp won't load the pxelinux.0, it tries twice 
(according to the in.tftpd's log), but the client doesn't try to look 
for pxelinux.cfg/C0... config files. I have one similar setup working 
using the Intel e1000, and according to 
http://syslinux.zytor.com/hardware.php there's been trouble with 3com 
cards, so I figure the fault is not in the config but in the network 
chip. 

The best option would be PXE (anyone have a working pxe setup 
with 3c940?), but since it seems impossible, I'm trying to boot 
clients from floppy and use nfsroot: however the driver for 3c940 is 
available (from www.asus.com) only as kernel module, and 
unfortunately kernel runs ip-config before loading the module from 
initrd?!? How is this fixed? I'm not really a kernel hacker, obviously 
one could browse the kernel source and look for ip-config and module 
loading, but isn't there any easier way to change the boot sequence so 
that network module would be loaded before running ip-config? Any help 
would be greatly appreciated. If there is no easy way to change the 
order, what would be the next thing to do? Have minimal root 
filesystem on the floppy and then nfs-mount /usr etc. from the server? 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From samhdaniel at earthlink.net  Thu Jul 10 13:33:50 2003
From: samhdaniel at earthlink.net (Sam Daniel)
Date: 10 Jul 2003 13:33:50 -0400
Subject: ClusterWorld
Message-ID: <1057858430.4664.4.camel@wulf>

Didn't anyone attend? Doesn't anyone have anything to say about it? How
were the sessions? Will there be any Proceedings available? Etc., etc.,
etc....

If not on this list, then where?

-- Sam
Come out in the open with Linux.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From twhitcomb at apl.washington.edu  Thu Jul 10 16:52:50 2003
From: twhitcomb at apl.washington.edu (Timothy R. Whitcomb)
Date: Thu, 10 Jul 2003 13:52:50 -0700 (PDT)
Subject: help! MPI Calls not responding...
Message-ID: <Pine.LNX.4.44.0307101351090.22363-100000@snark.apl.washington.edu>

We are trying to run the Navy's COAMPS atmospheric model on a Scyld
Beowulf cluster, using the Portland Group FORTRAN compiler.  The
cluster is comprised of five nodes, each with dual AMD processors.

After some modification to the supplied Makefile, the software now
compiles and fully links.  The makefile was modified to use the
following options for the compiler
-----------------------------------------------
"EXTRALIBS= -L/usr/lib -lmpi -lmpich -lpmpich -lbproc -lbpsh -lpvfs
-lbeomap -lbeostat -ldl -llapack -lblas -lparpack_LINUX
-L/usr/coamps3/lib -lfnoc -L/usr/lib/gcc-lib/i386-redhat-linux/2.96 -lg2c"
-----------------------------------------------

However, when we try to run the code using
mpirun -allcpus atmos_forecast.exe
or
mpprun -allcpus atmos_forecast.exe
in a Perl script, it gives the following error:
-----------------------------------------------
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_INIT and code is 208
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_COMM_RANK and
code is 197
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_COMM_SIZE and
code is 197
NOT ENOUGH COMPUTATIONAL PROCESSES
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_ABORT and code is 197
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_BARRIER and code is 197
-----------------------------------------------
where the NOT ENOUGH COMPUTATIONAL PROCESSES is a program message that
indicates that you've specified to use more processors than
available.  The offending section of code is
-----------------------------------------------
      call MPI_INIT(ierr_mpi)
      call MPI_COMM_RANK(MPI_COMM_WORLD, npr, ierr_mpi)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, nprtot, ierr_mpi)
-----------------------------------------------

I modified this code to add a call to MPI_INITIALIZED after the
MPI_INIT call which indicated that the MPI_INIT just plain was not
working.

If it makes any difference, I can run the Beowulf demos (like
mpi-mandel or linpack) just fine on the multiple processors.

What is going on here and how do we fix it? We're new to cluster
computing, and this is getting over our heads.  I've tried to supply
the information I thought was relevant but as this project is proving
to me what I think doesn't do me much good.

Thanks in advance...

Tim Whitcomb
twhitcomb at apl.washington.edu
University of Washington Applied Physics Laboratory

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bob at drzyzgula.org  Thu Jul 10 18:13:59 2003
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Thu, 10 Jul 2003 18:13:59 -0400
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
References: <1057857552.73501@accufo.vwh.net>
Message-ID: <20030710181359.I14673@www2>

Grid Engine. Free, open source.
Binaries are available for Tru64.

http://gridengine.sunsource.net/

--Bob Drzyzgula

On Thu, Jul 10, 2003 at 11:19:13AM -0600, sfrolov at accufo.vwh.net wrote:
> 
> Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sfrolov at accufo.vwh.net  Thu Jul 10 13:19:13 2003
From: sfrolov at accufo.vwh.net (sfrolov at accufo.vwh.net)
Date: Thu, 10 Jul 2003 11:19:13 -0600 (MDT)
Subject: batch software
Message-ID: <1057857552.73501@accufo.vwh.net>

Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Thu Jul 10 20:13:27 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Fri, 11 Jul 2003 08:13:27 +0800
Subject: queueing system for x86-64
Message-ID: <3F0E0127.8A50A8CB@nchc.gov.tw>

Hi,

I wonder if someone knows where can I find a queueing system like
OpenPBS
for x86-64 (AMD Opteron) ?

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Thu Jul 10 21:22:18 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 11 Jul 2003 09:22:18 +0800 (CST)
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
Message-ID: <20030711012218.72314.qmail@web16810.mail.tpe.yahoo.com>

Sun's Gridengine is very good, it's free and
opensource. 

http://gridengine.sunsource.net/

(IMO, I think it is even better than commercial
software like PBSPro or LSF).

Andrew.

 --- sfrolov at accufo.vwh.net ????
> Can anybody recommend a good (and cheap) batch
> software for an alpha cluster running true64 Unix?
> Unfortunately we cannot afford to spend more than
> $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 18:05:48 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 17:05:48 -0500 (CDT)
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
Message-ID: <Pine.GSO.4.44.0307101656520.29006-100000@geosci.uchicago.edu>


Take a look at Sun Grid Engine....there are binaries for True64 (or
source) and it's free. You may want to look at running maui scheduler on
top of it. http://www.supercluster.org/maui/

russell

On Thu, 10 Jul 2003 at 11:19, sfrolov at accufo.vwh.net wrote:

> Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Fri Jul 11 01:08:50 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 11 Jul 2003 13:08:50 +0800 (CST)
Subject: queueing system for x86-64
In-Reply-To: <3F0E0127.8A50A8CB@nchc.gov.tw>
Message-ID: <20030711050850.30031.qmail@web16811.mail.tpe.yahoo.com>

Has anyone tried Gridengine on Opteron?

I think the existing x86 binary should work, binary
download:
http://gridengine.sunsource.net/project/gridengine/download.html

If it doesn't, just subscribe to the users list, there
are a lot of helpful people.

http://gridengine.sunsource.net/project/gridengine/maillist.html

Another reason I like SGE is because it has Chinese
User/Admin manual:

http://www.sun.com/products-n-solutions/hardware/docs/Software/Sun_Grid_Engine/

Andrew.

 --- Jyh-Shyong Ho <c00jsh00 at nchc.gov.tw> ????
> Hi,
> 
> I wonder if someone knows where can I find a
> queueing system like
> OpenPBS
> for x86-64 (AMD Opteron) ?
> 
> Best Regards
> 
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Fri Jul 11 12:13:08 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Fri, 11 Jul 2003 12:13:08 -0400
Subject: MPICH 1.2.5 failures (net_recv)
Message-ID: <3F0EE214.6000602@lmco.com>

Good afternoon!

   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
with the following configuration:

./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
          --with-ARCH=LINUX \
          --with-device=ch_p4 \
          --without-romio --without-mpe \
          -opt=-O2  \
          -cc=/usr/pgi/linux86/bin/pgcc \
          -fc=/usr/pgi/linux86/bin/pgf90 \
          -clinker=/usr/pgi/linux86/bin/pgcc \
          -flinker=/usr/pgi/linux86/bin/pgf90 \
          -f90=/usr/pgi/linux86/bin/pgf90 \
          -f90linker=/usr/pgi/linux86/bin/pgf90 \
          -c++=/usr/pgi/linux86/bin/pgCC \
          -c++linker=/usr/pgi/linux86/bin/pgCC


I've built the 'cpi' and 'fpi' examples in the examples/basic directory
and tried running them using the following mpirun line:


/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
PBS_NODEFILE cpi


where PBS_NODEFILE is,

penguin1
penguin1
penguin2
penguin2
penguin3
penguin3
penguin4
penguin4
penguin5
penguin5

(however, I'm testing outside of PBS). The code seems to hang fo
 quite a while and then I get the following:

p0_14235: (935.961023) net_recv failed for fd = 10
p0_14235:  p4_error: net_recv read, errno = : 110
p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
/home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
/home/g593851/src/mpich-1.2.5/examples/basic


More system details - It's a RH 7.1 OS, but with a stock 2.4.20
kernel. The interconnect is FastE through a Foundry switch and the
NICS are Intel EEPro100 (using the eepro100 driver).
   Does anybody have any ideas? I've I searched around the net a bit and
the results  were inconclusive ("use LAM instead", may have bad NIC
drivers, problematic TCP stack, etc.).

TIA!

Jeff


-- 
Dr. Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Fri Jul 11 13:11:07 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Fri, 11 Jul 2003 10:11:07 -0700
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030711171107.GA29718@stikine.ucs.sfu.ca>

On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:
> Good afternoon!
> 
>   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
> 
> ./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
>          --with-ARCH=LINUX \
>          --with-device=ch_p4 \
>          --without-romio --without-mpe \
>          -opt=-O2  \
>          -cc=/usr/pgi/linux86/bin/pgcc \
>          -fc=/usr/pgi/linux86/bin/pgf90 \
>          -clinker=/usr/pgi/linux86/bin/pgcc \
>          -flinker=/usr/pgi/linux86/bin/pgf90 \
>          -f90=/usr/pgi/linux86/bin/pgf90 \
>          -f90linker=/usr/pgi/linux86/bin/pgf90 \
>          -c++=/usr/pgi/linux86/bin/pgCC \
>          -c++linker=/usr/pgi/linux86/bin/pgCC
> 
> 
> I've built the 'cpi' and 'fpi' examples in the examples/basic directory
> and tried running them using the following mpirun line:
> 
> 
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
> PBS_NODEFILE cpi
> 
> 
> where PBS_NODEFILE is,
> 
> penguin1
> penguin1
> penguin2
> penguin2
> penguin3
> penguin3
> penguin4
> penguin4
> penguin5
> penguin5
> 
> (however, I'm testing outside of PBS). The code seems to hang fo
> quite a while and then I get the following:
> 
> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235:  p4_error: net_recv read, errno = : 110
> p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
> pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
> /home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
> /home/g593851/src/mpich-1.2.5/examples/basic
> 
> 
> More system details - It's a RH 7.1 OS, but with a stock 2.4.20
> kernel. The interconnect is FastE through a Foundry switch and the
> NICS are Intel EEPro100 (using the eepro100 driver).
>   Does anybody have any ideas? I've I searched around the net a bit and
> the results  were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).

I think you sent this to the wrong mailing list. As outlined on the
MPICH home page problem reports should go to

mpi-maint at mcs.anl.gov

The folks at Argonne are usually extremly helpful with solving problems.

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 11 13:55:10 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 11 Jul 2003 10:55:10 -0700
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030711175510.GA3185@greglaptop.greghome.keyresearch.com>

On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:

> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235:  p4_error: net_recv read, errno = : 110

It's a shame that so many programs don't print human-readable error
messages.

errno 110 is ETIMEDOUT.

error 104 is ECONNRESET, but I would suspect that it's a secondary
error generated by p0 exiting from the errno 110.

greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From AlberT at SuperAlberT.it  Fri Jul 11 06:35:21 2003
From: AlberT at SuperAlberT.it (AlberT)
Date: Fri, 11 Jul 2003 12:35:21 +0200
Subject: PVM
In-Reply-To: <3F0D30A0.D572627A@nchc.gov.tw>
References: <3F0D30A0.D572627A@nchc.gov.tw>
Message-ID: <200307111235.21746.AlberT@SuperAlberT.it>

On Thursday 10 July 2003 11:23, Jyh-Shyong Ho wrote:
> Hi,
>
> I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
>
> running SLSE8 for AMD64, I got the following message:
> > pvm
>
> libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
> libpvm [pid1483] mksocs() connect: No such file or directory
> libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
> libpvm [pid1483]: Console: Can't contact local daemon
>
> I wonder if someone knows what is the reason causes this problem?
> Thanks for any suggestion and help.

are ou sure pvmd is running ???
check it using    ps -axu | grep pvm
-- 
<?php echo '       Emiliano `AlberT` Gabrielli       '."\n".
           '  E-Mail: AlberT at SuperAlberT.it  '."\n".
           '  Web:    http://SuperAlberT.it  '."\n".
'  IRC:    #php,#AES azzurra.com '."\n".'ICQ: 158591185'; ?>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From exa at kablonet.com.tr  Fri Jul 11 05:17:58 2003
From: exa at kablonet.com.tr (Eray Ozkural)
Date: Fri, 11 Jul 2003 12:17:58 +0300
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
References: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <200307111217.58060.exa@kablonet.com.tr>

On Thursday 10 July 2003 06:14, P. Aaron Lott wrote:
> Hi,
>
> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.

I investigated this a while ago. It turns out that gentoo isn't really geared 
towards cluster use, but once you've customized it it can be pretty easy to 
use a system replication tool.

I guess gentoo could benefit from a standardized HPC clustering solution, 
including parallel system libraries and tools.

Thanks,

-- 
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara  KDE Project: http://www.kde.org
www: http://www.cs.bilkent.edu.tr/~erayo  Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B  EA0F 7C07 AE16 874D 539C
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jakob at unthought.net  Sun Jul 13 15:17:42 2003
From: jakob at unthought.net (Jakob Oestergaard)
Date: Sun, 13 Jul 2003 21:17:42 +0200
Subject: NIS problem ..
In-Reply-To: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>
References: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>
Message-ID: <20030713191742.GA10670@unthought.net>

On Tue, Jul 08, 2003 at 02:25:11PM +0530, Rakesh Gupta wrote:
> 
> 
> Hi, 
>    I am setting up a small 8 node cluster .. I have installed RedHat 9.0
> on all the nodes. 
>   Now I want to setup NIS .. I have ypserv , portmap, ypbind running on
> one of the nodes (The server) on the others I have ypbind and portmap.
> 
> The NIS Domain is also set in /etc/sysconfig networkk .. 
> 
> Now when I do /var/yp/make .. an error of the following form comes
> 
> " failed to send 'clear' to local ypserv: RPC: Unknown HostUpdating
> passwd.byuid " 
> 
> and a sequence of such messages follow..
> 
> can anyone please help me with this.


What's in your /var/yp/ypservers file?   Does it include the NIS server?

Are you sure that whatever hostname(s) you have there is resolvable?

Do you have 'localhost' (and the name for the local host used in the
ypservers file) in your /etc/hosts file?

Are you sure you don't have any fancy firewalling enabled by accident?

I'm shooting in the dark here... I haven't seen that particular problem
on a NIS server before.  It just looks like somehow it cannot contact
the local host, which is weird...

As a last resort, I would suggest looking thru the makefile, to see
exactly which command fails.  Once you have isolated the single command
to run to get the error message you see, try running it under "strace".
Then it should be pretty clear exactly which system call fails, and from
there on you might be able to guess why it attempts to make that call.

I haven't needed to go thru that routine with a NIS server yet...
Usually turning on debugging information, and double-checking the
configuration files should do it.

My NIS server and slave is on Debian 3 now though, and I don't know if
there are any particular oddities in the RedHat 9 setup.

-- 
................................................................
:   jakob at unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From msnitzer at lnxi.com  Mon Jul 14 16:03:33 2003
From: msnitzer at lnxi.com (Mike Snitzer)
Date: Mon, 14 Jul 2003 14:03:33 -0600
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>; from jeffrey.b.layton@lmco.com on Fri, Jul 11, 2003 at 12:13:08PM -0400
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030714140333.A10106@lnxi.com>

On Fri, Jul 11 2003 at 10:13,
Jeff Layton <jeffrey.b.layton at lmco.com> wrote:

> Good afternoon!
> 
>    Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
...
>    Does anybody have any ideas? I've I searched around the net a bit and
> the results  were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).

Hey jeff,

you might try compiling mpich with gcc to eliminate PGI as a potential
source of error.  This would at least allow you to verify the integrity of
the drivers, tcp stack, nic, etc.

PGI should be perfectly fine given the minimal mpich configure you
provided but the compiler is one variable that is easy enough to eliminate
as a potential problem. If you see the same problem with gcc compiled
mpich then there is a deeper issue.  You might confine the mpirun to use
only 2 nodes and then scale up accordingly.

regards,
mike

-- 
Mike Snitzer                           msnitzer at lnxi.com
Linux Networx                          http://www.lnxi.com 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From msnitzer at lnxi.com  Mon Jul 14 16:35:41 2003
From: msnitzer at lnxi.com (Mike Snitzer)
Date: Mon, 14 Jul 2003 14:35:41 -0600
Subject: queueing system for x86-64
In-Reply-To: <3F0E0127.8A50A8CB@nchc.gov.tw>; from c00jsh00@nchc.gov.tw on Fri, Jul 11, 2003 at 08:13:27AM +0800
References: <3F0E0127.8A50A8CB@nchc.gov.tw>
Message-ID: <20030714143541.B10106@lnxi.com>

On Thu, Jul 10 2003 at 18:13,
Jyh-Shyong Ho <c00jsh00 at nchc.gov.tw> wrote:

> Hi,
> 
> I wonder if someone knows where can I find a queueing system like
> OpenPBS
> for x86-64 (AMD Opteron) ?

hello,

If you'd like to use OpenPBS on x86-64 it works fine.. once you patch the
buildutils/config.guess accordingly.  An ia64 patch is available here:

http://www.osc.edu/~troy/pbs/patches/config-ia64-2.3.12.diff

you'll need to replace all instances of 'ia64' with 'x86_64' in the patch.

fyi, you'll likely also need a patch to get gcc3.x to work with OpnePBS's
makedepend-sh; search google with: makedepend openpbs gcc3 

regards,
mike

-- 
Mike Snitzer                           msnitzer at lnxi.com
Linux Networx                          http://www.lnxi.com 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul 15 00:23:18 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Tue, 15 Jul 2003 12:23:18 +0800
Subject: PVM
References: <3F0D30A0.D572627A@nchc.gov.tw> <200307111235.21746.AlberT@SuperAlberT.it>
Message-ID: <3F1381B6.E423FA07@nchc.gov.tw>

Hi,

Thanks for the message. I checked and found that pvmd is not running,
when I ran pvmd to initiate the daemon, it aborted immediately:

c00jsh00 at Zephyr:~> pvmd
/tmp/pvmtmp012493.0
Aborted

Here are the environment variables:

export PVM_ROOT=/usr/lib/pvm3
export PVM_ARCH=X86_64
export PVM_DPATH=$PVM_ROOT/lib/pvmd
export PVM_TMP=/tmp
export PVM=$PVM_ROOT/lib/pvm

Perhaps someone knows what might be wrong.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC


AlberT wrote:
> 
> On Thursday 10 July 2003 11:23, Jyh-Shyong Ho wrote:
> > Hi,
> >
> > I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
> >
> > running SLSE8 for AMD64, I got the following message:
> > > pvm
> >
> > libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
> > libpvm [pid1483] mksocs() connect: No such file or directory
> > libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
> > libpvm [pid1483]: Console: Can't contact local daemon
> >
> > I wonder if someone knows what is the reason causes this problem?
> > Thanks for any suggestion and help.
> 
> are ou sure pvmd is running ???
> check it using    ps -axu | grep pvm
> --
> <?php echo '       Emiliano `AlberT` Gabrielli       '."\n".
>            '  E-Mail: AlberT at SuperAlberT.it  '."\n".
>            '  Web:    http://SuperAlberT.it  '."\n".
> '  IRC:    #php,#AES azzurra.com '."\n".'ICQ: 158591185'; ?>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul 15 03:11:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 15 Jul 2003 09:11:16 +0200
Subject: Default user installed by Packages
Message-ID: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>

Hi Beowulfers,

I'm working on a little Cluster Builder which bases on rsync.
As I noticed, rsync change the owner of a file attribute via chown, if
the owner is known by the system.
Would you be so nice and take a look, if I have to expand my
"default-known" user list on the pxe-environment ?.
I would like the have it destribution independent.
Some Suse and Debian lists would be nice.


This list belongs to RH 7.3 # cat /etc/passwd | cut -d: -f1 | sort
adm
amanda
apache
bin
daemon
ftp
games
gdm
gopher
halt
ident
junkbust
ldap
lp
mail
mailman
mailnull
mysql
named
netdump
news
nfsnobody
nobody
nscd
ntp
operator
pcap
postfix
postgres
pvm
radvd
root
rpc
rpcuser
rpm
shutdown
squid
sync
uucp
vcsa
xfs

Thanks in advance
Rene Storm
__________________________
emplics AG
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Tue Jul 15 09:59:56 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Tue, 15 Jul 2003 07:59:56 -0600
Subject: PVM
In-Reply-To: <3F1381B6.E423FA07@nchc.gov.tw>
References: <3F0D30A0.D572627A@nchc.gov.tw> <200307111235.21746.AlberT@SuperAlberT.it> <3F1381B6.E423FA07@nchc.gov.tw>
Message-ID: <3F1408DC.20606@inel.gov>

Jyh-Shyong Ho wrote:

> Hi,
> 
> Thanks for the message. I checked and found that pvmd is not running,
> when I ran pvmd to initiate the daemon, it aborted immediately:
> 
> c00jsh00 at Zephyr:~> pvmd
> /tmp/pvmtmp012493.0
> Aborted
> 
> Here are the environment variables:
> 
> export PVM_ROOT=/usr/lib/pvm3
> export PVM_ARCH=X86_64
> export PVM_DPATH=$PVM_ROOT/lib/pvmd
> export PVM_TMP=/tmp
> export PVM=$PVM_ROOT/lib/pvm
> 
> Perhaps someone knows what might be wrong.

Do you have a /tmp/pvmd* file?  They can be left
after a pvm crash and prevent future instances
from starting.  Also, do you really mean to
execute pvmd directly and without arguments?

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Tue Jul 15 11:48:05 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 15 Jul 2003 11:48:05 -0400
Subject: When are diskless compute nodes inappropriate?
Message-ID: <1058284085.17543.12.camel@haze.sr.unh.edu>

Okay, I'm convinced by the arguments in favor of diskless compute
nodes, including cost savings applicable elsewhere, reduced power
consumption, and increased reliability through the elimination of
moving parts.

With all the arguments against disks, what are the arguments in favor
of diskful compute nodes? In particular, what are the situations or
types of jobs for which a cluster with a high percentage of diskless
nodes is contraindicated?

I look forward to learning from the list's collective wisdom.

Thanks.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From henken at seas.upenn.edu  Tue Jul 15 12:27:18 2003
From: henken at seas.upenn.edu (Nicholas Henke)
Date: 15 Jul 2003 12:27:18 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>

On Tue, 2003-07-15 at 11:48, Tod Hagan wrote:
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?

Anytime that accessing the data locally is faster than via NFS/OtherFS.
The other case is when you are routinely using swap for memory.

The one 'practical' situation we see here is on our Genomics cluster,
where they are running BLAST on very large data sets. It makes an
extremely large difference to copy the data to a local drive and use
that than to access the data via NFS.

HTH,
Nic
-- 
Nicholas Henke
Penguin Herder & Linux Cluster System Programmer
Liniac Project - Univ. of Pennsylvania

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Tue Jul 15 12:28:25 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 15 Jul 2003 12:28:25 -0400
Subject: Default user installed by Packages
In-Reply-To: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
References: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
Message-ID: <1058286507.17543.19.camel@haze.sr.unh.edu>

On Tue, 2003-07-15 at 03:11, Rene Storm wrote:
> Some Suse and Debian lists would be nice.

>From my Debian stable (woody) system:
backup
bin
daemon
games
gdm
gnats
identd
irc
list
lp
mail
man
news
nobody
operator
postgres
proxy
root
sshd
sync
sys
uucp
www-data


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Tue Jul 15 12:53:45 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 15 Jul 2003 12:53:45 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <1058288025.3280.102.camel@protein.scalableinformatics.com>

When you do lots of disk IO to large blocks, sequential reads/writes. 
Remote disk will bottleneck you either at the network port of the
compute node (~10 MB/s for 100 Base T, or ~80 MB/s for gigabit), or at
the network port(s) of the file server (even if you multihome it, N
clients distributed over M ports all heavily utilizing the file system
will slow down the whole system if the requested bandwidth exceeds what
the server is able to provide out its port(s)).  Or even at the disk of
the server.  

Local IO to a single spindle IDE disk can get you 30(50) MB/s
write(read) performance.  RaidO (using Linux MD device) can get you
60(80) MB/s write(read) performance.  Sure, this is less than a 200 MB/s
fibre channel, but it is also not shared like the 200 MB/s fibre channel
(which becomes effectively (200/M) MB/s fibre channel for M requestors
using lots of bandwidth).

The aggregate IO when you get many writers/readers utilizing lots of
bandwidth is a win for local disk over shared disk.  From a cost
perspective this is far better bang per US$ than shared disk for the
heavy IO applications.  At about $60 for a 40 GB IDE (ATA 100, 7200
RPM), the price isn't significant compared to the cost of an individual
compute node.  That is, unless you go SCSI for compute nodes.

If you go diskless on the OS, just have a local scratch disk space for
your heavy IO jobs.

On Tue, 2003-07-15 at 11:48, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption, and increased reliability through the elimination of
> moving parts.
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?
> 
> I look forward to learning from the list's collective wisdom.
> 
> Thanks.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Tue Jul 15 13:11:25 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 15 Jul 2003 13:11:25 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
	 <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>
Message-ID: <1058289085.3280.120.camel@protein.scalableinformatics.com>

On Tue, 2003-07-15 at 12:27, Nicholas Henke wrote:

[...]

> The one 'practical' situation we see here is on our Genomics cluster,
> where they are running BLAST on very large data sets. It makes an
> extremely large difference to copy the data to a local drive and use
> that than to access the data via NFS.

One thing that you can do is to segment the databases (use the -v switch
on formatdb) or if you don't care about the absolute E-values being
correct relative to your real database size, you could pre-segment the
database using a tool such as our segment.pl at
http://scalableinformatics.com/downloads/segment.pl .  The large cost of
disk access for the large BLAST jobs comes from the way it mmaps the
indices, in case they overflow available memory.  If they do overflow
memory, then you spend your time in disk IO bringing the indices into
memory as you walk through them.  This lowers your overall absolute
performance.

Regardless of the segmentation, it is rarely a good idea (except in the
case of very small databases) to keep them on NFS for the computation.
 Even if they are small, you are going to suffer network congestion very
quickly for a reasonable number of compute nodes.


Of course this gets into the problem of moving the databases out to the
compute nodes.  We are working on a neat solution to the data motion
problem (specifically the database transport problem to the compute
nodes).  To avoid annoying everyone, please go offlist if you want to
speak to us about it.  Email/phone in .sig.
 
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From edwardsa at plk.af.mil  Tue Jul 15 17:16:43 2003
From: edwardsa at plk.af.mil (Arthur H. Edwards)
Date: Tue, 15 Jul 2003 15:16:43 -0600
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <20030715211643.GA23118@plk.af.mil>

If you are running large numbers of jobs that read and write to disk,
local disk can be much more stable. We have been running an essentially 
serial application on many nodes and in both cases where we were writing
to a parallel file system, the app would consistently crash.

Art Edwards

On Tue, Jul 15, 2003 at 11:48:05AM -0400, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption, and increased reliability through the elimination of
> moving parts.
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?
> 
> I look forward to learning from the list's collective wisdom.
> 
> Thanks.
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Art Edwards
Senior Research Physicist
Air Force Research Laboratory
Electronics Foundations Branch
KAFB, New Mexico

(505) 853-6042 (v)
(505) 846-2290 (f)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From markgw at sgi.com  Wed Jul 16 02:31:23 2003
From: markgw at sgi.com (Mark Goodwin)
Date: Wed, 16 Jul 2003 16:31:23 +1000 (EST)
Subject: [ANNOUNCE] SGI Performance Co-Pilot 2.3.1 now available
Message-ID: <Pine.LNX.4.44.0307161621110.21382-100000@sherman.melbourne.sgi.com>


SGI is pleased to announce the new version of Performance Co-Pilot (PCP)
open source (version 2.3.1-4) is now available for download from

          ftp://oss.sgi.com/projects/pcp/download

This release contains mostly bug fixes following several months
of testing the "dev" releases (most recent was version 2.3.0-17).
A list of changes since the last major open source release (which
was version 2.3.0-14) is in /usr/doc/pcp-2.3.1/CHANGELOG after
installation, or at http://oss.sgi.com/projects/pcp/latest.html

There are re-built RPMs for i386 and ia64 platforms in the above ftp
directory. Other platforms will need to build RPMs from either the SRPM
or from the tarball, e.g. :

    # tar xvzf pcp-2.3.1-4.src.tar.gz
    # cd pcp-2.3.1
    # ./Makepkgs

PCP is an extensible system monitoring package with a client/server
architecture. It provides a distributed unifying abstraction for all
interesting performance statistics in /proc and assorted applications
(e.g. Apache). The PCP library APIs are robust and well documented,
supporting rapid deployment of new and diverse sources of performance
data and the development of sophisticated performance monitoring tools.

The PCP homepage is at http://oss.sgi.com/projects/pcp and you can join
the PCP mailing list via http://oss.sgi.com/projects/pcp/mail.html

SGI would like to thank those who contributed to this and earlier releases.

Thanks

-- Mark Goodwin <markgw at sgi.com>
SGI Engineering


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lange at informatik.Uni-Koeln.DE  Wed Jul 16 05:34:03 2003
From: lange at informatik.Uni-Koeln.DE (Thomas Lange)
Date: Wed, 16 Jul 2003 11:34:03 +0200
Subject: Default user installed by Packages
In-Reply-To: <1058286507.17543.19.camel@haze.sr.unh.edu>
References: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
	<1058286507.17543.19.camel@haze.sr.unh.edu>
Message-ID: <16149.7179.554250.882661@informatik.uni-koeln.de>

>>>>> On 15 Jul 2003 12:28:25 -0400, Tod Hagan <tod at gust.sr.unh.edu> said:

    > On Tue, 2003-07-15 at 03:11, Rene Storm wrote:
    >> Some Suse and Debian lists would be nice.

These are the packages that are defined in the class Beowulf used in
FAI (fully automatic installation for Debian) for a Beowulf computing
node.


# packages for Beowulf clients

PACKAGES install
fping jmon
rsh-client rsh-server rstat-client rstatd rusers rusersd
autofs

dsh update-cluster-hosts update-cluster
etherwake

PACKAGES taskinst
c-dev
PACKAGES install
lam-runtime lam3 lam3-dev libpvm3 pvm-dev mpich
scalapack-mpich-dev

-- 
regards Thomas
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From franz.marini at mi.infn.it  Wed Jul 16 07:04:57 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Wed, 16 Jul 2003 13:04:57 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>

Hello,

  being in the process of deciding which net infrastructure to use for our 
next cluster (Myrinet, SCI/Dolphin or Quadrics), I was looking at the 
specs for the different types of hw.

  Provided that SCI/Dolphin implements RDMA, I was wondering why so little 
effort seems to be put into implementing a GSM solution for x86 clusters. 
The only (maybe big, maybe not) problem I see in the Dolphin hw is the 
lack of support for cache coherency. 

  I think that having GSM support in (almost) commodity clusters would be 
a really-nice-thing(tm). 

  I know that the Altix family implements GSM, but the price point of even 
a really small system (4 x Itanium2 procs, 4 Gb ram, 36 Gb HD) is really 
high, compared to an (performance wise) equivalent commodity cluster. And 
I can really see that SGI had a nice ccNUMA hw already developed, and so 
the software effort to implement GSM has (probabily) been less massive 
than the effort a Dolphin GSM solution would need. 

  Nonetheless, I still can't quite understand why so little effort is 
being put in developing a GSM solution for commodity cluster (even with 
Myrinet or Quadrics, I'm thinking about SCI/Dolphin only because of the hw 
support for RDMA operations).

  Any idea, comment or whatever ? 

  Have a nice day everyone,

Franz


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Wed Jul 16 09:16:09 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 16 Jul 2003 15:16:09 +0200
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>
References: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>
Message-ID: <200307161516.09818.joachim@ccrl-nece.de>

Franz Marini:
>   being in the process of deciding which net infrastructure to use for our
> next cluster (Myrinet, SCI/Dolphin or Quadrics), I was looking at the
> specs for the different types of hw.
>
>   Provided that SCI/Dolphin implements RDMA, I was wondering why so little
> effort seems to be put into implementing a GSM solution for x86 clusters.

Because MPI is what most people want to achieve code- and 
peformance-portability.

> The only (maybe big, maybe not) problem I see in the Dolphin hw is the
> lack of support for cache coherency.
>
>   I think that having GSM support in (almost) commodity clusters would be
> a really-nice-thing(tm).

Martin Schulz (formerly TU M?nchen, now Cornell Theory Center) has developed 
exactly the thing you are looking for. See 
http://wwwbode.cs.tum.edu/Par/arch/smile/software/shmem/ . You will also find 
his PhD thesis there which describes the complete software.

I do not know about the exact status of the SW right now (his approach 
required some patches to the SCI driver, and it will probably be necessary to 
apply them to the current drivers). Very interesting approach, though.

Other, non SCI approaches like MOSIX and the various DSM/SVM libraries also 
offer you some sort of global shared memory - but most do only use TCP/IP for 
communication.

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fmahr at gmx.de  Wed Jul 16 10:13:44 2003
From: fmahr at gmx.de (Ferdinand Mahr)
Date: Wed, 16 Jul 2003 16:13:44 +0200
Subject: Global Shared Memory and SCI/Dolphin
References: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it> <200307161516.09818.joachim@ccrl-nece.de>
Message-ID: <3F155D98.7CB8BE90@gmx.de>

Joachim Worringen wrote:
> Other, non SCI approaches like MOSIX and the various DSM/SVM libraries also
> offer you some sort of global shared memory - but most do only use TCP/IP for
> communication.

Unfortunately, MOSIX (so far) does not offer global shared memory. The
node with the largest installed RAM is the restriction, since MOSIX
cannot use the memory of more than one node for one process.
The MOSIX team seems to work on DSM, but there are no official results
so far.

Regards,
Ferdinand
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Wed Jul 16 11:36:23 2003
From: jcownie at etnus.com (James Cownie)
Date: Wed, 16 Jul 2003 16:36:23 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Your message of "Wed, 16 Jul 2003 18:28:33 +0400."
             <200307161428.SAA28224@nocserv.free.net> 
Message-ID: <19coKN-5n4-00@etnus.com>

> > Because MPI is what most people want to achieve code- and
> > peformance-portability.

>   Partially I may agree, partially - not: MPI is not the best in the
> sense of portability (for example, optimiziation requires knowledge
> of interconnect topology, which may vary from cluster to cluster,
> and of course from MPP to MPP computer).

MPI has specific support for this in Rolf Hempel's topology code,
which is intended to allow you to have the system help you to choose a
good mapping of your processes onto the processors in the system.

This seems to me to be _more_ than you have in a portable way on the
ccNUMA machines, where you have to worry about

1) where every page of data lives, not just how close each process is
   to another one (and you have more pages than processes/threads to
   worry about !)

2) the scheduler choosing to move your processes/threads around the
   machine. 

> I think that if there is relative cheap and effective way to build
> ccNUMA system from cluster - it may have success.

Which is, of course, what SCI was _intended_ to be, and we saw how
well that succeeded :-(

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Wed Jul 16 05:12:42 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Wed, 16 Jul 2003 17:12:42 +0800
Subject: NFS problem
Message-ID: <3F15170A.22D968E4@nchc.gov.tw>

Hi,

I set up a small cluster of 4+1 nodes, directories /home, /usr/local,
/opt and /workraid
of the master node are exported to slave nodes. With /etc/fstab defined
as nfs file system
on slave nodes and file /etc/exports defined in the master node, the NFS
should work.
However, not all of these directories are mounted when these slave nodes
are rebooted,
I always get the message when the system tries to mount the NFS
directories:

RPC portmapper failure: unable to receive

When the system is up, I can mount these directories manually. The
booting message does
include the line:

Starting RPC portmap daemon.....done

Could anyone point out what might be wrong or where to check?

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 02:45:58 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Wed, 16 Jul 2003 23:45:58 -0700
Subject: P4 dual vs P4C vs Opteron
Message-ID: <20030717064558.GA10800@sphere.math.ucdavis.edu>


I have been evaluating price/performance with a locally written
earthquake simulation code written in C, mostly floating point, and
not very cache friendly.  I thought people might be interested in the
performance numbers I collected.

Gcc-3.2.2 was used in all cases with the -O3 flag (compiled on the
machine it ran).

Dual p4-3.0/533 Mhz, no HT mahcine
1 process took 86.43 seconds.
2 proccesses in parallel took 156.9 seconds
Scaling efficiency =~ 10% (2 processes run at the same time have 10% greather 
                           throughput then a single process on a single cpu)

Dual Opteron 240-1.4 Ghz/333 MHz
1 process took 97.87 seconds.
2 proccesses in parallel took 99.79 seconds
Scaling efficiency =~ 96% (2 processes run at the same time have 97% greather 
                           throughput then a single process on a single cpu)

Single P4C-2.6 Ghz/800 Mhz FSB with HT enabled.
1 process took 81.22 seconds.
2 proccesses in parallel took 137.59 seconds
Scaling efficiency =~ 18% (2 processes run at the same time have 18% greather 
                           throughput then a single process on a single cpu)

I'd also like to do a performance per watt.  Anyone have a >= 2.6 Ghz
dual P4, 533 Mhz FSB, a rackmount motherboard, and a kill-a-watt?
Unfortunately my dual p4 has a fast 3d card which would throw my
performance per watt calculations.

I found it amusing that Hyperthreading scaled somewhat poorly, but still
managed to outscale and outperform the dual p4, despite a significantly
slower clock.

So the P4C-2.6 is the fastest for a single job and the opteron (the slowest
model sold) is the fastest for 2 jobs.  For the curious I'm seeing around
1.8 amps @ 110V running the dual opteron with 2 busy CPUs.

-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Thu Jul 17 05:01:37 2003
From: jcownie at etnus.com (James Cownie)
Date: Thu, 17 Jul 2003 10:01:37 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Mikhail Kuzminsky <kus@free.net> 
   of "Wed, 16 Jul 2003 22:31:15 +0400." <200307161831.WAA02082@nocserv.free.net> 
Message-ID: <19d4dt-1F6-00@etnus.com>


> > >   Partially I may agree, partially - not: MPI is not the best in the
> > > sense of portability (for example, optimiziation requires knowledge
> > > of interconnect topology, which may vary from cluster to cluster,
> > > and of course from MPP to MPP computer).
> 
> > MPI has specific support for this in Rolf Hempel's topology code,
> > which is intended to allow you to have the system help you to choose a
> > good mapping of your processes onto the processors in the system.
> 
>   Unfortunately I do not know about that codes :-( but for the best
> optimization I'll re-build the algorithm itself to "fit" for target
> topology.

Since it's a standard part of MPI it seems a bit unfair of you to be
saying that MPI doesn't support optimisation based on topology, when
all you mean is "I didn't RTFM so I don't know about that part of
the MPI standard".

See (for instance) chapter 6 in "MPI The Complete Reference" which
discusses the MPI topology routines at some length.
This is all MPI-1 stuff too, so it's not as if it's new ;-)

Of course it may well be that none of the vendors has bothered
actually to implement the topology routines in any way which gives you
a benefit. However it still seems unfair to blame the MPI _standard_
for failings in MPI _implementations_. After all the MPI forum spent
time arguing about this, so we were aware of the issue, and trying to
give you a solution to the problem.

> > This seems to me to be _more_ than you have in a portable way on the
> > ccNUMA machines, where you have to worry about
> > 
> > 1) where every page of data lives, not just how close each process is
> >    to another one (and you have more pages than processes/threads to
> >    worry about !)
> > 
> > 2) the scheduler choosing to move your processes/threads around the
> >    machine. 
> 
>   Yes, but "by default" I believe that they are the tasks of
> operating system, or, as maximum, the information I'm supplying to
> OS, *after* translation and linking of the program.

Having seen the effect which layout has, and the contortions people go
to to try to get their SMP codes to work efficiently in non-portable
ways (re-coding to make "first touch" happen on the "right" processor,
use of machine specific system calls for page affinity control and so
on), I remain unconvinced.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From janfrode at parallab.no  Thu Jul 17 05:04:54 2003
From: janfrode at parallab.no (Jan-Frode Myklebust)
Date: Thu, 17 Jul 2003 11:04:54 +0200
Subject: bad job distribution with MPICH
Message-ID: <20030717090453.GB23226@ii.uib.no>

Hi, 

we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
ethernet), and are having some problems with the mpich job distribution. 
An example from today:

The PBS job:

----------------------------------------
#PBS -l nodes=4:ppn=2,walltime=100:00:00
#
mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
----------------------------------------

is assigned to nodes:

	node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1

PBS generates a PBS_NODEFILE containing:

-----------------------------
node17
node15
node14
node11
node17
node15
node14
node11
-----------------------------

And this command is started in node 17:

	mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable

And then when I look over the nodes, there's 1 executable running on
node17, 3 on node15, 2 on node14 and 2 on node11.

Anybody seen something like this, and maybe have an idea of what might 
be causing it?


  -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Thu Jul 17 13:39:04 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 17 Jul 2003 13:39:04 -0400 (EDT)
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058288025.3280.102.camel@protein.scalableinformatics.com>
Message-ID: <Pine.LNX.4.44.0307171336050.10166-100000@coffee.psychology.mcmaster.ca>

as everyone said: local disks suck for reliability, but are simply
necessary if you're doing any kind of sigificant file IO, especially
checkpoints.

IMO, that means diskless net-booting with local swap/scratch.

> write(read) performance.  RaidO (using Linux MD device) can get you
> 60(80) MB/s write(read) performance.  Sure, this is less than a 200 MB/s

of course, MD can give you much higher raid0 if you use more than two disks;
it's not hard to hit 200 MB/s.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Daniel.Kidger at quadrics.com  Thu Jul 17 07:15:56 2003
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Thu, 17 Jul 2003 12:15:56 +0100
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>

Franz Marini wrote:

>  Nonetheless, I still can't quite understand why so little effort is 
>being put in developing a GSM solution for commodity cluster (even with 
>Myrinet or Quadrics, I'm thinking about SCI/Dolphin only because of the hw 
>support for RDMA operations).

The Quadrics Interconnect also does hardware RDMA, and yes a significant
percentage of people do use Global Shared Memory programming models rather
than message passing.

In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
RDMA ??


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From koz at urbi.com.br  Thu Jul 17 01:09:12 2003
From: koz at urbi.com.br (Alexandre M.)
Date: Thu, 17 Jul 2003 02:09:12 -0300
Subject: NFS problem
References: <3F15170A.22D968E4@nchc.gov.tw>
Message-ID: <000801c34c21$903eeaa0$5901020a@nhg4bx71qabh4t>

Hi,
One problem that's common is trying to mount the NFS dir while the network
is not ready yet during boot. You could see if this is the case by placing a
"sleep 5" in the NFS service bootup script just before the mount command.

----- Original Message ----- 
From: "Jyh-Shyong Ho" <c00jsh00 at nchc.gov.tw>
To: <beowulf at beowulf.org>
Sent: Wednesday, July 16, 2003 6:12 AM
Subject: NFS problem


> Hi,
>
> I set up a small cluster of 4+1 nodes, directories /home, /usr/local,
> /opt and /workraid
> of the master node are exported to slave nodes. With /etc/fstab defined
> as nfs file system
> on slave nodes and file /etc/exports defined in the master node, the NFS
> should work.
> However, not all of these directories are mounted when these slave nodes
> are rebooted,
> I always get the message when the system tries to mount the NFS
> directories:
>
> RPC portmapper failure: unable to receive
>
> When the system is up, I can mount these directories manually. The
> booting message does
> include the line:
>
> Starting RPC portmap daemon.....done
>
> Could anyone point out what might be wrong or where to check?
>
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-Performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 16:42:55 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Thu, 17 Jul 2003 13:42:55 -0700
Subject: Dual Opteron-1.4 power usage
Message-ID: <20030717204255.GA15891@sphere.math.ucdavis.edu>


I figured this might be handy for those planning Power, UPS, or airconditioning
budgets.

Tyan dual opteron motherboard
4 1GB dimms (ECC registered)
enlight 8950 case
Sparkle 550 watt power supply.
No PCI cards.

Measured with a kill-a-watt.

163 watts idle 
192 watts with 2 distributed.net OGR crunchers running.
194 watts with 2 earthquake sims
196 watts Bonnie++ and 2*OGR
198 watts Bonnie++ and 2 earthquake sims
208 watts bonnie++ and pstream (2 threads banging main memory sequentially)
212 watts pstream (2 threads banging main memory sequentially)


-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rbw at ahpcrc.org  Thu Jul 17 16:40:10 2003
From: rbw at ahpcrc.org (Richard Walsh)
Date: Thu, 17 Jul 2003 15:40:10 -0500
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>


Dan Kidger wrote:

>The Quadrics Interconnect also does hardware RDMA, and yes a significant
>percentage of people do use Global Shared Memory programming models rather
>than message passing.
>
>In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
>RDMA ??

 Does this support run all the way up the stack to the MPI-2 "one-sided"
 communications stuff?  Anyone working on supporting the implicit DSM
 language constructs of CAF and/or UPC with their RDMA capability?  Comments 
 on any/all interconnects mentioned are welcome.

 Thanks,

 rbw

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 16:48:38 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Thu, 17 Jul 2003 13:48:38 -0700
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <20030717204838.GB15891@sphere.math.ucdavis.edu>

On Tue, Jul 15, 2003 at 11:48:05AM -0400, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption

5-10 watts.

>, and increased reliability through the elimination of
> moving parts.

Indeed.  Although similar reliability can be had if you can survive
a disk failure.

> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or

Swap, and high speed disk I/O.  35 MB/sec of sequential I/O to a local disk
is very hard to centralize.  If you can make do with much less then it's not
to much of a big deal.

For our 32 node cluster on boot we:
	netboot a kernel
	kernel loads a ramdisk
	disk is partitioned
	disk is mkswaped
	/scratch and /swap are mounted.

So this leave ZERO state on the hard disk, so if a disk dies just reboot
and the node works (but doesn't have /swap and /scratch), if you pull
a disk off a shelf and stick it in a node you just reboot.

Very nice to minimize the administrative costs of managing, patching,
backing up, troubleshooting etc of N nodes, with possibly different images,
and of course any state.

My central fileserver is a dual-p4, dual PC1600 memory bus, 133 Mhz/64 bit
PCI, and several U160 channels full of 5 disks each.  I see 200-300 MB/sec
sustained for large sequential file reads/writes.  Granted the central
fileserver can not keep up with 32 nodes wanting to read/write at 35 MB/sec,
but it's enough to usually not be a bottlneck.

-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Thu Jul 17 17:13:01 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Thu, 17 Jul 2003 14:13:01 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
Message-ID: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>

On Thu, Jul 17, 2003 at 12:15:56PM +0100, Daniel Kidger wrote:

> The Quadrics Interconnect also does hardware RDMA, and yes a significant
> percentage of people do use Global Shared Memory programming models rather
> than message passing.
> 
> In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> RDMA ??

There's a terminology problem here: Some people mean cache-coherent
shared memory, like that on an SGI Origin.

Another term for non-cache-coherent but globally addressable and
accessible memory is SALC: Shared address, local consistency.

And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
non-cache-coherent kind of shared memory. Programming models in this
area are:

  * UPC: Unified Parallel C
  * CoArray Fortran
  * MPI-2 one-sided operations
  * Global Arrays from PNL
  * The Cray SHMEM library

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From eccf at super.unam.mx  Thu Jul 17 17:11:55 2003
From: eccf at super.unam.mx (Eduardo Cesar Cabrera Flores)
Date: Thu, 17 Jul 2003 16:11:55 -0500 (CDT)
Subject: bad job distribution with MPICH
In-Reply-To: <200307171904.h6HJ4Lw25122@NewBlue.Scyld.com>
Message-ID: <Pine.LNX.4.44.0307171610390.10489-100000@mezcal.super.unam.mx>


You should try mpiexec

                      
cafe


Hi,                                                                                     
                                                                                        
we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast                     
ethernet), and are having some problems with the mpich job distribution.                
An example from today:                                                                  
                                                                                        
The PBS job:                                                                            
                                                                                        
----------------------------------------                                                
#PBS -l nodes=4:ppn=2,walltime=100:00:00                                                
#                                                                                       
mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe                  
----------------------------------------                                                
                                                                                        
is assigned to nodes:                                                                   
                                                                                        
        
node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1         
                                                                                        
PBS generates a PBS_NODEFILE containing:                                                
                                                                                        
-----------------------------         
node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1         
                                                                                        
PBS generates a PBS_NODEFILE containing:                                                
                                                                                        
-----------------------------                                                           
node17                                                                                  
node15                                                                                  
node14                                                                                  
node11                                                                                  
node17                                                                                  
node15                                                                                  
node14                                                                                  
node11                                                                                  
-----------------------------                                                           
                                                                                        
And this command is started in node 17:                                                 
                                                                                        
        mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable              
                                                                                        
And then when I look over the nodes, there's 1 executable running on                    
node17, 3 on node15, 2 on node14 and 2 on node11.                                       
                                                                                        
Anybody seen something like this, and maybe have an idea of what might 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Thu Jul 17 23:20:25 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 18 Jul 2003 11:20:25 +0800 (CST)
Subject: SGE 5.3p4 released (was: queueing system for x86-64)
In-Reply-To: <20030714143541.B10106@lnxi.com>
Message-ID: <20030718032025.1909.qmail@web16813.mail.tpe.yahoo.com>

I was trying to install SGE on a x86-64 cluster, and
found that I need SGE 5.3p4 to get the resource limit
set correctly.

http://gridengine.sunsource.net/project/gridengine/news/SGE53p4-announce.html

I will find try to install SGE on x86-64 next week,
and I will tell everyone on this list my experience.

Andrew. 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Fri Jul 18 04:31:45 2003
From: jcownie at etnus.com (James Cownie)
Date: Fri, 18 Jul 2003 09:31:45 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Richard Walsh <rbw@ahpcrc.org> 
   of "Thu, 17 Jul 2003 15:40:10 CDT." <200307172040.h6HKeAm29015@mycroft.ahpcrc.org> 
Message-ID: <19dQeX-1LH-00@etnus.com>


>  Does this support run all the way up the stack to the MPI-2
>  "one-sided" communications stuff?  Anyone working on supporting the
>  implicit DSM language constructs of CAF and/or UPC with their RDMA
>  capability?  Comments on any/all interconnects mentioned are
>  welcome.

Compaq UPC (from HP) on their SC machines directly targets the
Quadrics' Elan processors.

See http://h30097.www3.hp.com/upc/ for details of the Compaq UPC
product. 

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Fri Jul 18 04:41:43 2003
From: jcownie at etnus.com (James Cownie)
Date: Fri, 18 Jul 2003 09:41:43 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Greg Lindahl <lindahl@keyresearch.com> 
   of "Thu, 17 Jul 2003 14:13:01 PDT." <20030717211301.GA4929@greglaptop.internal.keyresearch.com> 
Message-ID: <19dQoB-1LO-00@etnus.com>

> > In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> > RDMA ??
> 
> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.
> 
> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:
> 
>   * UPC: Unified Parallel C
>   * CoArray Fortran
>   * MPI-2 one-sided operations
>   * Global Arrays from PNL
>   * The Cray SHMEM library

However there's another axis to the classification which you haven't
mentioned, and which is also extremeley important, which is whether
the remote access is "punned" onto a normal load/store instruction, or
requires a different explicit operation.

I like to refer to the Quadrics' model as "explicit remote store
access", since it requires special accesses to (process mapped) device
registers to  cause remote operations to happen; therefore the process
making a remote access has to know that that's what it wants to do. It
can't just follow a chain of pointers and end up doing remote accesses
transparently.

Note, also, that AFAIK the explicit remote store accesses in the
Quadrics' implementation are cache coherent at both ends, so they are
not SALC. (Both because there isn't a shared address space, and
because they are consistent at both ends !).

As I understand it the Quadrics' model is that there are multiple
processes each with their own address space, but that by explicit
operations a process can read or write data in a cache coherent
fashion and without co-operation from its owner in any of the address
spaces. (At least that's how it worked back at Meiko ;-)

I suppose you could view the {process-id, address} tuple as a shared
address space, but it seems a bit of a stretch to me.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From franz.marini at mi.infn.it  Fri Jul 18 04:52:20 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Fri, 18 Jul 2003 10:52:20 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
 <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>

On Thu, 17 Jul 2003, Greg Lindahl wrote:

> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.

I maybe wrong but I think that all the SGI machines (including the Altix) 
implement c-c shared mem. 

> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:
> 
>   * UPC: Unified Parallel C
>   * CoArray Fortran
>   * MPI-2 one-sided operations
>   * Global Arrays from PNL
>   * The Cray SHMEM library

And this should testify to the fact that the shmem programming paradigm is 
all but rarely used. As long as I can tell there is a *lot* of code out 
there that uses, e.g. the Cray SHMEM lib (btw, this is one of the things 
that makes the Scali/Dolphin solution interesting to us).

But, still, whereas, e.g. the SHMEM lib has been implemented under Scali 
(and maybe under Quadrics/Myrinet/Infiniband, not sure about it), what 
I think it'd be interesting and usefull is the support (at the OS level) 
for a GSM/single system image, providing support for POSIX threads across 
the nodes. I may be dreaming here, I know, but still... :)


Btw, on a side note, does anyone know if there is some compiler (both C 
and F90/HPF) out there supporting some kind of auto parallelization via, 
e.g. the SHMEM lib (I'm not asking for a MPI-enabled compiler, I'm not 
*so* crazy ;)) ?


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sp at scali.com  Thu Jul 17 17:58:24 2003
From: sp at scali.com (Steffen Persvold)
Date: Thu, 17 Jul 2003 23:58:24 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.44.0307172354080.21247-100000@localhost.localdomain>

On Thu, 17 Jul 2003, Greg Lindahl wrote:

> On Thu, Jul 17, 2003 at 12:15:56PM +0100, Daniel Kidger wrote:
> 
> > The Quadrics Interconnect also does hardware RDMA, and yes a significant
> > percentage of people do use Global Shared Memory programming models rather
> > than message passing.
> > 
> > In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> > RDMA ??
> 
> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.
> 
> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:

Just to clarify; Scali makes software, not hardware. So putting Scali in 
the same group as Quadrics, Myrinet and Infiniband is kinda wrong. It 
should have been Dolphin (as in the SCI card vendor) I guess. Our message 
passing software may run on all four interconnects (and ethernet).

Regards,
-- 
      Steffen Persvold           ,,,       mailto: sp at scali.com
   Senior Software Engineer     (o-o)      http://www.scali.com
-----------------------------oOO-(_)-OOo-----------------------------
Scali AS, PObox 150, Oppsal, N-0619 Oslo, Norway, Tel: +4792484511

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From ashley at pittman.co.uk  Fri Jul 18 07:45:01 2003
From: ashley at pittman.co.uk (Ashley Pittman)
Date: 18 Jul 2003 12:45:01 +0100
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>
References: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>
Message-ID: <1058528701.21031.57.camel@ashley>

On Thu, 2003-07-17 at 21:40, Richard Walsh wrote:
> Dan Kidger wrote:
> 
> >The Quadrics Interconnect also does hardware RDMA, and yes a significant
> >percentage of people do use Global Shared Memory programming models rather
> >than message passing.
> >
> >In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> >RDMA ??
> 
>  Does this support run all the way up the stack to the MPI-2 "one-sided"
>  communications stuff?  Anyone working on supporting the implicit DSM
>  language constructs of CAF and/or UPC with their RDMA capability?  Comments 
>  on any/all interconnects mentioned are welcome.

Yes it does, we support both Cray SHMEM and MPI-2 "one-sided" which are
essentially simple wrappers around the DMA engine.  Because it's truly
one-sided it's lower latency than Send/Recv.  I've included some pallas
figures from one of the machines here.

There are two UPC implementations which work over Quadrics hardware, one
of which is open source, check out http://upc.nersc.gov/

Ashley,


#---------------------------------------------------
# Benchmarking Unidir_Put 
# ( #processes = 2 ) 
#---------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         4096         0.07         0.00
            4         4096         1.67         2.28
            8         4096         1.68         4.55
           16         4096         1.72         8.86
           32         4096         2.19        13.95
           64         4096         2.55        23.89
          128         4096         2.77        44.06
          256         4096         3.19        76.60
          512         4096         4.14       118.06
         1024         4096         5.76       169.42
         2048         4096         8.95       218.30
         4096         4096        15.32       254.92
         8192         4096        28.00       279.04
        16384         2560        53.40       292.63
        32768         1280       104.10       300.19
        65536          640       207.56       301.12
       131072          320       412.33       303.15
       262144          160       821.94       304.16


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 18 12:14:05 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 18 Jul 2003 09:14:05 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <19dQoB-1LO-00@etnus.com>
References: <20030717211301.GA4929@greglaptop.internal.keyresearch.com> <19dQoB-1LO-00@etnus.com>
Message-ID: <20030718161405.GA13859@greglaptop.greghome.keyresearch.com>

On Fri, Jul 18, 2003 at 09:41:43AM +0100, James Cownie wrote:

> Note, also, that AFAIK the explicit remote store accesses in the
> Quadrics' implementation are cache coherent at both ends, so they are
> not SALC. (Both because there isn't a shared address space, and
> because they are consistent at both ends !).

In both cases you're using different terminology than the SALC folks do.

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnh at sjgeophysics.com  Fri Jul 18 15:04:52 2003
From: johnh at sjgeophysics.com (John Harrop)
Date: 18 Jul 2003 12:04:52 -0700
Subject: Empty passwords vs ssh-agent?
Message-ID: <1058555100.10220.33.camel@orion-2>

I'm currently switching our system from using r-commands to ssh.  We
have a fairly small system with 27 nodes.  The only two options I can
see with ssh are empty passwords and ssh-agent.  The first looks like it
isn't much better for security than r commands.  (We do have ssh with
passwords and known hosts on a portal machine.)  Using ssh-agent on a
cluster looks like a potentially big hassle.  Or am I mistaken about the
last impression?  After all, we have nodes that are almost hitting up
time of 400 days so ssh-add would only have been run once for each
cluster user.

What are people using as the clusters get bigger?

Thanks is advance for your comments and thought!

Cheers,

John Harrop


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rodmur at maybe.org  Fri Jul 18 16:26:50 2003
From: rodmur at maybe.org (Dale Harris)
Date: Fri, 18 Jul 2003 13:26:50 -0700
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <1058555100.10220.33.camel@orion-2>
References: <1058555100.10220.33.camel@orion-2>
Message-ID: <20030718202650.GI24530@maybe.org>

On Fri, Jul 18, 2003 at 12:04:52PM -0700, John Harrop elucidated:
> I'm currently switching our system from using r-commands to ssh.  We
> have a fairly small system with 27 nodes.  The only two options I can
> see with ssh are empty passwords and ssh-agent.  The first looks like it
> isn't much better for security than r commands.  (We do have ssh with
> passwords and known hosts on a portal machine.)  Using ssh-agent on a
> cluster looks like a potentially big hassle.  Or am I mistaken about the
> last impression?  After all, we have nodes that are almost hitting up
> time of 400 days so ssh-add would only have been run once for each
> cluster user.
> 
> What are people using as the clusters get bigger?
> 
> Thanks is advance for your comments and thought!
> 
> Cheers,
> 
> John Harrop
> 


I've have the same questions, too.  Is this something you're just doing
for administrative purposes?  Or are the users going to need use ssh to
authenticate themselves as well?


--
Dale Harris   
rodmur at maybe.org
/.-)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From xyzzy at speakeasy.org  Fri Jul 18 17:10:45 2003
From: xyzzy at speakeasy.org (Trent Piepho)
Date: Fri, 18 Jul 2003 14:10:45 -0700 (PDT)
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <20030718202650.GI24530@maybe.org>
Message-ID: <Pine.LNX.4.04.10307181403160.12532-100000@12-207-199-254.client.attbi.com>

On Fri, 18 Jul 2003, Dale Harris wrote:
> On Fri, Jul 18, 2003 at 12:04:52PM -0700, John Harrop elucidated:
> > I'm currently switching our system from using r-commands to ssh.  We
> > have a fairly small system with 27 nodes.  The only two options I can
> > see with ssh are empty passwords and ssh-agent.  The first looks like it

You can use RSA host based authentication.  This is the same style as the r
commands, except instead of only using what the remote host claims as its IP
address, a RSA/DSA key check is done.  This way you can do non-interactive ssh
just among your cluster nodes, but still have passwords for extra-cluster
connections.

ssh-agent also works well.  Users can start the agent once and leave it
running, only having to type in their password once per reboot.

A nifty thing would be if login could check for ssh-agent, and if it finds
one, setup the env variables (already can be done from the shell dot-files). 
If it doesn't find one, it starts it and runs ssh-add using the password
supplied for login. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 18 17:06:15 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 18 Jul 2003 14:06:15 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com> <20030717211301.GA4929@greglaptop.internal.keyresearch.com> <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>
Message-ID: <20030718210615.GA2096@greglaptop.internal.keyresearch.com>

On Fri, Jul 18, 2003 at 10:52:20AM +0200, Franz Marini wrote:

> Btw, on a side note, does anyone know if there is some compiler (both C 
> and F90/HPF) out there supporting some kind of auto parallelization via, 
> e.g. the SHMEM lib (I'm not asking for a MPI-enabled compiler, I'm not 
> *so* crazy ;)) ?

PGI's HPF compiler can compile down to fortran + MPI calls. No doubt
they have other options. It's not going to get you to a very high
level of parallelism, though.

greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From stiehr at admiral.umsl.edu  Fri Jul 18 17:18:26 2003
From: stiehr at admiral.umsl.edu (Gary Stiehr)
Date: Fri, 18 Jul 2003 16:18:26 -0500
Subject: bad job distribution with MPICH
In-Reply-To: <20030717090453.GB23226@ii.uib.no>
References: <20030717090453.GB23226@ii.uib.no>
Message-ID: <3F186422.5030309@admiral.umsl.edu>

Hi,

Try to use "mpirun -nolocal -np ....".  I think if you don't specify the 
"-nolocal" option, the job will start one process on node17 and then 
that process will start the other 7 processes on the remaining 6 
processors not in node17; thus resulting in three processes on node15.  
Apparently if you use -nolocal, it will use all of the processors.  I'm 
not sure why this is, however, adding "-nolocal" to the mpirun command 
may help you.

HTH,
Gary

Jan-Frode Myklebust wrote:

>Hi, 
>
>we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
>ethernet), and are having some problems with the mpich job distribution. 
>An example from today:
>
>The PBS job:
>
>----------------------------------------
>#PBS -l nodes=4:ppn=2,walltime=100:00:00
>#
>mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
>----------------------------------------
>
>is assigned to nodes:
>
>	node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1
>
>PBS generates a PBS_NODEFILE containing:
>
>-----------------------------
>node17
>node15
>node14
>node11
>node17
>node15
>node14
>node11
>-----------------------------
>
>And this command is started in node 17:
>
>	mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable
>
>And then when I look over the nodes, there's 1 executable running on
>node17, 3 on node15, 2 on node14 and 2 on node11.
>
>Anybody seen something like this, and maybe have an idea of what might 
>be causing it?
>
>
>  -jf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Fri Jul 18 18:12:12 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Fri, 18 Jul 2003 16:12:12 -0600
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <1058555100.10220.33.camel@orion-2>
References: <1058555100.10220.33.camel@orion-2>
Message-ID: <3F1870BC.6030409@inel.gov>

John Harrop wrote:

> I'm currently switching our system from using r-commands to ssh.  We
> have a fairly small system with 27 nodes.  The only two options I can
> see with ssh are empty passwords and ssh-agent.  The first looks like it
> isn't much better for security than r commands.  (We do have ssh with
> passwords and known hosts on a portal machine.)  Using ssh-agent on a
> cluster looks like a potentially big hassle.  Or am I mistaken about the
> last impression?  After all, we have nodes that are almost hitting up
> time of 400 days so ssh-add would only have been run once for each
> cluster user.
> 
> What are people using as the clusters get bigger?
> 
> Thanks is advance for your comments and thought!
> 
> Cheers,
> 
> John Harrop

Have you heard of Keychain? http://www.gentoo.org/proj/en/keychain.xml
"It acts as a front-end to ssh-agent, allowing you to easily have one
long-running ssh-agent process per system, rather than per login
session."  I have used this before and it worked well, but I've been
meaning to switch to the pam_ssh module.

Does anybody use the pam_ssh module to automatically start agents on
login?  I saw it when I was looking up pam documentation on modules.
Download through cvs http://sourceforge.net/cvs/?group_id=16000

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kblair at uidaho.edu  Fri Jul 18 17:41:06 2003
From: kblair at uidaho.edu (Kenneth Blair)
Date: Fri, 18 Jul 2003 14:41:06 -0700
Subject: monte boot fail
Message-ID: <1058564466.1164.28.camel@eagle2>

Having problems installing some nodes to an existing scyld cluster.

Scyld Beowulf release 27bz-7 (based on Red Hat Linux 6.2)

I run   # beoboot-install 62 /dev/hda

Creating boot images...
Building phase 1 file system image in /tmp/beoboot.22389...
ram disk image size (uncompressed): 2116K
compressing...done
ram disk image size (compressed): 792K
Kernel image is:    "/tmp/beoboot.22389".
Initial ramdisk is: "/tmp/beoboot.22389.initrd".
Kernel image is:    "/tmp/.beoboot-install.22388".
Initial ramdisk is: "/tmp/.beoboot-install.22388.initrd".
Installing beoboot on partition 1 of /dev/hda.
mke2fs 1.18, 11-Nov-1999 for EXT2 FS 0.5b, 95/08/09
/dev/hda1: 11/25584 files (0.0% non-contiguous), 3250/102280 blocks
Done

Added kernel *
Beoboot installed on node 62

BUT..... when I reboot the box, it fails on the phase 1 load with a "
"mote_boot fail  invalid argument"

Has anyone seen this before???

thanks  -ken

-- 
Kenneth D. Blair
Initiative for Bioinformatics and Evolutionary STudies
College of Engineering (Computer Science)
University of Idaho
Phone: 208-885-9830
Cell:  408-888-3579

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20030718/f8f8e1ce/attachment.html>

From rouds at servihoo.com  Sat Jul 19 00:52:34 2003
From: rouds at servihoo.com (RoUdY)
Date: Sat, 19 Jul 2003 08:52:34 +0400
Subject: Beowulf digest, Vol 1 #1382 - 12 msgs
In-Reply-To: <200307181901.h6IJ1aw22843@NewBlue.Scyld.com>
Message-ID: <web-18821194@servihoo.com>

hello everybody.
I'm Roudy and I am new in making a cluster of 4-1 node.
Well, I am writing to you all in a hope to hear from you 
very soon. The coming Monday I will need to go to the 
University to build this cluster. Please send me the step 
to undergo so that it is a success.
Thanks 
Roudy
(Mauritius)
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sat Jul 19 00:52:34 2003
From: rouds at servihoo.com (RoUdY)
Date: Sat, 19 Jul 2003 08:52:34 +0400
Subject: Beowulf digest, Vol 1 #1382 - 12 msgs
In-Reply-To: <200307181901.h6IJ1aw22843@NewBlue.Scyld.com>
Message-ID: <web-18821194@servihoo.com>

hello everybody.
I'm Roudy and I am new in making a cluster of 4-1 node.
Well, I am writing to you all in a hope to hear from you 
very soon. The coming Monday I will need to go to the 
University to build this cluster. Please send me the step 
to undergo so that it is a success.
Thanks 
Roudy
(Mauritius)
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From janfrode at parallab.no  Sat Jul 19 07:32:56 2003
From: janfrode at parallab.no (Jan-Frode Myklebust)
Date: Sat, 19 Jul 2003 13:32:56 +0200
Subject: bad job distribution with MPICH
In-Reply-To: <3F186422.5030309@admiral.umsl.edu>
References: <20030717090453.GB23226@ii.uib.no> <3F186422.5030309@admiral.umsl.edu>
Message-ID: <20030719113256.GA23631@ii.uib.no>

On Fri, Jul 18, 2003 at 04:18:26PM -0500, Gary Stiehr wrote:
> 
> Try to use "mpirun -nolocal -np ....".  

Yes, that seems to fix it. Thanks!

I also got a nice explanation in private from George Sigut explainig 
what MPICH was doing whan not given the '-nolocal' flag.

"
  I seem to remember something about mpirun starting distributing the
  jobs NOT on the first node (i.e. in your case node17) and continuing
  in the circular fashion:
                                                                                                             
  given:    17 15 14 11 17 15 14 11
  expected: 17 15 14 11 17 15 14 11
  getting:  |  15 14 11 17 15 14 11  (instead of 1st 17, twice 15)
            -> 15

"

Looks like without the '-nolocal' MPICH is reserving the first node
in the machinefile for job management.


   -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sun Jul 20 00:38:32 2003
From: rouds at servihoo.com (RoUdY)
Date: Sun, 20 Jul 2003 08:38:32 +0400
Subject: configure a cluster of 4-1 node
In-Reply-To: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <web-18857160@servihoo.com>

hello everybody,
Can someone mail me the step how to configure a cluster of 
4-1 node using the platform Linux.
Thanks
Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sun Jul 20 00:38:32 2003
From: rouds at servihoo.com (RoUdY)
Date: Sun, 20 Jul 2003 08:38:32 +0400
Subject: configure a cluster of 4-1 node
In-Reply-To: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <web-18857160@servihoo.com>

hello everybody,
Can someone mail me the step how to configure a cluster of 
4-1 node using the platform Linux.
Thanks
Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dlane at ap.stmarys.ca  Sun Jul 20 08:19:57 2003
From: dlane at ap.stmarys.ca (Dave Lane)
Date: Sun, 20 Jul 2003 09:19:57 -0300
Subject: configure a cluster of 4-1 node
In-Reply-To: <web-18857160@servihoo.com>
References: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <5.2.0.9.0.20030720091400.02585ea8@crux.stmarys.ca>

At 08:38 AM 7/20/2003 +0400, RoUdY wrote:
>hello everybody,
>Can someone mail me the step how to configure a cluster of 4-1 node using 
>the platform Linux.

Roudy,

This is not a simple answer that can be answered in an e-mail. I suggest 
you read at least some of Robert Brown's online book (a continuous work in 
progress for him, so its up-to-date) at:

http://www.phy.duke.edu/brahma/Resources/beowulf_book.php

That will tell you everything you need to know. You may also want to look 
at one or more of the cluster software distributions such as:

Rocks - http://www.rocksclusters.org/Rocks/
Oscar - http://oscar.sourceforge.net/

Good luck ... Dave


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shin at solarider.org  Sun Jul 20 18:02:11 2003
From: shin at solarider.org (Shin)
Date: Sun, 20 Jul 2003 23:02:11 +0100
Subject: Clusters Vs Grids 
Message-ID: <20030720220211.GC16662@gre.ac.uk>

Hi,

I got a few queries about the exact differences between clusters and
grids and as I couldn't really find a general purpose grid list to
post on and because this list is normally a fountain of knowledge I
thought I'd ask here. However if there is somewhere more appropriate
to ask then please push me in that direction.

Broadly (very broadly) as I understand it a cluster is a collection
of machines that will run parallel jobs for codes that require high
performance - they might be connected by a high speed interconnect
(ie Myrinet, SCI, etc) or via a normal ethernet type connections.
The former are described as closely or tightly coupled and the
latter as loosely coupled? Hopefully I'm correct so far. 

A cluster will normally (always?) be located at one specific location.

A grid is also a collection of computing resources (cpu's, storage)
that will run parallel jobs for codes that also require high
performance (or perhaps very long run times?). However these
resources might be distributed over a department, campus or even
further afield in other organisations, in different parts of the
world?

As such a grid cam not be closely coupled and any codes that are
developed for a grid will have to take the very high latency
overheads of a grid into consideration. Not sure what the bandwidth
of a grid would be like?

On the other hand, a grid potentially makes more raw computing power
available to a user who does not have a local adequately specced
cluster available.

So I was wondering just how all those coders out there who are
developing codes on clusters connected with fast interconnects are
going to convert their codes to use on a grid - or is there even the
concept of a highly coupled grid - ie grid components that are
connected via fast interconnections (10Gb ethernet perhaps?) or is
that still very low in terms of what closely coupled clusters are
capable of.

Or are people making their clusters available as components of a
grid, call it a ClusterGrid and in the same way that a grid app
would specify certain resoure  requirements - it could specify that
it should look for an available cluster on a grid.

However I can't see why establishments who have spent a lot of money
developing their clusters would then make them available on a grid
for others to use - when they could just create an account for the
user on their cluster to run their code on.

I could understand the use of single machines that are mostly idle
being made available for a grid - but presumably most clusters are
in constant demand and use from users.

So I was just looking to see if I have my terminology above correct
for grids and clusters and whether there was any concept of a
tightly coupled grid or even a ClusterGrid. And if there was any
useful cross over between clusters and grids - or are the two so
completely different architecurally that they will never meet; or
not for the near future at least.

I was also curious about all these codes that use MPI across tightly
coupled systems and how they would adapt to use on loosely coupled
grid. 

I'm having a hard time marrying the 2 concept of a cluster and a
grid together; but I'm sure much finer brains than mine have already
considered all this and ruled it out/in/not-yet.

Thanks for any clarity and information you can provide. Oh and if
anyone has any comments on the following comment from a colleague
I'd appreciate that as well; "grids - hmmm - there're just the
latest computing fad - real high performance scientists won't use
them and grids will be just so much hype for many years to come".

Thanks
Shin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Sun Jul 20 20:31:54 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Sun, 20 Jul 2003 17:31:54 -0700
Subject: Clusters Vs Grids
In-Reply-To: <20030720220211.GC16662@gre.ac.uk>
References: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <20030721003154.GA16512@greglaptop.greghome.keyresearch.com>

> I got a few queries about the exact differences between clusters and
> grids and as I couldn't really find a general purpose grid list to
> post on and because this list is normally a fountain of knowledge I
> thought I'd ask here.

There's an IEEE Task Force on Cluster Computing that has an open
mailing list. But this is reasonably on-topic.

A grid deals with machines separated by significant physical distance,
and that usually cross into several administrative domains. Grids have
a lot more frequent failures than clusters.

A cluster is usually close and administered as one system.

> So I was wondering just how all those coders out there who are
> developing codes on clusters connected with fast interconnects are
> going to convert their codes to use on a grid

The speed of light is the only thing that does not scale with Moore's
Law.

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Mon Jul 21 01:40:03 2003
From: rouds at servihoo.com (RoUdY)
Date: Mon, 21 Jul 2003 09:40:03 +0400
Subject: thank Dave
In-Reply-To: <200307201902.h6KJ2Dw20695@NewBlue.Scyld.com>
Message-ID: <web-18892611@servihoo.com>

Hello Dave
Thanks for all
roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Mon Jul 21 01:40:03 2003
From: rouds at servihoo.com (RoUdY)
Date: Mon, 21 Jul 2003 09:40:03 +0400
Subject: thank Dave
In-Reply-To: <200307201902.h6KJ2Dw20695@NewBlue.Scyld.com>
Message-ID: <web-18892611@servihoo.com>

Hello Dave
Thanks for all
roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Mon Jul 21 02:58:13 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 21 Jul 2003 02:58:13 -0400
Subject: New version of the sge_mpiblast tool
Message-ID: <1058770692.3285.13.camel@protein.scalableinformatics.com>

Hi Folks:

  We completely rewrote our sge_mpiblast execution tool into a real
program that allows you to run the excellent mpiBLAST
(http://mpiblast.lanl.gov) code within the SGE queuing system on a
bio-cluster.  The new code is named run_mpiblast and is available from
our download page (http://scalableinformatics.com/downloads/). 
Documentation is in process, and the source is heavily commented.

  The principal differences between the old and new versions are

. error detection and problem reporting
. file staging
. rewritten in a real programming language, no more shell script
. works within SGE, or from the command line
. uses config files
. run isolation
. debugging and verbosity controls

  This is a merge between an internal project and the ideas behind the
original code.  Please give it a try and let us know how it behaves. 
The link to the information page is
http://scalableinformatics.com/sge_mpiblast.html .

Joe

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nixon at nsc.liu.se  Mon Jul 21 04:18:15 2003
From: nixon at nsc.liu.se (nixon at nsc.liu.se)
Date: Mon, 21 Jul 2003 10:18:15 +0200
Subject: Clusters Vs Grids
In-Reply-To: <20030720220211.GC16662@gre.ac.uk> (shin@solarider.org's
 message of "Sun, 20 Jul 2003 23:02:11 +0100")
References: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <m3he5gxph4.fsf@nsc.liu.se>

Shin <shin at solarider.org> writes:

> Broadly (very broadly) as I understand it a cluster is a collection
> of machines that will run parallel jobs for codes that require high
> performance - they might be connected by a high speed interconnect
> (ie Myrinet, SCI, etc) or via a normal ethernet type connections.
> The former are described as closely or tightly coupled and the
> latter as loosely coupled? Hopefully I'm correct so far. 

You're basically correct, except that a cluster doesn't necessarily
run parallel jobs. A common situation is that you have lots and lots
of non-interdependent, single-CPU jobs that you want to run as quickly
as possible.

> A grid is also a collection of computing resources (cpu's, storage)
> that will run parallel jobs for codes that also require high
> performance (or perhaps very long run times?). However these
> resources might be distributed over a department, campus or even
> further afield in other organisations, in different parts of the
> world?

Again, basically correct, except for the same point as above. I think
the key issues about a grid is that the resources are:

a) possibly distributed over large geographical distances,

b) possibly belonging to different organizations with different
   policies; there is no centralized administrative control over them.

> As such a grid cam not be closely coupled and any codes that are
> developed for a grid will have to take the very high latency
> overheads of a grid into consideration. Not sure what the bandwidth
> of a grid would be like?

That only depends on how fat pipes you put in. In Nordugrid there is
gigabit-class bandwidth between (most of) the resources. The latency,
on the other hand, is harder to do anything about.

> So I was wondering just how all those coders out there who are
> developing codes on clusters connected with fast interconnects are
> going to convert their codes to use on a grid - or is there even the
> concept of a highly coupled grid - ie grid components that are
> connected via fast interconnections (10Gb ethernet perhaps?) or is
> that still very low in terms of what closely coupled clusters are
> capable of.

There are MPI implementations that run in grid environments, but of
course you might get horrible latency if you have processes running at
different sites.

> Or are people making their clusters available as components of a
> grid, call it a ClusterGrid and in the same way that a grid app
> would specify certain resoure  requirements - it could specify that
> it should look for an available cluster on a grid.

That is a much more likely scenario for running parallel applications
on a grid, yes.

> However I can't see why establishments who have spent a lot of money
> developing their clusters would then make them available on a grid
> for others to use - when they could just create an account for the
> user on their cluster to run their code on.

It is partly a question of administrative overhead. In an non-grid
situation, if a user gets resources allocated to him at n computing
sites, he typically needs to go through n different account activation
processes. Now, consider a large project like LHC at CERN, where you
have dozens and dozens of participating computing sites and a large
number of users - it's just not feasible to have individual accounts
at individual sites.

Another part is resource location; if you have dozens and dozens of
potential job submission sites, you really don't want to manually
keep track of the current load at the different sites. 

In a grid situation, you just need your grid identity, which is a
member of the project virtual organization. You only need to submit
your job to the grid, and it will automatically be scheduled on the
least loaded site where your project VO has been granted resources.
(In theory at least. I'm not aware of many grid projects that have
gotten this far. Nordugrid is one, though.)

> So I was just looking to see if I have my terminology above correct
> for grids and clusters and whether there was any concept of a
> tightly coupled grid or even a ClusterGrid. And if there was any
> useful cross over between clusters and grids - or are the two so
> completely different architecurally that they will never meet; or
> not for the near future at least.

Think of the grid as a generalized way of locating and getting access
to resources in a fluffy, vague "network cloud" of computing
resources.

Clusters are just one type of resource that can be present in the
cloud.

Certain types of applications run best on clusters with high-speed
interconnects - well, then you can use the grid to locate and get
access to suitable clusters.

-- 
Leif Nixon                                    Systems expert
------------------------------------------------------------
National Supercomputer Centre           Linkoping University
------------------------------------------------------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Mon Jul 21 05:30:12 2003
From: jcownie at etnus.com (James Cownie)
Date: Mon, 21 Jul 2003 10:30:12 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Greg Lindahl <lindahl@keyresearch.com> 
   of "Fri, 18 Jul 2003 09:14:05 PDT." <20030718161405.GA13859@greglaptop.greghome.keyresearch.com> 
Message-ID: <19eWzk-260-00@etnus.com>


> In both cases you're using different terminology than the SALC folks
> do.

Perhaps you could give us a reference to the real definition of SALC
then ?

Google shows up a selection of _different_ versions of the acronym

       Shared Address Local Copy
       Shared Address Local Cache
and you used
       Shared Address Local Consistency

Since the "Shared Address Local Copy" is in a paper by Bob Numrich, I
think this is likely the right one ?

If we can't even agree what the acronym stands for it's a bit hard to
decide what it means :-(

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rbw at ahpcrc.org  Mon Jul 21 10:15:04 2003
From: rbw at ahpcrc.org (Richard Walsh)
Date: Mon, 21 Jul 2003 09:15:04 -0500
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <200307211415.h6LEF4m20454@mycroft.ahpcrc.org>


Steffen Persvold wrote:

>Our message passing software may runs on all four interconnects (and ethernet).

 But the one-sided features of the (cray-like) SHMEM and MPI-2 libraries
 need underlying hardware support to perform.  You must be saying that the
 Scali implements the MPI-2 one-sided routines and they can be called even
 over Ethernet, but are actually two-sided emulations with two-sided performance
 on latency (on Ethernet), right?

 Regards,

 rbw
#---------------------------------------------------
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
#         rbw at ahpcrc.org
#
#---------------------------------------------------
# "Without mystery, there can be no authority."
#                                  -Charles DeGaulle
#---------------------------------------------------
# "Why waste time learning when ignornace is
#  instantaneous?"                 -Thomas Hobbes
#---------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Mon Jul 21 17:36:44 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Mon, 21 Jul 2003 14:36:44 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <19eWzk-260-00@etnus.com>
References: <20030718161405.GA13859@greglaptop.greghome.keyresearch.com> <19eWzk-260-00@etnus.com>
Message-ID: <20030721213644.GA1635@greglaptop.internal.keyresearch.com>

On Mon, Jul 21, 2003 at 10:30:12AM +0100, James Cownie wrote:

> Perhaps you could give us a reference to the real definition of SALC
> then ?
> 
> Google shows up a selection of _different_ versions of the acronym
> 
>        Shared Address Local Copy
>        Shared Address Local Cache
> and you used
>        Shared Address Local Consistency

What makes you think that the 1st and 3rd are actually different? They
aren't. I've never heard the 2nd.

As for what it *means*, it's exactly the model provided by the SHMEM
library, or that provided by UPC or CoArray Fortran. It is not the
model supported by ccNuma or MPI-1. Is this not clear?

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Mon Jul 21 21:20:11 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Mon, 21 Jul 2003 18:20:11 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <1058770692.3285.13.camel@protein.scalableinformatics.com>
References: <1058770692.3285.13.camel@protein.scalableinformatics.com>
Message-ID: <20030722012011.GA2127@greglaptop.internal.keyresearch.com>

p.s. it would also help if you could explain what is different from
the last time we had this same discussion, about SALC, on this very
list, in the year 2000.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Tue Jul 22 00:07:17 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 22 Jul 2003 00:07:17 -0400 (EDT)
Subject: Clusters Vs Grids 
In-Reply-To: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>

> I'm having a hard time marrying the 2 concept of a cluster and a
> grid together; but I'm sure much finer brains than mine have already

"grid" is just a marketing term stemming from the fallacy that networks
are getting a lot faster/better/cheaper.  without those amazing crooks 
at worldcom, I figure grid would never have accumulated as much attention
as it has.  I don't know about you, but my wide-area networking experience
has improved by about a factor of 10 over the past 10-15 years.

network bandwidth and latency is *not* on an exponential curve,
but CPU power is.  (as is disk density - not surprising when you consider
that CPUs and disks are both *areal* devices, unlike networks.)  so we should
expect it to fall further behind, meaning that for a poorly-networked cluster
(aka grid), you'll need even looser-coupled programs than today.

YOU MUST READ THIS:
	http://www.clustercomputing.org/content/tfcc-5-1-gray.html

cycle scavenging is a wonderful thing, but it's about like having
a compost heap in your back yard, or a neighborhood aluminum
can collector ;)

> I'd appreciate that as well; "grids - hmmm - there're just the
> latest computing fad - real high performance scientists won't use
> them and grids will be just so much hype for many years to come".

my users are dramatically bifurcated into two sets: those who want
1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
using a grid; it's a lot easier for them to grab a piece of the 
cluster pie, though.  I wonder whether that's the fate of grids 
in general: not worth the trouble of setting up, except in extreme
cases (seti at home, etc).

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Kim.Branson at csiro.au  Tue Jul 22 04:18:20 2003
From: Kim.Branson at csiro.au (Kim Branson)
Date: Tue, 22 Jul 2003 18:18:20 +1000
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
References: <20030720220211.GC16662@gre.ac.uk>
	<Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030722181820.2e8522be.Kim.Branson@csiro.au>


> my users are dramatically bifurcated into two sets: those who want
> 1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
> who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
> using a grid; it's a lot easier for them to grab a piece of the 
> cluster pie, though.  I wonder whether that's the fate of grids 
> in general: not worth the trouble of setting up, except in extreme
> cases (seti at home, etc).

Grids are great for my purposes, virtual screening of large chemical databases. We have lots of small independent jobs, some work 
i have done with the use of grids for virtual screening ( using the molecular docking program DOCK ) can be found at  http://www.cs.mu.oz.au/~raj/vlab/index.html
there are links to some publications off the site. This work was very much a test to see how grids and scheduling would perform. To my suprise i got better performance
from my small local 64 node 1ghz athlon cluster than i did for the grid for most calculations. The use of the machines we were soaking time on and the time taken to
run and return the calculations means the dedicated cluster is a better option. For very large datasets the grid does begin to win out, but it is dependent on the load on the grid machines. 

If you have no local resources a grid is a good option for these caclculations but a large dedicated machine is better for small jobs.  The lack of data security means most of our data cannot be dispersed on a grid, and this is perhaps another point to consider when evaluating the usefullness of grids. Would you be happy if someone else could acess your calculation results and inputs? our powers  that be certainly don't. 

cheers

kim


-- 
______________________________________________________________________ 

Dr Kim Branson
Computational Drug Design
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au

______________________________________________________________________ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Tue Jul 22 07:48:42 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Tue, 22 Jul 2003 19:48:42 +0800 (CST)
Subject: New version of the sge_mpiblast tool
In-Reply-To: <1058770692.3285.13.camel@protein.scalableinformatics.com>
Message-ID: <20030722114842.50715.qmail@web16811.mail.tpe.yahoo.com>

Somewhat related, Integrating BLAST with SGE:

http://developers.sun.com/solaris/articles/integrating_blast.html

Andrew.

 --- Joseph Landman <landman at scalableinformatics.com>
????> Hi Folks:
> 
>   We completely rewrote our sge_mpiblast execution
> tool into a real
> program that allows you to run the excellent
> mpiBLAST
> (http://mpiblast.lanl.gov) code within the SGE
> queuing system on a
> bio-cluster.  The new code is named run_mpiblast and
> is available from
> our download page
> (http://scalableinformatics.com/downloads/). 
> Documentation is in process, and the source is
> heavily commented.
> 
>   The principal differences between the old and new
> versions are
> 
> . error detection and problem reporting
> . file staging
> . rewritten in a real programming language, no more
> shell script
> . works within SGE, or from the command line
> . uses config files
> . run isolation
> . debugging and verbosity controls
> 
>   This is a merge between an internal project and
> the ideas behind the
> original code.  Please give it a try and let us know
> how it behaves. 
> The link to the information page is
> http://scalableinformatics.com/sge_mpiblast.html .
> 
> Joe
> 
> -- 
> Joseph Landman, Ph.D
> Scalable Informatics LLC
> email: landman at scalableinformatics.com
>   web: http://scalableinformatics.com
> phone: +1 734 612 4615
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From flifson at cs.uct.ac.za  Mon Jul 21 14:30:10 2003
From: flifson at cs.uct.ac.za (Farrel Lifson)
Date: 21 Jul 2003 20:30:10 +0200
Subject: In need of Beowulf data
Message-ID: <1058812210.4397.78.camel@asgard.cs.uct.ac.za>

Hi there,

As part of my M.Sc I hope to carry out a case study using Markov Reward
Models of a large distributed system. Being a Linux fan, a Beowulf
cluster was the obvious choice. 

Performance data seems to be quite readily available, however finding
reliability data seems to be more of a challenge. Specifically I am
looking for real word failure and repair rates for the various
components of a Beowulf node (HDD, power supply, CPU, RAM) and the
larger cluster (software failure, network, etc). 

While some components have a mean time to failure rating, this is
sometimes underestimated by the manufacturer and I am interested in
getting an as accurate as possible model of a real world Beowulf
cluster.

If anyone has any data they would be willing to share, or if you know of
any papers or reports which list such data I would greatly appreciate
any links or pointers to them.

Thanks in advance,
Farrel Lifson
-- 
Data Network Architecture Research Lab    mailto:flifson at cs.uct.ac.za
Dept. of Computer Science                 http://people.cs.uct.ac.za/~flifson
University of Cape Town                   +27-21-650-3127
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20030721/97021d50/attachment.sig>

From c00jsh00 at nchc.gov.tw  Sat Jul 19 05:40:35 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Sat, 19 Jul 2003 17:40:35 +0800
Subject: channel bonding on SuSE
Message-ID: <3F191213.A1898B95@nchc.gov.tw>

Hi,

Has anyone successfully set up channel bonding in SuSE?
I tried and failed many times and I think it might be the time
to ask for help.
I am using SuSE Linux Enterprise Server 8 for AMD 64,
and I tried to set up the channel bonding for the two
Broadcom gigabit LAN ports on the HDAMA motherboard (for
dual Opteron CPUs).
I followed the instructions in .../Documentation/networking/bonding.txt:

1. modify file /etc/modules.conf to include the line:

alias bond0 bonding
probeall bond0 eth0 eth1 bonding

2. create ifenslave

3. create /etc/sysconfig/network/ifcfg-bond0
   as

DEVICE=bond0
IPADDR=192.168.3.60
NETMASK=255.255.255.0
NETWORK=192.168.3.0
BROADCAST=192.168.3.255
ONBOOT=yes
STARTMODE='onboot'
BOOTPROTO=none
USERCTL=no

and modify file ifcfg-eth0 as

BROADCAST='192.168.3.255'
IPADDR='192.168.3.10'
NETMASK='255.255.255.0'
NETWORK='192.168.3.0'
REMOTE_IPADDR=''
STARTMODE='onboot'
UNIQUE='QOEa.mRtDs8d6UMD'
WIRELESS='no'
DEVICE='eth0'
USERCTL='no'
ONBOOT='yes'
MASTER='bond0'
SLAVE='yes'
BOOTPROTO='none'

and modify file ifcfg-eth1 as

BROADCAST='192.168.3.255'
IPADDR='192.168.3.40'
NETMASK='255.255.255.0'
NETWORK='192.168.3.0'
REMOTE_IPADDR=''
STARTMODE='onboot'
UNIQUE='QOEa.mRtDs8d6UMD'
WIRELESS='no'
DEVICE='eth1'
USERCTL='no'
ONBOOT='yes'
MASTER='bond0'
SLAVE='yes'
BOOTPROTO='none'


4. then I tried several ways to bring up the interface bond0:

a.   ifup bond0  
     this caused the system hang, and have to reboot the system

b.   /etc/init.d/network restart
     or 
     reboot
     did not bring up bond0

c.   /sbin/ifconfig bond0 192.168.3.60 netmask 255.255.255.0 \
      broadcast 192.168.3.255 up

      this caused the system hang, and have to reboot the system

I did make the kernel and made sure that Network Devices/bonding devices
was made as a module.

I have no idea how to proceed next, so if someone has the experience,
please help.

Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Tue Jul 22 07:40:08 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Tue, 22 Jul 2003 06:40:08 -0500
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
Message-ID: <3F1D2298.9080808@tamu.edu>

I'd offer that we're going to see grids grow for at least the forseeable 
(?sp?; ?coffee?) future.  I think we need to coin another term, however, 
for the applications that will run on them in the near term: 
"pathetically" parallel.  We've seen the growth of clusters, especially 
in the NUMA/embarrassingly parallel regime.  These have proven to work 
well.  Across the 'Grid' we appreciate today, we either see parallellism 
that simply benefits from distribution due to the vast amount of data 
and thus benefits from cycle-stealing, or applications that are totally 
tolerant of disparate latency issues.

But what does the future hold?  I can foresee an application that uses 
distributed storage to preposition an entire input dataset so that all 
the distributed nodes can access it, and a version of the Logistical 
Backbone that queues data parcels for acquisition and processing and 
manages the reintegration of the returned results into an output queue. 
  Along another line, I can envision an application prepositioning all 
the data across the distributed nodes and using an enhanced version of 
semaphores to to signal when a chunk is processed, then reintegrating 
the pieces later.

Done correctly, both of these become grid-enabling mechanisms.  They 
require atraditional thinking to overcome the non-exponential curve 
associated with network speed and latency.  They will benefit from the 
introduction of some of the network protocols we've come to know and 
dream of, including MPLS and some real form of QoS agreement among 
various carriers, ISP, Universities and other endpoints.  And they won't 
happen tomorrow.

IPv6 may enable some of this; QoS is integrated into its very fabric, 
but agreement on QoS implementation is still far from universal.  Worse, 
while carriers are looking at, or actually implementing IPv6 within 
their network cores, they are not necessarily bringing it to the edge. 
Unless you're in Japan or Europe.  Oh, I'm sorry, this *IS* a globally 
distributed list.  Is anyone from Level 3 or AT&T listening?

The concept of grid computing has taken me a while to embrace, and I'm 
not sure I like it yet.  Overall, I tend to agree with Mark's rather 
cynical assessment that it's a WorldCom marketting ploy that acquired a 
life of its own.

gerry

Mark Hahn wrote:
>>I'm having a hard time marrying the 2 concept of a cluster and a
>>grid together; but I'm sure much finer brains than mine have already
> 
> 
> "grid" is just a marketing term stemming from the fallacy that networks
> are getting a lot faster/better/cheaper.  without those amazing crooks 
> at worldcom, I figure grid would never have accumulated as much attention
> as it has.  I don't know about you, but my wide-area networking experience
> has improved by about a factor of 10 over the past 10-15 years.
> 
> network bandwidth and latency is *not* on an exponential curve,
> but CPU power is.  (as is disk density - not surprising when you consider
> that CPUs and disks are both *areal* devices, unlike networks.)  so we should
> expect it to fall further behind, meaning that for a poorly-networked cluster
> (aka grid), you'll need even looser-coupled programs than today.
> 
> YOU MUST READ THIS:
> 	http://www.clustercomputing.org/content/tfcc-5-1-gray.html
> 
> cycle scavenging is a wonderful thing, but it's about like having
> a compost heap in your back yard, or a neighborhood aluminum
> can collector ;)
> 
> 
>>I'd appreciate that as well; "grids - hmmm - there're just the
>>latest computing fad - real high performance scientists won't use
>>them and grids will be just so much hype for many years to come".
> 
> 
> my users are dramatically bifurcated into two sets: those who want
> 1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
> who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
> using a grid; it's a lot easier for them to grab a piece of the 
> cluster pie, though.  I wonder whether that's the fate of grids 
> in general: not worth the trouble of setting up, except in extreme
> cases (seti at home, etc).
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sp at scali.com  Mon Jul 21 10:54:31 2003
From: sp at scali.com (Steffen Persvold)
Date: Mon, 21 Jul 2003 16:54:31 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <200307211415.h6LEF4m20454@mycroft.ahpcrc.org>
Message-ID: <Pine.LNX.4.44.0307211647520.3093-100000@sp-laptop.isdn.scali.no>

On Mon, 21 Jul 2003, Richard Walsh wrote:

> 
> Steffen Persvold wrote:
> 
> >Our message passing software may runs on all four interconnects (and ethernet).
> 
>  But the one-sided features of the (cray-like) SHMEM and MPI-2 libraries
>  need underlying hardware support to perform.  You must be saying that the
>  Scali implements the MPI-2 one-sided routines and they can be called even
>  over Ethernet, but are actually two-sided emulations with two-sided performance
>  on latency (on Ethernet), right?

We don't have MPI-2 one-sided, yet, but since we now run on several 
interconnects, when we implement it we will use the hardware RDMA features 
where we can and emulate it where we can't, yes.

Regards,
-- 
      Steffen Persvold           ,,,       mailto: sp at scali.com
   Senior Software Engineer     (o-o)      http://www.scali.com
-----------------------------oOO-(_)-OOo-----------------------------
Scali AS, PObox 150, Oppsal, N-0619 Oslo, Norway, Tel: +4792484511

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From ktaka at clustcom.com  Tue Jul 22 03:13:58 2003
From: ktaka at clustcom.com (Kimitoshi Takahashi)
Date: Tue, 22 Jul 2003 16:13:58 +0900
Subject: MTU change on bonded device
Message-ID: <200307220713.AA00264@grape3.clustcom.com>

Hello,

I'm a newbie in the cluster field.
I wanted to use jumbo frame on channel bonded device.
Any number larger than 1500 seems to be rejected.

# ifconfig bond0 mtu 1501
SIOCSIFMTU: Invalid argument

# ifconfig bond0 mtu 8000
SIOCSIFMTU: Invalid argument

Does anyone know if the bonding driver support Jumbo Frame ?
Or, am I doing all wrong ?

I could change MTUs of enslaved devices,

# ifconfig  eth2 mtu  7000
# ifconfig  eth2
eth2      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7000  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:4198 (4.0 Kb)
          Interrupt:16 Base address:0xd800 Memory:ff860000-ff880000 
 
I use 2.4.20 stock kernel, with channel bonding enabled.
The bonded devices are eth1(e1000) and eth2(e1000).
Here is the relevant part of the ifconfig output,

# ifconfig  -a
bond0     Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:7305 (7.1 Kb)

eth1      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:3625 (3.5 Kb)
          Interrupt:22 Base address:0xd880 Memory:ff8c0000-ff8e0000 

eth2      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:23 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:3680 (3.5 Kb)
          Interrupt:16 Base address:0xd800 Memory:ff860000-ff880000 

Thanks in advance.

Kimitoshi Takahashi 
ktaka at clustcom.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Tue Jul 22 13:06:28 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Tue, 22 Jul 2003 10:06:28 -0700
Subject: Clusters Vs Grids
In-Reply-To: <3F1D2298.9080808@tamu.edu>
References: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca> <3F1D2298.9080808@tamu.edu>
Message-ID: <20030722170628.GA1355@greglaptop.internal.keyresearch.com>

On Tue, Jul 22, 2003 at 06:40:08AM -0500, Gerry Creager N5JXS wrote:

> I'd offer that we're going to see grids grow for at least the forseeable 
> (?sp?; ?coffee?) future.  I think we need to coin another term, however, 
> for the applications that will run on them in the near term: 
> "pathetically" parallel.

The people who have been doing {distributed computing, metacomputing,
p2p, grids, insert new trendy term here} for a long time have built
systems which can run moderately data-intensive programs, not just
SETI at home. In fact, a realistic assessment of the bandwidth needed for
non-pathetic programs was the basis of the TeraGrid project.

> But what does the future hold?  I can foresee an application that uses 
> distributed storage to preposition an entire input dataset so that all 
> the distributed nodes can access it,

Or, you could use existing systems that do exactly that, which were
foreseen more than a decade ago, had multiple implementations 5 years
ago, and are heading towards production use today.

> Overall, I tend to agree with Mark's rather cynical assessment that
> it's a WorldCom marketting ploy that acquired a life of its own.

Which doesn't match up with the age of current grid efforts, which
predate WorldCom buying UUNet.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Tue Jul 22 13:49:48 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Tue, 22 Jul 2003 10:49:48 -0700
Subject: In need of Beowulf data
In-Reply-To: <1058812210.4397.78.camel@asgard.cs.uct.ac.za>
Message-ID: <5.2.0.9.2.20030722104510.01899928@mailhost4.jpl.nasa.gov>

At 08:30 PM 7/21/2003 +0200, Farrel Lifson wrote:
>Hi there,
>
>As part of my M.Sc I hope to carry out a case study using Markov Reward
>Models of a large distributed system. Being a Linux fan, a Beowulf
>cluster was the obvious choice.
>
>Performance data seems to be quite readily available, however finding
>reliability data seems to be more of a challenge. Specifically I am
>looking for real word failure and repair rates for the various
>components of a Beowulf node (HDD, power supply, CPU, RAM) and the
>larger cluster (software failure, network, etc).
>
>While some components have a mean time to failure rating, this is
>sometimes underestimated by the manufacturer and I am interested in
>getting an as accurate as possible model of a real world Beowulf
>cluster.

I don't know that the manufacturer failure rate data is actually 
underestimated (they tend to pay pretty close attention to this, it being a 
legally enforceable specification), but, more probably, the data is 
being  misinterpreted by the casual consumer of it.  Take, for example, an 
MTBF rating for a disk drive. A typical rating might be 50,000 
hrs.  However, what temperature is that rating at (20C)? What temperature 
are you really running the drive at (40C?), What's the life derating for 
the 20C temperature rise? What sort of operation rate is presumed in that 
failure rate (constant seeks, or some smaller duty cycle)?  What counts as 
a failure?  How many power on/power off cycles are assumed?

Most of the major manufacturers have very detailed writeups on the 
reliability of their components (i.e. go to Seagate's site, and there's 
many pages describing how they do life tests, what the results are, how to 
apply them, etc.)

For "no-name" power supplies, though, you might have a bit more of a challenge.


>If anyone has any data they would be willing to share, or if you know of
>any papers or reports which list such data I would greatly appreciate
>any links or pointers to them.
>
>Thanks in advance,
>Farrel Lifson
>--
>Data Network Architecture Research Lab    mailto:flifson at cs.uct.ac.za
>Dept. of Computer Science                 http://people.cs.uct.ac.za/~flifson
>University of Cape Town                   +27-21-650-3127

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Tue Jul 22 18:56:17 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 22 Jul 2003 18:56:17 -0400 (EDT)
Subject: Clusters Vs Grids
In-Reply-To: <3F1D2298.9080808@tamu.edu>
Message-ID: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>

On Tue, 22 Jul 2003, Gerry Creager N5JXS wrote:

> "pathetically" parallel.  We've seen the growth of clusters, especially 

Gerry, you're a genius.  Pathetically parallel indeed.  I'll have to
work this into my next talk...:-)

   rgb

(back from a fairly obvious, long, vacation:-)

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Tue Jul 22 21:46:33 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 22 Jul 2003 20:46:33 -0500
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>
References: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>
Message-ID: <1058924793.1154.4.camel@terra>

On Tue, 2003-07-22 at 17:56, Robert G. Brown wrote:
> On Tue, 22 Jul 2003, Gerry Creager N5JXS wrote:
> 
> > "pathetically" parallel.  We've seen the growth of clusters, especially 
> 
> Gerry, you're a genius.  Pathetically parallel indeed.  I'll have to
> work this into my next talk...:-)
> 
>    rgb
> 
> (back from a fairly obvious, long, vacation:-)

While I agree that there needs to be a term, I think "pathetically
parallel" is ambiguous. We know what we are talking about, having been
steeped in the world of parallelism, but others aren't. If I am pathetic
at sports, it means that I am not very athletic, ie pathetically
athletic. Perhaps "Frighteningly"... ah, nevermind. ;-)

-- 

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 15:04:38 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 09:04:38 -1000
Subject: Thermal Problems
Message-ID: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>

I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
I am having problems with some of the nodes overheating and shutting down.
We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
inspection I noticed that the heatsink and fan sit very close to the lid of
the case.  I was wondering how much clearance is needed between the lid and
the fan that blown down onto the short copper heatsink?  When I put the lid
on the case it is almost as if the fan is working in a vaccum because it
actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
is no air resistance.  Could this be why the fans are crapping out?  I was
thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
fan might work better but I have not seen any... have you?

Also the vendor suggested that we sepetate the 1U cases because he belives
that there is heat transfer between the nodeswhen they are stacked right on
top of eachother.  I thought that if one node is running at 50c and another
node is running at 50c it wont generate a combined heatload of more than 50c
right.


Mitchel Kagawa
Systems Admin.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Wed Jul 23 16:14:40 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 23 Jul 2003 16:14:40 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
> I am having problems with some of the nodes overheating and shutting down.
> We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
> I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
> and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
> inspection I noticed that the heatsink and fan sit very close to the lid of
> the case.  I was wondering how much clearance is needed between the lid and
> the fan that blown down onto the short copper heatsink?  When I put the lid
> on the case it is almost as if the fan is working in a vaccum because it
> actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
> is no air resistance.  Could this be why the fans are crapping out?  I was
> thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
> fan might work better but I have not seen any... have you?
> 
> Also the vendor suggested that we sepetate the 1U cases because he belives
> that there is heat transfer between the nodeswhen they are stacked right on
> top of eachother.  I thought that if one node is running at 50c and another
> node is running at 50c it wont generate a combined heatload of more than 50c
> right.

AMD's really hate to run hot, and duals in 1U require some fairly
careful engineering to run cool enough, stably.  Who is your vendor?
Did they do the node design or did you?  If they did, you should be able
to ask them to just plain fix it -- replace the fans or if necessary
reengineer the whole case -- to make the problem go away.

Issues like fan clearance and stacking and overall airflow through the
case are indeed important.  Sometimes things like using round instead of
ribbon cables (which can turn sideways and interrupt airflow) makes a
big difference.  Keeping the room's ambient air "cold" (as opposed to
"comfortable") helps.  There is likely some heat transfer vertically
between the 1U cases, but if you go to the length of separating them you
might as well have used 2U cases in the first place.

>From your description, it does sound like you have some bad fans.
Whether they are bad (as in a bad design, poor vendor), or bad (as in
installed "incorrectly" in a case/mobo with inadequate clearance causing
them to fail), or bad (as in you just happened to get some fans from a
bad production batch but replacements would probably work fine) it is
very hard to say, and I don't envy you the debugging process of finding
out which.  We've been the route of replacing all of the fans once
ourselves so it can certainly happen...

   rgb

> 
> 
> Mitchel Kagawa
> Systems Admin.
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 16:33:26 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 10:33:26 -1000
Subject: pfilter.conf
Message-ID: <005001c35159$ab4a2c50$6f01a8c0@Navatek.local>

I'm having problems finding out how to open a range of ports that are being
filtered using the pfilter service.  I am able to open a specific port by
editing the /etc/pfilter.conf file with a line like 'open   tcp  3389'  but
for the life of me I can't figure out how to open a range of ports like
30000 - 33000  and I have serached everywhere on the net can any of you help
me out?  thanks!

Mitchel Kagawa
Systems Administrator


Mitchel Kagawa
Systems Administrator


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Wed Jul 23 18:19:00 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 23 Jul 2003 15:19:00 -0700
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <5.2.0.9.2.20030723145932.02fa56b0@mailhost4.jpl.nasa.gov>

At 09:04 AM 7/23/2003 -1000, Mitchel Kagawa wrote:
>I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
>I am having problems with some of the nodes overheating and shutting down.
>We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
>I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
>and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
>inspection I noticed that the heatsink and fan sit very close to the lid of
>the case.  I was wondering how much clearance is needed between the lid and
>the fan that blown down onto the short copper heatsink?

To a first order, the area of the inlet should be comparable to the area of 
the outlet.  A 60 mm diameter fan has an area of around 2800 mm^2. If you 
draw from around the entire periphery (which would be around 180 mm), you'd 
need a gap of around 15 mm (probably 20 mm would be a better idea)  That's 
a fairly significant fraction of the 45 mm or so for 1 rack U.


>  When I put the lid
>on the case it is almost as if the fan is working in a vaccum because it
>actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
>is no air resistance.  Could this be why the fans are crapping out?  I was
>thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
>fan might work better but I have not seen any... have you?
>
>Also the vendor suggested that we sepetate the 1U cases because he belives
>that there is heat transfer between the nodeswhen they are stacked right on
>top of eachother.  I thought that if one node is running at 50c and another
>node is running at 50c it wont generate a combined heatload of more than 50c
>right.

So, your vendor essentially claims that his 1U case will work just fine as 
long as there is a 1U air gap above and below?

Let's look at the problem with some simple calculations:

Assume no heat transfer up or down (tightly packed), and that no heat 
transfers through the sides by conduction, as well, so all the heat has to 
go into airflow.
Assume that you've got to move about 200W out of the box, and you can 
tolerate a 10C rise in temperature of the air moving through the box. The 
question is how much air do you need to move. Air has a density of about 
1.13 kg/m^3 and a specific heat of about 1 kJ/kgK.
200W is 0.2 kJ/sec, so you need to move 0.02 kg of air every second (you 
get a 10 deg rise) is about 0.018 cubic meters/second. To relate this to 
more common fan specs: about 40 CFM or  65 cubic meters/hr. (I did a quick 
check on some smallish 60mm fans, and they only flow around 10-20 CFM into 
NO backpressure... http://www.papst.de/pdf_dat_d/Seite_13.pdf
for instance)

How fast is the air going to be moving through the vents?  What's the vent 
area... say it's 10 square inches (1 inch high and 10 inches wide...).. 40 
CFM through .07 square feet is 576 ft/min for the air flow (which is a 
reasonable speed.. 1000 ft/min is getting fast and noisy...)

But here's the thing.. you've got 32 of these things in the rack... are you 
moving 1300 CFM through the rack, or are you blowing hot air from one 
chassis into the next.


>Mitchel Kagawa
>Systems Admin.
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul 23 18:32:33 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 23 Jul 2003 18:32:33 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307231830060.23607-100000@coffee.psychology.mcmaster.ca>

> We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but

I don't think it makes much sense to use cpu-fans in 1U chassis - 
not only are cpu-fans *in*general* less reliable, but you'd 
constantly be facing this sort of problem.  not to mention
the fact that the overall airflow would be near-pessimal.

far better is the kind of 1U chassis that has 1 or two fairly
large, reliable centrifugal blowers forcing air past passive 
heatsinks on the CPUs.  there are multiple vendors that sell
this kind of design.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 22:15:31 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 16:15:31 -1000
Subject: Thermal Problems
References: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu>
Message-ID: <000c01c35189$750cd310$6f01a8c0@Navatek.local>

Here are a few pictures of the culprite.  Any suggestions on how to fix it
other than buying a whole new case would be appreciated
http://neptune.navships.com/images/oscarnode-front.jpg
http://neptune.navships.com/images/oscarnode-side.jpg
http://neptune.navships.com/images/oscarnode-back.jpg

You can also see how many I'm down... it should read 65 nodes (64 + 1 head
node)
http://neptune.navships.com/ganglia

Mitchel Kagawa
Systems Administrator

----- Original Message -----
From: "Robert G. Brown" <rgb at phy.duke.edu>
To: "Mitchel Kagawa" <mitchel at navships.com>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, July 23, 2003 10:14 AM
Subject: Re: Thermal Problems


> On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
>
> > I run a small 64 node cluster each with dual AMD MP2200's in a 1U
chassis.
> > I am having problems with some of the nodes overheating and shutting
down.
> > We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm
but
> > I notice that a lot (25%) of the fans tend to freeze up or blow the
bearings
> > and spin at only 1000 RPM, which causes the cpu to overheat.  After
careful
> > inspection I noticed that the heatsink and fan sit very close to the lid
of
> > the case.  I was wondering how much clearance is needed between the lid
and
> > the fan that blown down onto the short copper heatsink?  When I put the
lid
> > on the case it is almost as if the fan is working in a vaccum because it
> > actually speeds up an aditional 600-700 rpm to over 6000 rpm... like
there
> > is no air resistance.  Could this be why the fans are crapping out?  I
was
> > thinking that a 60x60x10mm cpu fan that has air intakes on the side of
the
> > fan might work better but I have not seen any... have you?
> >
> > Also the vendor suggested that we sepetate the 1U cases because he
belives
> > that there is heat transfer between the nodeswhen they are stacked right
on
> > top of eachother.  I thought that if one node is running at 50c and
another
> > node is running at 50c it wont generate a combined heatload of more than
50c
> > right.
>
> AMD's really hate to run hot, and duals in 1U require some fairly
> careful engineering to run cool enough, stably.  Who is your vendor?
> Did they do the node design or did you?  If they did, you should be able
> to ask them to just plain fix it -- replace the fans or if necessary
> reengineer the whole case -- to make the problem go away.
>
> Issues like fan clearance and stacking and overall airflow through the
> case are indeed important.  Sometimes things like using round instead of
> ribbon cables (which can turn sideways and interrupt airflow) makes a
> big difference.  Keeping the room's ambient air "cold" (as opposed to
> "comfortable") helps.  There is likely some heat transfer vertically
> between the 1U cases, but if you go to the length of separating them you
> might as well have used 2U cases in the first place.
>
> From your description, it does sound like you have some bad fans.
> Whether they are bad (as in a bad design, poor vendor), or bad (as in
> installed "incorrectly" in a case/mobo with inadequate clearance causing
> them to fail), or bad (as in you just happened to get some fans from a
> bad production batch but replacements would probably work fine) it is
> very hard to say, and I don't envy you the debugging process of finding
> out which.  We've been the route of replacing all of the fans once
> ourselves so it can certainly happen...
>
>    rgb
>
> >
> >
> > Mitchel Kagawa
> > Systems Admin.
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From salonj at hotmail.com  Thu Jul 24 03:13:36 2003
From: salonj at hotmail.com (salon j)
Date: Thu, 24 Jul 2003 07:13:36 +0000
Subject: open the graphic interface.
Message-ID: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>

i want t open the graphic interface on three machines of my clusters,
which program with pvm, in my programme , i use gtk to program the
graphic interface, i have add machines before i spawn,  but after i use
spawn -> filename, it shown pvm>[t80001]  Cannot connect to X server
t80001 is a task on the other machine ,not the machine which start up
the pvm task. how can i do with this error?

_________________________________________________________________
??????????????? MSN Hotmail?  http://www.hotmail.com  

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mikee at mikee.ath.cx  Thu Jul 24 08:05:58 2003
From: mikee at mikee.ath.cx (Mike Eggleston)
Date: Thu, 24 Jul 2003 07:05:58 -0500
Subject: open the graphic interface.
In-Reply-To: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>; from salonj@hotmail.com on Thu, Jul 24, 2003 at 07:13:36AM +0000
References: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>
Message-ID: <20030724070558.A14082@mikee.ath.cx>

On Thu, 24 Jul 2003, salon j wrote:

> i want t open the graphic interface on three machines of my clusters,
> which program with pvm, in my programme , i use gtk to program the
> graphic interface, i have add machines before i spawn,  but after i use
> spawn -> filename, it shown pvm>[t80001]  Cannot connect to X server
> t80001 is a task on the other machine ,not the machine which start up
> the pvm task. how can i do with this error?

There is a debugging option in one of the pvm shell scripts. Setting
the debugging option will allow your programs to reach your X server.

Mike
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 08:37:11 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 14:37:11 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <3F1FD2F7.6FA2F2E3@imag.fr>

Hello,

We have recently received 48 Bi-xeon Dell 1600SC and we are performing some benchmarks to tests the cluster.
Unfortunately we have very bad perfomance with the internal gigabit card (82540EM chipset). We have passed linux netperf test and we have only 33 Mo
between 2 machines. We have changed the drivers for the last ones, installed procfgd and so on... Finally we had Win2000 installed and the last driver
from intel installed : the results are identical... To go further we have installed a PCI-X 82540EM card and re-run the tests : in that way the
results are much better : 66 Mo full duplex...
So the question is : is there a well known problem with this DELL 1600SC concernig the 82540EM integration on the motherboard ????
As anyone already have (heard about) this problem ? 
Is there any solution ?

thx for your help

Regards,


-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Thu Jul 24 08:04:20 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Thu, 24 Jul 2003 08:04:20 -0400
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FD2F7.6FA2F2E3@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
Message-ID: <3F1FCB44.3010002@lmco.com>

Stephane,

   What kind of switch (100 or 1000)? Have you looked
at the switch ports? Are they connecting at full or half
duplex? How about the NICs? You'll see bad performance
with a duplex mismatch between the NICs and switch.
Are you forcing the NICs or are they auto-negiotiating?

Good Luck!

Jeff


> Hello,
>
> We have recently received 48 Bi-xeon Dell 1600SC and we are performing 
> some benchmarks to tests the cluster.
> Unfortunately we have very bad perfomance with the internal gigabit 
> card (82540EM chipset). We have passed linux netperf test and we have 
> only 33 Mo
>
> between 2 machines. We have changed the drivers for the last ones, 
> installed procfgd and so on... Finally we had Win2000 installed and 
> the last driver
>
> from intel installed : the results are identical... To go further we 
> have installed a PCI-X 82540EM card and re-run the tests : in that way the
>
> results are much better : 66 Mo full duplex...
> So the question is : is there a well known problem with this DELL 
> 1600SC concernig the 82540EM integration on the motherboard ????
>
> As anyone already have (heard about) this problem ?
> Is there any solution ?
>
> thx for your help
>

-- 
Dr. Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From canon at nersc.gov  Thu Jul 24 10:36:53 2003
From: canon at nersc.gov (canon at nersc.gov)
Date: Thu, 24 Jul 2003 07:36:53 -0700
Subject: Thermal Problems 
In-Reply-To: Your message of "Wed, 23 Jul 2003 15:19:00 PDT."
             <5.2.0.9.2.20030723145932.02fa56b0@mailhost4.jpl.nasa.gov> 
Message-ID: <200307241436.h6OEarX2002407@pookie.nersc.gov>


We have a similar setup and have seen a similar problem.
The vendor determined the fans weren't robust enough
and sent replacements.

With regards to adding gaps...  We have considered
(but haven't implemented) adding a gap every 10ish nodes.
This would be primarily to reset the vertical temperature
gradient.  You can run your hand up the exhaust and feel
the temperature difference between the top and the bottom.
I suspect hot air rises.  :-)  The gap would allow us
to "reset" the temperature gradient.  This would only
lose us 2 or 3U which isn't too bad if it helps the
cooling.

--Shane

------------------------------------------------------------------------
Shane Canon                             voice: 510-486-6981
PSDF Project Lead                       fax:   510-486-7520
National Energy Research Scientific
  Computing Center
1 Cyclotron Road Mailstop 943-256
Berkeley, CA 94720                      canon at nersc.gov
------------------------------------------------------------------------

> At 09:04 AM 7/23/2003 -1000, Mitchel Kagawa wrote:
> >I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
> >I am having problems with some of the nodes overheating and shutting down.
> >We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
> >I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
> >and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
> >inspection I noticed that the heatsink and fan sit very close to the lid of
> >the case.  I was wondering how much clearance is needed between the lid and
> >the fan that blown down onto the short copper heatsink?
> 
> To a first order, the area of the inlet should be comparable to the area of 
> the outlet.  A 60 mm diameter fan has an area of around 2800 mm^2. If you 
> draw from around the entire periphery (which would be around 180 mm), you'd 
> need a gap of around 15 mm (probably 20 mm would be a better idea)  That's 
> a fairly significant fraction of the 45 mm or so for 1 rack U.
> 
> 
> 
> >  When I put the lid
> >on the case it is almost as if the fan is working in a vaccum because it
> >actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
> >is no air resistance.  Could this be why the fans are crapping out?  I was
> >thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
> >fan might work better but I have not seen any... have you?
> >
> >Also the vendor suggested that we sepetate the 1U cases because he belives
> >that there is heat transfer between the nodeswhen they are stacked right on
> >top of eachother.  I thought that if one node is running at 50c and another
> >node is running at 50c it wont generate a combined heatload of more than 50c
> >right.
> 
> So, your vendor essentially claims that his 1U case will work just fine as 
> long as there is a 1U air gap above and below?
> 
> Let's look at the problem with some simple calculations:
> 
> Assume no heat transfer up or down (tightly packed), and that no heat 
> transfers through the sides by conduction, as well, so all the heat has to 
> go into airflow.
> Assume that you've got to move about 200W out of the box, and you can 
> tolerate a 10C rise in temperature of the air moving through the box. The 
> question is how much air do you need to move. Air has a density of about 
> 1.13 kg/m^3 and a specific heat of about 1 kJ/kgK.
> 200W is 0.2 kJ/sec, so you need to move 0.02 kg of air every second (you 
> get a 10 deg rise) is about 0.018 cubic meters/second. To relate this to 
> more common fan specs: about 40 CFM or  65 cubic meters/hr. (I did a quick 
> check on some smallish 60mm fans, and they only flow around 10-20 CFM into 
> NO backpressure... http://www.papst.de/pdf_dat_d/Seite_13.pdf
> for instance)
> 
> How fast is the air going to be moving through the vents?  What's the vent 
> area... say it's 10 square inches (1 inch high and 10 inches wide...).. 40 
> CFM through .07 square feet is 576 ft/min for the air flow (which is a 
> reasonable speed.. 1000 ft/min is getting fast and noisy...)
> 
> But here's the thing.. you've got 32 of these things in the rack... are you 
> moving 1300 CFM through the rack, or are you blowing hot air from one 
> chassis into the next.
> 
> 
> 
> 
> 
> 
> >Mitchel Kagawa
> >Systems Admin.
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> James Lux, P.E.
> Spacecraft Telecommunications Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beo
> wulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 10:09:15 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 10:09:15 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307240918580.1813-100000@lilith>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> Here are a few pictures of the culprite.  Any suggestions on how to fix it
> other than buying a whole new case would be appreciated
> http://neptune.navships.com/images/oscarnode-front.jpg
> http://neptune.navships.com/images/oscarnode-side.jpg
> http://neptune.navships.com/images/oscarnode-back.jpg

The case design doesn't look totally insane, although that depends a bit
on the actual capacity of some of the fans.  You've got a fairly large,
clear aperture at the front, three fans pulling from it and blowing cool
air over the memory and all three heatsinks, and a rotary/turbine fan in
the rear corner to exhaust the heated air.  The ribbon cables are off to
the side where they don't appear to obstruct the airflow.  The hard disk
presumably has its own fan and pulls front to back over on the other
side more or less independent of the case flow.

At a guess, you're problem really is just the CPU coolers, which may not
be optimal for 1U cases.  A few minutes with google turns up a lot of
alternatives, e.g.:

  http://www.buyextras.com/cojaiuracpuc.html

which is engineered to pull air in through the copper (very good heat
conductor) fins and exhaust it to the SIDE and not out the TOP.  Another
couple of things you can try are to contact AMD and find out what CPU
cooler(s) THEY recommend for 1U systems or join one of the AMD hardware
user support lists (I'll let you do the googling on this one, but they
are out there) and see if somebody will give you a glowing testimonial
on some particular brands for quality, reliability, effectiveness.

The high end coolers aren't horribly cheap -- the one above is $20
(although the site also had a couple of coolers for $16 that might also
be adequate).  However, retrofitting fans is a lot cheaper than
replacing 64 1U cases with 2U cases AND likely having to replace the CPU
coolers anyway, as a cheap cooler is a cheap cooler and likely to fail.

If you bought the cluster from a vendor selling "1U dual Athlon nodes"
and they picked the hardware, they should replace all of the cheap fans
with good fans at their cost, and they should do it right away as you're
losing money by the bucketfull every time a node goes down and you have
to mess with it.  Downtime and your time are EXPENSIVE -- hardware is
cheap.  If they refuse to, please post their name on the list so the
rest of us can avoid them plague-like (a thing I'm tempted to do anyway
if their advice on "fixing" your cooling is to install your 1U node on a
2U spacing).

If you picked the hardware and they just assembled it, well, tough luck,
but they should still help out some -- perhaps take back the cheap fans
and replace them with good fans at cost.  However, even if they decide
to do nothing at all for you and you're stuck doing it all yourself,
you're better off spending $40 x 64 = $2560 and a couple of days of your
time and ending up with a functional cluster than living with days/weeks
of downtime fruitlessly cycling cheap replacement fans doomed to die in
their turn.  Also, eventually your CPUs will start to die and not just
crash your systems, and that gets very expensive very quickly quite
aside from the cost of downtime and labor.

There are no free lunches, and it may be that going with expensive (but
effective!) CPU cooler fans isn't enough to stabilize your systems.  For
example, if the rear exhaust fan doesn't have adequate capacity or the
cooler fans can't be installed in such a way as to establish a clean
airflow of cool air from the front, the CPU cooler fans will just end up
blowing heated air around in a turbulent loop inside the case and even
though the fans may not fail (as they won't be obstructed) the CPUs may
run hotter than you'd like.  You'll have no way of knowing without
trying.

If your vendor doesn't handle this for you I'd recommend that you
immediately spring for a "sample" of the high end fans -- perhaps eight
of them, perhaps sixteen -- and use them to repair your downed systems.
Run the nodes in their usual environment with the new fans and sample
CPU core temperatures.  I'd predict that the CPUs will run cooler than
they do now in any event, but it is good to be sure.  When you're
confident that they will a) keep the CPUs cool and b) run reliably,
given that they have unobstructed airflow you can either buy them as you
need them and just repair nodes as the cheap fans die with the new ones
or, if your cluster really needs to be up and stay up, spring for the
complete set.

BTW, you should check to make sure that the fan at the link above is
actually correct for your CPUs -- it seems like it would be, but caveat
emptor.

Good luck,

    rgb

> 
> You can also see how many I'm down... it should read 65 nodes (64 + 1 head
> node)
> http://neptune.navships.com/ganglia
> 
> Mitchel Kagawa
> Systems Administrator
> 
> ----- Original Message -----
> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: "Mitchel Kagawa" <mitchel at navships.com>
> Cc: <beowulf at beowulf.org>
> Sent: Wednesday, July 23, 2003 10:14 AM
> Subject: Re: Thermal Problems
> 
> 
> > On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
> >
> > > I run a small 64 node cluster each with dual AMD MP2200's in a 1U
> chassis.
> > > I am having problems with some of the nodes overheating and shutting
> down.
> > > We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm
> but
> > > I notice that a lot (25%) of the fans tend to freeze up or blow the
> bearings
> > > and spin at only 1000 RPM, which causes the cpu to overheat.  After
> careful
> > > inspection I noticed that the heatsink and fan sit very close to the lid
> of
> > > the case.  I was wondering how much clearance is needed between the lid
> and
> > > the fan that blown down onto the short copper heatsink?  When I put the
> lid
> > > on the case it is almost as if the fan is working in a vaccum because it
> > > actually speeds up an aditional 600-700 rpm to over 6000 rpm... like
> there
> > > is no air resistance.  Could this be why the fans are crapping out?  I
> was
> > > thinking that a 60x60x10mm cpu fan that has air intakes on the side of
> the
> > > fan might work better but I have not seen any... have you?
> > >
> > > Also the vendor suggested that we sepetate the 1U cases because he
> belives
> > > that there is heat transfer between the nodeswhen they are stacked right
> on
> > > top of eachother.  I thought that if one node is running at 50c and
> another
> > > node is running at 50c it wont generate a combined heatload of more than
> 50c
> > > right.
> >
> > AMD's really hate to run hot, and duals in 1U require some fairly
> > careful engineering to run cool enough, stably.  Who is your vendor?
> > Did they do the node design or did you?  If they did, you should be able
> > to ask them to just plain fix it -- replace the fans or if necessary
> > reengineer the whole case -- to make the problem go away.
> >
> > Issues like fan clearance and stacking and overall airflow through the
> > case are indeed important.  Sometimes things like using round instead of
> > ribbon cables (which can turn sideways and interrupt airflow) makes a
> > big difference.  Keeping the room's ambient air "cold" (as opposed to
> > "comfortable") helps.  There is likely some heat transfer vertically
> > between the 1U cases, but if you go to the length of separating them you
> > might as well have used 2U cases in the first place.
> >
> > From your description, it does sound like you have some bad fans.
> > Whether they are bad (as in a bad design, poor vendor), or bad (as in
> > installed "incorrectly" in a case/mobo with inadequate clearance causing
> > them to fail), or bad (as in you just happened to get some fans from a
> > bad production batch but replacements would probably work fine) it is
> > very hard to say, and I don't envy you the debugging process of finding
> > out which.  We've been the route of replacing all of the fans once
> > ourselves so it can certainly happen...
> >
> >    rgb
> >
> > >
> > >
> > > Mitchel Kagawa
> > > Systems Admin.
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> >
> > Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 11:12:54 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 17:12:54 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <3F1FD2F7.6FA2F2E3@imag.fr> <3F1FCB44.3010002@lmco.com>
Message-ID: <3F1FF776.E586E244@imag.fr>

Jeff Layton a ?crit :
> 
> Stephane,
> 
>    What kind of switch (100 or 1000)? Have you looked
> at the switch ports? Are they connecting at full or half
> duplex? How about the NICs? You'll see bad performance
> with a duplex mismatch between the NICs and switch.
> Are you forcing the NICs or are they auto-negiotiating?
> 
> Good Luck!
> 
> Jeff
> 
> > Hello,
> >
> > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > some benchmarks to tests the cluster.
> > Unfortunately we have very bad perfomance with the internal gigabit
> > card (82540EM chipset). We have passed linux netperf test and we have
> > only 33 Mo
> >
> > between 2 machines. We have changed the drivers for the last ones,
> > installed procfgd and so on... Finally we had Win2000 installed and
> > the last driver
> >
> > from intel installed : the results are identical... To go further we
> > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> >
> > results are much better : 66 Mo full duplex...
> > So the question is : is there a well known problem with this DELL
> > 1600SC concernig the 82540EM integration on the motherboard ????
> >
> > As anyone already have (heard about) this problem ?
> > Is there any solution ?
> >
> > thx for your help
> >
> 
> --
> Dr. Jeff Layton
> Chart Monkey - Aerodynamics and CFD
> Lockheed-Martin Aeronautical Company - Marietta

Hello, 

For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
connected to a J4898A. The negociation between the NIcs ans the two switches is working.

When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo). 

All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....

thx a lot for your concern,

regards
 

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fant at pobox.com  Thu Jul 24 11:17:00 2003
From: fant at pobox.com (Andrew Fant)
Date: Thu, 24 Jul 2003 11:17:00 -0400 (EDT)
Subject: Comparing MPI Implementations
Message-ID: <20030724111221.Y73094-100000@net.bluemoon.net>


Does anyone have any experiences comparing MPI implementations for Linux?
In particular, I am interested in people's views of the relative merits of
Mpich, LAM, and MPIPro.  I currently have Mpich installed on our
production cluster, but this decision came mostly out of default, rather
than by any serious study.

Thanks in advance,
		Andy

Andrew Fant      |   This    |  "If I could walk THAT way...
Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
Boston, MA USA   |   Hire    |    http://www.pharmawulf.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 11:43:07 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 10:43:07 -0500
Subject: Project Help
Message-ID: <Law11-F98m15k4R9xC200005ecb@hotmail.com>


To whomever it may concern,

              I work at a company called AT systems. I was recently assigned 
the task of using up thirty extra SBC's that we have. My boss told me that 
he wants to link all of the SBC's together, and plop them in a tower, and 
donate them to a college or university as a tax write-off. We have a factory 
attached to our engineering department, which contains a turret, multiple 
work stations, and so on. So getting a hold of a custom tower, power supply, 
etc. is not a problem. I just need to create a way to use these thirty extra 
board we have. All thirty of them contain: a P266 processor, 128 MB of RAM, 
128 IDE, Compac Flash Drive, and Ethernet and USB ports. Any diagrams, 
sites, comments, or suggestions would be greatly appreciated. Thanks.

Eric Uren
AT Systems

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*  
http://join.msn.com/?page=features/virus

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Thu Jul 24 11:48:28 2003
From: seth at hogg.org (Simon Hogg)
Date: Thu, 24 Jul 2003 16:48:28 +0100
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
 <3F1FCB44.3010002@lmco.com>
Message-ID: <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>

At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
<problems snipped>
>All our tests tends to show that dell missed something in the integration 
>of the 82540EM in the 1600SC series...if not we'll really really appreciate
>to know what we are missing there cause here we have a 150 000 dollars 
>cluster said to be connected with a network gigabit having network perfs of
>three 100 card bonded (in full duplex it's even worse !!!!!). If the 
>problem is not rapidly solved the 48 machines will be returned....
>
>thx a lot for your concern,

Sorry I can't help you, but I wonder what response you have had from Dell?

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Thu Jul 24 11:48:28 2003
From: seth at hogg.org (Simon Hogg)
Date: Thu, 24 Jul 2003 16:48:28 +0100
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
 <3F1FCB44.3010002@lmco.com>
Message-ID: <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>

At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
<problems snipped>
>All our tests tends to show that dell missed something in the integration 
>of the 82540EM in the 1600SC series...if not we'll really really appreciate
>to know what we are missing there cause here we have a 150 000 dollars 
>cluster said to be connected with a network gigabit having network perfs of
>three 100 card bonded (in full duplex it's even worse !!!!!). If the 
>problem is not rapidly solved the 48 machines will be returned....
>
>thx a lot for your concern,

Sorry I can't help you, but I wonder what response you have had from Dell?

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 13:47:14 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 19:47:14 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <3F1FD2F7.6FA2F2E3@imag.fr>
	 <3F1FCB44.3010002@lmco.com> <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>
Message-ID: <3F201BA2.F7371A4@imag.fr>

Simon Hogg a ?crit :
> 
> At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
> <problems snipped>
> >All our tests tends to show that dell missed something in the integration
> >of the 82540EM in the 1600SC series...if not we'll really really appreciate
> >to know what we are missing there cause here we have a 150 000 dollars
> >cluster said to be connected with a network gigabit having network perfs of
> >three 100 card bonded (in full duplex it's even worse !!!!!). If the
> >problem is not rapidly solved the 48 machines will be returned....
> >
> >thx a lot for your concern,
> 
> Sorry I can't help you, but I wonder what response you have had from Dell?
> 
> --
> Simon

hello,

I can't really answer to this question....hummm....first the technician sent us a link to a web page talking about another network chipset and another
machine saying that they have similar network integration (personnaly I would never compare network results between a biPIII and a BI-Xeon...but....).
It was really unuselful, the technician was arguing that it was a test of the 82540EM...not really serious; The worse of all is that he said that
those results were correct ones (because it was the same result in his link... furthemore he didnt' react much when I told him that such a poor
performance will certainely lead to a reject of all the cluster). So, for him all is all right !!!!!.
I decided to go one level up and had a similar response (I ve been sent a internal report, benchmarking again ANOTHER configuration : thats to say
this time I had numbers concerning a card plugged in the PCI-X bus !!!!!!! I've already done those test by myself...).

so what to say ? I'm not sure they really feel concerned about my (their) problems... My boss said : if no solution tomorow, the cluster is going to
be sent back...

thx for your concerns,

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From math at sizone.org  Wed Jul 23 19:58:14 2003
From: math at sizone.org (Ken Chase)
Date: Wed, 23 Jul 2003 19:58:14 -0400
Subject: cold rooms & machines
Message-ID: <20030723235814.GA11248@velocet.ca>

A group I know wants to put a cluster in their labs, but they
dont have any facilities for cooling _EXCEPT_ a cold room to store
chemicals and conduct experiments at 5C (its largely unused and could
probably be set to any temp up to 10C, really - even -10C if desired
;)

The chillers in there are pretty underworked and might be able to
handle the 3000W odd of heat that would be radiating out of the 
machines.

What other criteria should we be looking at - non-condensing
environment I would guess is one - is this just a function of the %RH
in the room? What should it be set to? Any other concerns?

/kc
-- 
Ken Chase, math at sizone.org
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jim at ks.uiuc.edu  Thu Jul 24 14:21:58 2003
From: jim at ks.uiuc.edu (Jim Phillips)
Date: Thu, 24 Jul 2003 13:21:58 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
Message-ID: <Pine.GSO.4.40.0307241307180.18431-100000@verdun.ks.uiuc.edu>

Hi,

The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get full
gigabit bandwidth, particularly if you're running at 33 MHz (look at
/proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There are no
82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 64-bit 82545EM
card?  Intel distinguishes their 32-bit 33/66 MHz PCI PRO/1000 MT Desktop
cards that use 82540EM from their 64-bit PCI-X PRO/1000 MT Server cards
that use the 82545EM (and have full gigabit performance).

-Jim


On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:

> > > Hello,
> > >
> > > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > > some benchmarks to tests the cluster.
> > > Unfortunately we have very bad perfomance with the internal gigabit
> > > card (82540EM chipset). We have passed linux netperf test and we have
> > > only 33 Mo
> > >
> > > between 2 machines. We have changed the drivers for the last ones,
> > > installed procfgd and so on... Finally we had Win2000 installed and
> > > the last driver
> > >
> > > from intel installed : the results are identical... To go further we
> > > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> > >
> > > results are much better : 66 Mo full duplex...
> > > So the question is : is there a well known problem with this DELL
> > > 1600SC concernig the 82540EM integration on the motherboard ????
> > >
> > > As anyone already have (heard about) this problem ?
> > > Is there any solution ?
> > >
> > > thx for your help
> > >
> >
> > --
> > Dr. Jeff Layton
> > Chart Monkey - Aerodynamics and CFD
> > Lockheed-Martin Aeronautical Company - Marietta
>
> Hello,
>
> For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
> ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
> connected to a J4898A. The negociation between the NIcs ans the two switches is working.
>
> When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
> compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo).
>
> All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
> to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
> three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....
>
> thx a lot for your concern,
>
> regards
>
>
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr
> ENSIMAG - Antenne de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mprinkey at aeolusresearch.com  Thu Jul 24 09:33:09 2003
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Thu, 24 Jul 2003 09:33:09 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307240909230.12184-100000@ra.thebes>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> Here are a few pictures of the culprite.  Any suggestions on how to fix it
> other than buying a whole new case would be appreciated
> http://neptune.navships.com/images/oscarnode-front.jpg
> http://neptune.navships.com/images/oscarnode-side.jpg
> http://neptune.navships.com/images/oscarnode-back.jpg
> 
> You can also see how many I'm down... it should read 65 nodes (64 + 1 head
> node)
> http://neptune.navships.com/ganglia
> 
> Mitchel Kagawa
> Systems Administrator
> 

The Intel Xeon ships with an interesting heat sink/fan/shroud system.  
For an normal case, you can mount the fan on the top of the shroud which
makes it work much like a "normal" heat sink/fan...the air comes in the
top and blows down onto the CPU.  But, for low-profile installations (mine
were 2U), the fan attaches to the side of the shroud to form a "wind
tunnel."  Maybe a similar solution would exist in your case, i.e., taller
heat sinks (~1") with one or two fans mounted on the side blowing across
the heat sink.  I did a quick search online, but couldn't find a vendor 
for this type heat sink.  Sorry.

You might be able to experiment.  Fans are usually only held in place with
oversized screws that go easily into soft heat sinks.  You can probably
build a pair of test heat sinks in 10 minutues.  The flow from the fan
should be aligned with the fins.  Depending on the type of heatsink you
start with, you might be able to direct the output flow in any direction
you choose.  From the photos, I would recommend that you place the fans on
the side of the heat sink near the front of the case so the exhaust is
directed to the vents at the rear of the case.

Good luck,

Mike Prinkey
Aeolus Research, Inc.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Thu Jul 24 09:47:22 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Thu, 24 Jul 2003 08:47:22 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FCB44.3010002@lmco.com>
References: <3F1FD2F7.6FA2F2E3@imag.fr> <3F1FCB44.3010002@lmco.com>
Message-ID: <3F1FE36A.30905@tamu.edu>

And for the 802.3u impaired, you need to A) either set speed and duplex 
settings on your switch AND NIC to fixed values (preferably matching 
each other) or B) leave them all at Auto/Auto for switch and NIC(s).

For those who haven't wandered past the negotiation between switch and 
NIC recently, if you fix any value, negotiation will fail and the 
devices will go to default settings, ie., something resembling a 
consistent speed between the 2 and half-duplex.  But not even that is 
guaranteed.

Note that I've also received recent reports of horrid GBE performance on 
Serverworks botherboards with the internal E-1000 NIC.  I've not been 
able to identify a cause (Don?  Thoughts?  Definitive info?) but I've 
been able to reproduce it.

gerry

Jeff Layton wrote:
> Stephane,
> 
>   What kind of switch (100 or 1000)? Have you looked
> at the switch ports? Are they connecting at full or half
> duplex? How about the NICs? You'll see bad performance
> with a duplex mismatch between the NICs and switch.
> Are you forcing the NICs or are they auto-negiotiating?
> 
> Good Luck!
> 
> Jeff
> 
> 
>> Hello,
>>
>> We have recently received 48 Bi-xeon Dell 1600SC and we are performing 
>> some benchmarks to tests the cluster.
>> Unfortunately we have very bad perfomance with the internal gigabit 
>> card (82540EM chipset). We have passed linux netperf test and we have 
>> only 33 Mo
>>
>> between 2 machines. We have changed the drivers for the last ones, 
>> installed procfgd and so on... Finally we had Win2000 installed and 
>> the last driver
>>
>> from intel installed : the results are identical... To go further we 
>> have installed a PCI-X 82540EM card and re-run the tests : in that way 
>> the
>>
>> results are much better : 66 Mo full duplex...
>> So the question is : is there a well known problem with this DELL 
>> 1600SC concernig the 82540EM integration on the motherboard ????
>>
>> As anyone already have (heard about) this problem ?
>> Is there any solution ?
>>
>> thx for your help
>>
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bari at onelabs.com  Thu Jul 24 14:36:50 2003
From: bari at onelabs.com (Bari Ari)
Date: Thu, 24 Jul 2003 13:36:50 -0500
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
References: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu> <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <3F202742.5010107@onelabs.com>

Mitchel Kagawa wrote:

>Here are a few pictures of the culprite.  Any suggestions on how to fix it
>other than buying a whole new case would be appreciated
>http://neptune.navships.com/images/oscarnode-front.jpg
>http://neptune.navships.com/images/oscarnode-side.jpg
>http://neptune.navships.com/images/oscarnode-back.jpg
>
>  
>
The fans tied to the cpu heat sinks may be too close to the top cover 
for effective air flow/cooling. Measure the air temp at various places 
inside the case when closed and the cpu's operating. Try to get an idea 
of how much airflow is actually moving through the case vs just around 
the inside of the case.

Try placing tangential (cross flow) fans in the empty drive bays and up 
against the front panel and opening up the rear of the case.

http://www.airvac.se/products.htm

The power supply has fans at its front and rear to move air through it. 
The centrifugal blower in the rear corner may not be helping much to 
draw air across the cpu's.  The same principle applies to the enclosure. 
Try to move more air through it vs just around the inside. The cooler 
the components the lower the failure rate.

Bari


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 15:12:25 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 15:12:25 -0400 (EDT)
Subject: cold rooms & machines
In-Reply-To: <20030723235814.GA11248@velocet.ca>
Message-ID: <Pine.LNX.4.44.0307241506570.1813-100000@lilith>

On Wed, 23 Jul 2003, Ken Chase wrote:

> A group I know wants to put a cluster in their labs, but they
> dont have any facilities for cooling _EXCEPT_ a cold room to store
> chemicals and conduct experiments at 5C (its largely unused and could
> probably be set to any temp up to 10C, really - even -10C if desired
> ;)
> 
> The chillers in there are pretty underworked and might be able to
> handle the 3000W odd of heat that would be radiating out of the 
> machines.
> 
> What other criteria should we be looking at - non-condensing
> environment I would guess is one - is this just a function of the %RH
> in the room? What should it be set to? Any other concerns?

Air circulation.  The room needs to have a circulation pattern that
delivers cool air to the intake/front of the cluster and delivers warmed
air from the exhaust rear to the air return.  A cold room might or might
not have adequate airflow or chiller capacity, as it isn't really
engineered for active sources within the space but rather for removing
ambient heat from objects placed therein a single time, plus dealing
with heat bleeding through its (usually copious) insulation.

There are lots of (bad) things that could happen if the air circulation
isn't engineered right -- coils can freeze up, humidity can condense and
leak, cluster nodes can feed back heated air outside the cooled air
circulation and overheat.

I'd have them contact an AC engineer to go over the space and see
whether it can work, and if so what modifications are required.

   rgb

> 
> /kc
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rocky at atipa.com  Thu Jul 24 11:42:04 2003
From: rocky at atipa.com (Rocky McGaugh)
Date: Thu, 24 Jul 2003 10:42:04 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
Message-ID: <Pine.LNX.4.44.0307241025330.13514-100000@rocky>


On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:

> Hello, 
> 
> For our tests we are connected to a 4108GL (J4865A), we have done all
> necessary checks (maybe we've have forget something very very big ????)
> to ensure the validity of our mesures. The ports have been tested with
> auto neg on, then off and also forced. We have also the same mesures
> when connected to a J4898A. The negociation between the NIcs ans the two
> switches is working.
> 
> When using a tyan motherboard with the 82540EM built-in and using the
> same benchs and switches ans the same procedures (drivers updates and
> compilations from Intel, various benchs, different OS) the results are
> correct (80 to 90Mo).
> 
> All our tests tends to show that dell missed something in the
> integration of the 82540EM in the 1600SC series...if not we'll really
> really appreciate to know what we are missing there cause here we have a
> 150 000 dollars cluster said to be connected with a network gigabit
> having network perfs of three 100 card bonded (in full duplex it's even
> worse !!!!!). If the problem is not rapidly solved the 48 machines will
> be returned....

I'd totally remove the switch from the situation first. See what you can
get back-to-back by directly connecting one node to another first.

While the 4108GL is great for management networks, it is not a high
performance switch. Wait till you fire up all 48 with PMB.

Your bisectional bandwidth is not going to be great, but you should still
be able to hit decent numbers with a limited number of machines. It's 
possible that broadcast and multicast traffic are interfering with your
runs.

So first remove the switch. If you get the performance you are looking for
point-to-point, then you can focus your efforts on the switch. 

Twice i've had 4108GL's that would experience a severe performance hit
when doing any traffic with a certain blade. The first time it was a fast
ethernet blade in slot "C". Any network traffic that hit a port on this
blade was severely degraded. We swapped blades with a different slot and
the problem did not follow the blade. A firmware update solved the issue.

The second time it was with a gig-E blade in slot "F". Again, any network
traffic that hit a port on this blade was severely degraded (similar to
what you're seeing now). This time, a firmware update did not fix it, but
swapping it with another gig-E blade from another 4108GL worked fine. The
"problem" blade also worked fine in the other 4108.

Targeting Pallas PMB to run on specific nodes based on the topology of the 
switch can sure tell one a lot about a switch...:)

Good luck,

-- 
Rocky McGaugh
Atipa Technologies
rocky at atipatechnologies.com
rmcgaugh at atipa.com
1-785-841-9513 x3110
http://67.8450073/
perl -e 'print unpack(u, ".=W=W+F%T:7\!A+F-O;0H`");'


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 14:58:02 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 13:58:02 -0500
Subject: Hubs
Message-ID: <Law11-F105wy5OXYirm00006891@hotmail.com>


To whomever it may concern,

           I am trying to link together 30 boards through Ethernet. What 
would be your recomendation for how many and what type of Hubs I should use 
to connect them all together. Any imput is appreciated.

Eric Uren
AT Systems

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online  
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 15:40:37 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 15:40:37 -0400 (EDT)
Subject: Hubs
In-Reply-To: <Law11-F105wy5OXYirm00006891@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>

On Thu, 24 Jul 2003, Eric Uren wrote:

> 
> To whomever it may concern,
> 
>            I am trying to link together 30 boards through Ethernet. What 
> would be your recomendation for how many and what type of Hubs I should use 
> to connect them all together. Any imput is appreciated.

Any hint as to what you're going to be doing with the 30 boards?  The
obvious choice is a cheap 48 port 10/100BT switch from any name-brand
vendor.  However, there are circumstances where you'd want more
expensive switches, 1000BT switches, or a different network altogether.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From klight at appliedthermalsciences.com  Thu Jul 24 16:16:54 2003
From: klight at appliedthermalsciences.com (Ken Light)
Date: Thu, 24 Jul 2003 16:16:54 -0400
Subject: Thermal Problems
Message-ID: <DF234083CF6BD511AB250020781FCD13243C67@POKEY>

I think there are a lot of compromises in this layout.  The centrifugal
blower in the back looks like it is helping mostly the power supply, not the
CPUs.  The CPU fans doesn't look like they are being very effective when the
top of the case goes on and the little muffin fans near the memory are
notoriously inefficient when you present them with any kind of flow
restriction like that duct.  I would be tempted to experiment with different
CPU heat sinks and a bigger blower on front to move air over them.  The
following links show some views of a pretty good Xeon setup.  Maybe you can
get some ideas of things to try (by the way, the CPUs are under the paper).
The case is custom from Microway Inc. and is pretty deep, but the extra
space makes for a good layout.  Good luck.

http://www.clusters.umaine.edu/xeon/

-Ken

> -----Original Message-----
> From: Michael T. Prinkey [mailto:mprinkey at aeolusresearch.com] 
> Sent: Thursday, July 24, 2003 9:33 AM
> To: Mitchel Kagawa
> Cc: beowulf at beowulf.org
> Subject: Re: Thermal Problems
> 
> 
> On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
> 
> > Here are a few pictures of the culprite.  Any suggestions 
> on how to fix it
> > other than buying a whole new case would be appreciated
> > http://neptune.navships.com/images/oscarnode-front.jpg
> > http://neptune.navships.com/images/oscarnode-side.jpg
> > http://neptune.navships.com/images/oscarnode-back.jpg
> > 
> > You can also see how many I'm down... it should read 65 
> nodes (64 + 1 head
> > node)
> > http://neptune.navships.com/ganglia
> > 
> > Mitchel Kagawa
> > Systems Administrator
> > 
> 
> The Intel Xeon ships with an interesting heat sink/fan/shroud 
> system.  
> For an normal case, you can mount the fan on the top of the 
> shroud which
> makes it work much like a "normal" heat sink/fan...the air 
> comes in the
> top and blows down onto the CPU.  But, for low-profile 
> installations (mine
> were 2U), the fan attaches to the side of the shroud to form a "wind
> tunnel."  Maybe a similar solution would exist in your case, 
> i.e., taller
> heat sinks (~1") with one or two fans mounted on the side 
> blowing across
> the heat sink.  I did a quick search online, but couldn't 
> find a vendor 
> for this type heat sink.  Sorry.
> 
> You might be able to experiment.  Fans are usually only held 
> in place with
> oversized screws that go easily into soft heat sinks.  You 
> can probably
> build a pair of test heat sinks in 10 minutues.  The flow from the fan
> should be aligned with the fins.  Depending on the type of 
> heatsink you
> start with, you might be able to direct the output flow in 
> any direction
> you choose.  From the photos, I would recommend that you 
> place the fans on
> the side of the heat sink near the front of the case so the exhaust is
> directed to the vents at the rear of the case.
> 
> Good luck,
> 
> Mike Prinkey
> Aeolus Research, Inc.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From deadline at plogic.com  Thu Jul 24 16:13:01 2003
From: deadline at plogic.com (Douglas Eadline)
Date: Thu, 24 Jul 2003 16:13:01 -0400 (EDT)
Subject: Comparing MPI Implementations
In-Reply-To: <20030724111221.Y73094-100000@net.bluemoon.net>
Message-ID: <Pine.LNX.4.44.0307241554010.28707-100000@otto.plogic.com>

On Thu, 24 Jul 2003, Andrew Fant wrote:

> 
> Does anyone have any experiences comparing MPI implementations for Linux?
> In particular, I am interested in people's views of the relative merits of
> Mpich, LAM, and MPIPro.  I currently have Mpich installed on our
> production cluster, but this decision came mostly out of default, rather
> than by any serious study.

One easy way to compare is to use the NAS test suite in the
Beowulf Performance Suite. You can very easily run the NAS suite with
MPICH, LAM, and MPI-PRO, (and compilers, numbers of cpus, and test size)
The suite does not include the MPI versions.

Have  a look at:

www.cluster-rant.com/article.pl?sid=03/03/17/1838236 

for links and example output.

I have not had a chance to post some recent results, but I can say
the following:

Given the same hardware for all MPI's:

 - it depends on the application
 - it depends if you are using dual nodes running two
   copies of your program.
 - it depends on the version you use

How is that for a simple answer.

Doug


> 
> Thanks in advance,
> 		Andy
> 
> Andrew Fant      |   This    |  "If I could walk THAT way...
> Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
> fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
> Boston, MA USA   |   Hire    |    http://www.pharmawulf.com
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 17:52:02 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 23:52:02 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <Pine.GSO.4.40.0307241307180.18431-100000@verdun.ks.uiuc.edu>
Message-ID: <3F205502.A2E197D3@imag.fr>

Jim Phillips a ?crit :
> 
> Hi,
> 
> The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get full
> gigabit bandwidth, particularly if you're running at 33 MHz (look at
> /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There are no
> 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 64-bit 82545EM
> card?  Intel distinguishes their 32-bit 33/66 MHz PCI PRO/1000 MT Desktop
> cards that use 82540EM from their 64-bit PCI-X PRO/1000 MT Server cards
> that use the 82545EM (and have full gigabit performance).
> 
> -Jim
> 
> On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> 
> > > > Hello,
> > > >
> > > > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > > > some benchmarks to tests the cluster.
> > > > Unfortunately we have very bad perfomance with the internal gigabit
> > > > card (82540EM chipset). We have passed linux netperf test and we have
> > > > only 33 Mo
> > > >
> > > > between 2 machines. We have changed the drivers for the last ones,
> > > > installed procfgd and so on... Finally we had Win2000 installed and
> > > > the last driver
> > > >
> > > > from intel installed : the results are identical... To go further we
> > > > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> > > >
> > > > results are much better : 66 Mo full duplex...
> > > > So the question is : is there a well known problem with this DELL
> > > > 1600SC concernig the 82540EM integration on the motherboard ????
> > > >
> > > > As anyone already have (heard about) this problem ?
> > > > Is there any solution ?
> > > >
> > > > thx for your help
> > > >
> > >
> > > --
> > > Dr. Jeff Layton
> > > Chart Monkey - Aerodynamics and CFD
> > > Lockheed-Martin Aeronautical Company - Marietta
> >
> > Hello,
> >
> > For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
> > ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
> > connected to a J4898A. The negociation between the NIcs ans the two switches is working.
> >
> > When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
> > compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo).
> >
> > All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
> > to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
> > three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....
> >
> > thx a lot for your concern,
> >
> > regards
> >
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr
> > ENSIMAG - Antenne de Montbonnot
> > ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >

I'm going to re re re re check it...

thx a lot for your concern !

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Thu Jul 24 21:36:43 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Thu, 24 Jul 2003 18:36:43 -0700 (PDT)
Subject: Thermal Problems
In-Reply-To: <3F202742.5010107@onelabs.com>
Message-ID: <Pine.LNX.3.96.1030724183211.25620A-100000@Maggie.Linux-Consulting.com>


hi ya

any system where the cpu is next to the power supply is a doomed box

if the airflow in the chassis is done right ... there should be 
minimal temp difference between the system running with covers
and without covers

cpu fans above the cpu heatsink is worthless in a 1U case .. throw it away
( unless there is a really good fan blade design to pull air and move air
( in 0.25" of space between the heatsink bottom and the cover just just
( above the fan blade

lots of fun playing with air :-)

blowers in the back of the power supply doesnt do anything
	- most power supply exhaust air out the back y its power cord
	and should NOT be blocked or have cross air flow from other fans
	like in an indented power supply ( inside the chassis )

c ya
alvin

On Thu, 24 Jul 2003, Bari Ari wrote:

> Mitchel Kagawa wrote:
> 
> >Here are a few pictures of the culprite.  Any suggestions on how to fix it
> >other than buying a whole new case would be appreciated
> >http://neptune.navships.com/images/oscarnode-front.jpg
> >http://neptune.navships.com/images/oscarnode-side.jpg
> >http://neptune.navships.com/images/oscarnode-back.jpg
> >
> >  
> >
> The fans tied to the cpu heat sinks may be too close to the top cover 
> for effective air flow/cooling. Measure the air temp at various places 
> inside the case when closed and the cpu's operating. Try to get an idea 
> of how much airflow is actually moving through the case vs just around 
> the inside of the case.
> 
> Try placing tangential (cross flow) fans in the empty drive bays and up 
> against the front panel and opening up the rear of the case.
> 
> http://www.airvac.se/products.htm
> 
> The power supply has fans at its front and rear to move air through it. 
> The centrifugal blower in the rear corner may not be helping much to 
> draw air across the cpu's.  The same principle applies to the enclosure. 
> Try to move more air through it vs just around the inside. The cooler 
> the components the lower the failure rate.
> 
> Bari
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Thu Jul 24 22:08:02 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Thu, 24 Jul 2003 21:08:02 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC10800A7@ausx2kmpc108.aus.amer.dell.com>

Desktop or server quality, I do not know, but the 1600sc does have the 82540
chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
10/100/1000 nic.  I was curious which driver you were using, e1000 or
eepro1000?  The latter has known slow transfer problems, but just as
mentioned, hard-setting all network devices should yield the best
performance.  Hope that helps.  1600sc servers are not the best for clusters
with their size and power consumption, but I would recommend the 650 or
1650s.

-matt

-----Original Message-----
From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr] 
Sent: Thursday, July 24, 2003 4:52 PM
To: Jim Phillips
Cc: boewulf
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


Jim Phillips a ?crit :
> 
> Hi,
> 
> The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get 
> full gigabit bandwidth, particularly if you're running at 33 MHz (look 
> at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There 
> are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 
> 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI 
> PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X 
> PRO/1000 MT Server cards that use the 82545EM (and have full gigabit 
> performance).
> 
> -Jim
> 
> On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> 
> > > > Hello,
> > > >
> > > > We have recently received 48 Bi-xeon Dell 1600SC and we are 
> > > > performing some benchmarks to tests the cluster. Unfortunately 
> > > > we have very bad perfomance with the internal gigabit card 
> > > > (82540EM chipset). We have passed linux netperf test and we have 
> > > > only 33 Mo
> > > >
> > > > between 2 machines. We have changed the drivers for the last 
> > > > ones, installed procfgd and so on... Finally we had Win2000 
> > > > installed and the last driver
> > > >
> > > > from intel installed : the results are identical... To go 
> > > > further we have installed a PCI-X 82540EM card and re-run the 
> > > > tests : in that way the
> > > >
> > > > results are much better : 66 Mo full duplex...
> > > > So the question is : is there a well known problem with this 
> > > > DELL 1600SC concernig the 82540EM integration on the motherboard 
> > > > ????
> > > >
> > > > As anyone already have (heard about) this problem ?
> > > > Is there any solution ?
> > > >
> > > > thx for your help
> > > >
> > >
> > > --
> > > Dr. Jeff Layton
> > > Chart Monkey - Aerodynamics and CFD
> > > Lockheed-Martin Aeronautical Company - Marietta
> >
> > Hello,
> >
> > For our tests we are connected to a 4108GL (J4865A), we have done 
> > all necessary checks (maybe we've have forget something very very 
> > big ????) to ensure the validity of our mesures. The ports have been 
> > tested with auto neg on, then off and also forced. We have also the 
> > same mesures when connected to a J4898A. The negociation between the 
> > NIcs ans the two switches is working.
> >
> > When using a tyan motherboard with the 82540EM built-in and using 
> > the same benchs and switches ans the same procedures (drivers 
> > updates and compilations from Intel, various benchs, different OS) 
> > the results are correct (80 to 90Mo).
> >
> > All our tests tends to show that dell missed something in the 
> > integration of the 82540EM in the 1600SC series...if not we'll 
> > really really appreciate to know what we are missing there cause 
> > here we have a 150 000 dollars cluster said to be connected with a 
> > network gigabit having network perfs of three 100 card bonded (in 
> > full duplex it's even worse !!!!!). If the problem is not rapidly 
> > solved the 48 machines will be returned....
> >
> > thx a lot for your concern,
> >
> > regards
> >
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> >

I'm going to re re re re check it...

thx a lot for your concern !

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeff.cheung at nixdsl.com  Fri Jul 25 04:02:24 2003
From: jeff.cheung at nixdsl.com (Jeff Cheung)
Date: Fri, 25 Jul 2003 16:02:24 +0800
Subject: Xoen Prefermence
Message-ID: <BAD0D13AB4DD1349859C28E6DFAA192E06860B@mail.nixdsl.com>

Hello

	Does anyone know where can I find the Linpack and NASA Parallel Benchmarks on a dual P4 Xeon 2.8GHz 533FSB with 2GB RAM 

Jeff Cheung


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Fri Jul 25 05:22:41 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Fri, 25 Jul 2003 11:22:41 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <6CB36426C6B9D541A8B1D2022FEA7FC10800A7@ausx2kmpc108.aus.amer.dell.com>
Message-ID: <3F20F6E1.346DF1CD@imag.fr>

Matthew_Wygant at Dell.com a ?crit :
> 
> Desktop or server quality, I do not know, but the 1600sc does have the 82540
> chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
> 10/100/1000 nic.  I was curious which driver you were using, e1000 or
> eepro1000?  The latter has known slow transfer problems, but just as
> mentioned, hard-setting all network devices should yield the best
> performance.  Hope that helps.  1600sc servers are not the best for clusters
> with their size and power consumption, but I would recommend the 650 or
> 1650s.
> 
> -matt
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Thursday, July 24, 2003 4:52 PM
> To: Jim Phillips
> Cc: boewulf
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Jim Phillips a ?crit :
> >
> > Hi,
> >
> > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > full gigabit bandwidth, particularly if you're running at 33 MHz (look
> > at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There
> > are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a
> > 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI
> > PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X
> > PRO/1000 MT Server cards that use the 82545EM (and have full gigabit
> > performance).
> >
> > -Jim
> >
> > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> >
> > > > > Hello,
> > > > >
> > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are
> > > > > performing some benchmarks to tests the cluster. Unfortunately
> > > > > we have very bad perfomance with the internal gigabit card
> > > > > (82540EM chipset). We have passed linux netperf test and we have
> > > > > only 33 Mo
> > > > >
> > > > > between 2 machines. We have changed the drivers for the last
> > > > > ones, installed procfgd and so on... Finally we had Win2000
> > > > > installed and the last driver
> > > > >
> > > > > from intel installed : the results are identical... To go
> > > > > further we have installed a PCI-X 82540EM card and re-run the
> > > > > tests : in that way the
> > > > >
> > > > > results are much better : 66 Mo full duplex...
> > > > > So the question is : is there a well known problem with this
> > > > > DELL 1600SC concernig the 82540EM integration on the motherboard
> > > > > ????
> > > > >
> > > > > As anyone already have (heard about) this problem ?
> > > > > Is there any solution ?
> > > > >
> > > > > thx for your help
> > > > >
> > > >
> > > > --
> > > > Dr. Jeff Layton
> > > > Chart Monkey - Aerodynamics and CFD
> > > > Lockheed-Martin Aeronautical Company - Marietta
> > >
> > > Hello,
> > >
> > > For our tests we are connected to a 4108GL (J4865A), we have done
> > > all necessary checks (maybe we've have forget something very very
> > > big ????) to ensure the validity of our mesures. The ports have been
> > > tested with auto neg on, then off and also forced. We have also the
> > > same mesures when connected to a J4898A. The negociation between the
> > > NIcs ans the two switches is working.
> > >
> > > When using a tyan motherboard with the 82540EM built-in and using
> > > the same benchs and switches ans the same procedures (drivers
> > > updates and compilations from Intel, various benchs, different OS)
> > > the results are correct (80 to 90Mo).
> > >
> > > All our tests tends to show that dell missed something in the
> > > integration of the 82540EM in the 1600SC series...if not we'll
> > > really really appreciate to know what we are missing there cause
> > > here we have a 150 000 dollars cluster said to be connected with a
> > > network gigabit having network perfs of three 100 card bonded (in
> > > full duplex it's even worse !!!!!). If the problem is not rapidly
> > > solved the 48 machines will be returned....
> > >
> > > thx a lot for your concern,
> > >
> > > regards
> > >
> > >
> > > --
> > > Stephane Martin         Stephane.Martin at imag.fr
> > > http://icluster.imag.fr
> > > Tel: 04 76 61 20 31
> > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > > 38330 MONTBONNOT SAINT MARTIN
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> 
> I'm going to re re re re check it...
> 
> thx a lot for your concern !
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
> de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

hello,

The driver used is the e1000 one; last src from intel...
We are on the way of a commercial issue to get "not on board" good gb NICs at low low cost...
Which one is the best ? (broadcom ? intel ? other ?)
I've check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT it's 82545EM : our fault !!! good news !
BUT, I've also re checked the number on the tyan motherboard and this this time it's really a 82540EM ! bad news !
So the pb is still there : why on a tyan mb we get twice the perfs in comparaison with a dell mb ? (same os install, same bench, same network)
BTW we are going to get a card on the 64 bit PCI-X bus as the onbaord is not suitable for high performance usage.

thx all for your concerns.

regards

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Fri Jul 25 07:31:23 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Fri, 25 Jul 2003 06:31:23 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DD@ausx2kmpc108.aus.amer.dell.com>

I would stick to intel, I would not use a Broadcom at all...  

-----Original Message-----
From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr] 
Sent: Friday, July 25, 2003 4:23 AM
To: Matthew_Wygant at exchange.dell.com
Cc: beowulf at beowulf.org
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


Matthew_Wygant at Dell.com a ?crit :
> 
> Desktop or server quality, I do not know, but the 1600sc does have the 
> 82540 chip, dmseg should show that much.  It is on a 33MHz bus and 
> does rate as a 10/100/1000 nic.  I was curious which driver you were 
> using, e1000 or eepro1000?  The latter has known slow transfer 
> problems, but just as mentioned, hard-setting all network devices 
> should yield the best performance.  Hope that helps.  1600sc servers 
> are not the best for clusters with their size and power consumption, 
> but I would recommend the 650 or 1650s.
> 
> -matt
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Thursday, July 24, 2003 4:52 PM
> To: Jim Phillips
> Cc: boewulf
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Jim Phillips a ?crit :
> >
> > Hi,
> >
> > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get 
> > full gigabit bandwidth, particularly if you're running at 33 MHz 
> > (look at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  
> > There are no 82540EM-based PCI-X cards, AFAIK; are you sure it 
> > wasn't a 64-bit 82545EM card?  Intel distinguishes their 32-bit 
> > 33/66 MHz PCI PRO/1000 MT Desktop cards that use 82540EM from their 
> > 64-bit PCI-X PRO/1000 MT Server cards that use the 82545EM (and have 
> > full gigabit performance).
> >
> > -Jim
> >
> > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> >
> > > > > Hello,
> > > > >
> > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are 
> > > > > performing some benchmarks to tests the cluster. Unfortunately 
> > > > > we have very bad perfomance with the internal gigabit card 
> > > > > (82540EM chipset). We have passed linux netperf test and we 
> > > > > have only 33 Mo
> > > > >
> > > > > between 2 machines. We have changed the drivers for the last 
> > > > > ones, installed procfgd and so on... Finally we had Win2000 
> > > > > installed and the last driver
> > > > >
> > > > > from intel installed : the results are identical... To go 
> > > > > further we have installed a PCI-X 82540EM card and re-run the 
> > > > > tests : in that way the
> > > > >
> > > > > results are much better : 66 Mo full duplex...
> > > > > So the question is : is there a well known problem with this 
> > > > > DELL 1600SC concernig the 82540EM integration on the 
> > > > > motherboard ????
> > > > >
> > > > > As anyone already have (heard about) this problem ? Is there 
> > > > > any solution ?
> > > > >
> > > > > thx for your help
> > > > >
> > > >
> > > > --
> > > > Dr. Jeff Layton
> > > > Chart Monkey - Aerodynamics and CFD
> > > > Lockheed-Martin Aeronautical Company - Marietta
> > >
> > > Hello,
> > >
> > > For our tests we are connected to a 4108GL (J4865A), we have done 
> > > all necessary checks (maybe we've have forget something very very 
> > > big ????) to ensure the validity of our mesures. The ports have 
> > > been tested with auto neg on, then off and also forced. We have 
> > > also the same mesures when connected to a J4898A. The negociation 
> > > between the NIcs ans the two switches is working.
> > >
> > > When using a tyan motherboard with the 82540EM built-in and using 
> > > the same benchs and switches ans the same procedures (drivers 
> > > updates and compilations from Intel, various benchs, different OS) 
> > > the results are correct (80 to 90Mo).
> > >
> > > All our tests tends to show that dell missed something in the 
> > > integration of the 82540EM in the 1600SC series...if not we'll 
> > > really really appreciate to know what we are missing there cause 
> > > here we have a 150 000 dollars cluster said to be connected with a 
> > > network gigabit having network perfs of three 100 card bonded (in 
> > > full duplex it's even worse !!!!!). If the problem is not rapidly 
> > > solved the 48 machines will be returned....
> > >
> > > thx a lot for your concern,
> > >
> > > regards
> > >
> > >
> > > --
> > > Stephane Martin         Stephane.Martin at imag.fr
> > > http://icluster.imag.fr
> > > Tel: 04 76 61 20 31
> > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann 38330 
> > > MONTBONNOT SAINT MARTIN 
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> 
> I'm going to re re re re check it...
> 
> thx a lot for your concern !
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

hello,

The driver used is the e1000 one; last src from intel...
We are on the way of a commercial issue to get "not on board" good gb NICs
at low low cost... Which one is the best ? (broadcom ? intel ? other ?) I've
check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT
it's 82545EM : our fault !!! good news ! BUT, I've also re checked the
number on the tyan motherboard and this this time it's really a 82540EM !
bad news ! So the pb is still there : why on a tyan mb we get twice the
perfs in comparaison with a dell mb ? (same os install, same bench, same
network) BTW we are going to get a card on the 64 bit PCI-X bus as the
onbaord is not suitable for high performance usage.

thx all for your concerns.

regards

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Fri Jul 25 08:50:06 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Fri, 25 Jul 2003 14:50:06 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DD@ausx2kmpc108.aus.amer.dell.com>
Message-ID: <3F21277E.D0932B89@imag.fr>

Matthew_Wygant at Dell.com a ?crit :
> 
> I would stick to intel, I would not use a Broadcom at all...
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Friday, July 25, 2003 4:23 AM
> To: Matthew_Wygant at exchange.dell.com
> Cc: beowulf at beowulf.org
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Matthew_Wygant at Dell.com a ?crit :
> >
> > Desktop or server quality, I do not know, but the 1600sc does have the
> > 82540 chip, dmseg should show that much.  It is on a 33MHz bus and
> > does rate as a 10/100/1000 nic.  I was curious which driver you were
> > using, e1000 or eepro1000?  The latter has known slow transfer
> > problems, but just as mentioned, hard-setting all network devices
> > should yield the best performance.  Hope that helps.  1600sc servers
> > are not the best for clusters with their size and power consumption,
> > but I would recommend the 650 or 1650s.
> >
> > -matt
> >
> > -----Original Message-----
> > From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> > Sent: Thursday, July 24, 2003 4:52 PM
> > To: Jim Phillips
> > Cc: boewulf
> > Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> >
> > Jim Phillips a ?crit :
> > >
> > > Hi,
> > >
> > > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > > full gigabit bandwidth, particularly if you're running at 33 MHz
> > > (look at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).
> > > There are no 82540EM-based PCI-X cards, AFAIK; are you sure it
> > > wasn't a 64-bit 82545EM card?  Intel distinguishes their 32-bit
> > > 33/66 MHz PCI PRO/1000 MT Desktop cards that use 82540EM from their
> > > 64-bit PCI-X PRO/1000 MT Server cards that use the 82545EM (and have
> > > full gigabit performance).
> > >
> > > -Jim
> > >
> > > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> > >
> > > > > > Hello,
> > > > > >
> > > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are
> > > > > > performing some benchmarks to tests the cluster. Unfortunately
> > > > > > we have very bad perfomance with the internal gigabit card
> > > > > > (82540EM chipset). We have passed linux netperf test and we
> > > > > > have only 33 Mo
> > > > > >
> > > > > > between 2 machines. We have changed the drivers for the last
> > > > > > ones, installed procfgd and so on... Finally we had Win2000
> > > > > > installed and the last driver
> > > > > >
> > > > > > from intel installed : the results are identical... To go
> > > > > > further we have installed a PCI-X 82540EM card and re-run the
> > > > > > tests : in that way the
> > > > > >
> > > > > > results are much better : 66 Mo full duplex...
> > > > > > So the question is : is there a well known problem with this
> > > > > > DELL 1600SC concernig the 82540EM integration on the
> > > > > > motherboard ????
> > > > > >
> > > > > > As anyone already have (heard about) this problem ? Is there
> > > > > > any solution ?
> > > > > >
> > > > > > thx for your help
> > > > > >
> > > > >
> > > > > --
> > > > > Dr. Jeff Layton
> > > > > Chart Monkey - Aerodynamics and CFD
> > > > > Lockheed-Martin Aeronautical Company - Marietta
> > > >
> > > > Hello,
> > > >
> > > > For our tests we are connected to a 4108GL (J4865A), we have done
> > > > all necessary checks (maybe we've have forget something very very
> > > > big ????) to ensure the validity of our mesures. The ports have
> > > > been tested with auto neg on, then off and also forced. We have
> > > > also the same mesures when connected to a J4898A. The negociation
> > > > between the NIcs ans the two switches is working.
> > > >
> > > > When using a tyan motherboard with the 82540EM built-in and using
> > > > the same benchs and switches ans the same procedures (drivers
> > > > updates and compilations from Intel, various benchs, different OS)
> > > > the results are correct (80 to 90Mo).
> > > >
> > > > All our tests tends to show that dell missed something in the
> > > > integration of the 82540EM in the 1600SC series...if not we'll
> > > > really really appreciate to know what we are missing there cause
> > > > here we have a 150 000 dollars cluster said to be connected with a
> > > > network gigabit having network perfs of three 100 card bonded (in
> > > > full duplex it's even worse !!!!!). If the problem is not rapidly
> > > > solved the 48 machines will be returned....
> > > >
> > > > thx a lot for your concern,
> > > >
> > > > regards
> > > >
> > > >
> > > > --
> > > > Stephane Martin         Stephane.Martin at imag.fr
> > > > http://icluster.imag.fr
> > > > Tel: 04 76 61 20 31
> > > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann 38330
> > > > MONTBONNOT SAINT MARTIN
> > > > _______________________________________________
> > > > Beowulf mailing list, Beowulf at beowulf.org
> > > > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> > > >
> >
> > I'm going to re re re re check it...
> >
> > thx a lot for your concern !
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> hello,
> 
> The driver used is the e1000 one; last src from intel...
> We are on the way of a commercial issue to get "not on board" good gb NICs
> at low low cost... Which one is the best ? (broadcom ? intel ? other ?) I've
> check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT
> it's 82545EM : our fault !!! good news ! BUT, I've also re checked the
> number on the tyan motherboard and this this time it's really a 82540EM !
> bad news ! So the pb is still there : why on a tyan mb we get twice the
> perfs in comparaison with a dell mb ? (same os install, same bench, same
> network) BTW we are going to get a card on the 64 bit PCI-X bus as the
> onbaord is not suitable for high performance usage.
> 
> thx all for your concerns.
> 
> regards
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
> de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN

As someone tested those two cards ????...
those papers are not helping much ;)

http://www.veritest.com/clients/reports/intel/intel_pro1000_mt_desktop_adapter.pdf

http://www.etestinglabs.com/clients/reports/broadcom/broadcom_5703.pdf

thx for your help

regards,

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bogdan.costescu at iwr.uni-heidelberg.de  Fri Jul 25 10:13:12 2003
From: bogdan.costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 25 Jul 2003 16:13:12 +0200 (CEST)
Subject: cold rooms & machines
In-Reply-To: <20030723235814.GA11248@velocet.ca>
Message-ID: <Pine.LNX.4.44.0307251605320.27222-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 23 Jul 2003, Ken Chase wrote:

> _EXCEPT_ a cold room to store chemicals and conduct experiments at 5C
> (its largely unused

If by this you mean that computers and chemicals will share the room, I'd
advise against it. Especially if the chemicals include some acids or
volatile substances... Giving that on my university diploma it's written
"biochemist"  I think that I know what I'm talking about :-)
Even with non-dangerous substances, if some of them are obtained 
commercially they might cost an arm and a leg and even something extra, so 
the owners should know what can happen if the cooling installation fails 
for some reason...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joelja at darkwing.uoregon.edu  Fri Jul 25 10:25:39 2003
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Fri, 25 Jul 2003 07:25:39 -0700 (PDT)
Subject: Thermal Problems
In-Reply-To: <Pine.LNX.3.96.1030724183211.25620A-100000@Maggie.Linux-Consulting.com>
Message-ID: <Pine.LNX.4.44.0307250717170.32325-100000@twin.uoregon.edu>

larger passive heatsinks... low-profile dimm modules in the angled dimm 
sockets... 

The fact that the power-supply is essentially exhausting into case despite 
the blower is worrysome...

joelja

On Thu, 24 Jul 2003, Alvin Oga wrote:

> 
> hi ya
> 
> any system where the cpu is next to the power supply is a doomed box
> 
> if the airflow in the chassis is done right ... there should be 
> minimal temp difference between the system running with covers
> and without covers
> 
> cpu fans above the cpu heatsink is worthless in a 1U case .. throw it away
> ( unless there is a really good fan blade design to pull air and move air
> ( in 0.25" of space between the heatsink bottom and the cover just just
> ( above the fan blade
> 
> lots of fun playing with air :-)
> 
> blowers in the back of the power supply doesnt do anything
> 	- most power supply exhaust air out the back y its power cord
> 	and should NOT be blocked or have cross air flow from other fans
> 	like in an indented power supply ( inside the chassis )
> 
> c ya
> alvin
> 
> On Thu, 24 Jul 2003, Bari Ari wrote:
> 
> > Mitchel Kagawa wrote:
> > 
> > >Here are a few pictures of the culprite.  Any suggestions on how to fix it
> > >other than buying a whole new case would be appreciated
> > >http://neptune.navships.com/images/oscarnode-front.jpg
> > >http://neptune.navships.com/images/oscarnode-side.jpg
> > >http://neptune.navships.com/images/oscarnode-back.jpg
> > >
> > >  
> > >
> > The fans tied to the cpu heat sinks may be too close to the top cover 
> > for effective air flow/cooling. Measure the air temp at various places 
> > inside the case when closed and the cpu's operating. Try to get an idea 
> > of how much airflow is actually moving through the case vs just around 
> > the inside of the case.
> > 
> > Try placing tangential (cross flow) fans in the empty drive bays and up 
> > against the front panel and opening up the rear of the case.
> > 
> > http://www.airvac.se/products.htm
> > 
> > The power supply has fans at its front and rear to move air through it. 
> > The centrifugal blower in the rear corner may not be helping much to 
> > draw air across the cpu's.  The same principle applies to the enclosure. 
> > Try to move more air through it vs just around the inside. The cooler 
> > the components the lower the failure rate.
> > 
> > Bari
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli	      Academic User Services   joelja at darkwing.uoregon.edu    
--    PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E      --
  In Dr. Johnson's famous dictionary patriotism is defined as the last
  resort of the scoundrel.  With all due respect to an enlightened but
  inferior lexicographer I beg to submit that it is the first.
	   	            -- Ambrose Bierce, "The Devil's Dictionary"


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jim at ks.uiuc.edu  Fri Jul 25 10:47:40 2003
From: jim at ks.uiuc.edu (Jim Phillips)
Date: Fri, 25 Jul 2003 09:47:40 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F20F6E1.346DF1CD@imag.fr>
Message-ID: <Pine.GSO.4.40.0307250943490.21070-100000@verdun.ks.uiuc.edu>

Hi again,

If the Dell has an 82540 on 33 MHz but the Tyan has it on 66 MHz, I would
expect the Tyan to have twice the performance, but still less than that of
a 64-bit 82545 at 66 MHz (or 133 MHz on PCI-X).

-Jim


On Fri, 25 Jul 2003 Stephane.Martin at imag.fr wrote:

> Matthew_Wygant at Dell.com a ?crit :
> >
> > Desktop or server quality, I do not know, but the 1600sc does have the 82540
> > chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
> > 10/100/1000 nic.  I was curious which driver you were using, e1000 or
> > eepro1000?  The latter has known slow transfer problems, but just as
> > mentioned, hard-setting all network devices should yield the best
> > performance.  Hope that helps.  1600sc servers are not the best for clusters
> > with their size and power consumption, but I would recommend the 650 or
> > 1650s.
> >
> > -matt
> >
> > -----Original Message-----
> > From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> > Sent: Thursday, July 24, 2003 4:52 PM
> > To: Jim Phillips
> > Cc: boewulf
> > Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> >
> > Jim Phillips a ?crit :
> > >
> > > Hi,
> > >
> > > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > > full gigabit bandwidth, particularly if you're running at 33 MHz (look
> > > at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There
> > > are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a
> > > 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI
> > > PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X
> > > PRO/1000 MT Server cards that use the 82545EM (and have full gigabit
> > > performance).
> > >
> > > -Jim
> > >
>
> The driver used is the e1000 one; last src from intel...
> We are on the way of a commercial issue to get "not on board" good gb NICs at low low cost...
> Which one is the best ? (broadcom ? intel ? other ?)
> I've check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT it's 82545EM : our fault !!! good news !
> BUT, I've also re checked the number on the tyan motherboard and this this time it's really a 82540EM ! bad news !
> So the pb is still there : why on a tyan mb we get twice the perfs in comparaison with a dell mb ? (same os install, same bench, same network)
> BTW we are going to get a card on the 64 bit PCI-X bus as the onbaord is not suitable for high performance usage.
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Fri Jul 25 10:52:36 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Fri, 25 Jul 2003 09:52:36 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DE@ausx2kmpc108.aus.amer.dell.com>

A good place to go for these Dell related things are the
linux-poweredge at dell.com lists...  Thanks.

-----Original Message-----
From: Jim Phillips [mailto:jim at ks.uiuc.edu] 
Sent: Friday, July 25, 2003 9:48 AM
To: Stephane.Martin at imag.fr
Cc: Matthew_Wygant at exchange.dell.com; beowulf at beowulf.org
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


This message uses a character set that is not supported by the Internet
Service.  To view the original message content,  open the attached message.
If the text doesn't display correctly, save the attachment to disk, and then
open it using a viewer that can display the original character set. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Fri Jul 25 13:29:54 2003
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 25 Jul 2003 10:29:54 -0700
Subject: Top node hotter thanothers?
Message-ID: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>

We have a 20 x 2U rack and I've noticed that the
top node is always a step hotter than the other nodes.

Why?

There is a slight gradient going up the rack (see
below, 01 is on the bottom, 20 on the top) but it
doesn't explain the jump at the top node.  At first
I thought it might be due to hot air moving from
the back of the rack, over the top of the highest
node, and being sucked in by it.
However no temperature change resulted when all
side vents were blocked and cardboard pasted up
the front of the rack so that only the same cold
air as the other nodes could enter.  The only other
difference between this node and the others is
that there's hot air above 20 (two empty rack slots),
but another node above all the others. So maybe all
that hot air heats the top node's case and that
couples the heat in?  I don't have an insulating
panel handy to test that hypothesis.

node case    cpu
01   +34?C   +43?C 
02   +35?C   +44?C 
03   +37?C   +48?C 
04   +42?C   +50?C 
05   +38?C   +48?C 
06   +37?C   +50?C 
07   +36?C   +45?C 
08   +38?C   +48?C 
09   +38?C   +48?C 
10   +38?C   +48?C 
11   +36?C   +44?C 
12   +38?C   +48?C 
13   +38?C   +48?C 
14   +40?C   +49?C 
15   +38?C   +46?C 
16   +36?C   +46?C 
17   +39?C   +51?C 
18   +39?C   +48?C 
19   +39?C   +49?C 
20   +44?C   +54?C

Temperatures were measured using "sensors" on these
tyan S2466 motherboards (1 CPU on each currently.)
The case value is the temperature reading by the
diode under the socket of the absent 2nd CPU.
The temperatures jump around a degree or two.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From john152 at libero.it  Fri Jul 25 13:17:20 2003
From: john152 at libero.it (john152 at libero.it)
Date: Fri, 25 Jul 2003 19:17:20 +0200
Subject: Problems with 3Com card...
Message-ID: <HILC0W$EA4355B402EBC5A66476C9949F74FD71@libero.it>

Hi all,
i'd like to use a 3Com905-TX instead of Realtek RTL-8139 i used before,
but i have problems with mii-diag software in detecting the link status.

With Realtek card  all was Ok, infact i had:
at start (cable connected):
18:54:36.592  Baseline value of MII BMSR 
(basic mode status register) is 782d.

disconnecting the link:
18:55:01.632  MII BMSR now 7809:   no link, NWay busy, 
No Jabber (0000).
18:55:01.637  Baseline value of MII BMSR 
basic mode status register) is 7809.

connecting the link:
18:55:06.722  MII BMSR now 782d: Good link, 
NWay done, No Jabber (45e1).
18:55:06.728  Baseline value of MII BMSR 
(basic mode status register) is 782d.
.
.

Now i have the following output lines with 3Com:

at start (cable connected):
18:42:46.073  Baseline value of MII BMSR 
(basic mode status register) is 782d.

disconnecting the link:
18:42:50.779  MII BMSR now 7829:   no link, 
NWay done, No Jabber (0000).
18:49:38.524  Baseline value of MII BMSR 
(basic mode status register) is 7809.

connecting the link:
18:52:15.887  MII BMSR now 7829:   no link,
 NWay done, No Jabber (41e1).
18:52:15.895  Baseline value of MII BMSR 
(basic mode status register) is 782d.
.
.

With 3Com, the Baseline value of MII BMSR is 782d with Link Good
and 7809 with no Link (and it seems like the Realtek).
When the function 'monitor_mii' starts, in the baseline_1 variable
i see a correct value, instead in the following loop
while (continue_monitor)..., there is new_1 variable
that is always wrong: 7829. (Correctly the loop ends, but i have
the output "no link" wrong!)

new_1 is the return value of mdio_read(ioaddr, phy_id, 1) and should
be the same values of baseline_1 (782d or 7809), shouldn' t it?

Can you help me?
Thanks in advance for your kind answers.

Giovanni di Giacomo


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hunting at ix.netcom.com  Fri Jul 25 14:03:17 2003
From: hunting at ix.netcom.com (Michael Huntingdon)
Date: Fri, 25 Jul 2003 11:03:17 -0700
Subject: Top node hotter than others?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <3.0.3.32.20030725110317.013f6b60@popd.ix.netcom.com>

David

Do the systems have anything similar to Insight Manager to indicate the
rate of your fans? In a rack where space is tight and systems are running
hot, a slight variance in the movement of air can be significant.

Do the cabinets have fans overhead to draw the warm air out? Less expensive
cabinets are not necessarily engineered to ensure consistent airflow under
demanding conditions, typical with clusters like this.

Are all 20 nodes purely compute or do you have head nodes somewhere in the
mix? As clusters become larger and more dense there is a great deal of
research going on in various labs, to ensure stability of temperatures not
just within cabinets, but across entire computer rooms. "Hot Spots" are a
growing issue. Have you dealt with any of the major manufactures specific
to this or any other concerns as your research clusters grow?

My Best
Michael

At 10:29 AM 7/25/2003 -0700, David Mathog wrote:
>We have a 20 x 2U rack and I've noticed that the
>top node is always a step hotter than the other nodes.
>
>Why?
>
>There is a slight gradient going up the rack (see
>below, 01 is on the bottom, 20 on the top) but it
>doesn't explain the jump at the top node.  At first
>I thought it might be due to hot air moving from
>the back of the rack, over the top of the highest
>node, and being sucked in by it.
>However no temperature change resulted when all
>side vents were blocked and cardboard pasted up
>the front of the rack so that only the same cold
>air as the other nodes could enter.  The only other
>difference between this node and the others is
>that there's hot air above 20 (two empty rack slots),
>but another node above all the others. So maybe all
>that hot air heats the top node's case and that
>couples the heat in?  I don't have an insulating
>panel handy to test that hypothesis.
>
>node case    cpu
>01   +34?C   +43?C 
>02   +35?C   +44?C 
>03   +37?C   +48?C 
>04   +42?C   +50?C 
>05   +38?C   +48?C 
>06   +37?C   +50?C 
>07   +36?C   +45?C 
>08   +38?C   +48?C 
>09   +38?C   +48?C 
>10   +38?C   +48?C 
>11   +36?C   +44?C 
>12   +38?C   +48?C 
>13   +38?C   +48?C 
>14   +40?C   +49?C 
>15   +38?C   +46?C 
>16   +36?C   +46?C 
>17   +39?C   +51?C 
>18   +39?C   +48?C 
>19   +39?C   +49?C 
>20   +44?C   +54?C
>
>Temperatures were measured using "sensors" on these
>tyan S2466 motherboards (1 CPU on each currently.)
>The case value is the temperature reading by the
>diode under the socket of the absent 2nd CPU.
>The temperatures jump around a degree or two.
>
>Regards,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Fri Jul 25 14:15:40 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Jul 2003 14:15:40 -0400 (EDT)
Subject: Top node hotter thanothers?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.44.0307251410440.27504-100000@ganesh.phy.duke.edu>

On Fri, 25 Jul 2003, David Mathog wrote:

> We have a 20 x 2U rack and I've noticed that the
> top node is always a step hotter than the other nodes.
> 
> Why?
> 
> There is a slight gradient going up the rack (see
> below, 01 is on the bottom, 20 on the top) but it
> doesn't explain the jump at the top node.  At first
> I thought it might be due to hot air moving from
> the back of the rack, over the top of the highest
> node, and being sucked in by it.
> However no temperature change resulted when all
> side vents were blocked and cardboard pasted up
> the front of the rack so that only the same cold
> air as the other nodes could enter.  The only other
> difference between this node and the others is
> that there's hot air above 20 (two empty rack slots),
> but another node above all the others. So maybe all
> that hot air heats the top node's case and that
> couples the heat in?  I don't have an insulating
> panel handy to test that hypothesis.

What happens if the top node is turned off?  Does the second from the
top become the hot node?  What happens when the top node is swapped with
the bottom node?  It could just be that the top node's CPU cooler fan
has a piece of lint stuck on it and is running hotter, or even that its
sensor itsn't calibrated right.

It could be some sort of loopback of heated air as you describe, but if
you put a small fan and set it to blow across the top node you should
break up the circulation pattern if any such pattern exists.  I don't
have as much faith in cardboard used to block vents, since that can also
heat up the node by impeding circulation.

   rgb

> 
> node case    cpu
> 01   +34?C   +43?C 
> 02   +35?C   +44?C 
> 03   +37?C   +48?C 
> 04   +42?C   +50?C 
> 05   +38?C   +48?C 
> 06   +37?C   +50?C 
> 07   +36?C   +45?C 
> 08   +38?C   +48?C 
> 09   +38?C   +48?C 
> 10   +38?C   +48?C 
> 11   +36?C   +44?C 
> 12   +38?C   +48?C 
> 13   +38?C   +48?C 
> 14   +40?C   +49?C 
> 15   +38?C   +46?C 
> 16   +36?C   +46?C 
> 17   +39?C   +51?C 
> 18   +39?C   +48?C 
> 19   +39?C   +49?C 
> 20   +44?C   +54?C
> 
> Temperatures were measured using "sensors" on these
> tyan S2466 motherboards (1 CPU on each currently.)
> The case value is the temperature reading by the
> diode under the socket of the absent 2nd CPU.
> The temperatures jump around a degree or two.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mas at ucla.edu  Fri Jul 25 14:37:19 2003
From: mas at ucla.edu (Michael Stein)
Date: Fri, 25 Jul 2003 11:37:19 -0700
Subject: Top node hotter thanothers?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>; from mathog@mendel.bio.caltech.edu on Fri, Jul 25, 2003 at 10:29:54AM -0700
References: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <20030725113719.A5315@mas1.ats.ucla.edu>

> node case    cpu
> 01   +34?C   +43?C 
> 02   +35?C   +44?C 
> 03   +37?C   +48?C 
> 04   +42?C   +50?C 
> 05   +38?C   +48?C 
> 06   +37?C   +50?C 
> 07   +36?C   +45?C 
> 08   +38?C   +48?C 
> 09   +38?C   +48?C 
> 10   +38?C   +48?C 
> 11   +36?C   +44?C 
> 12   +38?C   +48?C 
> 13   +38?C   +48?C 
> 14   +40?C   +49?C 
> 15   +38?C   +46?C 
> 16   +36?C   +46?C 
> 17   +39?C   +51?C 
> 18   +39?C   +48?C 
> 19   +39?C   +49?C 
> 20   +44?C   +54?C

It's not clear to me that there is an actual difference going toward
the top.  04 is +42?

Assuming the input air temperature is reasonably uniform over the
machines, I'd guess that you're seeing a combination of different
sensor calibration and different heat dissipation (or different fan
capabilities).  Ignoring sensor error, the hotter machines must have
either higher power input or less air flow (assuming similar input air
temperature).

There is a tolerance on CPU (and other chips) heat/power usage -- some are
bound to run hotter than others.

Or check what's running on each machine.  This can make a huge difference.
I've seen output air on one machine go from 81 F to 99 F (27 C to 37 C)
from unloaded to full load (dual Xeon, 2.4 Ghz, multiple burnP6+burnMMX).
This was with 72 F input air (22 C).
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Fri Jul 25 14:45:21 2003
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 25 Jul 2003 11:45:21 -0700
Subject: Top node hotter than others?
Message-ID: <E19g7ZB-0004Y0-00@mendel.bio.caltech.edu>


> Do the systems have anything similar to Insight Manager
> to indicate the rate of your fans? 

"sensors" shows that. The CPU and two chassis fans
in the various systems are within a few percent of each other.
I can't read the power supply fans though.

Example:

node   cpu  Fan1 Fan2
19     4720 4425 4474
20     4720 4377 4377

> Do the cabinets have fans overhead to draw the warm air out? 

Not installed but there's a panel that comes off
where one could be put in.  When that panel is removed 
there's not much metal holding heat on the top of the system,
but the top node only cooled off about 1 degree and no
effect at all on the other nodes.  There's a hole in the bottom
of the case where cool air can go in.  The front is
currently completely open, and the back is open but it's about
8" from a wall.  It's about 4 feet from the top of the top node
to the acoustical tile, and there's a return vent
only 4 feet away, off to one side.  (Yes, I've thought
about moving that return vent directly over the rack.)

I think the hot air is rising, but not very fast, so that it
lingers around the top of the rack no matter what.  You
are probably correct that a fan to pull it off faster
would help.  I'm beginning to think of the rack as a sort
of poorly designed chimney - the kind that doesn't "pull"
well and results in a smokey fireplace.

> 
> Are all 20 nodes purely compute

yes, the master node is across the room.

> As clusters become larger and more dense there is a great deal of
> research going on in various labs, to ensure stability of
> temperatures not just within cabinets, but across entire
> computer rooms. 

Racks should probably plug into chimneys - take all that
heat and vent it straight out of the building. Heck of
a lot cheaper than running A/C to cool it in place. We've got
old fume hood ducts somewhere up above the acoustic ceiling
that go straight to the roof, but the A/C guys didn't
like my chimney idea much because apparently it would
screw up airflow in the building.  Plus a bit
of negative pressure could suck the output from another
lab's fume hood back into my area, which isn't
an attractive prospect.


> growing issue. Have you dealt with any of the majo
> manufactures specific
> to this or any other concerns as your research clusters grow?

The cluster is big enough for now.  Growth is pretty
limited in any case by available power, A/C capacity,
my tolerance for noise since I have to work in the
same room, and of course, $$$.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 11:24:56 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 10:24:56 -0500
Subject: HELP!
Message-ID: <Law11-F116yMmxsywxZ00005d9e@hotmail.com>

To whomever it may concern,

              I work at a company called AT systems. We recently aquired 
thirty SBC's. I was assigned to develop a way to link all of the boards 
together, and place them in a tower. We will then donate it to a local 
college, and use it as a tax write-off. The boards contain: P266 Mhz, 128 MB 
of RAM, 128 IDE, Compac Flash Drive, Ethernet and USB ports. I am stationed 
in the same building as our factory. We have a turret, so developing the 
tower, power supply, etc. is not a problem. My task is just to find out a 
way to use all these boards up. Any site, diagrams, or suggestions would be 
greatly appreciated. Thanks.

Eric Uren
AT Systems

_________________________________________________________________
Add photos to your messages with MSN 8. Get 2 months FREE*.  
http://join.msn.com/?page=features/featuredemail

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From law at acm.org  Fri Jul 25 16:44:02 2003
From: law at acm.org (lynn wilkins)
Date: Fri, 25 Jul 2003 13:44:02 -0700
Subject: Hubs
In-Reply-To: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>
References: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>
Message-ID: <0307251344020A.22708@maggie>

Hi,
Also, some switches use "store and forward" switching. Some don't.
Is "store and forward" a "good thing" or should we avoid it? (Other things 
being equal, such as 100baseT, full duplex, etc.)
-law


On Thursday 24 July 2003 12:40, you wrote:
> On Thu, 24 Jul 2003, Eric Uren wrote:
> > To whomever it may concern,
> >
> >            I am trying to link together 30 boards through Ethernet. What
> > would be your recomendation for how many and what type of Hubs I should
> > use to connect them all together. Any imput is appreciated.
>
> Any hint as to what you're going to be doing with the 30 boards?  The
> obvious choice is a cheap 48 port 10/100BT switch from any name-brand
> vendor.  However, there are circumstances where you'd want more
> expensive switches, 1000BT switches, or a different network altogether.
>
>    rgb
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joelja at darkwing.uoregon.edu  Fri Jul 25 17:58:15 2003
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Fri, 25 Jul 2003 14:58:15 -0700 (PDT)
Subject: Project Help
In-Reply-To: <Law11-F98m15k4R9xC200005ecb@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307251456020.1942-100000@twin.uoregon.edu>

varius ee or cs embeded computing projects would probably happily take 
them off your hands as is...

joelja

On Thu, 24 Jul 2003, Eric Uren wrote:

> 
> 
> To whomever it may concern,
> 
>               I work at a company called AT systems. I was recently assigned 
> the task of using up thirty extra SBC's that we have. My boss told me that 
> he wants to link all of the SBC's together, and plop them in a tower, and 
> donate them to a college or university as a tax write-off. We have a factory 
> attached to our engineering department, which contains a turret, multiple 
> work stations, and so on. So getting a hold of a custom tower, power supply, 
> etc. is not a problem. I just need to create a way to use these thirty extra 
> board we have. All thirty of them contain: a P266 processor, 128 MB of RAM, 
> 128 IDE, Compac Flash Drive, and Ethernet and USB ports. Any diagrams, 
> sites, comments, or suggestions would be greatly appreciated. Thanks.
> 
> Eric Uren
> AT Systems
> 
> _________________________________________________________________
> MSN 8 with e-mail virus protection service: 2 months FREE*  
> http://join.msn.com/?page=features/virus
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli	      Academic User Services   joelja at darkwing.uoregon.edu    
--    PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E      --
  In Dr. Johnson's famous dictionary patriotism is defined as the last
  resort of the scoundrel.  With all due respect to an enlightened but
  inferior lexicographer I beg to submit that it is the first.
	   	            -- Ambrose Bierce, "The Devil's Dictionary"


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Sat Jul 26 10:28:21 2003
From: seth at hogg.org (Simon Hogg)
Date: Sat, 26 Jul 2003 15:28:21 +0100
Subject: UK only?  Power Meters
Message-ID: <4.3.2.7.2.20030726151139.00a86f00@pop.freeuk.net>

Some of the list members may remember a recent discussion of the usefulness 
of power meters.  I have just seen some for sale in Lidl[1] (of all 
places!) in the UK (with a UK 3-pin plug-through arrangement).

They were UKP 6.99 (equivalent to about US$10) and had a little lcd 
display.  Measurements performed were Current, Peak Current (poss. with 
High Current warning?), Power, Peak Power, total kWh and Power Factor.

I have no details of performance, etc. (since I didn't buy one) but the 
price is certainly very attractive compared even the the much feted 
'kill-a-watt'.  If anyone wants one and can't find a Lidl you can contact 
me off-list, and I will get on my trusty bicycle down to the shops.

--
Simon

[1] www.lidl.com (www.lidl.de) German-based trans-European discount retailer.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Sun Jul 27 04:03:14 2003
From: award at andorra.ad (Alan Ward)
Date: Sun, 27 Jul 2003 10:03:14 +0200
Subject: Infiniband: cost-effective switchless configurations
References: <200307251655.UAA08132@nocserv.free.net>
Message-ID: <3F238742.1060408@andorra.ad>

If I understand correctly, you need all-to-all connectivity?
Do all the nodes need to access the whole data set, or only
share part of the data set between a few nodes each time?

I had a case where I wanted to share the whole data
set between all nodes, using point-to-point Ethernet connections
(no broadcast). I put them in a ring, so that with e.g. four nodes:


    A -----> B -----> C -----> D
    ^                          |
    |                          |
     --------------------------

Node A sends its data, plus C's and D's to node B.
Node B sends its data, plus D's and A's to node C.
Node C sends its data, plus A's and B's to node D
Node D sends its data, plus B's and C's to node A.

Data that has done (N-1) hops is no longer forwarded.


We used a single Java program with 3 threads on each node:

- one to receive data and place it in a local array
- one to forward finished data to the next node
- one to perform calculations


The main drawback is that you need a smart algorithm to determine
which pieces of data are "new" and which are "used"; i.e. have
been used for calculation and been forwarded to the next node,
and can be chucked out to make space. Ours wasn't smart enough :-(

Alan Ward


En/na Mikhail Kuzminsky ha escrit:
>   It's possible to build 3-nodes switchless Infiniband-connected
> cluster w/following topology (I assume one 2-ports Mellanox HCA card
> per node):
> 
>     node2 -------IB------Central node-----IB-----node1
>      !                                             !
>      !                                             !
>      ----------------------IB-----------------------
> 
> It gives complete nodes connectivity and I assume to have
> 3 separate subnets w/own subnet manager for each. But I think that
> in the case if MPI broadcasting must use hardware multicasting,
> MPI broadcast will not work from nodes 1,2 (is it right ?).
> 
> OK. But may be it's possible also to build the following topology
> (I assume 2 x 2-ports Mellanox HCAs per node, and it gives also
> complete connectivity of nodes) ? :
> 
> 
>   node 2----IB-------- C e n t r a l  n o d e -----IB------node1
>        \              /                      \           /
>          \          /                         \         /
>            \       /                           \      /
>              \--node3                         node4--
> 
> and I establish also additional IB links (2-1, 2-4, 3-1, 3-4, not
> presenetd in the "picture") which gives me complete nodes connectivity.
> Sorry, is it possible (I don't think about changes in device drivers)?
> If yes, it's good way to build very small
> and cost effective IB-based switchless clusters !
> 
> BTW, if I will use IPoIB service, is it possible to use netperf
> and/or netpipe tools for measurements of TCP/IP performance ?
>        
> Yours
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jhearns at micromuse.com  Mon Jul 28 18:04:53 2003
From: jhearns at micromuse.com (John Hearns)
Date: 28 Jul 2003 23:04:53 +0100
Subject: UK power meters
Message-ID: <1059429893.1415.5.camel@harwood>

I bought two of the power meters from LIDL.
The Clapham Junction branch has dozens.

Seems to work fine! My mini-ITX system is running at 45 watts.

-- 
John Hearns <jhearns at micromuse.com>
Micromuse

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gary at lerhaupt.com  Sat Jul 26 12:44:30 2003
From: gary at lerhaupt.com (Gary Lerhaupt)
Date: 26 Jul 2003 11:44:30 -0500
Subject: Dell Linux mailing list
Message-ID: <1059237870.6969.3.camel@localhost.localdomain>

For ample amounts of help with your Dell / Linux equipment, please check
out the Linux-Poweredge mailing list at
http://lists.us.dell.com/mailman/listinfo/linux-poweredge.

Gary

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jnellis at mtcrossroads.org  Sun Jul 27 20:31:01 2003
From: jnellis at mtcrossroads.org (Joe Nellis)
Date: Sun, 27 Jul 2003 17:31:01 -0700
Subject: Neighbor table overflow
References: <200307251655.UAA08132@nocserv.free.net> <3F238742.1060408@andorra.ad>
Message-ID: <001c01c3549f$93bbd680$8800a8c0@joe>

Greetings,

I am running scyld 27bz version.  I recently started getting "neighbor table
overflow" messages on the last boot stage on one of my nodes though nothing
has changed.  Can anyone explain this message.  The node just hangs with
this message repeating every 30 seconds or so.

Sincerely,
Joe.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Mon Jul 28 18:33:59 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Mon, 28 Jul 2003 15:33:59 -0700 (PDT)
Subject: Dell Linux mailing list
In-Reply-To: <1059237870.6969.3.camel@localhost.localdomain>
Message-ID: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com>


hi ya

i cant resist...

On 26 Jul 2003, Gary Lerhaupt wrote:

> For ample amounts of help with your Dell / Linux equipment, please check
> out the Linux-Poweredge mailing list at
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge.

if dell machines needs so much "help"... something else is
wrong with the box ...

and yes, i've been going around to fix/replace lots of broken dell boxes

a good box works out of the crate ( outof the box ) and keeps
working for years and years.. and keeps working even if you
open the covers and fiddle with the insides

c ya
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From John.Hearns at micromuse.com  Mon Jul 28 07:32:14 2003
From: John.Hearns at micromuse.com (John Hearns)
Date: Mon, 28 Jul 2003 12:32:14 +0100
Subject: Power meters at LIDL
Message-ID: <027901c354fb$e82d4030$8461cdc2@DREAD>

Thanks to Simon Hogg.

I have got some cheap cycling gear from LIDL, but I never thought of buying
Beowulf bits from there!
I have a couple nearby me, so if anyone else in the UK wants one I'll see if
they are in stock
and post one on if you provide name/address.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Mon Jul 28 21:53:37 2003
From: angel at wolf.com (Angel Rivera)
Date: Tue, 29 Jul 2003 01:53:37 GMT
Subject: Dell Linux mailing list
In-Reply-To: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030729015337.23350.qmail@houston.wolf.com>

Alvin Oga writes: 

> 
> a good box works out of the crate ( outof the box ) and keeps
> working for years and years.. and keeps working even if you
> open the covers and fiddle with the insides

Sounds great on paper, but... 

When one buys hundreds of boxes at a whack, the major issue, besides the 
normal shipping ones, is going to be the firmware differences between the 
boxes which has a tendency to bite you that the most inopportune moment. 
Dell is no worse than some and a lot better than others. 

We drive a real production commercial cluster.  I would NEVER open an in 
service production box.  Messing up a production run results in serious 
money(and time)being lost. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Mon Jul 28 22:03:57 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Mon, 28 Jul 2003 19:03:57 -0700 (PDT)
Subject: Dell Linux mailing list
In-Reply-To: <20030729015337.23350.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030728190139.9520B-100000@Maggie.Linux-Consulting.com>


hi ya

On Tue, 29 Jul 2003, Angel Rivera wrote:

> > a good box works out of the crate ( outof the box ) and keeps
> > working for years and years.. and keeps working even if you
> > open the covers and fiddle with the insides
> 
> Sounds great on paper, but... 

yup...

and that is precisely why i dont use gateway, compaq, dell ...
( i wont be putting important data on those boxes )

i qa/qc my own boxes for production use ... 
and yes, never touch a box in production .. never ever .. no matter what
well within reason ...if the production boxes are dying... fix it asap
and methodically and documented and tested and qa'd and qc'd and
foo-blessed

c ya
alvin

> When one buys hundreds of boxes at a whack, the major issue, besides the 
> normal shipping ones, is going to be the firmware differences between the 
> boxes which has a tendency to bite you that the most inopportune moment. 
> Dell is no worse than some and a lot better than others. 
> 
> We drive a real production commercial cluster.  I would NEVER open an in 
> service production box.  Messing up a production run results in serious 
> money(and time)being lost. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mwheeler at startext.co.uk  Tue Jul 29 05:57:29 2003
From: mwheeler at startext.co.uk (Martin WHEELER)
Date: Tue, 29 Jul 2003 09:57:29 +0000 (UTC)
Subject: Neighbor table overflow
In-Reply-To: <001c01c3549f$93bbd680$8800a8c0@joe>
Message-ID: <Pine.LNX.4.33.0307290930220.23391-100000@caxton.startext.demon.co.uk>

On Sun, 27 Jul 2003, Joe Nellis wrote:

> I am running scyld 27bz version.  I recently started getting "neighbor table
> overflow" messages on the last boot stage on one of my nodes though nothing
> has changed.  Can anyone explain this message.  The node just hangs with
> this message repeating every 30 seconds or so.

Ah.  The dreaded 'neighbour table overflow' message.

I was plagued with this a couple of years ago.

It usually means that your system is unable to resolve some of its
component machines.  But which?  (In my case, usually localhost.)

Check very carefully the contents of:

   * /etc/hosts
   * /etc/resolv.conf
   * /etc/network/interfaces

Also check that you can ping every machine on the network.
(Particularly localhost.)

Then make sure that you have *explicitly* given correct addresses,
netmasks, and gateway address in /etc/network/interfaces for both
ethernet and local loopback connections.
(see man interfaces for examples)

What does ifconfig tell you?
(You should see details of both ethernet and local loopback connections
-- if not, you've got a problem.)

If necessary, do an

     ifconfig 127.0.0.1 netmask 255.0.0.0 up

to try to kick local loopback into life.
(If it does, add the address and netmask info lines to to the lo iface
in your /etc/network/interfaces file.)

HTH
-- 
Martin Wheeler   -   StarTEXT / AVALONIX - Glastonbury - BA6 9PH - England
mwheeler at startext.co.uk                http://www.startext.co.uk/mwheeler/
GPG pub key : 01269BEB  6CAD BFFB DB11 653E B1B7 C62B  AC93 0ED8 0126 9BEB
      - Share your knowledge. It's a way of achieving immortality. -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Tue Jul 29 18:41:23 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 29 Jul 2003 15:41:23 -0700 (PDT)
Subject: Dell Linux mailing list - testing
In-Reply-To: <20030729021918.25594.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com>


hi ya angel

lets good ... i think i shall post a reply to the list ..

On Tue, 29 Jul 2003, Angel Rivera wrote:

> > i qa/qc my own boxes for production use ... 
...
 
> Normally, when we get our boxes, they have been burned in for at least 72 
> hours by the vendor. 

yes... that's the "claim" ...

if we say its been burnt in for 72 hrs...
	- they get a list of times and dates ...
	- i prefer to do infinite kernel compiles
	( rm -rf /tmp/linux-2.x ; cp -par linux-2.x /tmp ; make bzImage ;
	  date-stamp )

	http://www.linux-1u.net/Diags/scripts/test.pl
	( a dumb/simple/easy test that runs few standard operations )

> Then we beat them using our suit of programs for a 
> week. If there are any problems, the clock gets reset.

yes... that is the trick .... to get a god set of test suites


>  Not always a very 
> popular way of doing things, but it keeps bad boxes to a very low roar.  I 

keeping testing costs time down and "start testing process all over is
key"

testing and diags
	http://www.linux-1u.net/Diags/

and everybody has their own idea of what tests to do .. and "its
considered tested" ... or the depth of the tests..

1st tests should be visual ..
	- check the bios time stamps and version
	- check the batch levels of the pcb
	- check the manufacturer of the pcb and the chips on sdrams
	- blah ... dozens of things to inspect

than the power up tests
	- run diags to read bios version numbers
	- run diags for various purposes

- diagnostics and testing should be 100% automated including
  generating failure and warning notices
	- people tend to get lazy or go on vacation 
	and most are not as meticulous about testing foo-stuff
	while the other guyz might care that bar-stuff works 

- testing is very very expensive ...
	- getting known good mb, cpu, mem, disk, fans
	( repeatedly ) is the key ...

	- problem is some vendors discontinue their mb in 2 months
	so the whole testing clock start over again

	- in our case, its cheaper to find smaller distributors
	that have inventory of the previously tested known good mb
	that we like

- if it aint broke... leave it alone .. if its doing its job :-)

c ya
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Tue Jul 29 21:26:33 2003
From: angel at wolf.com (Angel Rivera)
Date: Wed, 30 Jul 2003 01:26:33 GMT
Subject: Dell Linux mailing list - testing
In-Reply-To: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030730012633.3897.qmail@houston.wolf.com>

Alvin Oga writes: 

[snip] 

>> Then we beat them using our suit of programs for a 
>> week. If there are any problems, the clock gets reset.
> 
> yes... that is the trick .... to get a god set of test suites

We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
etc. We have monitoring programs so we are always getting stats and when 
something goes wrong they notify us. 

We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
that means hot!) were all under testing.  The heat blasting out the rear wa 
hot enough to triggered an alarm in the server room so they had to come take 
a look. 

>  
> 
> testing and diags
> 	http://www.linux-1u.net/Diags/ 
> 
> and everybody has their own idea of what tests to do .. and "its
> considered tested" ... or the depth of the tests.. 
> 
> 1st tests should be visual ..
> 	- check the bios time stamps and version
> 	- check the batch levels of the pcb
> 	- check the manufacturer of the pcb and the chips on sdrams
> 	- blah ... dozens of things to inspect

> than the power up tests
> 	- run diags to read bios version numbers
> 	- run diags for various purposes

This is really important when you get a demo box to test on for a month or 
so. The time between you getting that box and your order starts landing on 
the loading dock means there have been a lot of changes if you have a good 
vendor.  We test and test before they go into production-cause once we turn 
them over we have a heck of time getting them off-line for anything less 
than a total failure. 

> 
> - diagnostics and testing should be 100% automated including
>   generating failure and warning notices
> 	- people tend to get lazy or go on vacation 
> 	and most are not as meticulous about testing foo-stuff
> 	while the other guyz might care that bar-stuff works  
> 
> - testing is very very expensive ...
> 	- getting known good mb, cpu, mem, disk, fans
> 	( repeatedly ) is the key ... 
> 
> 	- problem is some vendors discontinue their mb in 2 months
> 	so the whole testing clock start over again 
> 
> 	- in our case, its cheaper to find smaller distributors
> 	that have inventory of the previously tested known good mb
> 	that we like

Ah, the voice of experience.  We are very loathe to take a shortcut. 
Sometimes it is very hard. When we bought those 28TB of storage, the first 
thing we heard was that we can test it in production.  Had we done that, we 
may have lost data-we lost a box. 

> 
> - if it aint broke... leave it alone .. if its doing its job :-)

*LOL* Once it is live our entire time is spent not messing anything up. And 
that can be very hard w/ those angstroms where you have two computers in a 
1U form factor and one goes doen. :) 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Tue Jul 29 21:52:43 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 29 Jul 2003 18:52:43 -0700 (PDT)
Subject: Dell Linux mailing list - testing
In-Reply-To: <20030730012633.3897.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com>


hi ya angel

On Wed, 30 Jul 2003, Angel Rivera wrote:

> We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
> etc. We have monitoring programs so we are always getting stats and when 
> something goes wrong they notify us. 

yup... and hopefull there is say 90- 95% probability that the "notice of
failure" as in fact correct ... :-)
	- i know people that ignore those pagers/emails becuase the
	notices are NOT real .. :-0

	- i ignore some notices too ... its now treated as a "thats nice,
	that server is still alive" notices
 
> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
> that means hot!) were all under testing.  The heat blasting out the rear wa 
> hot enough to triggered an alarm in the server room so they had to come take 
> a look. 

yes.. amd gets hot ...

and ii think angstroms has that funky indented power supply and cpu
fans on the side where the cpu and ps is fighting each other for the
4"x 4"x 1.75" air space .. pretty silly .. :-)

> > testing and diags
> > 	http://www.linux-1u.net/Diags/ 
> > 
> > and everybody has their own idea of what tests to do .. and "its
> > considered tested" ... or the depth of the tests.. 

...  
> This is really important when you get a demo box to test on for a month or 
> so.

i like to treat all boxes as if it was never tested/seen before ...
assuming time/budget allows for it 

..
> them over we have a heck of time getting them off-line for anything less 
> than a total failure. 

if something went bad... that was a bad choice for that system/parts ??

> > - testing is very very expensive ...

..

> Ah, the voice of experience.  We are very loathe to take a shortcut. 

short cuts have never paid off in the long run ..  you usually
wind up doing the same task 3x-5x  instead of doing it once correctly
	( take apart the old system, build new one, test new one
	( and now we're back to the start ... and thats ignoring
	( all the tests and changes before giving up on the old
	( shortcut system

> Sometimes it is very hard. When we bought those 28TB of storage, the first 
> thing we heard was that we can test it in production.  Had we done that, we 
> may have lost data-we lost a box. 

i assume you have at least 3 identical 28TB storage mechanisms..
otherwise, old age tells me one day, 28TB will be lost.. no matter
how good your raid and backup is
 	- nobody takes time to build/tests the backup system from
	bare metal ... and confirm the new system is identical to the
	supposed/simulated crashed box including all data being processed
	during the "backup-restore" test period

> > 
> > - if it aint broke... leave it alone .. if its doing its job :-)
> 
> *LOL* Once it is live our entire time is spent not messing anything up. And 
> that can be very hard w/ those angstroms where you have two computers in a 
> 1U form factor and one goes doen. :) 

you have those boxes that have 2 systems that depend on eachother ??
	- ie ..turn off 1 power supply and both systems go down ???

	( geez.. that $80 power supply shortcut is a bad mistake 
	( if the number of nodes is important

	- lots of ways to get 4 independent systems into one 1U shelf
	and with mini-itx, you can fit 8-16 independent 3GHz machines
	into one 1U shelf
		- that'd be a fun system to design/build/ship ...
		( about 200-400 independent p4-3G cpu in one rack )

	- i think mini-itx might very well take over the expensive blade
	market  asumming certain "pull-n-replace" options in blade
	is not too important in mini-itx ( when you have 200-400 nodes
	anyway in a rack )


have fun
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gary at lerhaupt.com  Mon Jul 28 18:50:01 2003
From: gary at lerhaupt.com (gary at lerhaupt.com)
Date: Mon, 28 Jul 2003 17:50:01 -0500
Subject: Dell Linux mailing list
Message-ID: <1059432601.3f25a899b1c9a@www.webmail.westhost.com>

I agree and I think most of the stuff does work out of the box.  However its 
at least comforting to know that if it doesn't or if it later develops 
problems, that list will get you exactly what you need to solve the problem.  
I happened to see people with problems here and wanted to make sure they knew 
of this great resource.

Quoting Alvin Oga <alvin at Mail.Linux-Consulting.com>:

> 
> hi ya
> 
> i cant resist...
> 
> On 26 Jul 2003, Gary Lerhaupt wrote:
> 
> > For ample amounts of help with your Dell / Linux equipment, please check
> > out the Linux-Poweredge mailing list at
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge.
> 
> if dell machines needs so much "help"... something else is
> wrong with the box ...
> 
> and yes, i've been going around to fix/replace lots of broken dell boxes
> 
> a good box works out of the crate ( outof the box ) and keeps
> working for years and years.. and keeps working even if you
> open the covers and fiddle with the insides
> 
> c ya
> alvin
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Daniel.Kidger at quadrics.com  Tue Jul 29 07:12:45 2003
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Tue, 29 Jul 2003 12:12:45 +0100
Subject: Power meters at LIDL
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA78DE049@stegosaurus.bristol.quadrics.com>

Thanks for the info Simon.

I too went out and bought one from our local LIDL in Fishponds,Bristol.
They has plenty in stock. Manufactured specially for LIDL by EMC
see:
  http://www.lidl.co.uk/gb/index.nsf/pages/c.o.oow.20030724.p.Energy_Monitor


One interesting extra feature this device has is that as well as the
instantaneous power reading(W) and energy over time (KWh), it will also
display the maximum power consumption(W) and the time/date it occured. This
should be useful for those of us who want to stress test nodes to get a
maximum power figure. 


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


-----Original Message-----
From: John Hearns [mailto:John.Hearns at micromuse.com]
Sent: 28 July 2003 12:32
To: beowulf at beowulf.org
Subject: Power meters at LIDL


Thanks to Simon Hogg.

I have got some cheap cycling gear from LIDL, but I never thought of buying
Beowulf bits from there!
I have a couple nearby me, so if anyone else in the UK wants one I'll see if
they are in stock
and post one on if you provide name/address.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jd89313 at hotmail.com  Tue Jul 29 12:37:37 2003
From: jd89313 at hotmail.com (Jack Douglas)
Date: Tue, 29 Jul 2003 16:37:37 +0000
Subject: Cisco switches for lam mpi
Message-ID: <BAY1-F72QsFBL4JcpE90000e41d@hotmail.com>

Hi

I wonder if someone can help me

We have just installed a 32 Node Dual Xeon Cluster, with a Cisco Cataslyst 
4003 Chassis with 48 1000Base-t ports.

We are running LAM MPI over gigabit, but we seem to be experiencing 
bottlenecks within the switch

Typically, using the cisco, we only see CPU utilisation of around 30-40%

Howver, we experimented with a Foundry Switch, and were seeing cpu 
utilisation on the same job of around 80 - 90%.

We know that there are commands to "open" the cisco, but the ones we have 
been advised dont seem to do the trick.

Was the cisco a bad idea? If so can someone recommend a good Gigabit switch 
for MPI?  I have heard HP Procurves are supposed to be pretty good.

Or does anyone know any other commands that will open the Cisco switch 
further getting the performance up

Best Regards

JD

_________________________________________________________________
On the move? Get Hotmail on your mobile phone http://www.msn.co.uk/msnmobile

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Wed Jul 30 08:33:14 2003
From: angel at wolf.com (Angel Rivera)
Date: Wed, 30 Jul 2003 12:33:14 GMT
Subject: Testing (Was: Re: Dell Linux mailing list - testing)
In-Reply-To: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030730123314.18107.qmail@houston.wolf.com>

Alvin Oga writes: 

> 
> hi ya angel 
> 
> On Wed, 30 Jul 2003, Angel Rivera wrote: 
> 
>> We have set of jobs we call beater jobs that beat memory, cpu, drives, 
>> nfs etc. We have monitoring programs so we are always getting stats and 
>> when something goes wrong they notify us. 
> 
> yup... and hopefull there is say 90- 95% probability that the "notice of
> failure" as in fact correct ... :-)
> 	- i know people that ignore those pagers/emails becuase the
> 	notices are NOT real .. :-0

We have very high confidence our emails and pages are real.  Our problem is 
information overload.  We need to work on a methodology to make sure the 
important ones are not lost in the forest of messages. 


> 	- i ignore some notices too ... its now treated as a "thats nice,
> 	that server is still alive" notices

I try and at least scan them. We are making changes to help us gain 
situational awareness without having to spend all out time hunched over the 
monitors. 


>  
>> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
>> that means hot!) were all under testing.  The heat blasting out the rear was hot enough to triggered an alarm in the server room so they had to come 
>> take a look. 
> 
> yes.. amd gets hot ... 
> 
> and ii think angstroms has that funky indented power supply and cpu
> fans on the side where the cpu and ps is fighting each other for the
> 4"x 4"x 1.75" air space .. pretty silly .. :-)

each node has it's own power supply. When everything is running right it's 
the bomb. When not, then you have to take down two nodes to work on one. Or, 
until you get used how it is built, you have to be very careful that the 
reset button you hit is for the right now and not its neighbor. :) 

 
>> This is really important when you get a demo box to test on for a month >> or so.
> 
> i like to treat all boxes as if it was never tested/seen before ...
> assuming time/budget allows for it 

Before a purchase, we look at the top 2-3 choices and start testing them to 
see how fast and how we can tweak them. One of the problems is that between 
that time and the order coming in the door there can be enough changes that 
your build changes do not work properly. 

> i assume you have at least 3 identical 28TB storage mechanisms..
> otherwise, old age tells me one day, 28TB will be lost.. no matter
> how good your raid and backup is
>  	- nobody takes time to build/tests the backup system from
> 	bare metal ... and confirm the new system is identical to the
> 	supposed/simulated crashed box including all data being processed
> 	during the "backup-restore" test period

They are 10 - 2.8 (dual 1.4 3ware 7500 cards in a 6-1-1 configuration.) The 
vendor is right down the street. We keep on-site spares ready to do so we 
always have a hot spare on each card. 

We don't back up very much from the cluster. just two of the management 
nodes that keep our stats. It would be impossible to backup that much data 
in a timely manner. 

 
> you have those boxes that have 2 systems that depend on eachother ??
> 	- ie ..turn off 1 power supply and both systems go down ??? 
> 
> 	( geez.. that $80 power supply shortcut is a bad mistake 
> 	( if the number of nodes is important 
> 
> 	- lots of ways to get 4 independent systems into one 1U shelf
> 	and with mini-itx, you can fit 8-16 independent 3GHz machines
> 	into one 1U shelf
> 		- that'd be a fun system to design/build/ship ...
> 		( about 200-400 independent p4-3G cpu in one rack ) 
> 
> 	- i think mini-itx might very well take over the expensive blade
> 	market  asumming certain "pull-n-replace" options in blade
> 	is not too important in mini-itx ( when you have 200-400 nodes
> 	anyway in a rack )

No they are two standalone boxes in a 1U with different everything. That 
means it is very compact in the back and power and reset buttons close 
together in the front-so you have to pay attention. But they rock as compute 
nodes. 

We are now going to explore blades now.  Anyone have recommendations? 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Wed Jul 30 08:46:41 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed, 30 Jul 2003 05:46:41 -0700 (PDT)
Subject: Testing - blades
In-Reply-To: <20030730123314.18107.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030730054407.16895A-100000@Maggie.Linux-Consulting.com>


hi ya angel

On Wed, 30 Jul 2003, Angel Rivera wrote:

> each node has it's own power supply. When everything is running right it's 
> the bomb. When not, then you have to take down two nodes to work on one. Or, 

thats the problem... take 2 down to fix 1... not good

 
> They are 10 - 2.8 (dual 1.4 3ware 7500 cards in a 6-1-1 configuration.) The 
> vendor is right down the street. We keep on-site spares ready to do so we 
> always have a hot spare on each card. 

if you're near 3ware in sunnyvale, than i drive by you daily .. :-)
 
> > 	- i think mini-itx might very well take over the expensive blade
> > 	market  asumming certain "pull-n-replace" options in blade
> > 	is not too important in mini-itx ( when you have 200-400 nodes
> > 	anyway in a rack )
> 
> No they are two standalone boxes in a 1U with different everything. That 
> means it is very compact in the back and power and reset buttons close 
> together in the front-so you have to pay attention. But they rock as compute 
> nodes. 

we do custom 1U boxes ... anything that is reasonable is done .. :-)
 
> We are now going to explore blades now.  Anyone have recommendations? 

blades..
	http://www.linux-1u.net/1U_Others
	- towards the bottom of the page.. up about 2-3 sections

c ya
alvin


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Robin.Laing at drdc-rddc.gc.ca  Wed Jul 30 11:09:47 2003
From: Robin.Laing at drdc-rddc.gc.ca (Robin Laing)
Date: Wed, 30 Jul 2003 09:09:47 -0600
Subject: Interesting read - Canada's fastest computer...
Message-ID: <3F27DFBB.9090103@drdc-rddc.gc.ca>

Here is a link about Canada's fastest cluster.  There is a link off of 
the "McKenzie's" home page that explains how they worked out some of 
the latency problems using low cost gig switches.  A complete 
description of hardware is also included.

http://www.newsandevents.utoronto.ca/bin5/030721a.asp

The graphics of galaxy collisions are interesting as well.

-- 
Robin Laing
Instrumentation Technologist   Voice: 1.403.544.4762
Military Engineering Section   FAX:   1.403.544.4704
Defence R&D Canada - Suffield  Email: Robin.Laing at DRDC-RDDC.gc.ca
PO Box 4000, Station Main      WWW:http://www.suffield.drdc-rddc.gc.ca
Medicine Hat, AB, T1A 8K6
Canada

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From john152 at libero.it  Wed Jul 30 15:01:14 2003
From: john152 at libero.it (john152 at libero.it)
Date: Wed, 30 Jul 2003 21:01:14 +0200
Subject: Bug with 3com card?
Message-ID: <HIUQ62$D5EB61BD4C09E6414A19DCF481C3CFFA@libero.it>

Hi all,
i have problems with mii-diag software in detecting the link status
( -w option ).

I'm using a 3Com905-TX card instead of Realtek RTL-8139 
i used before.
With Realtek card all was Ok, infact 
with mii-diag i had the following output:

- at start (cable connected):
     18:54:36.592 Baseline value of MII BMSR
     (basic mode status register) is 782d.

- disconnecting the link:
     18:55:01.632 MII BMSR now 7809: no link, NWay busy,
     No Jabber (0000).
     18:55:01.637 Baseline value of MII BMSR
     basic mode status register) is 7809.

- connecting again the link:
     18:55:06.722 MII BMSR now 782d: Good link,
     NWay done, No Jabber (45e1).
     18:55:06.728 Baseline value of MII BMSR
     (basic mode status register) is 782d.
.
.

Now i have the following output lines with 3Com:

- at start (cable connected):
     18:42:46.073 Baseline value of MII BMSR
     (basic mode status register) is 782d.

- disconnecting the link:
     18:42:50.779 MII BMSR now 7829: no link,
     NWay done, No Jabber (0000).
     18:49:38.524 Baseline value of MII BMSR
     (basic mode status register) is 7809.

- connecting again the link:
     18:52:15.887 MII BMSR now 7829: no link,
     NWay done, No Jabber (41e1).
     18:52:15.895 Baseline value of MII BMSR
     (basic mode status register) is 782d.
.
.

The Baseline value of MII BMSR is correct with each card,
but i think there is an incorrect return value when
written "...MII BMSR now 7829..." (monitor_mii function).

I think that correct values of this new value are 
782d or 7809, aren't they? 

Could it be a bug in the software or more simply this card
is not supported?

It seems that the function mdio_read(ioaddr, phy_id, 1)
can return two different values even if the link status is the 
same!
Infact at the status change, i see two outputs coming from the 
same call "mdio_read(ioaddr, phy_id, 1)" : 
a first output is 7829 ( i don't understand the why)
and the second output is 782d or 7809 and it seems correct.

Thanks in advance for your kind answers and observations.

Giovanni di Giacomo

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Wed Jul 30 15:06:05 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Wed, 30 Jul 2003 12:06:05 -0700
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <3F27DFBB.9090103@drdc-rddc.gc.ca>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca>
Message-ID: <20030730190605.GA2640@sphere.math.ucdavis.edu>

On Wed, Jul 30, 2003 at 09:09:47AM -0600, Robin Laing wrote:
> Here is a link about Canada's fastest cluster.  There is a link off of 
> the "McKenzie's" home page that explains how they worked out some of 
> the latency problems using low cost gig switches.  A complete 
> description of hardware is also included.
> 
> http://www.newsandevents.utoronto.ca/bin5/030721a.asp
> 
> The graphics of galaxy collisions are interesting as well.

Anyone have any idea what range of latencies and bandwidths are
observed on that machine (as visible to MPI)?


-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From douglas at shore.net  Wed Jul 30 16:44:58 2003
From: douglas at shore.net (Douglas O'Flaherty)
Date: Wed, 30 Jul 2003 16:44:58 -0400
Subject: Cisco switches for lam mpi
Message-ID: <3F282E4A.30301@shore.net>

    From: "Jack Douglas" <jd89313 at hotmail.com
    <http://www.mail2web.com/cgi-bin/compose.asp?mb=&mp=P&mps=0&lid=0&intListPerPage=20&messageto=jd89313 at hotmail.com&ed=0GiZyQ7mCUTaOfqbPc0PCcw5ipKw5gh%2Bk8e2sQ0iJ0kppFsWke4Syd%2Bg3IwaIWhXCYEvHrvg9CjF%0D%0AWN0oWsv6zTP7GUytPsTeOHpoiRk6sRGQsanK5As%3D>>

    To: beowulf at beowulf.org
    <http://www.mail2web.com/cgi-bin/compose.asp?mb=&mp=P&mps=0&lid=0&intListPerPage=20&messageto=beowulf at beowulf.org&ed=0GiZyQ7mCUTaOfqbPc0PCcw5ipKw5gh%2Bk8e2sQ0iJ0kppFsWke4Syd%2Bg3IwaIWhXCYEvHrvg9CjF%0D%0AWN0oWsv6zTP7GUytPsTeOHpoiRk6sRGQsanK5As%3D>

    Subject: Cisco switches for lam mpi
    Date: Tue, 29 Jul 2003 16:37:37 +0000 

    Hi

    I wonder if someone can help me

    We have just installed a 32 Node Dual Xeon Cluster, with a Cisco
    Cataslyst
    4003 Chassis with 48 1000Base-t ports.

    We are running LAM MPI over gigabit, but we seem to be experiencing
    bottlenecks within the switch

    Typically, using the cisco, we only see CPU utilisation of around
    30-40%

    Howver, we experimented with a Foundry Switch, and were seeing cpu
    utilisation on the same job of around 80 - 90%.

    We know that there are commands to "open" the cisco, but the ones we
    have
    been advised dont seem to do the trick.

    Was the cisco a bad idea? If so can someone recommend a good Gigabit
    switch
    for MPI? I have heard HP Procurves are supposed to be pretty good.

    Or does anyone know any other commands that will open the Cisco switch
    further getting the performance up

    Best Regards

    JD

==============

Jack:

Have you run Pallas' MPI benchmarks 
(http://www.pallas.com/e/products/pmb/) to quantify the differences 
between the two switches? The dramatic difference in system performance 
suggests you have something going wrong there.  You should test under no 
load and under load. The difference may be illuminating.

I'd start with an assumption you may have something wrong on the Cisco. 
And I'd call whomever you bought it form to come show otherwise.

Make certain you check your counters on the switch (and a few systems) 
to see if you have collisions, overruns or any other issues. As noted on 
this list before, the Cisco's can have pathological problems with 
auto-negotiation. You should be certain to set the ports to Full Duplex 
to get the speed up. With GigE, Jumbo Frames increases performance by a 
bit. Depending on your set up, I'd also turn off spanning tree, 
eliminate any ACLs, SNMP counters etc. which may be on the switch and 
contributing to load.

Worst case would be being backplane constrained - you have 32 GigE 
nodes. The Supervisor Engine  in the Cisco is listed as a 24-Gbps 
forwarding engine (18 million packets/sec) at peak. The Foundry NetIron 
400 & 800 backplane is 32Gbps + and they say 90mpps peak. Notice the 
math to convert between packets and backplane speed doesn't work.  My 
experience is that the Foundry is always faster and has lower latency. 

I have little experience with the HP pro curve switches. I've used them 
in data closets where backplane speed is not an issue. They've been 
reliable, but I've never considered them for a high speed network core.

doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Wed Jul 30 17:56:16 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 30 Jul 2003 17:56:16 -0400
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca> 
	<20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <1059602177.17090.81.camel@haze.sr.unh.edu>

On Wed, 2003-07-30 at 15:06, Bill Broadley wrote:
> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

There's a plot of the bandwidth tests they ran at the bottom of the
Mckenzie Networking HOWTO:
http://www.cita.utoronto.ca/webpages/mckenzie/tech/networking/index.html

No latency info, though.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Wed Jul 30 18:20:54 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Wed, 30 Jul 2003 15:20:54 -0700
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca> <20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <20030730222054.GA2266@greglaptop.internal.keyresearch.com>

On Wed, Jul 30, 2003 at 12:06:05PM -0700, Bill Broadley wrote:

> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

A bisection bandwidth histrogram is at the bottom of:

http://www.cita.utoronto.ca/webpages/mckenzie/tech/networking/index.html

You can tell these guys are physicists: they didn't just print the
average.

I'd guess latency in the cube network isn't very good, because they're
using Linux to forward packets. Given that, it's impressive how good
the bisection bandwidth is. Eventually the price of 10gig trunking is
going to fall to the point where it's better than this kind of
setup... until the wheel of reincarnation turns again, and we're using
10 gig links to the nodes.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul 30 19:25:27 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 30 Jul 2003 19:25:27 -0400 (EDT)
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <Pine.LNX.4.44.0307301830350.6305-100000@coffee.psychology.mcmaster.ca>

> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

see the bottom of http://www.cita.utoronto.ca/webpages/mckenzie/

the machine is build for very latency-tolerant
aggregate-bandwidth-intensive codes.  you can see from the histograms
that their topology does a pretty good job of producing fast links,
but the 40-ish MB/s is going to be significantly affected by 
other traffic on the machine.  I guess the amount of interference
would depend largely on how efficient is the kernel's routing code.
for instance, is routing zero-copy?  I believe these are all Intel
7500CW boards, so their NICs probably have checksum-offloading
(or is that only done at endpoints?)

latency is not going to be great, if you're thinking in terms of 
myrinet or even flat 1000bT nets, since most routes will wind up
going through a small number of nodes.  it would be very interesting
to see similar histograms of latency or even just hop-count.  if 

I understand the topology correctly, you ascend into the express-cube
for 7/8ths of all possible random routes, and the weighted average
of CDCC hops is 0*(1/8)+1*(4/8)+2*(3/8)=1.25 hops.  without diagonals, 
the avg would be 1:3:3:1=1.5 hops, which isn't all that much worse.
but I think bisection cuts 8 4x1000bT links: 4 GB/s; without express
links, bisection would be half as much!

I think I'm missing something about the eth1 (point-to-point) links...

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Thu Jul 31 02:48:51 2003
From: award at andorra.ad (Alan Ward)
Date: Thu, 31 Jul 2003 08:48:51 +0200
Subject: small home cluster
Message-ID: <3F28BBD3.4040104@andorra.ad>

Dear list-people,

I just put the pictures of my home "civilized" cluster on the web:

	http://www.geocities.com/ward_a2003/

This is more play than work, as you can see from the Geocities address.

Best regards,
Alan Ward

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tkonto at aegean.gr  Thu Jul 31 11:04:00 2003
From: tkonto at aegean.gr (Kontogiannis Theophanis)
Date: Thu, 31 Jul 2003 18:04:00 +0300
Subject: TEST --- IGNORE --- TEST -- IGNORE
Message-ID: <EB9251239B96944C895280024D7FB35A05FBA2FA@hermes.aegean.gr>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fboudra at uxp.fr  Thu Jul 31 11:04:48 2003
From: fboudra at uxp.fr (Fathi BOUDRA)
Date: Thu, 31 Jul 2003 17:04:48 +0200
Subject: 82551ER eeprom
Message-ID: <200307311704.48984.fboudra@uxp.fr>

Hi,

i try to program the 82551ER eeprom.

When i receive the eeprom, his contents was :

eepro100-diag -#2 -aaeem
eepro100-diag.c:v2.12 4/15/2003 Donald Becker (becker at scyld.com)
 http://www.scyld.com/diag/index.html
Index #2: Found a Intel 82559ER EtherExpressPro/100+ adapter at 0xe400.
i82557 chip registers at 0xe400:
  00000000 00000000 00000000 00080002 10000000 00000000
  No interrupt sources are pending.
   The transmit unit state is 'Idle'.
   The receive unit state is 'Idle'.
  This status is unusual for an activated interface.
EEPROM contents, size 64x16:
    00: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x08: ffff ffff fffd ffff ffff ffff ffff ffff  ________________
  0x10: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x18: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x20: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x28: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x30: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x38: ffff ffff ffff ffff ffff ffff ffff bafb  ________________
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address FF:FF:FF:FF:FF:FF.
  Board assembly ffffff-255, Physical connectors present: RJ45 BNC AUI MII
  Primary interface chip i82555 PHY #-1.
    Secondary interface chip i82555, PHY -1.

I used the -H, -G parameters and changed the eeprom_id, subsystem_id and 
subsystem_vendor :

 eepro100-diag -#1 -aaeem
eepro100-diag.c:v2.12 4/15/2003 Donald Becker (becker at scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Intel 82559ER EtherExpressPro/100+ adapter at 0xe800.
i82557 chip registers at 0xe800:
  00000000 00000000 00000000 00080002 10000000 00000000
  No interrupt sources are pending.
   The transmit unit state is 'Idle'.
   The receive unit state is 'Idle'.
  This status is unusual for an activated interface.
EEPROM contents, size 64x16:
    00: 1100 3322 5544 0000 0000 0101 4401 0000  __"3DU_______D__
  0x08: 0000 0000 4000 1209 8086 0000 0000 0000  _____ at __________
      ...
  0x38: 0000 0000 0000 0000 0000 0000 0000 09c3  ________________
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address 00:11:22:33:44:55.
  Receiver lock-up bug exists. (The driver work-around *is* implemented.)
  Board assembly 000000-000, Physical connectors present: RJ45
  Primary interface chip DP83840 PHY #1.
  Transceiver-specific setup is required for the DP83840 transceiver.
Primary transceiver is MII PHY #1. MII PHY #1 transceiver registers:
   3000 7829 02a8 0154 05e1 45e1 0003 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0203 0000 0001 035e 0000 0003 0b74 0003
   0000 0000 0000 0000 0010 0000 0000 0000.
 Basic mode control register 0x3000: Auto-negotiation enabled.
 Basic mode status register 0x7829 ... 782d.
   Link status: previously broken, but now reestablished.
   Capable of  100baseTx-FD 100baseTx 10baseT-FD 10baseT.
   Able to perform Auto-negotiation, negotiation complete.
 Vendor ID is 00:aa:00:--:--:--, model 21 rev. 4.
   No specific information is known about this transceiver type.
 I'm advertising 05e1: Flow-control 100baseTx-FD 100baseTx 10baseT-FD 10baseT
   Advertising no additional info pages.
   IEEE 802.3 CSMA/CD protocol.
 Link partner capability is 45e1: Flow-control 100baseTx-FD 100baseTx 
10baseT-FD 10baseT.
   Negotiation  completed.

All these things doesn't work. I read the "online" 82551er datasheet but it 
doesn't help me (they explain only  the words 00h to 02h and 0Ah to 0Ch).

Someone know what i need to do or have a working 82551er eeprom ?

thanks
fbo

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Thu Jul 31 11:53:54 2003
From: rouds at servihoo.com (RoUdY)
Date: Thu, 31 Jul 2003 19:53:54 +0400
Subject: NFS problem
In-Reply-To: <200307301906.h6UJ6tw26647@NewBlue.Scyld.com>
Message-ID: <web-19381333@servihoo.com>

Hello dear friends,

I am doing my beowulf cluster and I have a small problem 
when I test the NFS.


the command I used was :

" mount -t nfs node1:/home /home nfs "  

(where node1 is my master node)


Well the output that I obtain is 
"
RPC : Remote system error
connection refused
RPC not registered "

But when I am on NOde2 and I ping to the master node that 
is node1 it's ok..

hope to hear from u very soon for HELP

bye

Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bropers at lsu.edu  Thu Jul 31 12:35:09 2003
From: bropers at lsu.edu (Brian D. Ropers-Huilman)
Date: Thu, 31 Jul 2003 11:35:09 -0500 (CDT)
Subject: NFS problem
In-Reply-To: <web-19381333@servihoo.com>
References: <web-19381333@servihoo.com>
Message-ID: <Pine.LNX.4.56.0307311130430.12488@cannondale.ocs.lsu.edu>

Roudy,

Do you have portmapper running on node1? Do you have nfsd running on node1? 
Does your /etc/exports file include /home? Is the /home export open to the 
client node?

Do you have portmapper running on your client node? Do you have NFS support in 
your kernel or do you have a mount daemon running like rpciod or biod?

Finally, do you have any firewalling on either of the nodes?

The client and server must have all appropriate software running first and be 
properly configured before anything will work. Also, if any of those ports are 
blocked, at either end, things won't work.

On Thu, 31 Jul 2003, RoUdY wrote:
> Hello dear friends,
> 
> I am doing my beowulf cluster and I have a small problem 
> when I test the NFS.
> 
> the command I used was :
> 
> " mount -t nfs node1:/home /home nfs "  
> 
> (where node1 is my master node)
> 
> 
> Well the output that I obtain is 
> "
> RPC : Remote system error
> connection refused
> RPC not registered "
> 
> But when I am on NOde2 and I ping to the master node that 
> is node1 it's ok..
> 
> hope to hear from u very soon for HELP
> 
> bye
> 
> Roudy

--  
Brian D. Ropers-Huilman                        (225) 578-0461 (V)
Systems Administrator                 AIX      (225) 578-6400 (F)
Office of Computing Services       GNU Linux   brian at ropers-huilman.net
High Performance Computing            .^.      http://www.ropers-huilman.net/
Fred Frey Building, Rm. 201, E-1Q     /V\                          \o/
Louisiana State University           (/ \)           --  __o   /    |
Baton Rouge, LA 70803-1900           (   )          --- `\<,  /    `\\,
                                     ^^-^^              O/ O /     O/ O
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From franz.marini at mi.infn.it  Tue Jul  1 03:20:34 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Tue, 1 Jul 2003 09:20:34 +0200 (CEST)
Subject: Job Posting - cluster admin.
In-Reply-To: <1056715034.2172.21.camel@rohgun.cse.duke.edu>
References: <1056715034.2172.21.camel@rohgun.cse.duke.edu>
Message-ID: <Pine.LNX.4.53.0307010918500.11448@merlino.mi.infn.it>

On Fri, 27 Jun 2003, Bill Rankin wrote:

> FYI - we are seeking a Beowulf admin for our university cluster.  If you
> know of anyone that is interested, please forward them this information.

Hrm... From the description it looks like the perfect job for me :)

Just to know, would you sponsor a H-1B ? ;)

Have a good day y'all !

Franz


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Tue Jul  1 04:03:17 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Tue, 1 Jul 2003 10:03:17 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <19WuWG-1Uo-00@etnus.com>
References: <19WuWG-1Uo-00@etnus.com>
Message-ID: <200307011003.17755.joachim@ccrl-nece.de>

James Cownie:
> Mark Hahn wrote:
> > does anyone have references handy for recent work on interconnect
> > latency?
>
> Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
>
> It doesn't have Inifinband, but does have Quadrics, Myrinet 2000, GigE and
> IBM.

Nice paper showing interesting properties.  But some metrics seem a little bit 
dubious to me: in 5.2, they seem to see an advantage if the "overlap 
potential" is higher (when they compare Quadrics and Myrinet) - which usually 
just results in higher MPI latencies, as this potiential (on small messages) 
can not be exploited. Even with overlapping mulitple communication 
operations, the faster interconnect remains faster. This is especially true 
for small-message latency.

>From the contemporary (cluster) interconnects, SCI is missing next to 
Infiniband. It would have been interesting to see the results for SCI as it 
has a very different communication model than most of the other interconnects 
(most resembling the T3E one).

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Andrew.Cannon at nnc.co.uk  Tue Jul  1 09:15:21 2003
From: Andrew.Cannon at nnc.co.uk (Cannon, Andrew)
Date: Tue, 1 Jul 2003 14:15:21 +0100
Subject: Cluster over standard network
Message-ID: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>

Hi all,

Has anyone implemented a cluster over a normal office network using the PCs
on people's desks as part of the cluster? If so, what was the performance of
the cluster like? What sort of performance penalty was there for the
ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford,
Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
 NNC Limited
 Booths Hall
 Chelford Road
 Knutsford
 Cheshire
 WA16 8QZ
 
 Country of Registration: United Kingdom
 Registered Number: 1120437
 
 This e-mail and any files transmitted with it are confidential and 
 intended solely for the use of the individual or entity to whom they   
 are addressed. If you have received this e-mail in error please notify 
 the NNC system manager by e-mail at eadm at nnc.co.uk.
***********************************************************************************

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Tue Jul  1 09:48:30 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 01 Jul 2003 08:48:30 -0500
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <1057067310.26428.14.camel@terra>

On Tue, 2003-07-01 at 08:15, Cannon, Andrew wrote:
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
> 
> Just a thought...
> 
> 

It all depends on what your application does with that network and how beefy your nodes are. 
For instance, if you were to run something like Amber (molecular dynamics) over an office
LAN, I can pretty much guarantee that you will not win any office popularity polls. It
simply saturates the network. If your nodes are reasonably slow you might be better,
relatively, as you might have reduced network traffic because you nodes are spending
more time thinking. I wouldn't depend on it.

On the other hand, you have to consider what those pesky co-workers are doing to YOUR
network. ;-) Use of M$ Outlook and streaming mp3's off fileservers, to mention a couple,
will cut into YOUR bandwidth causing performance problems.

Just my $0.02 worth.

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  1 10:00:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 1 Jul 2003 16:00:16 +0200
Subject: AW: Cluster over standard network
Message-ID: <29B376A04977B944A3D87D22C495FB2301275E@vertrieb.emplics.com>

Hi Andy,

think this depends on your desktop pc's.
I already have installed such a "cluster", but the desktops were dual 2.4 Ghz PCs with 'own' giga ethernet.
It all worked with dual boot on the boot loader and automatic switching of the boot-options in the evening and morning.

But there some problems you should take a closer look at.
 What would you do if your job is still running in the morning and the employees are on the way to their offices ?
 Could your network bear up with the heavy traffic or woult it disturb things like eg backup server. (If you haven't a
 seperat backbone.)
 What if someone would like to impress the boss and do some overtime ?

I would recommend, that you use some of the diskless cds or floppys out there (like knoppix or mosix-on-floppy)
to check your equipment against your demands.

If your office pc are already using linux, you could/should take a look at openmosix.
>From openmosix.org:
"Once you have installed openMosix, the nodes in the cluster start talking to one another and the cluster adapts itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node. openMosix continuously attempts to optimize the resource allocation."

We are using openmosix on our clusters and on our servers as well. Works fine for no-parallel jobs.

Greetings
Ren?

-----Urspr?ngliche Nachricht-----
Von: Cannon, Andrew [mailto:Andrew.Cannon at nnc.co.uk] 
Gesendet: Dienstag, 1. Juli 2003 15:15
An: Beowolf (E-mail)
Betreff: Cluster over standard network


Hi all,

Has anyone implemented a cluster over a normal office network using the PCs on people's desks as part of the cluster? If so, what was the performance of the cluster like? What sort of performance penalty was there for the ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford, Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
 NNC Limited
 Booths Hall
 Chelford Road
 Knutsford
 Cheshire
 WA16 8QZ
 
 Country of Registration: United Kingdom
 Registered Number: 1120437
 
 This e-mail and any files transmitted with it are confidential and 
 intended solely for the use of the individual or entity to whom they   
 are addressed. If you have received this e-mail in error please notify 
 the NNC system manager by e-mail at eadm at nnc.co.uk.
***********************************************************************************

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From louisr at aspsys.com  Tue Jul  1 10:40:04 2003
From: louisr at aspsys.com (Louis J. Romero)
Date: Tue, 1 Jul 2003 08:40:04 -0600
Subject: Hard link /etc/passwd
In-Reply-To: <20030630032016.88507.qmail@web10607.mail.yahoo.com>
References: <20030630032016.88507.qmail@web10607.mail.yahoo.com>
Message-ID: <200307010840.04590.louisr@aspsys.com>

hi Justin,

Keep in mind that concurrent access is not a given.  The last writer gets to 
update the file.  All other edits will be lost.

Louis
On Sunday 29 June 2003 09:20 pm, Justin Cook wrote:
> Good day,
> I have an 11 node diskless cluster.  All slave node
> roots are under /tftpboot/node1 ... /tftpboot/node2
> ... so on.  Is it safe to hard link the /etc/passwd
> and /etc/group file to the server nodes for
> consistency across the network?
>
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Louis J. Romero
Chief Software Architect

Aspen Systems, Inc.
3900 Youngfield Street
Wheat Ridge, Co 80033
Toll Free: (800) 992-9242
Tel +01 (303) 431-4606 Ext. 104
Cell +01 (303) 437-6168
Fax +01 (303) 431-7196
louisr at aspsys.com
http://www.aspsys.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bob at drzyzgula.org  Tue Jul  1 10:40:01 2003
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Tue, 1 Jul 2003 10:40:01 -0400
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <20030701104001.I1838@www2>

On Tue, Jul 01, 2003 at 02:15:21PM +0100, Cannon, Andrew wrote:
> 
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
> 
> Just a thought...
> 
> TIA

This is actually the way much of this stuff used to be
done, before commodity computers became both powerful and
inexpensive enough [1] to make it worth buying them just to
place in a computing cluster. It was quite common in the
early 1990s (and likely still is, in many organizations),
for example, to have PVM running on production office
and lab networks. However, one did have to be reasonably
considerate. One didn't usually use these ad hoc clusters
during business hours (or at least ran the jobs at idle
priority if one did) and one usually asked permission of
the person to whom the computer had been assigned before
adding it to the cluster. One also had to be careful not
to cause problems with other off-hours operations, such
as filesystem backups.

Of course this approach has disadvantages, and may not 
work well at all for certain types of network-intensive
applications. But if one had, for example, a Monte Carlo
simulation to run, and there was no hope of getting mo'
better computers, it could make the difference between
the the analysis being done or not done.

--Bob

[1] Or perhaps I should say before cast-off computers were
powerful enough, since that's what the first Beowulf was
made from, but that phase didn't last very long; it soon
became obvious the the cluster idea was useful enough to
justify the purchase of new machines, and cast-off machines
had problems with reliability and power consumption that
made them less than ideal for this application.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  1 11:57:54 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 1 Jul 2003 17:57:54 +0200
Subject: WG: Cluster over standard network
Message-ID: <29B376A04977B944A3D87D22C495FB23D50F@vertrieb.emplics.com>


Hi Andy,

think this depends on your desktop pc's.
I already have installed such a "cluster", but the desktops were dual 2.4 Ghz PCs with 'own' giga ethernet. It all worked with dual boot on the boot loader and automatic switching of the boot-options in the evening and morning.

But there some problems you should take a closer look at.
 What would you do if your job is still running in the morning and the employees are on the way to their offices ?  Could your network bear up with the heavy traffic or woult it disturb things like eg backup server. (If you haven't a  seperat backbone.)  What if someone would like to impress the boss and do some overtime ?

I would recommend, that you use some of the diskless cds or floppys out there (like knoppix or mosix-on-floppy) to check your equipment against your demands.

If your office pc are already using linux, you could/should take a look at openmosix. From openmosix.org: "Once you have installed openMosix, the nodes in the cluster start talking to one another and the cluster adapts itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node. openMosix continuously attempts to optimize the resource allocation."

We are using openmosix on our clusters and on our servers as well. Works fine for no-parallel jobs.

Greetings
Ren?


##############################
Hi all,

Has anyone implemented a cluster over a normal office network using the PCs on people's desks as part of the cluster? If so, what was the performance of the cluster like? What sort of performance penalty was there for the ordinary user and what was the network traffic like?

Just a thought...

TIA

Andy

Andrew Cannon, Nuclear Technology (J2), NNC Ltd, Booths Hall, Knutsford, Cheshire, WA16 8QZ.

Telephone; +44 (0) 1565 843768
email: mailto:andrew.cannon at nnc.co.uk
NNC website: http://www.nnc.co.uk


***********************************************************************************
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Tue Jul  1 18:48:08 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 1 Jul 2003 15:48:08 -0700
Subject: Linux support for AMD Opteron with Broadcom NICs
Message-ID: <20030701224808.GA15167@stikine.ucs.sfu.ca>

Hello,

I have a dual AMD Opteron for a week or so as a demo and try to install
Linux on it - so far with little success.

First of all: doing a google search for x86-64 Linux turns up a lot of
press releases but not much more, particularly nothing one could download
and install. Even a direct search on the SuSE and Mandrake sites shows
only press releases. Sigh.

Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
Thus I did a ftp installation which after (many) hickups actually worked.
However, that distribution does not support the onboard Broadcom 5704
NICs. I also could not get the driver from the broadcom web site to work
(insmod fails with "could not find MAC address in NVRAM").

Thus I tried to compile the 2.4.21 kernel which worked, but
"insmod tg3" freezes the machine instantly.

Thus, so far I am not impressed.

For those of you who have such a box: which distribution are you using?
Any advice on how to get those GigE Broadcom NICs to work?

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Tue Jul  1 19:41:16 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Tue, 01 Jul 2003 17:41:16 -0600
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F021C1C.4050309@inel.gov>

Martin Siegert wrote:

> Hello,
> 
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
> 
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
> 
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
> 
> Thus, so far I am not impressed.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
> 
> Cheers,
> Martin
> 

The evaluation box I had an account on ran SuSE and Mark Hahn is running 
RedHat 9 without problems.  Other than customizing a regular x86 distro,
you'll probably have to buy an enterprise or cluster version for now.

http://www.suse.com/us/business/products/server/sles/prices_amd64.html
http://www.mandrakesoft.com/products/clustering

It doesn't look like Debian is ready yet:
https://alioth.debian.org/projects/debian-x86-64/

I couldn't find redhat's opteron pages.

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nashif at planux.com  Tue Jul  1 20:14:47 2003
From: nashif at planux.com (Anas Nashif)
Date: Tue, 1 Jul 2003 20:14:47 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <200307012014.47101.nashif@planux.com>

On July 1, 2003 06:48 pm, Martin Siegert wrote:
> Hello,
>
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
>
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
>
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
>
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
>
> Thus, so far I am not impressed.
>
> For those of you who have such a box: which distribution are you using?

SuSE SLES 8.
> Any advice on how to get those GigE Broadcom NICs to work?

Works out of the box with broadcom. (bcm5700 module, tg3 is not always 
recommended)


Anas
>
> Cheers,
> Martin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jhearns at freesolutions.net  Wed Jul  2 06:13:04 2003
From: jhearns at freesolutions.net (John Hearns)
Date: Wed, 02 Jul 2003 11:13:04 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca> <20030701224808.GA15167@stikine.ucs.sfu.ca> <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
Message-ID: <3F02B030.1040305@freesolutions.net>

Simon Hogg wrote:

>
> As you say, at the moment the best bet seems to be to *buy* the 
> enterprise editions.  For those of us who are loathe to spend any 
> money or who 'just like' Debian, there is a bit of waiting still to 
> do.  According to one developer;
>
> "There is work ongoing on a Debian port, but it will be a while yet - 
> the x86-64 really needs sub-architecture support for effective support 
> (to allow mixing of 32- and 64-bit things), and that is a big step for 
> us. Feel free to chip in and help! :-)".
>
> However, as far as I am aware, it should be possible to install a 
> vanilla x86-32 distribution and recompile everything for 64-bit (with 
> a recent GCC (3.3 is the best bet at the moment I suppose)).
>
That's how Gentoo does things. Anyone heard of Gentoo running on X86-64 
? Might be fun.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Wed Jul  2 05:31:53 2003
From: seth at hogg.org (Simon Hogg)
Date: Wed, 02 Jul 2003 10:31:53 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <3F021C1C.4050309@inel.gov>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
 <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>

At 17:41 01/07/03 -0600, Andrew Shewmaker wrote:
>Martin Siegert wrote:
>
>>Hello,
>>I have a dual AMD Opteron for a week or so as a demo and try to install
>>Linux on it - so far with little success.
>>First of all: doing a google search for x86-64 Linux turns up a lot of
>>press releases but not much more, particularly nothing one could download
>>and install. Even a direct search on the SuSE and Mandrake sites shows
>>only press releases. Sigh.
>>Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
>>Thus I did a ftp installation which after (many) hickups actually worked.
>>However, that distribution does not support the onboard Broadcom 5704
>>NICs. I also could not get the driver from the broadcom web site to work
>>(insmod fails with "could not find MAC address in NVRAM").
>>Thus I tried to compile the 2.4.21 kernel which worked, but
>>"insmod tg3" freezes the machine instantly.
>>Thus, so far I am not impressed.
>>For those of you who have such a box: which distribution are you using?
>>Any advice on how to get those GigE Broadcom NICs to work?
>>Cheers,
>>Martin
>
>The evaluation box I had an account on ran SuSE and Mark Hahn is running 
>RedHat 9 without problems.  Other than customizing a regular x86 distro,
>you'll probably have to buy an enterprise or cluster version for now.

As you say, at the moment the best bet seems to be to *buy* the enterprise 
editions.  For those of us who are loathe to spend any money or who 'just 
like' Debian, there is a bit of waiting still to do.  According to one 
developer;

"There is work ongoing on a Debian port, but it will be a while yet - the 
x86-64 really needs sub-architecture support for effective support (to 
allow mixing of 32- and 64-bit things), and that is a big step for us. Feel 
free to chip in and help! :-)".

However, as far as I am aware, it should be possible to install a vanilla 
x86-32 distribution and recompile everything for 64-bit (with a recent GCC 
(3.3 is the best bet at the moment I suppose)).

However, your original problem seems not to be how to get it installed, but 
rather how to get your Broadcom GigE to work.  I'm afraid I don't know the 
answer to that one!

I know this doesn't answer your question, but hope it gives somebody some 
more impetus to get this darned Debian port finished :-)

HTH (although probably won't).

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fgp at pcnet.ro  Wed Jul  2 07:46:34 2003
From: fgp at pcnet.ro (Florian Gabriel)
Date: Wed, 02 Jul 2003 14:46:34 +0300
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F02C61A.4060409@pcnet.ro>

Martin Siegert wrote:

>Hello,
>
>I have a dual AMD Opteron for a week or so as a demo and try to install
>Linux on it - so far with little success.
>
>First of all: doing a google search for x86-64 Linux turns up a lot of
>press releases but not much more, particularly nothing one could download
>and install. Even a direct search on the SuSE and Mandrake sites shows
>only press releases. Sigh.
>
>Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
>Thus I did a ftp installation which after (many) hickups actually worked.
>However, that distribution does not support the onboard Broadcom 5704
>NICs. I also could not get the driver from the broadcom web site to work
>(insmod fails with "could not find MAC address in NVRAM").
>
>Thus I tried to compile the 2.4.21 kernel which worked, but
>"insmod tg3" freezes the machine instantly.
>
>Thus, so far I am not impressed.
>
>For those of you who have such a box: which distribution are you using?
>Any advice on how to get those GigE Broadcom NICs to work?
>
>Cheers,
>Martin
>
>  
>
You can try the preview distribution "gingin64" from here:
http://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 11:01:25 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 08:01:25 -0700 (PDT)
Subject: memory nightmare
Message-ID: <20030702075618.D6562-100000@euler.salk.edu>


I need some advice about how to handle some ambiguous results from
memtest86.  I also have some general questions about bios options
related to ECC memory.

First some background: I'm building a diskless cluster that will soon
grow to 100 dual athlon nodes.  At present it has 10 diskless nodes
and a server.  The boards are Gigabyte Technologies model GA7DPXDW-P,
and the cpus are Athlon MP 2200+.  In April I bought 69 1 gigabyte
ecc registered ddr modules from a vendor who had twice before sold me
reliable memory.  This time, however, the memory was bad.  Testing in
batches of 3 sticks per motherboard, nearly 100% failed memtest86,
and some machines crashed or would not even boot.  They replaced
all 69 sticks.  Of this second batch, about 60% failed memtest86,
and the longer I tested, the more would fail.  I again returned
them all.  In both of these batches, the failures were numerous,
often thousands or hundreds of thousands or even millions of errors.
The errors were usually multibit errors, where the "fail bits" were
things like 0f0f0f0f or ffffffff.  The most commonly failing test
seemed to be test number 6, but others failed, too.

I am now testing the third batch of 69 sticks.  I decided, more-or-less
arbitrarily, that I would consider them good if they passed 48 hours
of memtest86.  Testing in batches of 3 per board, all but 6 groups of
3 sticks passed 48 hours of memtest86.  I have been able to identify a
single failing stick in 2 of the 6 failed batches by testing 1 stick
per motherboard.  I am still testing the others, 1 stick per board,
but so far none has failed.

So here is the problem:  I have these 4 batches, of 3 sticks each,
which failed memtest86 when tested in batches of 3.  The failures did
not occur on each pass of memtest's 16 tests.  Instead the sticks would
pass all of the tests for several passes.  In one case the failure
did not occur until after memtest86 had been running, without error,
for 42 hours on that machine.  That particular failure was in a single
word in test 6.  The worst of the 4 batches failed at 14 memory
locations.  I have now been testing 9 of these 12 suspect sticks,
1 stick per motherboard, for several days.  Several have now passed
more than 100 hours of memtest86 without error.

Can I trust them?

Should I keep them or return them?

If I return them, how long must I run memtest86 on the replacements
before I can trust those?

Can I trust the 55 or so sticks that passed 48 hours of memtest86 in
batches of 3?

The vendor has been making a good-faith effort to solve the problem,
and has even agreed to refund my money for the whole purchase if I'm
not happy with it.

What would you do in this situation?


Those are the most urgent questions for which I need answers, but I
have a few others of a more general nature:

Is there a specific vendor or brand of memory that is much more
reliable than others?  Since the above-described ordeal, I've heard
that Kingston has a good reputation.  Anyone care to endorse or
refute that?  Any other good brands/vendors you care to mention?

My understanding is that ECC can correct only single-bit errors, and
so would not help with the kind of multibit errors that have been
troubling me lately.  But I have some basic questions on ECC that
you might be able to answer (I've asked the motherboard maker's tech
support, but to no avail!):

In the bios for my GA7DPXDW-P motherboards, there are these 4
alternatives for the SDRAM ECC Setting:

    Disabled
    Check only
    Correct Errors
    Correct + scrub

I'm pretty sure I understand what 'Disabled' does.  Can anyone
explain to me what the others do, and how they differ?  Also, if ECC
correction is enabled, does this slow down the machine in any way?
Is there any disadvantage to having ECC correction enabled?


TIA,

Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul  2 10:38:05 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 2 Jul 2003 10:38:05 -0400 (EDT)
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <4.3.2.7.2.20030702102245.00adc830@pop.freeuk.net>
Message-ID: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>

> However, as far as I am aware, it should be possible to install a vanilla 
> x86-32 distribution

it is.  dual-opterons are very xeon-compatible.  I was in a hurry to 
fiddle with one that came into my hands, so I just ripped the HD out
of a crappy i815/PIII system (containing a basic RH9 install),
and plugged it into the dual-opteron (MSI board).  worked fine.
I compiled a specific kernel for it, and it was even finer
(I don't use modules, but the AMD Viper ide controller and broadcom
gigabit drivers seemed to work perfectly fine.)  the machine is now 
in day-to-day use as a workstation running Mandrake (ia32 version,
I think, though probably also with a custom kernel).

I did some basic testing, and was pleased with performance - about
what I'd expect from a dual-xeon 2.6-2.8.  none of that testing was 
with an x86-64 compiler/kernel/runtime, though - in fact, I was just
using Intel's compilers ("scp -r xeon:/opt/intel /opt"!)

do be certain that your dimms are arranged right - our whitebox vendor
seemed to think that all the dimms should go in cpu0's bank first,
with no inter-bank or inter-node interleaving.  performance was ~30%
better under Stream when the dimms were properly distributed and 
both kinds of interleaving enabled in bios.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sgaudet at wildopensource.com  Wed Jul  2 13:26:29 2003
From: sgaudet at wildopensource.com (Stephen Gaudet)
Date: Wed, 02 Jul 2003 13:26:29 -0400
Subject: memory nightmare
References: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <3F0315C5.6020401@wildopensource.com>

Hello Jack,


<snip>

> So here is the problem:  I have these 4 batches, of 3 sticks each,
> which failed memtest86 when tested in batches of 3.  The failures did
> not occur on each pass of memtest's 16 tests.  Instead the sticks would
> pass all of the tests for several passes.  In one case the failure
> did not occur until after memtest86 had been running, without error,
> for 42 hours on that machine.  That particular failure was in a single
> word in test 6.  The worst of the 4 batches failed at 14 memory
> locations.  I have now been testing 9 of these 12 suspect sticks,
> 1 stick per motherboard, for several days.  Several have now passed
> more than 100 hours of memtest86 without error.
> 
> Can I trust them?
> 
> Should I keep them or return them?
> 
> If I return them, how long must I run memtest86 on the replacements
> before I can trust those?
> 
> Can I trust the 55 or so sticks that passed 48 hours of memtest86 in
> batches of 3?
> 
> The vendor has been making a good-faith effort to solve the problem,
> and has even agreed to refund my money for the whole purchase if I'm
> not happy with it.
> 
> What would you do in this situation?

First, I'd make sure the memory comes from a major supplier, Kingston, 
Crucial, Virtium, Ventura, Transend, etc...

Next, make sure all the ram has the same chipset Samsung, Infineon, 
etc...  If you have various sticks in these systems where the chip 
manufacture is different they sometime don't behave well.  So try to 
make everything match.

Last I check cooling.  Do these systems have proper cooling?


> Those are the most urgent questions for which I need answers, but I
> have a few others of a more general nature:
> 
> Is there a specific vendor or brand of memory that is much more
> reliable than others?  Since the above-described ordeal, I've heard
> that Kingston has a good reputation.  Anyone care to endorse or
> refute that?  Any other good brands/vendors you care to mention?

See above.  I personally never buy ram unless it's on Intel's approved 
list and comes with a lifetime warranty.  I realize this is an AMD 
solution.  However, anyone that is approved by Intel in most cases is a 
real supplier with technical depth and could of helped with this problem.

When I had strange problems like this in the past with various systems,
Virtium, Ventura and others took a system into their lab in order to
fix the problem.


> My understanding is that ECC can correct only single-bit errors, and
> so would not help with the kind of multibit errors that have been
> troubling me lately.  But I have some basic questions on ECC that
> you might be able to answer (I've asked the motherboard maker's tech
> support, but to no avail!):
> 
> In the bios for my GA7DPXDW-P motherboards, there are these 4
> alternatives for the SDRAM ECC Setting:
> 
>     Disabled
>     Check only
>     Correct Errors
>     Correct + scrub
> 
> I'm pretty sure I understand what 'Disabled' does.  Can anyone
> explain to me what the others do, and how they differ?  Also, if ECC
> correction is enabled, does this slow down the machine in any way?
> Is there any disadvantage to having ECC correction enabled?

What's the motherboard manufacture call for?

Cheers, and Happy 4th of July,

Steve Gaudet

Wild Open Source (home office)
----------------------
Bedford, NH 03110
pH:603-488-1599
cell:603-498-1600
http://www.wildopensource.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:02:36 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:02:36 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <3F0315C5.6020401@wildopensource.com>
Message-ID: <20030702104756.M6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Stephen Gaudet wrote:

> First, I'd make sure the memory comes from a major supplier, Kingston,
> Crucial, Virtium, Ventura, Transend, etc...

The supplier is not one of those you listed above.  I've been dealing with
them as well as with the vendor, and, at this point, I'd prefer not to
disclose their name on the list.  (Yes, I know, Steve: I should have just
bought these sticks from you in the first place!  Oh well.  We live and
learn.)

>
> Next, make sure all the ram has the same chipset Samsung, Infineon,
> etc...  If you have various sticks in these systems where the chip
> manufacture is different they sometime don't behave well.  So try to
> make everything match.

The latest batch of 69 sticks all used Samsung chips.

>
> Last I check cooling.  Do these systems have proper cooling?
>

Yes, definitely.  I monitor that closely.  Ambient temperature around the
motherboards never exceeded 77 deg F throughout these tests, and was
less than 70F most of the time.  I can't monitor cpu temperature directly
when memtest86 is running, but, in the same enclosure, when I can monitor
cpu temperatures, they are typically 55C or less.  I've been experimenting
with different heatsinks.  Some of the boards have Thermalright sk6+/Delta
60X25mm coolers, which keep the cpus below 40C most of the time.

>

Thanks and best wishes,
Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Wed Jul  2 13:41:02 2003
From: award at andorra.ad (Alan Ward)
Date: Wed, 02 Jul 2003 19:41:02 +0200
Subject: sharing a power supply
Message-ID: <3F03192E.4040904@andorra.ad>

Dear listpeople,

I am building a small beowulf with the following configuration:

- 4 motherboards w/ onboard Ethernet
- 1 hard disk
- 1 (small) switch
- 1 ATX power supply shared by all boards

The intended boot sequence is the classical (1) master boots off
hard disk; (2) after a suitable delay, slaves boot off master
with dhcp and root nfs.

I would appreciate comments on the following:

a) A 450 W power supply should have ample power for all -
but can it deliver on the crucial +5V and +3.3V lines? Has anybody
got real-world intensity measurements on these lines for Athlons
I can compare to the supply's specs?

b) I hung two motherboards off a single ATX supply. When I hit
the switch on either board, the supply goes on and both motherboards
come to life. Does anybody know a way of keeping the slaves still
until the master has gone through boot? e.g. Use the reset switch?
Can one of the power lines control the PLL on the motherboard?


Best regards,
Alan Ward


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Wed Jul  2 13:47:58 2003
From: becker at scyld.com (Donald Becker)
Date: Wed, 2 Jul 2003 10:47:58 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <Pine.LNX.4.44.0307021028160.15025-100000@beohost.scyld.com>

On Wed, 2 Jul 2003, Jack Wathey wrote:

> I need some advice about how to handle some ambiguous results from
> memtest86.  I also have some general questions about bios options
> related to ECC memory.
..
> The boards are Gigabyte Technologies model GA7DPXDW-P,
> ...Testing in batches of 3 sticks per motherboard, nearly 100% failed

My immediate reaction is that you have a motherboard that has memory
configuration restrictions.  A typical restriction is that can only use
two DIMMs when they are "double sided" (with two memory chips per signal
line instead of one) or have larger-capacity memory chips.

My second reaction is that you are running the chips too fast for ECC,
either because the serial EEPROM has been reprogrammed to claim that the
chips are faster or the BIOS settings have been tweaked.  Remember than
a ECC memory system is slower than the same chips without ECC!

> In the bios for my GA7DPXDW-P motherboards, there are these 4
> alternatives for the SDRAM ECC Setting:
> 
>     Disabled
>     Check only

   As the memory read is happening, start checking the data.  If the check
   fails, interrupt later.

>     Correct Errors

   When the memory read is started, check the data.  Hold the result
   until the check passes or the data is corrected.

>     Correct + scrub

   Correct read data as above, holding the transaction and writing
   corrected data back to the DIMM if an error is found.

> I'm pretty sure I understand what 'Disabled' does.  Can anyone
> explain to me what the others do, and how they differ?  Also, if ECC
> correction is enabled, does this slow down the machine in any way?

Yes.  The typical cost is one clock cycle of read latency.
It might seem obviously easy to overlap the ECC check when it usually
passes, but you can't really hide all of the cost.  The memory-read path is
always latency-critical.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kuku at physik.rwth-aachen.de  Tue Jul  1 02:11:22 2003
From: kuku at physik.rwth-aachen.de (Christoph P. Kukulies)
Date: Tue, 1 Jul 2003 08:11:22 +0200
Subject: Hard link /etc/passwd
In-Reply-To: <F563491C-AB3F-11D7-B87C-000393BE4DE6@engr.uky.edu>
References: <20030630032016.88507.qmail@web10607.mail.yahoo.com> <F563491C-AB3F-11D7-B87C-000393BE4DE6@engr.uky.edu>
Message-ID: <20030701061122.GA18433@gilberto.physik.rwth-aachen.de>

On Mon, Jun 30, 2003 at 05:15:21PM -0400, William Dieter wrote:
> You have to be careful when doing maintenance.  For example, if you do:
> 
> mv /etc/passwd /etc/passwd.bak
> cp /etc/passwd.bak /etc/passwd
> 
> all of the copies will be linked to the backup copy.  Normally you 
> might not do this, but some text editors sometimes do similar things 
> silently...
> 
> A symbolic link might be safer.

But it won't work in his diskless environment. Symbolic links are not visible
outside the chrooted environment of the specific diskless clients.

It's gotta be hard links.

> 
> >Good day,
> >I have an 11 node diskless cluster.  All slave node
> >roots are under /tftpboot/node1 ... /tftpboot/node2
> >... so on.  Is it safe to hard link the /etc/passwd
> >and /etc/group file to the server nodes for
> >consistency across the network?
> 
--
Chris Christoph P. U. Kukulies kukulies (at) rwth-aachen.de
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul  1 21:53:05 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Wed, 02 Jul 2003 09:53:05 +0800
Subject: Linux support for AMD Opteron with Broadcom NICs
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <3F023B01.2706C3A0@nchc.gov.tw>

Hi,

We installed SuSE Enterprise 8 for AMD64 on our dual AMD Opteron box,
it works fine for the on-board Broadcom NICs. SuSE Enterprise 8 for
AMD64
is not free, however. It uses a special 2.4.19Suse kernel which SuSE 
has done a lot of works to make sure most drivers behave normally.
We tried kernel 2.4.21 but it failed for Realtek NICs. At the moment,
there are not so many drivers supported in kernel 2.4.21 for Opteron.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC 


Martin Siegert wrote:
> 
> Hello,
> 
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
> 
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
> 
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
> 
> Thus, so far I am not impressed.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
> 
> Cheers,
> Martin
> 
> --
> Martin Siegert
> Manager, Research Services
> WestGrid Site Manager
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Wed Jul  2 06:01:43 2003
From: johnt at quadrics.com (John Taylor)
Date: Wed, 2 Jul 2003 11:01:43 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>

I agree with Joachim et al on the merit of the paper - it raises some
important issues relating to the overall efficacy of MPI in certain
circumstances.

In relation to IB there has been some work at Ohio State, comparing Myrinet
and QsNet. The latter however only discusses MPI, where as the UPC group in
the former discuss lower level APIs that suit better some algorithms as well
as being the target of specific compiler environments.

On the paper specifically at Berkeley my only concern is that there is no
mention on the influence of the PCI-Bridge implementation, not withstanding
its specification. For instance the system at ORNL is based on ES40 which on
a similar system gives an 8byte latency so...

prun -N2 mping 0 8 
  1 pinged   0:        0 bytes      7.76 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      8.11 uSec     0.12 MB/s
  1 pinged   0:        2 bytes      8.06 uSec     0.25 MB/s
  1 pinged   0:        4 bytes      8.35 uSec     0.48 MB/s
  1 pinged   0:        8 bytes      8.20 uSec     0.98 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   2469.61 uSec   212.30 MB/s
  1 pinged   0:  1048576 bytes   4955.28 uSec   211.61 MB/s

similar to the latency and bandwidth achieved for the author's benchmark.

whereas the same code on the same Quadrics hardware running on a Xeon
(GC-LE) platform gives

prun -N2 mping 0 8
  1 pinged   0:        0 bytes      4.31 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      4.40 uSec     0.23 MB/s
  1 pinged   0:        2 bytes      4.40 uSec     0.45 MB/s
  1 pinged   0:        4 bytes      4.39 uSec     0.91 MB/s
  1 pinged   0:        8 bytes      4.38 uSec     1.83 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   1632.61 uSec   321.13 MB/s
  1 pinged   0:  1048576 bytes   3252.28 uSec   322.41 MB/s
  
It may also be the case that the Myrinet performance could also be improved
(it is stated as PCI 32/66 in the paper) based on benchmarking a more recent
PCI-bridge. These current performance measurements may lead to differing
conclusions w.r.t latency although there is still the issue of the two-sided
nature.

For completeness here is the shmem_put performance on a new bridge.

prun -N2 sping -f put  -b 1000 0 8
  1:        4 bytes      1.60 uSec     2.50 MB/s
  1:        8 bytes      1.60 uSec     5.00 MB/s
  1:       16 bytes      1.58 uSec    10.11 MB/s


John Taylor
Quadrics Limited
http://www.quadrics.com

> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 01 July 2003 09:03
> To: Beowulf mailinglist
> Subject: Re: interconnect latency, dissected.
> 
> 
> James Cownie:
> > Mark Hahn wrote:
> > > does anyone have references handy for recent work on interconnect
> > > latency?
> >
> > Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
> >
> > It doesn't have Inifinband, but does have Quadrics, Myrinet 
> 2000, GigE and
> > IBM.
> 
> Nice paper showing interesting properties.  But some metrics 
> seem a little bit 
> dubious to me: in 5.2, they seem to see an advantage if the "overlap 
> potential" is higher (when they compare Quadrics and Myrinet) 
> - which usually 
> just results in higher MPI latencies, as this potiential (on 
> small messages) 
> can not be exploited. Even with overlapping mulitple communication 
> operations, the faster interconnect remains faster. This is 
> especially true 
> for small-message latency.
> 
> From the contemporary (cluster) interconnects, SCI is missing next to 
> Infiniband. It would have been interesting to see the results 
> for SCI as it 
> has a very different communication model than most of the 
> other interconnects 
> (most resembling the T3E one).
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sgaudet at wildopensource.com  Wed Jul  2 14:24:42 2003
From: sgaudet at wildopensource.com (Stephen Gaudet)
Date: Wed, 02 Jul 2003 14:24:42 -0400
Subject: memory nightmare
References: <20030702104756.M6682-100000@euler.salk.edu>
Message-ID: <3F03236A.3050106@wildopensource.com>

Hello Jack,


Jack Wathey wrote:
> 
> On Wed, 2 Jul 2003, Stephen Gaudet wrote:
> 
> 
>>First, I'd make sure the memory comes from a major supplier, Kingston,
>>Crucial, Virtium, Ventura, Transend, etc...
> 
> 
> The supplier is not one of those you listed above.  I've been dealing with
> them as well as with the vendor, and, at this point, I'd prefer not to
> disclose their name on the list.  (Yes, I know, Steve: I should have just
> bought these sticks from you in the first place!  Oh well.  We live and
> learn.)
> 
> 
>>Next, make sure all the ram has the same chipset Samsung, Infineon,
>>etc...  If you have various sticks in these systems where the chip
>>manufacture is different they sometime don't behave well.  So try to
>>make everything match.
> 
> 
> The latest batch of 69 sticks all used Samsung chips.

Same part number and speed?  What does the motherboard manufacture call 
for in regards to cas latency 2 or 3?  Best is usually 2.


>>Last I check cooling.  Do these systems have proper cooling?

Ok.

> Yes, definitely.  I monitor that closely.  Ambient temperature around the
> motherboards never exceeded 77 deg F throughout these tests, and was
> less than 70F most of the time.  I can't monitor cpu temperature directly
> when memtest86 is running, but, in the same enclosure, when I can monitor
> cpu temperatures, they are typically 55C or less.  I've been experimenting
> with different heatsinks.  Some of the boards have Thermalright sk6+/Delta
> 60X25mm coolers, which keep the cpus below 40C most of the time.

Don't rule out the motherboard or processors.  I agree with you looks 
like ram. However, might turn out to be a bad series of motherboards, 
and or processors.   Memtest86 also shows cache errors.  My own system 
here at home had memmory errors and I though for sure it was the ram. 
Turned out to be the memory controller chip on the motherboard.

Steve Gaudet

Wild Open Source (home office)
----------------------
Bedford, NH 03110
pH:603-488-1599
cell:603-498-1600
http://www.wildopensource.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mprinkey at aeolusresearch.com  Wed Jul  2 14:50:05 2003
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Wed, 2 Jul 2003 14:50:05 -0400 (EDT)
Subject: memory nightmare
In-Reply-To: <3F0315C5.6020401@wildopensource.com>
References: <3F0315C5.6020401@wildopensource.com>
Message-ID: <46008.66.118.77.29.1057171805.squirrel@ra.aeolustec.com>


>
> First, I'd make sure the memory comes from a major supplier, Kingston,
> Crucial, Virtium, Ventura, Transend, etc...
>
> Next, make sure all the ram has the same chipset Samsung, Infineon,
> etc...  If you have various sticks in these systems where the chip
> manufacture is different they sometime don't behave well.  So try to
> make everything match.
>
> Last I check cooling.  Do these systems have proper cooling?
>

I would add only to verify that you have sufficient and consistent power.  I
have seen many more "memory" errors caused by malfunctioning power supplies
than by bad memory modules.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From erwan at mandrakesoft.com  Tue Jul  1 10:52:36 2003
From: erwan at mandrakesoft.com (Erwan Velu)
Date: 01 Jul 2003 16:52:36 +0200
Subject: Cluster over standard network
In-Reply-To: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
References: <DD1E19A9AFC2D311A32200508B5589EF0519D8BB@nnc.co.uk>
Message-ID: <1057071156.9954.15.camel@revolution.mandrakesoft.com>

Le mar 01/07/2003 ? 15:15, Cannon, Andrew a ?crit :
> Hi all,
> 
> Has anyone implemented a cluster over a normal office network using the PCs
> on people's desks as part of the cluster? If so, what was the performance of
> the cluster like? What sort of performance penalty was there for the
> ordinary user and what was the network traffic like?
You may have a look on the quite "old" Icluster initiative http://www-id.imag.fr/Grappes/icluster/description.html.
They did it and you can see their benchmarks.. It was a 200 E-PC cluster
using an ethernet network. It was in top500 !
-- 
Erwan Velu
Linux Cluster Distribution Project Manager
MandrakeSoft
43 rue d'aboukir 75002 Paris
Phone Number : +33 (0) 1 40 41 17 94
Fax Number   : +33 (0) 1 40 41 92 00
Web site     : http://www.mandrakesoft.com
OpenPGP key  : http://www.mandrakesecure.net/cks/ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From cgdethe at yahoo.com  Wed Jul  2 00:55:45 2003
From: cgdethe at yahoo.com (chandrashekhar dethe)
Date: Tue, 1 Jul 2003 21:55:45 -0700 (PDT)
Subject: help
Message-ID: <20030702045545.65035.qmail@web10806.mail.yahoo.com>


Hello,
Myself Prof.C.G.Dethe, Asst. Professor in the
department of electronics and Tele. SSGM, College of
Engg. Shegaon (M.S.) India. I wish to set up an
experimental high performance linux cluster in our
lab. I want to begin with simply 8 nodes. This will be
given as an project to PG student.
I wish to write a proposal for this purpose to Dept.
of Science and Tech. Govt. of India. Pl. let us know
the hardware + software requirements for this cluster
which will be used for research work mainly. 

with regards,
-cgdethe

Prof.C.G.Dethe
SSGM College of Engg. Shegaon 444 203
Dist. Buldhana 
State: Maharashtra.
INDIA.


=====
with regards,

- C.G.DETHE.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tim.carlson at pnl.gov  Wed Jul  2 14:47:40 2003
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Wed, 02 Jul 2003 11:47:40 -0700 (PDT)
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>
Message-ID: <Pine.LNX.4.44.0307021146090.5069-100000@roach.emsl.pnl.gov>

On Wed, 2 Jul 2003, Leonard Chvilicek wrote:

> I was reading in some of the mailing lists that the AMD Opteron dual
> processor system was getting around 80-90% efficiency on the second
> processor.  I was wondering if that holds true to the Itanium2 platform?
> I looked through some of the archives and did not find any benchmarks or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

You are not going to be able to beat a dual Itanium in terms of efficiency
if you are talking about a linpack benchmark. Close to 98% efficient.

Tim

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:31:10 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:31:10 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <Pine.LNX.4.44.0307021028160.15025-100000@beohost.scyld.com>
Message-ID: <20030702111109.X6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Donald Becker wrote:

>
> My immediate reaction is that you have a motherboard that has memory
> configuration restrictions.  A typical restriction is that can only use
> two DIMMs when they are "double sided" (with two memory chips per signal
> line instead of one) or have larger-capacity memory chips.

I'll look into that. I doubt this is the problem, though, because last
December I got a batch of 30 1-gig sticks from the same vendor that pass
memtest86 just fine in batches of 3 per board, on the very same
motherboards.  The batch from December used Nanya chips and were
high-profile.  The latest batch are Samsung low-profile.  I don't know if
these are "double-sided" or not.  The only restriction I know of, from the
motherboard manual, is that the memory must be "registered ECC ddr", which
these are.  Also, most of the failing sticks I've seen fail when tested
one stick per board.

>
> My second reaction is that you are running the chips too fast for ECC,
> either because the serial EEPROM has been reprogrammed to claim that the
> chips are faster or the BIOS settings have been tweaked.  Remember than
> a ECC memory system is slower than the same chips without ECC!

ECC was turned off during the memtest86 runs.  I'm using the default bios
settings for memory timing parameters.

>
> > In the bios for my GA7DPXDW-P motherboards, there are these 4
> > alternatives for the SDRAM ECC Setting:
> >
> >     Disabled
> >     Check only
>
>    As the memory read is happening, start checking the data.  If the check
>    fails, interrupt later.
>
> >     Correct Errors
>
>    When the memory read is started, check the data.  Hold the result
>    until the check passes or the data is corrected.
>
> >     Correct + scrub
>
>    Correct read data as above, holding the transaction and writing
>    corrected data back to the DIMM if an error is found.
>
> > I'm pretty sure I understand what 'Disabled' does.  Can anyone
> > explain to me what the others do, and how they differ?  Also, if ECC
> > correction is enabled, does this slow down the machine in any way?
>
> Yes.  The typical cost is one clock cycle of read latency.
> It might seem obviously easy to overlap the ECC check when it usually
> passes, but you can't really hide all of the cost.  The memory-read path is
> always latency-critical.

Thanks, Don!  That helps a lot.

Best wishes,
Jack

>
> --
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
> Annapolis MD 21403			410-990-9993
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Wed Jul  2 14:24:33 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 02 Jul 2003 13:24:33 -0500
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
References: <3F03192E.4040904@andorra.ad>
Message-ID: <1057170273.26434.57.camel@terra>

On Wed, 2003-07-02 at 12:41, Alan Ward wrote:
> Dear listpeople,
> 
> I am building a small beowulf with the following configuration:
> 
> - 4 motherboards w/ onboard Ethernet
> - 1 hard disk
> - 1 (small) switch
> - 1 ATX power supply shared by all boards
> 
> The intended boot sequence is the classical (1) master boots off
> hard disk; (2) after a suitable delay, slaves boot off master
> with dhcp and root nfs.
> 
> I would appreciate comments on the following:
> 
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?
> 
> b) I hung two motherboards off a single ATX supply. When I hit
> the switch on either board, the supply goes on and both motherboards
> come to life. Does anybody know a way of keeping the slaves still
> until the master has gone through boot? e.g. Use the reset switch?
> Can one of the power lines control the PLL on the motherboard?
> 

Use two power supplies, one for the master, one for the slaves. Not an
optimal solution.

How long will PXE sit around waiting? Is it settable? If it will wait
long enough, it won't matter how long it takes for the master to boot.

-- 

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From leonard_chvilicek at rac.ray.com  Wed Jul  2 13:47:42 2003
From: leonard_chvilicek at rac.ray.com (Leonard Chvilicek)
Date: Wed, 2 Jul 2003 12:47:42 -0500
Subject: Dual Itanium2 performance
Message-ID: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>


Hello,

I was reading in some of the mailing lists that the AMD Opteron dual
processor system was getting around 80-90% efficiency on the second
processor.  I was wondering if that holds true to the Itanium2 platform?
I looked through some of the archives and did not find any benchmarks or
statistics on this.  I found lots of dual Xeons but no dual Itaniums.

Thanks in advance ....

 
Leonard Chvilicek
Senior IT Strategist I
Raytheon Aircraft
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Wed Jul  2 15:01:57 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 02 Jul 2003 12:01:57 -0700
Subject: memory nightmare
In-Reply-To: <20030702075618.D6562-100000@euler.salk.edu>
Message-ID: <5.2.0.9.2.20030702115149.018931d0@mailhost4.jpl.nasa.gov>

At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:

>I need some advice about how to handle some ambiguous results from
>memtest86.  I also have some general questions about bios options
>related to ECC memory.

<big snip>

>My understanding is that ECC can correct only single-bit errors, and
>so would not help with the kind of multibit errors that have been
>troubling me lately.  But I have some basic questions on ECC that
>you might be able to answer (I've asked the motherboard maker's tech
>support, but to no avail!):


First off... you're correct that ECC (or, EDAC (error detection and 
correction)) corrects single bit errors, and detects double bit errors. 
It's designed to deal with occasional bit flips, usually from radiation 
(neutrons resulting from cosmic rays, background radiation from the 
packaging, etc.), and really only addresses errors in the actual memory cells.

If you have errors in the data going to and from the memory, ECC does 
nothing, since the bus itself doesn't have EDAC.

The probability of a single bit flip (or upset) is fairly low (I'd be 
surprised at more than 1 a day), the probability of multiple errors is 
vanishingly small. One rate I have seen referenced is around 2E-12 
upsets/bit/hr. (remember that you won't see an upset in a bit if you don't 
read it).. There are some other statistics that show an upset occurs in a 
typical PC-like computer with 256MB of RAM about once a month. Fermilab has 
a system called ACPMAPS with 156 Gbit of memory, and they saw about 2.5 
upsets/day (7E-13 upset/bit/hr)

Lots of interesting information at 
http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf 
and, of course, the origingal papers from IBM (Ziegler, May and Woods)

On all systems I've worked on over the last 20 years that used ECC, 
multiple bit errors were always a timing or bus problem, i.e. electrical 
interfaces. If you're getting so many problems, it's indicative of some 
fundamental misconfiguration or mismatch between what the system wants to 
see and what your parts actually do.  Maybe wait states, voltages, etc. are 
incorrectly set up?


>James Lux, P.E.

Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 15:46:37 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 12:46:37 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <5.2.0.9.2.20030702115149.018931d0@mailhost4.jpl.nasa.gov>
Message-ID: <20030702124310.D6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Jim Lux wrote:

> At 08:01 AM 7/2/2003 -0700, Jack Wathey wrote:
>
>
> On all systems I've worked on over the last 20 years that used ECC,
> multiple bit errors were always a timing or bus problem, i.e. electrical
> interfaces. If you're getting so many problems, it's indicative of some
> fundamental misconfiguration or mismatch between what the system wants to
> see and what your parts actually do.  Maybe wait states, voltages, etc. are
> incorrectly set up?
>

Thanks, Jim.  That's most enlightening.  Several other respondents
alluded to incorrect timing parameters, too.  I'll look into this.

Best wishes,
Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Wed Jul  2 12:33:44 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Wed, 02 Jul 2003 12:33:44 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0307021031460.31758-100000@coffee.psychology.mcmaster.ca>
Message-ID: <3F030968.7030100@lmco.com>

Mark Hahn wrote:

> do be certain that your dimms are arranged right - our whitebox vendor
> seemed to think that all the dimms should go in cpu0's bank first,
> with no inter-bank or inter-node interleaving.  performance was ~30%
> better under Stream when the dimms were properly distributed and
> both kinds of interleaving enabled in bios.
>

Care to post from Stream numbers as well as the hardware
configuration? :)

TIA!

Jeff

-- 
Jeff Layton
Senior Engineer - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta

"Is it possible to overclock a cattle prod?" - Irv Mullins


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 15:35:42 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 12:35:42 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <46008.66.118.77.29.1057171805.squirrel@ra.aeolustec.com>
Message-ID: <20030702123131.T6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Michael T. Prinkey wrote:

> I would add only to verify that you have sufficient and consistent power.  I
> have seen many more "memory" errors caused by malfunctioning power supplies
> than by bad memory modules.

Good point, but not likely to be the culprit here.  Most of the nodes in
these tests use 300W pfc power supplies from PC Power & Cooling.  They're
diskless nodes with no floppy, no cdrom, and no PCI cards except for the
video cards, which are there only when I'm running memtest86.

Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 14:54:23 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 11:54:23 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <3F03236A.3050106@wildopensource.com>
Message-ID: <20030702114239.R6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Stephen Gaudet wrote:

> Same part number and speed?  What does the motherboard manufacture call
> for in regards to cas latency 2 or 3?  Best is usually 2.

I'm pretty sure they're all the same part number and speed, because the
supplier fabricated them all at the same time for me.  I don't know what
the MB maker recommends for cas latency.  They recommend setting DDR
timing to "Auto" in the bios, which causes the bios to set the timing
parameters automatically.  That's how I have them set.  If that parameter
is set to manual, then a whole bunch of parameters, including cas latency,
become accessible in the bios menu, but I have never tinkered with those,
and the MB manual has no recommended values for them.

> Don't rule out the motherboard or processors.  I agree with you looks
> like ram. However, might turn out to be a bad series of motherboards,
> and or processors.   Memtest86 also shows cache errors.  My own system
> here at home had memmory errors and I though for sure it was the ram.
> Turned out to be the memory controller chip on the motherboard.
>

I suppose it's remotely possible, but not likely.  All of the boards will
run memtest86 for many days, and my number-crunching code for many weeks,
with no problems at all, when I use memory from the batch I bought last
December.  Most of the failing sticks I've encountered since April will
fail consistently, whether tested alone or with other sticks, whether
tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
use in my server.  It's only a few sticks in the most recent batch of 69
that are failing in this rare and intermittent way that I can't seem to
reproduce when the sticks are tested one per motherboard.


Jack


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wathey at salk.edu  Wed Jul  2 16:43:03 2003
From: wathey at salk.edu (Jack Wathey)
Date: Wed, 2 Jul 2003 13:43:03 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
Message-ID: <20030702130036.J6682-100000@euler.salk.edu>


On Wed, 2 Jul 2003, Alan Ward wrote:

> I would appreciate comments on the following:
>
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?

I made these measurements for my diskless dual-Athlon nodes.  They are
Gigabyte Technologies GA7DPXDW-P, with MP2200+ processors.  They have
on-board NIC, which I use, but otherwise they are stripped down to the
bare essentials: just motherboard, 2 cpus with coolers, and memory.  No
video card, no pci cards of any kind, no floppy, no cdrom, etc.

They have 2 power connectors: the standard 20-pin ATX connector and a
square 4-pin connector that supplies 12V to the board.  I did the
measurements by putting a 0.005 ohm precision resistor (www.mouser.com,
part #71-WSR-2-0.005) in series with each of the 5v, 3.3V and 12V lines,
and then measuring the voltage across that.  Rather than cut up the wires
of a power supply, I cut up the wires of extension cables:

http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=22998
http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=27314

There are multiple wires in these cables for each voltage.  Obviously you
need to be careful to cut and solder together the right ones.  A
motherboard manual should give you the pinout details.

Here are the results I got for my nodes:

cpus          memory installed        voltage line        current drawn
----------   ------------------       ------------        -------------
idle         2GB (2 sticks)               +5V                 13.1A
loaded       2GB (2 sticks)               +5V                 17.1A
idle         2GB (2 sticks)               +3.3V               0.34A
loaded       2GB (2 sticks)               +3.3V               0.34A
idle         2GB (2 sticks)               +12V                4.2A
loaded       2GB (2 sticks)               +12V                5.3A
idle         4GB (4 sticks)               +5V                 15.3A
loaded       4GB (4 sticks)               +5V                 19.7A
idle         4GB (4 sticks)               +3.3V               0.34A
loaded       4GB (4 sticks)               +3.3V               0.34A
idle         4GB (4 sticks)               +12V                4.2A
loaded       4GB (4 sticks)               +12V                5.3A

For my stripped-down nodes, only the +5V line turns out to be crucial.
You might want to repeat the measurements yourself, especially if your
nodes have more hardware plugged into them than mine.

Hope this helps,

Jack

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Tue Jul  1 11:18:19 2003
From: johnt at quadrics.com (John Taylor)
Date: Tue, 1 Jul 2003 16:18:19 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC96@stegosaurus.bristol.quadrics.com>

I agree with Joachim et al on the merit of the paper. In relation to IB
there has been some work at Ohio State, comparing Myrinet and QsNet. The
latter however only discusses MPI, where the UPC group in the former, quite
correctly IMHO, discuss lower level APIs that suit better some applications
and algorithms as well as being the target of specific compiler
environments.

On the paper specifically at Berkeley my only concern is that there is no
mention on the influence of the PCI-Bridge implementation, not withstanding
its specification. For instance the system at ORNL is based on ES40 which on
a similar system gives an 8byte latency so...

prun -N2 mping 0 8 
  1 pinged   0:        0 bytes      7.76 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      8.11 uSec     0.12 MB/s
  1 pinged   0:        2 bytes      8.06 uSec     0.25 MB/s
  1 pinged   0:        4 bytes      8.35 uSec     0.48 MB/s
  1 pinged   0:        8 bytes      8.20 uSec     0.98 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   2469.61 uSec   212.30 MB/s
  1 pinged   0:  1048576 bytes   4955.28 uSec   211.61 MB/s

similar to the latency and bandwidth achieved for the author's benchmark.

whereas the same code on the same Quadrics hardware running on a Xeon
(GC-LE) platform gives

prun -N2 mping 0 8
  1 pinged   0:        0 bytes      4.31 uSec     0.00 MB/s
  1 pinged   0:        1 bytes      4.40 uSec     0.23 MB/s
  1 pinged   0:        2 bytes      4.40 uSec     0.45 MB/s
  1 pinged   0:        4 bytes      4.39 uSec     0.91 MB/s
  1 pinged   0:        8 bytes      4.38 uSec     1.83 MB/s
  .
  .
  .
  1 pinged   0:   524288 bytes   1632.61 uSec   321.13 MB/s
  1 pinged   0:  1048576 bytes   3252.28 uSec   322.41 MB/s
  
It may also be the case that the Myrinet performance could also be improved
(it is stated as PCI 32/66 in the paper) based on benchmarking a more recent
PCI-bridge. 


John Taylor
Quadrics Limited
http://www.quadrics.com

> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 01 July 2003 09:03
> To: Beowulf mailinglist
> Subject: Re: interconnect latency, dissected.
> 
> 
> James Cownie:
> > Mark Hahn wrote:
> > > does anyone have references handy for recent work on interconnect
> > > latency?
> >
> > Try http://www.cs.berkeley.edu/~bonachea/upc/netperf.pdf
> >
> > It doesn't have Inifinband, but does have Quadrics, Myrinet 
> 2000, GigE and
> > IBM.
> 
> Nice paper showing interesting properties.  But some metrics 
> seem a little bit 
> dubious to me: in 5.2, they seem to see an advantage if the "overlap 
> potential" is higher (when they compare Quadrics and Myrinet) 
> - which usually 
> just results in higher MPI latencies, as this potiential (on 
> small messages) 
> can not be exploited. Even with overlapping mulitple communication 
> operations, the faster interconnect remains faster. This is 
> especially true 
> for small-message latency.
> 
> From the contemporary (cluster) interconnects, SCI is missing next to 
> Infiniband. It would have been interesting to see the results 
> for SCI as it 
> has a very different communication model than most of the 
> other interconnects 
> (most resembling the T3E one).
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bruno at rocksclusters.org  Wed Jul  2 14:27:05 2003
From: bruno at rocksclusters.org (Greg Bruno)
Date: Wed, 2 Jul 2003 11:27:05 -0700
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <0258E449E0019844924F40FE68D15B2D5FFE8F@ictxchp02.rac.ray.com>
Message-ID: <C869329B-ACBA-11D7-AC11-000393754EA0@rocksclusters.org>

> I was reading in some of the mailing lists that the AMD Opteron dual
> processor system was getting around 80-90% efficiency on the second
> processor.

just curious -- what benchmark was being used?

> I was wondering if that holds true to the Itanium2 platform?
> I looked through some of the archives and did not find any benchmarks 
> or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

running linpack and linking against the goto blas 
(http://www.cs.utexas.edu/users/flame/goto/), a two-cpu opteron 
achieved 87% of peak.

a two-cpu itanium 2 achieved 98% of peak.

  - gb

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jducom at nd.edu  Wed Jul  2 17:56:42 2003
From: jducom at nd.edu (Jean-Christophe Ducom)
Date: Wed, 02 Jul 2003 16:56:42 -0500
Subject: 3ware Escalade 8500 Serial ATA RAID
Message-ID: <3F03551A.8030608@nd.edu>

Did anybody try this card? What are the performances compared to the parallel 
ATA? How stable is the driver on Linux?
Thank you

	JC

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Wed Jul  2 17:51:44 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed, 2 Jul 2003 14:51:44 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F03192E.4040904@andorra.ad>
Message-ID: <Pine.LNX.3.96.1030702144901.32727C-100000@Maggie.Linux-Consulting.com>


hi ya

hang a 100uf or 1000uf ( +50v or +100v ) electrolytic
capacitor across the mb power-on switch  to slow down its
power-on signal  ... or do a extra resistor-capacitor circuit ..

-- dont run 4 mb off one power supply.. you'd probably
   exceed the current output of the power supply
	- it will work.. it will just run hot and soon die

	( 1/2 life rule for every 10C increase in temp )

c ya
alvin

On Wed, 2 Jul 2003, Alan Ward wrote:

> Dear listpeople,
> 
> I am building a small beowulf with the following configuration:
> 
> - 4 motherboards w/ onboard Ethernet
> - 1 hard disk
> - 1 (small) switch
> - 1 ATX power supply shared by all boards
> 
> The intended boot sequence is the classical (1) master boots off
> hard disk; (2) after a suitable delay, slaves boot off master
> with dhcp and root nfs.
> 
> I would appreciate comments on the following:
> 
> a) A 450 W power supply should have ample power for all -
> but can it deliver on the crucial +5V and +3.3V lines? Has anybody
> got real-world intensity measurements on these lines for Athlons
> I can compare to the supply's specs?
> 
> b) I hung two motherboards off a single ATX supply. When I hit
> the switch on either board, the supply goes on and both motherboards
> come to life. Does anybody know a way of keeping the slaves still
> until the master has gone through boot? e.g. Use the reset switch?
> Can one of the power lines control the PLL on the motherboard?

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul  2 19:04:03 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 2 Jul 2003 19:04:03 -0400 (EDT)
Subject: 3ware Escalade 8500 Serial ATA RAID
In-Reply-To: <3F03551A.8030608@nd.edu>
Message-ID: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca>

> Did anybody try this card? What are the performances compared to the parallel 
> ATA? How stable is the driver on Linux?

it's just their 7500 card with sata translators on the ports;
I can't see how pata/sata would make any difference.

I've had good luck with my 7500-8, but have heard others both
complain and praise them.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From leonard_chvilicek at rac.ray.com  Wed Jul  2 16:20:25 2003
From: leonard_chvilicek at rac.ray.com (Leonard Chvilicek)
Date: Wed, 2 Jul 2003 15:20:25 -0500
Subject: [Rocks-Discuss]Dual Itanium2 performance
Message-ID: <0258E449E0019844924F40FE68D15B2D5FFE90@ictxchp02.rac.ray.com>


The code that they were using was a CFD code called TAU and they were
getting over 90% efficiency on the 2nd processor on the Dual Opteron
system.

Thanks for your information Tim & Greg

Have a great 4th of July!
 
Leonard 

-----Original Message-----
From: Greg Bruno [mailto:bruno at rocksclusters.org] 
Sent: Wednesday, July 02, 2003 1:27 PM
To: Leonard Chvilicek
Cc: beowulf at beowulf.org; npaci-rocks-discussion at sdsc.edu
Subject: Re: [Rocks-Discuss]Dual Itanium2 performance


> I was reading in some of the mailing lists that the AMD Opteron dual 
> processor system was getting around 80-90% efficiency on the second 
> processor.

just curious -- what benchmark was being used?

> I was wondering if that holds true to the Itanium2 platform? I looked 
> through some of the archives and did not find any benchmarks or
> statistics on this.  I found lots of dual Xeons but no dual Itaniums.

running linpack and linking against the goto blas 
(http://www.cs.utexas.edu/users/flame/goto/), a two-cpu opteron 
achieved 87% of peak.

a two-cpu itanium 2 achieved 98% of peak.

  - gb

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcookeman at yahoo.com  Wed Jul  2 20:36:52 2003
From: jcookeman at yahoo.com (Justin Cook)
Date: Wed, 2 Jul 2003 17:36:52 -0700 (PDT)
Subject: SuSE 8.2 and LAM-MPI 7.0
Message-ID: <20030703003652.1234.qmail@web10606.mail.yahoo.com>

Gents and Ladies,
I am new to the Beowulf arena.  I am trying to get a
diskless cluster up with SuSE 8.2 and LAM-MPI 7.0.  I
plan on using nfs-root and nfs for all of the mount
points.  

If I do a minimal install with gcc and install lam-mpi
for my slave-node images am I on the right track? 
Does anyone have a better solution for me?

Justin

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From anand at novaglobal.com.sg  Wed Jul  2 22:01:02 2003
From: anand at novaglobal.com.sg (Anand Vaidya)
Date: Thu, 3 Jul 2003 10:01:02 +0800
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <200307031001.05515.anand@novaglobal.com.sg>

I have tested a Dual Opteron with Mandrake and RedHat Linux. (MSI board with 
4GB, and Avant 1U)

Mandrake did not have ISO images (when I downloaded) so I had to download the 
files & install via NFS. There were lot of problems though. Download it from 
ftp://ftp.leo.org/pub/comp/os/unix/linux/Mandrake/Mandrake/9.0/x86_64

RedHat GinGin which is RH's version of RHL for Opteron (64bit) can be 
downloaded from 
ftp://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/
as ISO images.

RH installed and ran extremely well. We did run some benchmarks (smp jobs). 
Pretty impressive!

HTH

-Anand

On Wednesday 02 July 2003 06:48 am, Martin Siegert wrote:
> Hello,
>
> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
>
> First of all: doing a google search for x86-64 Linux turns up a lot of
> press releases but not much more, particularly nothing one could download
> and install. Even a direct search on the SuSE and Mandrake sites shows
> only press releases. Sigh.
>
> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
>
> Thus I tried to compile the 2.4.21 kernel which worked, but
> "insmod tg3" freezes the machine instantly.
>
> Thus, so far I am not impressed.
>
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?
>
> Cheers,
> Martin

-- 
------------------------------------------------------------------------------
Regards,
Anand Vaidya
Technical Manager

NovaGlobal Pte Ltd
Tel: (65) 6238 6400
Fax: (65) 6238 6401
Mo:  (65) 9615 7317 

http://www.novaglobal.com.sg/

------------------------------------------------------------------------------
Fortune Cookie for today:

------------------------------------------------------------------------------


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Wed Jul  2 20:37:59 2003
From: torsten at howard.cc (torsten)
Date: Wed, 2 Jul 2003 20:37:59 -0400
Subject: Kickstart Help
Message-ID: <20030702203759.1232970b.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 01:12:06 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:12:06 -0400
Subject: Kickstart Help
Message-ID: <20030703011206.6d22b1b6.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From beowulf at howard.cc  Thu Jul  3 01:01:12 2003
From: beowulf at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:01:12 -0400
Subject: Kickstart Help
Message-ID: <20030703010112.7be016f9.beowulf@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From csheer at hotmail.com  Thu Jul  3 03:22:27 2003
From: csheer at hotmail.com (John Shea)
Date: Thu, 03 Jul 2003 00:22:27 -0700
Subject: Java Beowulf Cluster
Message-ID: <Law9-F111K3r2Yk0vSf0001798d@hotmail.com>

For those who are interested in building beowulf cluster using Java, here is 
a great software
package you can try out at: http://www.GreenTeaTech.com.

John

-----------------------------------------------------------------------------------------------------------------------------------
Build your own GreenTea Network Computer at home, in the office, or on the 
Internet.
Check it all out at http://www.GreenTeaTech.com
----------------------------------------------------------------------------------------------------------------------------------

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*  
http://join.msn.com/?page=features/virus

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Thu Jul  3 04:22:08 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 3 Jul 2003 10:22:08 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA7CCDC9D@stegosaurus.bristol.quadrics.com>
Message-ID: <200307031022.08268.joachim@ccrl-nece.de>

John Taylor:
> For completeness here is the shmem_put performance on a new bridge.
>
>
> prun -N2 sping -f put  -b 1000 0 8
>   1:        4 bytes      1.60 uSec     2.50 MB/s
>   1:        8 bytes      1.60 uSec     5.00 MB/s
>   1:       16 bytes      1.58 uSec    10.11 MB/s

The latency decrease is impressive for this bridge - which one is it? Can you 
tell?

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Thu Jul  3 07:08:03 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Thu, 3 Jul 2003 13:08:03 +0200
Subject: interconnect latency, dissected.
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>
Message-ID: <200307031308.03813.joachim@ccrl-nece.de>

John Taylor:
> This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
> G3.

Hmm, this is not really a "new" bridge - or is it modified for HP? The other 
numbers (4.4us for Xeon) that you gave where also achieved on a GC-LE system. 
Where's the difference?

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 01:00:06 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 01:00:06 -0400
Subject: Kickstart Help
Message-ID: <20030703010006.65ab487a.torsten@howard.cc>

Hello All,

RedHat 9.0, headless node

I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
installation method.  When the computer boots, it give me the
	boot:
prompt, and waits.  I havae to type in
	linux ks=cdrom:/ks.cfg
to get it going.  Is there any way to make this automatic?

During the  install, it  gets through to  aspell-ca-somevesrsion package
and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
even exists, because I only copied one CD-ROM (disc1).  How do I control
which packages are  installed (since only a bare minimum  are needed, as
this is a headless node)?

Thanks for any pointers.
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnt at quadrics.com  Thu Jul  3 06:39:21 2003
From: johnt at quadrics.com (John Taylor)
Date: Thu, 3 Jul 2003 11:39:21 +0100
Subject: interconnect latency, dissected.
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA7CCDCC1@stegosaurus.bristol.quadrics.com>

This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
G3.

 
> -----Original Message-----
> From: Joachim Worringen [mailto:joachim at ccrl-nece.de]
> Sent: 03 July 2003 09:22
> To: John Taylor; 'beowulf at beowulf.org'
> Subject: Re: interconnect latency, dissected.
> 
> 
> John Taylor:
> > For completeness here is the shmem_put performance on a new bridge.
> >
> >
> > prun -N2 sping -f put  -b 1000 0 8
> >   1:        4 bytes      1.60 uSec     2.50 MB/s
> >   1:        8 bytes      1.60 uSec     5.00 MB/s
> >   1:       16 bytes      1.58 uSec    10.11 MB/s
> 
> The latency decrease is impressive for this bridge - which 
> one is it? Can you 
> tell?
> 
>  Joachim
> 
> -- 
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From robert.crosbie at tchpc.tcd.ie  Thu Jul  3 07:56:55 2003
From: robert.crosbie at tchpc.tcd.ie (Robert bobb Crosbie)
Date: Thu, 3 Jul 2003 12:56:55 +0100
Subject: Kickstart Help
In-Reply-To: <20030703011206.6d22b1b6.torsten@howard.cc>
References: <20030703011206.6d22b1b6.torsten@howard.cc>
Message-ID: <20030703115655.GB6647@tchpc01.tcd.ie>

torsten hath declared on Thursday the 03 day of July 2003  :-:
> Hello All,
> 
> RedHat 9.0, headless node
> 
> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

I have done this with a bootnet floppy on the 7.x series a number of times.
mount the bootnet.img on a loopback ``mount -o loop bootnet.img /mnt''
then edit /mnt/syslinux.cfg and added:

	label ksfloppy
	  kernel vmlinuz
	  append "ks=floppy" initrd=initrd.img lang= lowres devfs=nomount ramdisk_size=8192

Then set "ksfloppy" to the default with:

	default ksfloppy

We generally get the ks.cfg over nfs which might be handier if your going to 
be booting from cdrom, with something like the following:

	label ksnfs
	   kernel vmlinuz
	   append "ks=nfs:11.22.33.44:/kickstart/7.3/" initrd=initrd.img lang=lowres devfs=nomount ramdisk_size=8192

(Installing a machine with the IP 4.3.2.1 will then look for the file 
"/kickstart/7.3/4.3.2.1-kickstart" on the nfs server, we just use symlinks).

Then umount /mnt and dd the image to floppy.

I presume you could do something similar by mounting the ISO 
and editing /mnt/isolinux/isolinux.cfg, although I have never tried it.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed 

Under the "%packages" section of the ks.cfg you can specify either package
collections "Software Developement" or individual packages "gcc" to be
installed. A snippit from our ks.cfg for 7.3 workstation installs looks like:

%packages --resolvedeps
@Classic X Window System
@GNOME
@Software Development
	[...etc...]
ntp
vim-enhanced
vim-X11 
xemacs
gv
	[...etc...]


> (since only a bare minimum  are needed, as this is a headless node)?

Getting the package list setup is a little bit of trial and error,
but you get there in the end :)

HTH,

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jlb17 at duke.edu  Thu Jul  3 08:14:36 2003
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Thu, 3 Jul 2003 08:14:36 -0400 (EDT)
Subject: Kickstart Help
In-Reply-To: <20030703010006.65ab487a.torsten@howard.cc>
Message-ID: <Pine.LNX.4.44.0307030811040.18503-100000@chaos.egr.duke.edu>

On Thu, 3 Jul 2003 at 1:00am, torsten wrote

> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

Modify syslinux.cfg to have the default be your ks entry.  Also, crank 
down the timeout.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed (since only a bare minimum  are needed, as
> this is a headless node)?

You control the packages in the, err, %packages section of the ks.cfg.  
You can specify families and individual packages in there, as well as 
specifying packages not to install.

Kickstart is pretty well documented.  All the options are listed here:

http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/s1-kickstart2-options.html

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From wrankin at ee.duke.edu  Thu Jul  3 08:27:55 2003
From: wrankin at ee.duke.edu (Bill Rankin)
Date: 03 Jul 2003 08:27:55 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <200307030459.h634x5Y12821@NewBlue.Scyld.com>
References: <200307030459.h634x5Y12821@NewBlue.Scyld.com>
Message-ID: <1057235275.2186.22.camel@rohgun.cse.duke.edu>

Anand Vaidya <anand at novaglobal.com.sg>:

> RedHat GinGin which is RH's version of RHL for Opteron (64bit) can be 
> downloaded from 
> ftp://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/iso/x86_64/
> as ISO images.
> 
> RH installed and ran extremely well. We did run some benchmarks (smp jobs). 
> Pretty impressive!

I am also running Gingin64 on a Penguin Computing dual Opteron which
uses the Broadcom NICs.  It is running fine at this moment with no
complaints.  The only issues were:

1 - No floppy boot/install image.  Must boot from CD or (in my case) PXE
boot and install.

2 - IIRC, the Broadcom NIC was not properly recognized, but using the
one Broadcom NIC entry in the install list (forgot the model number)
works fine. 

Do a google for "gingin64" and it should get you the links.

There is a mailing list on Redhat for AMD64

https://listman.redhat.com/mailman/listinfo/amd64-list

Performance wise, using the stock 64 bit gcc on my molecular dynamics
codes shows overall performance of the 1.4 GHz Opteron 240 to be on par
with Xeon 2.4s.  YMMV.

- bill
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bvds at bvds.geneva.edu  Thu Jul  3 08:42:29 2003
From: bvds at bvds.geneva.edu (bvds at bvds.geneva.edu)
Date: Thu, 3 Jul 2003 08:42:29 -0400
Subject: Linux support for AMD Opteron with Broadcom NICs
Message-ID: <200307031242.h63CgTn02594@bvds.geneva.edu>


Simon Hogg wrote:

>However, as far as I am aware, it should be possible to install a vanilla
>x86-32 distribution and recompile everything for 64-bit (with a recent GCC
>(3.3 is the best bet at the moment I suppose)).
 
I attempted this:  start with 32-bit RedHat 9 and gradually move up
to 64 bit.  It proved to be rather difficult since you need to
compile a 64-bit kernel and you need to install gcc as a cross-compiler
to do this.  And then you would need to figure out how to handle
the 32- and 64-bit libraries, yuck!  I found it much easier to start 
over with gingin64 (which has worked well for me).  I found no 
advantage to installing a 32-bit OS.

Brett van de Sande
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Thu Jul  3 09:27:03 2003
From: angel at wolf.com (Angel Rivera)
Date: Thu, 03 Jul 2003 13:27:03 GMT
Subject: 3ware Escalade 8500 Serial ATA RAID
In-Reply-To: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca> 
References: <Pine.LNX.4.44.0307021902550.31758-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030703132703.24703.qmail@houston.wolf.com>

Mark Hahn writes: 

>> Did anybody try this card? What are the performances compared to the parallel 
>> ATA? How stable is the driver on Linux?
> 
> it's just their 7500 card with sata translators on the ports;
> I can't see how pata/sata would make any difference. 
> 
> I've had good luck with my 7500-8, but have heard others both
> complain and praise them.

We are using the 7500-8 to the tune of 20 of them in 10 boxes (28TB) in one 
rack and we are rather impressed with the card.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Thu Jul  3 09:21:35 2003
From: award at andorra.ad (Alan Ward)
Date: Thu, 03 Jul 2003 15:21:35 +0200
Subject: sharing a power supply
References: <3F03192E.4040904@andorra.ad>
Message-ID: <3F042DDF.9000700@andorra.ad>

Thanks to everybody for the help.

My final set-up will probably look like:

- master node on a 300W supply
- three slaves on a 450W supply.

I am counting on the following maximum draws for
each motherboard (Duron at 1300 + 512 MB RAM):

	15A / 5V
	<1A / 3.3V
	5A / 12V

This is _just_ inside the 450W supply's specs -
I hope they were not overly optimistic.

On the other hand, a good 350W supply can power
up a dual with 1GB RAM ...

Best regards,
Alan Ward

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From timm at fnal.gov  Thu Jul  3 09:58:34 2003
From: timm at fnal.gov (Steven Timm)
Date: Thu, 3 Jul 2003 08:58:34 -0500 (CDT)
Subject: interconnect latency, dissected.
In-Reply-To: <200307031308.03813.joachim@ccrl-nece.de>
Message-ID: <Pine.LNX.4.31.0307030857020.20505-100000@boxer.fnal.gov>

We also saw streams numbers that were much higher than expected
while using a HP Proliant DL360  (compared to machines from
other vendors that were supposedly using the exact same chipset,
memory, and CPU speed.)  HP didn't have an explanation for the increase.

Steve


------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Core Support Services Dept.
Assistant Group Leader, Scientific Computing Support Group
Lead of Computing Farms Team

On Thu, 3 Jul 2003, Joachim Worringen wrote:

> John Taylor:
> > This result was achieved on a ServerWorks GC-LE within a HP Proliant DL380
> > G3.
>
> Hmm, this is not really a "new" bridge - or is it modified for HP? The other
> numbers (4.4us for Xeon) that you gave where also achieved on a GC-LE system.
> Where's the difference?
>
>  Joachim
>
> --
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Thu Jul  3 09:29:14 2003
From: becker at scyld.com (Donald Becker)
Date: Thu, 3 Jul 2003 06:29:14 -0700 (PDT)
Subject: Java Beowulf Cluster
In-Reply-To: <Law9-F111K3r2Yk0vSf0001798d@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307030551280.15025-100000@beohost.scyld.com>

On Thu, 3 Jul 2003, John Shea wrote:

> Date: Thu, 03 Jul 2003 00:22:27 -0700
> From: John Shea <csheer at hotmail.com>
> To: beowulf at beowulf.org
> Subject: Java Beowulf Cluster
> 
> For those who are interested in building beowulf cluster using Java, here is 
> a great software
> package you can try out at: http://www.--GreenTeaTech.com.

Sorry about this obvious no-content marketing shill...
This person subscribed and immediately posted this message.
A quick search shows the same type of marketing on many other mailing
lists, usually posing as a unrelated user e.g.
   http://webnews.kornet.net/view.cgi?group=comp.parallel.pvm&msgid=9875
   https://mailer.csit.fsu.edu/pipermail/java-for-cse/2001/000013.html
BTW Greg, this person is actually Chris Xie, a marketing person at the
company.


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bobb at tchpc.tcd.ie  Thu Jul  3 10:34:24 2003
From: bobb at tchpc.tcd.ie (bobb)
Date: Thu, 3 Jul 2003 15:34:24 +0100
Subject: Kickstart Help
In-Reply-To: <20030703011206.6d22b1b6.torsten@howard.cc>
References: <20030703011206.6d22b1b6.torsten@howard.cc> <20030703115655.GB6647@tchpc01.tcd.ie>
Message-ID: <20030703143424.GA15206@tchpc01.tcd.ie>


torsten hath declared on Thursday the 03 day of July 2003  :-:
> Hello All,
> 
> RedHat 9.0, headless node
> 
> I'm  working   on  a  bootable-CD-ROM  (NFS-mounted   distro)  kickstart
> installation method.  When the computer boots, it give me the
> 	boot:
> prompt, and waits.  I havae to type in
> 	linux ks=cdrom:/ks.cfg
> to get it going.  Is there any way to make this automatic?

I have done this with a bootnet floppy on the 7.x series a number of times.
mount the bootnet.img on a loopback ``mount -o loop bootnet.img /mnt''
then edit /mnt/syslinux.cfg and added:

	label ksfloppy
	  kernel vmlinuz
	  append "ks=floppy" initrd=initrd.img lang= lowres devfs=nomount ramdisk_size=8192

Then set "ksfloppy" to the default with:

	default ksfloppy

We generally get the ks.cfg over nfs which might be handier if your going to 
be booting from cdrom, with something like the following:

	label ksnfs
	   kernel vmlinuz
	   append "ks=nfs:11.22.33.44:/kickstart/7.3/" initrd=initrd.img lang=lowres devfs=nomount ramdisk_size=8192

(Installing a machine with the IP 4.3.2.1 will then look for the file 
"/kickstart/7.3/4.3.2.1-kickstart" on the nfs server, we just use symlinks).

Then umount /mnt and dd the image to floppy.

I presume you could do something similar by mounting the ISO 
and editing /mnt/isolinux/isolinux.cfg, although I have never tried it.

> During the  install, it  gets through to  aspell-ca-somevesrsion package
> and stops, saying it is corrupt.  I haven't checked if it is corrupt, or
> even exists, because I only copied one CD-ROM (disc1).  How do I control
> which packages are  installed 

Under the "%packages" section of the ks.cfg you can specify either package
collections "Software Developement" or individual packages "gcc" to be
installed. A snippit from our ks.cfg for 7.3 workstation installs looks like:

%packages --resolvedeps
@Classic X Window System
@GNOME
@Software Development
	[...etc...]
ntp
vim-enhanced
vim-X11 
xemacs
gv
	[...etc...]


> (since only a bare minimum  are needed, as this is a headless node)?

Getting the package list setup is a little bit of trial and error,
but you get there in the end :)

HTH,

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Thu Jul  3 11:51:59 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Thu, 03 Jul 2003 11:51:59 -0400
Subject: Opteron benchmark numbers
Message-ID: <3F04511F.8030903@lmco.com>

Hello,

   I don't know if everyone has seen these results yet, but
here's a link to some Opteron numbers for a small (4
node of dual) cluster:

http://mpc.uci.edu/opteron.html

Enjoy!

Jeff

-- 
Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From erwan at mandrakesoft.com  Thu Jul  3 03:41:35 2003
From: erwan at mandrakesoft.com (Erwan Velu)
Date: 03 Jul 2003 09:41:35 +0200
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <1057218095.2268.19.camel@revolution.mandrakesoft.com>

> Anyway: I found a few ftp sites that supply a Mandrake-9.0 x86_64 version.
> Thus I did a ftp installation which after (many) hickups actually worked.
> However, that distribution does not support the onboard Broadcom 5704
> NICs. I also could not get the driver from the broadcom web site to work
> (insmod fails with "could not find MAC address in NVRAM").
I will have a look on that point because MandrakeLinux for opteron owns the bcm5700 driver.
Could you send me the PCI-ID of your card ?

> For those of you who have such a box: which distribution are you using?
The MandrakeClustering product (http://www.mandrakeclustering.com) has
been shown during ISC2003 at Heidelberg (www.isc2003.org) on dual
opteron systems. People who want to test it can contact me directly.

Best regards,
-- 
Erwan Velu
Linux Cluster Distribution Project Manager
MandrakeSoft
43 rue d'aboukir 75002 Paris
Phone Number : +33 (0) 1 40 41 17 94
Fax Number   : +33 (0) 1 40 41 92 00
Web site     : http://www.mandrakesoft.com
OpenPGP key  : http://www.mandrakesecure.net/cks/ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From atp at piskorski.com  Thu Jul  3 14:00:22 2003
From: atp at piskorski.com (Andrew Piskorski)
Date: Thu, 3 Jul 2003 14:00:22 -0400
Subject: sharing a power supply
In-Reply-To: <200307031624.h63GOMY26657@NewBlue.Scyld.com>
References: <200307031624.h63GOMY26657@NewBlue.Scyld.com>
Message-ID: <20030703180022.GA66577@piskorski.com>

On Thu, Jul 03, 2003 at 03:21:35PM +0200, Alan Ward wrote:
> My final set-up will probably look like:
> 
> - master node on a 300W supply
> - three slaves on a 450W supply.

Alan, how did you go about attaching three motherboard connectors to
that one 450W supply?  Where'd you buy the connectors, and did you
have to solder them on or is there some sort of Y type splitter cable
available?

Also, did you do anything to get the three slaves to power on
sequentially rather than all at once?  Or are you just hoping that the
supply will be able to handle the peak load on startup?

In my limited experience with Athlons, I've seen cheap power supplies
cause memory errors.  (In my case, only while also spinning a hard
drive while compiling the Linux kernel; memtest86 did not cach the
problem.)  So I'd definitely be inclined to try using one high quality
supply rather than three cheap ones.  But until your emails to the
list though I hadn't heard of anyone doing it.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Thu Jul  3 14:49:35 2003
From: torsten at howard.cc (torsten)
Date: Thu, 3 Jul 2003 14:49:35 -0400
Subject: Kickstart Help - Thanks!
In-Reply-To: <20030703115655.GB6647@tchpc01.tcd.ie>
References: <20030703011206.6d22b1b6.torsten@howard.cc>
	<20030703115655.GB6647@tchpc01.tcd.ie>
Message-ID: <20030703144935.12bf170f.torsten@howard.cc>

Thanks for the help.

Redhat 9.0 uses "isolinux" for the boot dist, so the
old "syslinux.cfg" is now "isolinux.cfg".

Getting the packages right is indeed trial and error.
I'm down to about 500MB, and reducing them one-by-one.

Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From egan at sense.net  Thu Jul  3 16:53:21 2003
From: egan at sense.net (Egan Ford)
Date: Thu, 3 Jul 2003 14:53:21 -0600
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <002e01c341a5$23e9a5b0$27b358c7@titan>

> For those of you who have such a box: which distribution are 
> you using?
> Any advice on how to get those GigE Broadcom NICs to work?

I have 2 boxes with 2 Opterons and 2 onboard Broadcoms NICs and have had
very minor but expected problems installing:

SLES8 x86_64
SLES8 x86
RH 7.3

Issues:

SLES8 x86_64 recognized the NIC in reverse order than that of RH73 and SLES8
x64.  Adding netdevice=eth1 to Autoyast network installer was the work
around.  FYI, Autoyast is like kickstart but for SuSE distros.

SLES8 x86 needed a minor tweak to the network boot image to find the
BCM5700s.  But the module was just fine.

RH 7.3 needed a new module and pcitable entry in the network boot image for
installation.  I also had to update the runtime bcm5700 support.  HINT:
RH7.3 installs the athlon kernel.  I'd love to know how to tell kickstart to
force i686.  I used version 6.2.11 from broadcom.com.

I am too lazy to do CD installs so I only tested network installing.  My
demo machines came with IDE drives, I suspect that if I had SCSI that RH7.3
would have needed that updated as well in the installer.

I just downloaded gingin64, but have not tested it yet.  I suspect that it
will work just fine.  Anyone know what gingin64 is?  RH8, RH9, RH10,...?

I am impressed with SLES8 x86_64.  The updated NUMA kernel with the numactl
command is very nice.  You can peg a process and its children to a processor
and memory bus or threads of an OMP application to the memory of the
processor the thread is running it.  Helps with benchmarks like STREAM and
SPECfp on multiprocessor systems.  Now if someone will add it as an option
to mpirun...


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Thu Jul  3 19:08:05 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Thu, 3 Jul 2003 16:08:05 -0700 (PDT)
Subject: sharing a power supply
In-Reply-To: <3F042DDF.9000700@andorra.ad>
Message-ID: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com>


hi ya

On Thu, 3 Jul 2003, Alan Ward wrote:

> I am counting on the following maximum draws for
> each motherboard (Duron at 1300 + 512 MB RAM):
> 
> 	15A / 5V
> 	<1A / 3.3V
> 	5A / 12V
> 
> This is _just_ inside the 450W supply's specs -
> I hope they were not overly optimistic.

if you're connecting 3 systems .. that's 45A 
that the power supply has to deliver ...
	-- double that for current spikes and 
	optimal/normal performance and reliability
	of the power supply
 
if the ps can't deliver that current, than you're
degrading your powersupply and motherboard down 
to irreparable damage over time 

450W power supply doesnt mean anything ...
its the total amps per each delivered voltages
that yoou should be looking at  and how well you
want it regulated ...  there's not much room
for noise on the +3.3v power lines and it uses
lots of current on some of the memory sticks

if the idea of hooking up 4 systems to one ps was
to reduce heat and increase reliability, i think
using multiple systems on a ps designed for one
fully loaded mb/system will give you the opposite
reliability effect

i think 2 minimal-systems per powersupply is the max
for any power supply .. most ps and cases is designed for 
fully loaded case

fun stuff ... lots of smoke tests ... 
( bad idea to let the blue smoke out...
( for some reason, the systmes always stop working
( after you let out the blue smoke
( and blue smoke smells funny too

have fun
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alorant at octigabay.com  Fri Jul  4 01:08:34 2003
From: alorant at octigabay.com (Adam Lorant)
Date: Thu, 3 Jul 2003 22:08:34 -0700
Subject: GigE PCI-X NIC Cards
Message-ID: <001201c341ea$54e9d870$0300a8c0@Adam>

Hi folks.? Do any of you have any recommendations for a high performance
Gigabit Ethernet NIC card for PCI-X slots?? Are they any that I should stay
away from?? My primary application is NAS access.
?
Much appreciated,

Adam.
?

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From maurice at harddata.com  Fri Jul  4 02:37:00 2003
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 04 Jul 2003 00:37:00 -0600
Subject: [Rocks-Discuss]Dual Itanium2 performance
In-Reply-To: <200307021908.h62J8UY09280@NewBlue.Scyld.com>
Message-ID: <5.1.1.6.2.20030704003523.033deaa0@mail.harddata.com>

With regards to your message at 01:08 PM 7/2/03, beowulf-request at scyld.com. 
Where you stated:
>On Wed, 2 Jul 2003, Leonard Chvilicek wrote:
>
> > I was reading in some of the mailing lists that the AMD Opteron dual
> > processor system was getting around 80-90% efficiency on the second
> > processor.  I was wondering if that holds true to the Itanium2 platform?
> > I looked through some of the archives and did not find any benchmarks or
> > statistics on this.  I found lots of dual Xeons but no dual Itaniums.
>
>You are not going to be able to beat a dual Itanium in terms of efficiency
>if you are talking about a linpack benchmark. Close to 98% efficient.
>
>Tim

Perhaps, but as linpack is not what most people actually run on their 
machines for production, I think it is more useful to consider what 
efficiency on SMP you get on real production code.


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From maurice at harddata.com  Fri Jul  4 02:43:12 2003
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 04 Jul 2003 00:43:12 -0600
Subject: memory nightmare
In-Reply-To: <200307030459.h634xIY12831@NewBlue.Scyld.com>
Message-ID: <5.1.1.6.2.20030704004114.033e1a00@mail.harddata.com>

With regards to your message :
>From: Jack Wathey <wathey at salk.edu>
>To: Stephen Gaudet <sgaudet at wildopensource.com>
>cc: beowulf at beowulf.org
>Subject: Re: memory nightmare
>
>I suppose it's remotely possible, but not likely.  All of the boards will
>run memtest86 for many days, and my number-crunching code for many weeks,
>with no problems at all, when I use memory from the batch I bought last
>December.  Most of the failing sticks I've encountered since April will
>fail consistently, whether tested alone or with other sticks, whether
>tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
>use in my server.  It's only a few sticks in the most recent batch of 69
>that are failing in this rare and intermittent way that I can't seem to
>reproduce when the sticks are tested one per motherboard.
>
>
>Jack

Have you tried raising the memory voltage level on the motherboards to 2.7V ?
I see characteristics of failure  like you have described on many cheap 
motherboards.
Works fine with 1 stick, errors with 3 sticks of RAM.


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Fri Jul  4 03:19:39 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Fri, 4 Jul 2003 00:19:39 -0700 (PDT)
Subject: memory nightmare
In-Reply-To: <5.1.1.6.2.20030704004114.033e1a00@mail.harddata.com>
Message-ID: <Pine.LNX.3.96.1030704001559.2743A-100000@Maggie.Linux-Consulting.com>


hi ya

On Fri, 4 Jul 2003, Maurice Hilarius wrote:

> With regards to your message :
> >From: Jack Wathey <wathey at salk.edu>
> >To: Stephen Gaudet <sgaudet at wildopensource.com>
> >cc: beowulf at beowulf.org
> >Subject: Re: memory nightmare
> >
> >I suppose it's remotely possible, but not likely.  All of the boards will
> >run memtest86 for many days, and my number-crunching code for many weeks,
> >with no problems at all, when I use memory from the batch I bought last
> >December.  Most of the failing sticks I've encountered since April will
> >fail consistently, whether tested alone or with other sticks, whether
> >tested on my Gigabyte GA7DPXDW-P boards or the Asus A7M266D board that I
> >use in my server.  It's only a few sticks in the most recent batch of 69
> >that are failing in this rare and intermittent way that I can't seem to
> >reproduce when the sticks are tested one per motherboard.

ditto that ...

all the generic 1GB mem sticks  ( ddr-2100) work fine by itself
but fails big time with 2 of um in the same mb ... 
	( wasted about a months of productivity during the random failures
	( and no failures since using 4x 512MB sticks

we wound up replacing the cheap asus mb with intel D845/D865 series and 
changed to 4x 512MB sticks instead and it worked fine

similarly, for finicky mb, we used name brand memory 256MB ddr-2100, and
it worked fine  ...

> Have you tried raising the memory voltage level on the motherboards to 2.7V ?
> I see characteristics of failure  like you have described on many cheap 
> motherboards.
> Works fine with 1 stick, errors with 3 sticks of RAM.

forgetful memory is not a good thing

c ya
alvin 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Fri Jul  4 03:53:42 2003
From: award at andorra.ad (Alan Ward)
Date: Fri, 04 Jul 2003 09:53:42 +0200
Subject: sharing a power supply
References: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com>
Message-ID: <3F053286.1090804@andorra.ad>

Hi Alvin


En/na Alvin Oga ha escrit:
(snip)
> 450W power supply doesnt mean anything ...
> its the total amps per each delivered voltages
> that yoou should be looking at  and how well you
> want it regulated ...  there's not much room
> for noise on the +3.3v power lines and it uses
> lots of current on some of the memory sticks

I am. As has been noted, it looks like there's very
little draw on 3.3V; we are way above specs.
You are right about 5V and spikes, though. Have to
try and see. Luckily, I have no other 5V devices
in the box (I think :-).

This 450W is given for 45A/5v and 25A/3.3V, with a
250W limit across these two lines.

> if the idea of hooking up 4 systems to one ps was
> to reduce heat and increase reliability, i think
> using multiple systems on a ps designed for one
> fully loaded mb/system will give you the opposite
> reliability effect

This is a small mobile console type system, on wheels.
The idea is to move it around from one desk to another,
so different people can litteraly get their hands on it.
Having little noise (thus fans) is about as important
as pure computing power at this stage - I need to have
them buy the concept first. The design isn't too bad;
the pics will be on the web ASAP.


Best regards,
Alan

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Fri Jul  4 03:53:52 2003
From: award at andorra.ad (Alan Ward)
Date: Fri, 04 Jul 2003 09:53:52 +0200
Subject: sharing a power supply
References: <200307031624.h63GOMY26657@NewBlue.Scyld.com> <20030703180022.GA66577@piskorski.com>
Message-ID: <3F053290.50800@andorra.ad>

Hi.

En/na Andrew Piskorski ha escrit:
> Alan, how did you go about attaching three motherboard connectors to
> that one 450W supply?  Where'd you buy the connectors, and did you
> have to solder them on or is there some sort of Y type splitter cable
> available?

I started with dominoes, and when I was sure it worked soldered them.
Jack Wathey posted the following:

 >> Rather than cut up the wires
 >> of a power supply, I cut up the wires of extension cables:
 >>
 >> http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=22998
 >> http://www.cablesamerica.com/product.asp?cat%5Fid=604&sku=27314

Being in southern Europe, there's no hope of getting these here.
But busted power supplies (for parts) are easy to find :-(

> Also, did you do anything to get the three slaves to power on
> sequentially rather than all at once?  Or are you just hoping that the
> supply will be able to handle the peak load on startup?

Can't do anything about that. When the supply goes on, it powers the
boards, and they start up, period. Maybe a breaker on the 5V and
3.3V lines would be a solution.

However, I reason the following: power-on spikes come from condensators.
But there are a lot more condensators in the power supplies than on
the motherboards - at the very least a factor of 100 more in capacity.
So I expect the spikes on the AC circuit as the supply is getting
charged up, rather than on the DC part.

(Comments, Alvin, Jack?)

> In my limited experience with Athlons, I've seen cheap power supplies
> cause memory errors.  (In my case, only while also spinning a hard
> drive while compiling the Linux kernel; memtest86 did not cach the
> problem.)  So I'd definitely be inclined to try using one high quality
> supply rather than three cheap ones.  But until your emails to the
> list though I hadn't heard of anyone doing it.

There seem to be two-stage power supplies for racks: a general 230V / 
12V converter for the whole rack, plus a simplified low-voltage supply
for each box. I've never even seen any of these around here, though.

What I'm doing is not strictly COTS. I loose the advantage of just
plugging the hardware in and worrying *only* about the soft ...


Best regards,
Alan


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bobb at tchpc.tcd.ie  Fri Jul  4 04:28:06 2003
From: bobb at tchpc.tcd.ie (bobb)
Date: Fri, 4 Jul 2003 09:28:06 +0100
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <002e01c341a5$23e9a5b0$27b358c7@titan>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca> <002e01c341a5$23e9a5b0$27b358c7@titan>
Message-ID: <20030704082806.GA32158@tchpc01.tcd.ie>

Egan Ford hath declared on Thursday the 03 day of July 2003  :-:
 
> I just downloaded gingin64, but have not tested it yet.  I suspect that it
> will work just fine.  Anyone know what gingin64 is?  RH8, RH9, RH10,...?

According to the release notes its 8.0.95.
http://ftp.redhat.com/pub/redhat/linux/preview/gingin64/en/os/x86_64/RELEASE-NOTES

- bobb

-- 
Robert "bobb" Crosbie.
Trinity Centre for High Performance Computing,
O'Reilly Institute,Trinity College Dublin.
Tel: +353 1 608 3725
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From daniel at labtie.mmt.upc.es  Fri Jul  4 12:08:31 2003
From: daniel at labtie.mmt.upc.es (Daniel Fernandez)
Date: 04 Jul 2003 18:08:31 +0200
Subject: Small PCs cluster
Message-ID: <1057334911.3814.28.camel@qeldroma.cttc.org>

Hi there,

I just started how to mantain a cluster, I mean monitoring
activity/temperature, finding/replacing damaged components and user
control. Recently we are planning here to add more nodes... but there's
a great problem, space. 

So we bought recently a Small Form Factor PC to test it, It's a Shuttle
SN41G2 equipped with a nForce2 chipset, It was a bit tricky at install
process because our older PCs were equipped with 3Com cards and
installed via BOOTP but that damn nVidia integrated ethernet only boots
via PXE, well, that's relatively easy to solve. And after installing
nVidia drivers seemed to work flawlessly.

It's obvious that we'll gain space but on the other hand heat
dissipation will be more difficult because will be more dissipated watts
per cubic-meter, that small PC case has a nice Heat-pipe for cooling the
main cpu though.

? Are there experiences ( successful or not ) about installing and
managing clusters with Small Form Factor PCs ? I'm not talking only
about heat but instability problems with integrated ethernet ( under
high activity ) as well.


-- 
Daniel Fernandez <daniel at labtie.mmt.upc.es>
Laboratori de Termot?cnia i Energia - CTTC
( Heat and Mass Transfer Center )
Universitat Polit?cnica de Catalunya

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tsyang at iesinet.com  Fri Jul  4 13:25:17 2003
From: tsyang at iesinet.com (T.-S. Yang)
Date: Fri, 04 Jul 2003 10:25:17 -0700
Subject: Small PCs cluster
In-Reply-To: <1057334911.3814.28.camel@qeldroma.cttc.org>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
Message-ID: <3F05B87D.9070108@iesinet.com>

Daniel Fernandez wrote:

> ..
> ? Are there experiences ( successful or not ) about installing and
> managing clusters with Small Form Factor PCs ? I'm not talking only
> about heat but instability problems with integrated ethernet ( under
> high activity ) as well.
> 

Your cluster is similar to the Space Simulator Cluster
http://space-simulator.lanl.gov/
There is a helpful paper in PDF format.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Fri Jul  4 13:55:34 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Fri, 4 Jul 2003 10:55:34 -0700
Subject: sharing a power supply
References: <Pine.LNX.3.96.1030703160054.19372B-100000@Maggie.Linux-Consulting.com> <3F053286.1090804@andorra.ad>
Message-ID: <001d01c34255$77eed4e0$02a8a8c0@office1>

If quiet and compact is your goal, then maybe getting some standard smaller
supplies and doing some repackaging might be a better solution.  Pull the
fans out of the small supplies, mount them with some ducting and use 1 or 2
larger diameter fans.  In general a larger diameter fan will move more air,
more quietly, than a small diameter fan.

You're already straying into non-standard application of the parts, so
opening up the power supplies is hardly a big deal.  You might find that
using 3 small 200W supplies might be a better way to go than 1 monster 450W
supply.

There are also conduction cooled power supplies available (no fans at all)

----- Original Message -----
From: "Alan Ward" <award at andorra.ad>
To: "Alvin Oga" <alvin at Mail.Linux-Consulting.com>
Cc: <beowulf at beowulf.org>
Sent: Friday, July 04, 2003 12:53 AM
Subject: Re: sharing a power supply


> Hi Alvin
>
>
> En/na Alvin Oga ha escrit:
> (snip)
> > 450W power supply doesnt mean anything ...
> > its the total amps per each delivered voltages
> > that yoou should be looking at  and how well you
> > want it regulated ...  there's not much room
> > for noise on the +3.3v power lines and it uses
> > lots of current on some of the memory sticks
>
> I am. As has been noted, it looks like there's very
> little draw on 3.3V; we are way above specs.
> You are right about 5V and spikes, though. Have to
> try and see. Luckily, I have no other 5V devices
> in the box (I think :-).
>
> This 450W is given for 45A/5v and 25A/3.3V, with a
> 250W limit across these two lines.
>
> > if the idea of hooking up 4 systems to one ps was
> > to reduce heat and increase reliability, i think
> > using multiple systems on a ps designed for one
> > fully loaded mb/system will give you the opposite
> > reliability effect
>
> This is a small mobile console type system, on wheels.
> The idea is to move it around from one desk to another,
> so different people can litteraly get their hands on it.
> Having little noise (thus fans) is about as important
> as pure computing power at this stage - I need to have
> them buy the concept first. The design isn't too bad;
> the pics will be on the web ASAP.
>
>
> Best regards,
> Alan
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Fri Jul  4 13:29:16 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Fri, 04 Jul 2003 12:29:16 -0500
Subject: Small PCs cluster
In-Reply-To: <1057334911.3814.28.camel@qeldroma.cttc.org>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
Message-ID: <3F05B96C.6040801@tamu.edu>

Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
processor class machine, are pretty big compared to the Mini-ITX 
systems.  However, the heat-pipes seem to do a pretty good job of 
off-loading heat and making the heat-exchanger available to ambient air.

I've not built a cluster so far using this sort of case, but I've got a 
lot of past heat-pipe experience.  I'd be tring to maintain a low inlet 
temperature to the rack, and a fairly high, and (uncharacteristically) 
non-laminar airflow through the rack.  The idea is to get as much 
airflow incident to the heat-pipe heat exchanger as possible.

We did a fair bit of heat-pipe work while I was at NASA.  We found cood 
radiative characteristics in heat-pipe heat exchangers (the heat-pipes 
wouldn't have worked otherwise!) but they work best when they combine 
both convective and radiative modes and use a cool-air transport.

I've got a number of isolated small-form-factor PCs now running.  I've 
seen no instability with the integrated components in any of these.

gerry

Daniel Fernandez wrote:
> Hi there,
> 
> I just started how to mantain a cluster, I mean monitoring
> activity/temperature, finding/replacing damaged components and user
> control. Recently we are planning here to add more nodes... but there's
> a great problem, space. 
> 
> So we bought recently a Small Form Factor PC to test it, It's a Shuttle
> SN41G2 equipped with a nForce2 chipset, It was a bit tricky at install
> process because our older PCs were equipped with 3Com cards and
> installed via BOOTP but that damn nVidia integrated ethernet only boots
> via PXE, well, that's relatively easy to solve. And after installing
> nVidia drivers seemed to work flawlessly.
> 
> It's obvious that we'll gain space but on the other hand heat
> dissipation will be more difficult because will be more dissipated watts
> per cubic-meter, that small PC case has a nice Heat-pipe for cooling the
> main cpu though.
> 
> ? Are there experiences ( successful or not ) about installing and
> managing clusters with Small Form Factor PCs ? I'm not talking only
> about heat but instability problems with integrated ethernet ( under
> high activity ) as well.
> 
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Fri Jul  4 16:41:45 2003
From: torsten at howard.cc (torsten)
Date: Fri, 4 Jul 2003 16:41:45 -0400
Subject: Kickstart ks.cfg file example for headless node
Message-ID: <20030704164145.1e8be175.torsten@howard.cc>

Hello,

Does anyone have a kickstart file (ks.cfg) that they
use for a very minimal install on a headless node?

Thanks,
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From derek.richardson at pgs.com  Fri Jul  4 18:12:27 2003
From: derek.richardson at pgs.com (Derek Richardson)
Date: Fri, 04 Jul 2003 17:12:27 -0500
Subject: Kickstart ks.cfg file example for headless node
In-Reply-To: <20030704164145.1e8be175.torsten@howard.cc>
References: <20030704164145.1e8be175.torsten@howard.cc>
Message-ID: <3F05FBCB.9080408@pgs.com>

Torsten,
If using redhat, try their kickstart configurator for a basic 
configuration.  Here's a list of packages I use for compute nodes on a 
redhat 7.1 cluster :

%packages
@ Networked Workstation
@ Kernel Development
@ Development
@ Network Management Workstation
@ Utilities
autofs
dialog
lsof
ORBit
XFree86
audiofile
control-panel
dialog
esound
gnome-audio
gnome-libs
gtk+
imlib
kaffe
linuxconf
libungif
modemtool
netcfg
pythonlib
tcl
timetool
tix
tk
tkinter
tksysv
wu-ftpd
ntp
pdksh
ncurses
ncurses-devel
ncurses4
compat-egcs
compat-egcs-c++
compat-egcs-g77
compat-egcs-objc
compat-glibc
compat-libs
compat-libstdc++
xosview
quota
expect
uucp

I can't send you the entire kickstart, since it contains information 
relevant to the company I work for ( not to mention everyone would hate 
me for filling their inbox... ).  This list would probably need to be 
updated for what version you're using.  I'll send you ( off-list ) a 
kickstart that I use for redhat9 workstations that doesn't contain 
anything sensitive, it contains some examples of scripting post-install 
configuration and whatnot.
Oh, redhat maintains excellent documentation :
http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/
Regards,
Derek R.

torsten wrote:

>Hello,
>
>Does anyone have a kickstart file (ks.cfg) that they
>use for a very minimal install on a headless node?
>
>Thanks,
>Torsten
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>

-- 
Linux Administrator
derek.derekson at pgs.com
derek.derekson at ieee.org
Office 713-781-4000
Cell 713-817-1197
A list is only as strong as its weakest link.
		-- Don Knuth


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Fri Jul  4 18:43:54 2003
From: torsten at howard.cc (torsten)
Date: Fri, 4 Jul 2003 18:43:54 -0400
Subject: Kickstart ks.cfg file example for headless node
In-Reply-To: <3F05FBCB.9080408@pgs.com>
References: <20030704164145.1e8be175.torsten@howard.cc>
	<3F05FBCB.9080408@pgs.com>
Message-ID: <20030704184354.61bed075.torsten@howard.cc>

> I'll   send  you   (  off-list   )  a   kickstart  that   I  use   for
>redhat9  workstations  that  doesn't  contain  anything  sensitive,  it
>contains   some  examples   of  scripting   post-install  configuration
>and   whatnot.   Oh,   redhat  maintains   excellent  documentation   :
>http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/custom-guide/

Thanks for the info.

I'm  most  interested in  %packages.   The  manual talks  about  package
selection.  In order to reduce the  install size, I select no additional
packages. I  just want  a base  (40-50M) system.   My current  installed
system turns out to be huge (700M+).

I read in  the manual, it says "The Package  Selection window allows you
to choose which  package groups to install."  I understand  this to mean
that choosing a  package installs that package, in addition  to the base
system.  Have I misread?

By  selecting  no packages,  is  kickstart  installing all  packages  by
default?

If I select "@ base", will this only install the base and skip the rest?

My goal is a very small, very quick network install.

Thanks to everyone  for their help and patience.  Extra  thanks to Derek
for sending me an excellent ks.cfg example.

Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Sat Jul  5 04:31:59 2003
From: seth at hogg.org (Simon Hogg)
Date: Sat, 05 Jul 2003 09:31:59 +0100
Subject: OT? Opteron suppliers in UK?
Message-ID: <4.3.2.7.2.20030705092404.00aa0de0@pop.freeuk.net>

Attn: Any Opteron users in the UK

I'm looking for an Opteron-based system supplier (nice white-box assembler) 
in the UK.  Can any UK users recommend any suppliers (off-list!)  The 
prices I have seen so far seem a bit steep compared to our American cousins.

Thanks in advance, and apologies for the off-topic(?) post (but it is the 
weekend and just after 4th July, so list traffic is low :-)

Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Sat Jul  5 21:44:09 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Sat, 5 Jul 2003 18:44:09 -0700 (PDT)
Subject: Small PCs cluster
In-Reply-To: <3F05B96C.6040801@tamu.edu>
Message-ID: <Pine.LNX.3.96.1030705184126.21028C-100000@Maggie.Linux-Consulting.com>


hi ya

On Fri, 4 Jul 2003, Gerry Creager N5JXS wrote:

> Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
> processor class machine, are pretty big compared to the Mini-ITX 
> systems.  However, the heat-pipes seem to do a pretty good job of 
> off-loading heat and making the heat-exchanger available to ambient air.

the folks at mini-box.com has cdrom-sized chassis (1.75" tall)  running
off +12v DC input ...

and we have a mini-itx 1u chassis w/ 2 hd .. good up to p4-3Ghz 
( noisier than ??  but keeps the cpu nice and cool )

c ya
alvin


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Sun Jul  6 05:43:21 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Sun, 06 Jul 2003 17:43:21 +0800
Subject: GinGin64 on Opteron
References: <20030624032259.48447.qmail@web16809.mail.tpe.yahoo.com> <3EF85B85.1090200@inel.gov>
Message-ID: <3F07EF39.7D7110F7@nchc.gov.tw>


Hi,

This afternoon I tried to install RedHat's GinGig64 on our
dual Opteron box (Riowork HDAMA motherboard with 8GB RAM) and 
found that the installation script failed at the initiation stage
of system checking, the installation script only works 
normally when the memory size is reduced to 4GB (4 1GB RAM).
I wonder if anyone has tried this and has the similar finding.

On the other hand, SuSE Linux Enterprise Server 8 for AMD64
works fine for system with 8GB RAM. However, Unlike RedHat,
SuSE SLES8 does not load 3w-xxxx driver before initiating the
installation, so the installation script does not recognize
device such as /dev/sda, /dev/sdb, etc, created by 3Ware
RAID card earlier. I suspect that part of the reason might be 
caused by the power supply on my system is not large enough
(460W for 9 120GB hard disks, a dual opteron motherboard,
and 8GB RAM). I'll replace the power supply and try again
next week.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
> 
> Andrew Wang wrote:
> 
> > How well the existing tools run on Opteron machines?
> >
> > Does LAM-MPI or MPICH run in 64-bit mode? Also, has
> > anyone tried Gridengine or PBS on it?
> >
> > Lastly, is there an opensource Opteron compile farm
> > that I can access? I would like to see if my code
> > really runs correctly on them before buying!
> >
> > Andrew.
> 
> Most vendors will give you a remote account or send you
> an evaluation unit.  I imagine you'll probably be
> contacted off-list by several of them.
> 
> I've compiled a 64-bit MPICH, GROMACS, and a few other
> codes with a GCC 3.3 prerelease.  I have also used the
> beta PGI compiler with good results.  Some build
> scripts require slight modification to recognize
> x86-64 as an architecture, but most porting is trivial.
> GROMACS has some optimized assembly that didn't come
> out quite right, but I bet they have it fixed by now.
> 
> All my testing was a couple of weeks before the release,
> but I haven't gotten any in yet unfortunately.
> 
> Andrew
> 
> --
> Andrew Shewmaker, Associate Engineer
> Phone: 1-208-526-1276
> Idaho National Eng. and Environmental Lab.
> P.0. Box 1625, M.S. 3605
> Idaho Falls, Idaho 83415-3605
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mbosma at atipa.com  Mon Jul  7 16:11:32 2003
From: mbosma at atipa.com (Mark Bosma)
Date: 07 Jul 2003 15:11:32 -0500
Subject: GinGin64 on Opteron
Message-ID: <1057608692.11660.38.camel@atipa-dp>

We noticed the same behavior on a dual opteron machine last week that
was the same setup as yours - the install script would only work with 4
or less gigs of RAM.  Once installation was complete, the full 8 gigs
could be installed and the OS seemed to recognize it all.  So I've had
similar findings, but I haven't had time to find the cause yet.  I'd be
interested to hear if someone else has.

Mark Bosma
Atipa Technologies
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Mon Jul  7 16:55:47 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Mon, 7 Jul 2003 16:55:47 -0400 (EDT)
Subject: GinGin64 on Opteron
In-Reply-To: <1057608692.11660.38.camel@atipa-dp>
Message-ID: <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>

> similar findings, but I haven't had time to find the cause yet.  I'd be
> interested to hear if someone else has.

I'd guess that that boots and runs the installer simply
isn't configured right, perhaps even just an ia32 one).

does the installer work on a >4G machine if you simply give it a mem=4G
argument?  I'd guess the installer has no use for even 2G of ram...

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Mon Jul  7 17:24:50 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 7 Jul 2003 14:24:50 -0700
Subject: GinGin64 on Opteron
In-Reply-To: <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>
References: <1057608692.11660.38.camel@atipa-dp> <Pine.LNX.4.44.0307071653330.9115-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030707212450.GA14775@stikine.ucs.sfu.ca>

On Mon, Jul 07, 2003 at 04:55:47PM -0400, Mark Hahn wrote:
> > similar findings, but I haven't had time to find the cause yet.  I'd be
> > interested to hear if someone else has.
> 
> I'd guess that that boots and runs the installer simply
> isn't configured right, perhaps even just an ia32 one).
> 
> does the installer work on a >4G machine if you simply give it a mem=4G
> argument?  I'd guess the installer has no use for even 2G of ram...

I tried GigGin64 on my demo box and it hung almost immediately:
the last thing the installer displayed was 

running /sbin/loader ...

Martin
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From adm35 at georgetown.edu  Mon Jul  7 18:56:09 2003
From: adm35 at georgetown.edu (Arnold Miles)
Date: Mon, 07 Jul 2003 18:56:09 -0400
Subject: Free 3-day seminar in using Beowulf clusters and programming
 MPI in Washington DC
Message-ID: <40ebbe40b61f.40b61f40ebbe@georgetown.edu>

All:

Georgetown University in Washington DC is hosting a free 3-day workshop/
seminar on High Performance Computing, High Throughput Computing and 
Distributed Computing on August 11, 12, and 13.  The main emphasis of 
this workshop is using Beowulf cluster and writing algorithms and 
programs for Beowulf clusters using MPI.

Information can be found at:
http://www.georgetown.edu/research/arc/workshop2.html

The first day is general information, and is aimed at anyone with any 
interest in Beowulf clusters and their use.  We encourage project 
managers, administrators, researchers, faculty, and students to attend, 
as well as programmers who want to get started using their clusters.

The second day will be split beetween lectures and labs on the use of 
Jini in distributed computing (Track 1), and parallel programming (Track 
2).  There will also be a session on using Beowulf clusters as a high 
throughput tool using Condor.  The third day will be an all day lab in 
parallel programming with MPI.  Track 2 assumes a knowledge of either C, 
C++ or Fortran.

Best of all, this seminar is fully funded by Georgetown University's 
Information Systems department, so there is no cost to attend this year!

Seating for day 2 and day 3 is limited.  Contact Arnie Miles at 
adm35 at georgetown.edu or Steve Moore at moores at georgetown.edu.  Hope to 
see you there.

Arnie Miles
Systems Administrator:  Advanced Research Computing
Adjunct Faculty:  Computer Science
202.687.9379
168 Reiss Science Building
http://www.georgetown.edu/users/adm35
http://www.guppi.arc.georgetown.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul  8 00:57:49 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Tue, 08 Jul 2003 12:57:49 +0800
Subject: etherchannel
Message-ID: <3F0A4F4D.FF742BC4@nchc.gov.tw>

Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature,
etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bvds at bvds.geneva.edu  Mon Jul  7 23:13:46 2003
From: bvds at bvds.geneva.edu (bvds at bvds.geneva.edu)
Date: Mon, 7 Jul 2003 23:13:46 -0400
Subject: semaphore problem with mpich-1.2.5
Message-ID: <200307080313.h683Dk722726@bvds.geneva.edu>


I have an Opteron system running GinGin64 with 
a 2.4.21 kernel and gcc-3.3.  I compiled
mpich-1.2.5 with --with-comm=shared, but mpirun 
crashes with the error:

 semget failed for setnum = 0

This is a known problem with mpich (see 
http://www-unix.mcs.anl.gov/mpi/mpich/buglist-tbl.html).

Has anyone else seen this error?

I found a discussion, reprinted below, by Douglas Roberts at LANL
(http://www.bohnsack.com/lists/archives/xcat-user/1275.html)
His fix worked for me.  Does anyone know of a "real" solution?

Brett van de Sande

********************************************************************

I think the reason we get sem_get errors is that the operating system is not
releasing inter-process communication resources (e.g. semaphores) when a
job is finished. It's possible to do this manually. ...
I wrote the following script, which removes
all the shared memory and semaphore resources held by the user:

#! /bin/csh

foreach id (`ipcs -m | gawk 'NR>4 {print $2}'`)
        ipcrm shm $id
end

foreach id (`ipcs -s | gawk 'NR>4 {print $2}'`)
        ipcrm sem $id
end

********************************************************************


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgupta at cse.iitkgp.ernet.in  Tue Jul  8 04:55:11 2003
From: rgupta at cse.iitkgp.ernet.in (Rakesh Gupta)
Date: Tue, 8 Jul 2003 14:25:11 +0530 (IST)
Subject: NIS problem ..
Message-ID: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>


Hi, 
   I am setting up a small 8 node cluster .. I have installed RedHat 9.0
on all the nodes. 
  Now I want to setup NIS .. I have ypserv , portmap, ypbind running on
one of the nodes (The server) on the others I have ypbind and portmap.

The NIS Domain is also set in /etc/sysconfig networkk .. 

Now when I do /var/yp/make .. an error of the following form comes

" failed to send 'clear' to local ypserv: RPC: Unknown HostUpdating
passwd.byuid " 

and a sequence of such messages follow..

can anyone please help me with this.


Regards
Rakesh


-- 
----------------------------------------------------------------------
Rakesh Gupta
Research Consultant
Computer Science and Engineering Department
IIT Kharagpur
West Bengal
India - 721302
URL: http://www.crx.iitkgp.ernet.in/~rakesh/
Phone:
  09832117500
--------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul  8 06:42:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 8 Jul 2003 12:42:16 +0200
Subject: AW:  etherchannel
Message-ID: <29B376A04977B944A3D87D22C495FB2301276B@vertrieb.emplics.com>

Hi,

Take a look at /usr/share/doc/kernel-doc-2.4.18/networking/bonding.txt (at RH 7.3, don't know for higher versions)
You will have to recompile ifenslave for network-trunking.
This will result in a higher bandwidth, but your latency will grow (don't do that for mpich jobs, won't perform).

Before starting to configure I would do some benches (ping, Pallas), cause latency gets really worse.

greetings Rene


########################################################################
To install ifenslave.c, do:
    # gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave
    # cp ifenslave /sbin/ifenslave

3) Configure your system
------------------------
Also see the following section on the module parameters. You will need to add
at least the following line to /etc/conf.modules (or /etc/modules.conf):

        alias bond0 bonding

Use standard distribution techniques to define bond0 network interface. For
example, on modern RedHat distributions, create ifcfg-bond0 file in
/etc/sysconfig/network-scripts directory that looks like this:

DEVICE=bond0
IPADDR=192.168.1.1
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
ONBOOT=yes
BOOTPROTO=none
USERCTL=no

(put the appropriate values for you network instead of 192.168.1).

All interfaces that are part of the trunk, should have SLAVE and MASTER
definitions. For example, in the case of RedHat, if you wish to make eth0 and
eth1 (or other interfaces) a part of the bonding interface bond0, their config
files (ifcfg-eth0, ifcfg-eth1, etc.) should look like this:

DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

(use DEVICE=eth1 for eth1 and MASTER=bond1 for bond1 if you have configured
second bonding interface).

Restart the networking subsystem or just bring up the bonding device if your
administration tools allow it. Otherwise, reboot. (For the case of RedHat
distros, you can do `ifup bond0' or `/etc/rc.d/init.d/network restart'.)

If the administration tools of your distribution do not support master/slave
notation in configuration of network interfaces, you will need to configure
the bonding device with the following commands manually:

    # /sbin/ifconfig bond0 192.168.1.1 up
    # /sbin/ifenslave bond0 eth0
    # /sbin/ifenslave bond0 eth1
#####################################################


-----Urspr?ngliche Nachricht-----
Von: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw] 
Gesendet: Dienstag, 8. Juli 2003 06:58
An: beowulf at beowulf.org
Betreff: etherchannel


Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature, etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC _______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Tue Jul  8 15:09:34 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Tue, 8 Jul 2003 12:09:34 -0700
Subject: Linux support for AMD Opteron with Broadcom NICs
In-Reply-To: <20030701224808.GA15167@stikine.ucs.sfu.ca>
References: <20030701224808.GA15167@stikine.ucs.sfu.ca>
Message-ID: <20030708190934.GA16851@stikine.ucs.sfu.ca>

On Tue, Jul 01, 2003 at 03:48:08PM -0700, Martin Siegert wrote:

> I have a dual AMD Opteron for a week or so as a demo and try to install
> Linux on it - so far with little success.
> 
> For those of you who have such a box: which distribution are you using?
> Any advice on how to get those GigE Broadcom NICs to work?

Thanks to all of you who have responded with suggestions and pointers.
In the end this did turn out to be a hardware problem (this NICs plainly
did not work) and had nothing to do with the drivers and the distributions
that I tried. I am going to get another Opteron box and then will try once
more.

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From math at velocet.ca  Tue Jul  8 17:15:18 2003
From: math at velocet.ca (Ken Chase)
Date: Tue, 8 Jul 2003 17:15:18 -0400
Subject: lopsisded draw on power supplies
Message-ID: <20030708171518.A27289@velocet.ca>

So, what's people's experience with PC power supplies and power draw
on various voltage lines? 

We have a buncha old but large SCSI drives here that are somewhat hefty, and
we want to power them with as few ATX supplies as possible. We have no
motherboard involved (yes, we have to find a hack to get the power on with a
signal, but I think its just shorting a couple of the pins in the mobo
connector for a sec -- anyone got info on that?).

The thing is we'd only be drawing +5 and +12V out of the thing for the drives.
Im not sure how much of each really, during operation, but the drives are all
listed as max 1.1A +5V and 1.1 or 1.7A +12V (latter for bigger of the 2 types
of drives).

Even the 300W non-enermax cheapo power supply says it supplies 22A of
+12V, which is the limiting factor for # of drives. (It gives 36A of +5V).
The 650W enermax monster we have gives 46 +5V and 24 +12V strangely enough
(strange because its only 2 more amps of 12 for such a big supply.)

Im wondering what will happen if we have a load on only one type of voltage
because of no motherboard or other perifs. Is this a lopsided load that
we should beef up the power supply for? I dont think we should use a 300W
for like 16 odd drives, but perhaps a 400 is enough? Should we go 650?
Is it necessary? We'll certainly use enermax for this, with 2 fans in it.
How close to the rated max should we go? We're looking at 16 drives here,
which is short of the 22 or 24A listed on the supplies.

Thanks.

/kc
-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, Canada
Wiznet Velocet DSL.ca Datavaults  24/7: 416-967-4414  tollfree: 1-866-353-0363

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From deadline at plogic.com  Wed Jul  9 13:12:23 2003
From: deadline at plogic.com (Douglas Eadline)
Date: Wed, 9 Jul 2003 13:12:23 -0400 (EDT)
Subject: Informal Survey
Message-ID: <Pine.LNX.4.44.0307091253210.29893-100000@otto.plogic.com>


I am curious where everyone gets information on clusters.
Obviously this list is one source, but what about other
sources. In addition, what kind of information do people most 
want/need about clusters. Please comment on the following
questions if you have the time. You can respond to me directly
and I will summarize the results for the list.

1. Where do you find "howto" information on clusters
   (besides this list)

    a) Google
    b) Vendor
    c) Trade Show  
    d) News Sites (what news sites are there?)
    e) Other

2. If there were a subscription print/web magazine on clusters, what kind 
   of coverage would you want? 

    a) howto information
    b) new products
    c) case studies
    d) benchmarks
    e) other


Thanks,

Doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mohamed.siddiqu at wipro.com  Tue Jul  8 04:45:16 2003
From: mohamed.siddiqu at wipro.com (Mohamed Abubakkar Siddiqu)
Date: Tue, 8 Jul 2003 14:15:16 +0530
Subject: etherchannel
Message-ID: <6353EB090D04484B9AFF8E257A4BF84D3D5F68@blrhomx2.wipro.co.in>


Hi..


U can try Channel Bonding. Check Bonding Documentation from the Kernel source

Siddiqu.T


-----Original Message-----
From: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw]
Sent: Tuesday, July 08, 2003 10:28 AM
To: beowulf at beowulf.org
Subject: etherchannel


Hi,

Does anyone know how to set up and configure etherchannel
on Linux system? 

I have a motherboard has two Broadcom gigabit ports, and 
a 24-port SMC Gigabit TigerSwitch which also has Broadcom
chip on it. Both support IEEE 802.3ad protocol which allows
to combine two physical LAN ports into a logical one and
double the bandwitch.There are several name for such feature,
etherchannel is just one of them.

I wonder if anyone has try this on a Linux system, say
SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
will be appreciated.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From torsten at howard.cc  Wed Jul  9 23:21:19 2003
From: torsten at howard.cc (torsten)
Date: Wed, 9 Jul 2003 23:21:19 -0400
Subject: Realtek 8139
Message-ID: <20030709232119.5a0a378b.torsten@howard.cc>

Hello All,

This is an FYI, followed by a request for ethernet card suggestions.

My secondary ethernet for my Beowulf cluster is a Realtek 8139 chip
D-Link 530TX.  I also have this chipset on the motherboard itself.

The chipset on the MB works, it seems, my suspicions are because it
is only 10MBit.  On the subnet, a 100MBit net, it is falling over itself.

First, I started getting NFS problems.  I google'd and found out that

A. The NFS "buffer" is overflowing, or not being cleared adequately.
B. The ethernet card is misconfigured.
C. The driver is poor or does not match the card.
D. The card is defective.

I also tried ftp, and after a few megs are transfered, the chip fails
to be able to transfer more.  I found many mentions of this chipset
being the low of the low, and it is driving me nuts.

Interestingly, I can IP masq the subnet and connect to the internet,
seemingly ok.  Just NFS and FTP are dying.  Blah.

I'm going to purchase some new network cards.  I'm leaning towards
3Com 3c905C-TXM cards because they are cheap enough ($20 pricewatch),
PCI, 100MBit, and have PXE roms, and, most of all, are known stable
and working under Linux.

I would like to solicit ethernet card recommendations before I purchase
another mistake.

Thanks,
Torsten
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From palott at math.umd.edu  Wed Jul  9 23:14:31 2003
From: palott at math.umd.edu (P. Aaron Lott)
Date: Wed, 9 Jul 2003 23:14:31 -0400
Subject: gentoo cluster
Message-ID: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi,

Our group is interested in building a beowulf cluster using gentoo 
linux as the OS. Has anyone on the list had experience with this or 
know anyone who has experience with this? We're trying to figure out 
the best way to spawn nodes once we have configured one machine 
properly. Any suggestions such as pseudo kickstart methods would be 
greatly appreciated.

Thanks,

Aaron


palott at math.umd.edu
http://www.lcv.umd.edu/~palott
LCV:    IPST 4364A (301)405-4865
Office: IPST 4364D (301)405-4843
Fax:   (301)314-0827

P. Aaron Lott
1301 Mathematics Building
University of Maryland
College Park, MD 20742-4015


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (Darwin)

iD8DBQE/DNoizzvfVkBO8H4RAhquAJ0XVKDjkHxE6W52eZGNO80YKDJKdwCfSZqP
d6iwjdalKhqGI4xHGH4d678=
=QcSo
-----END PGP SIGNATURE-----

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kpodesta at redbrick.dcu.ie  Thu Jul 10 05:17:34 2003
From: kpodesta at redbrick.dcu.ie (Karl Podesta)
Date: Thu, 10 Jul 2003 10:17:34 +0100
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
References: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <20030710091733.GD1661@prodigy.Redbrick.DCU.IE>

On Wed, Jul 09, 2003 at 11:14:31PM -0400, P. Aaron Lott wrote:
> Hi,
> 
> Our group is interested in building a beowulf cluster using gentoo 
> linux as the OS. Has anyone on the list had experience with this or 
> know anyone who has experience with this? We're trying to figure out 
> the best way to spawn nodes once we have configured one machine 
> properly. Any suggestions such as pseudo kickstart methods would be 
> greatly appreciated.
> 
> Thanks,
> 
> Aaron

Not gentoo-specific, but there was a thread a few weeks back where
people posted up various (mostly similar) methods they use to clone 
nodes etc. 

On an old 23-node beowulf we have, we use a few small homegrown 
collected perl scripts written by the university networking society.

Once configuring a machine, we make an image of it (simple gzip/tar, 
stores itself on the head node, takes 2 mins), then register the other
nodes to 'clone' from this image we've just made, reboot the nodes from 
a floppy, and they clone themselves from the network at about 2 minutes 
a piece, takes about 5-10 mins maybe to clone all 23 nodes! Surprisingly 
quick for a simple ftp/un-tgz over standard ethernet from a single head node. 

We use the etherboot package to create a boot floppy which we use to 
boot the nodes, and our scripts modify the DHCP conf file to say which 
nodes should then be subsequently picked up and which linux kernel 
they should use to load up. The startup scripts that load after
the linux kernel ftp the node image down from the head node, un-gzip
the image, and un-tar it onto the machine. Hey presto, etc.

You could probably write something small yourself using etherboot/DHCP/targz
and some alteration of config files, or you could use cloning software 
like g4u (which I found really slow? It took like 30 minutes to clone
a node compared to 2 for our own scripts?), or you could use cluster
software like ROCKS. Depends on your time and/or inclination!

I'm not sure that simple tar'ing of a filesystem is the completely correct
way to go about it, but we don't have many actively live users (at least
not when I decide I'm going to clone nodes...), plus it's fast and dirty. 
So works for us, for now.. Something more 'proper' might require a dd'ing
of the disk, or something?

Kp
-- 
Karl Podesta
+ School of Computing, Dublin City University, Ireland
+ National Institute for Cellular Biotechnology, Ireland
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From daniel at labtie.mmt.upc.es  Thu Jul 10 08:14:42 2003
From: daniel at labtie.mmt.upc.es (Daniel Fernandez)
Date: 10 Jul 2003 14:14:42 +0200
Subject: Small PCs cluster
In-Reply-To: <3F05B96C.6040801@tamu.edu>
References: <1057334911.3814.28.camel@qeldroma.cttc.org>
	 <3F05B96C.6040801@tamu.edu>
Message-ID: <1057839282.764.20.camel@qeldroma.cttc.org>

Hi again,

Thanks for the answers, we also checked the Mini-ITX mainboard, but C3 
processors don't offer enough FPU raw speed. On the other hand, the 
integrated nVidia ethernet controller is in fact a Realtek 8201BL, this
is our last trouble before we decide what to purchase. 

Our actual cluster is equipped with 3Com 3c905CX-TX-M ethernet controllers,
our doubt is about that Realtek controller because I suspect that Realtek
ethernet nics put more load onto the main CPU ? can anyone confirm this ?

I suppose that the NIC for cluster of choice is 3Com around there, but...
? how about Realtek NICs under heavy load? If doesn't work well, we can
 afford an extra 3Com NIC of course.

-- 
Daniel Fernandez <daniel at labtie.mmt.upc.es>
Laboratori de Termot?cnia i Energia - CTTC


> On Fri, 2003-07-04 at 19:29, Gerry Creager N5JXS wrote:
> Relatively speaking the Shuttle cases, while small for a P4 or Athelon 
> processor class machine, are pretty big compared to the Mini-ITX 
> systems.  However, the heat-pipes seem to do a pretty good job of 
> off-loading heat and making the heat-exchanger available to ambient air.
> 
> I've not built a cluster so far using this sort of case, but I've got a 
> lot of past heat-pipe experience.  I'd be tring to maintain a low inlet 
> temperature to the rack, and a fairly high, and (uncharacteristically) 
> non-laminar airflow through the rack.  The idea is to get as much 
> airflow incident to the heat-pipe heat exchanger as possible.
> 
> We did a fair bit of heat-pipe work while I was at NASA.  We found cood 
> radiative characteristics in heat-pipe heat exchangers (the heat-pipes 
> wouldn't have worked otherwise!) but they work best when they combine 
> both convective and radiative modes and use a cool-air transport.
> 
> I've got a number of isolated small-form-factor PCs now running.  I've 
> seen no instability with the integrated components in any of these.
> 
> gerry
> 
>  


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nashif at planux.com  Thu Jul 10 10:25:19 2003
From: nashif at planux.com (Anas Nashif)
Date: Thu, 10 Jul 2003 10:25:19 -0400
Subject: SuSE 8.2 for AMD64 Download
Message-ID: <3F0D774F.4010908@planux.com>

Hi,

8.2 for AMD64 is available on the FTP server:
ftp://ftp.suse.com/pub/suse/x86-64/8.2-beta/

Press Release in german:
http://www.suse.de/de/company/press/press_releases/archive03/82_x86_64_beta.html


Anas

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From becker at scyld.com  Thu Jul 10 11:04:44 2003
From: becker at scyld.com (Donald Becker)
Date: Thu, 10 Jul 2003 11:04:44 -0400 (EDT)
Subject: Small PCs cluster
In-Reply-To: <1057839282.764.20.camel@qeldroma.cttc.org>
Message-ID: <Pine.LNX.4.44.0307101056310.2967-100000@beohost.scyld.com>

On 10 Jul 2003, Daniel Fernandez wrote:

> Thanks for the answers, we also checked the Mini-ITX mainboard, but C3 
> processors don't offer enough FPU raw speed. On the other hand, the 
> integrated nVidia ethernet controller is in fact a Realtek 8201BL, this
> is our last trouble before we decide what to purchase. 

The nVidia Ethernet NIC uses the rtl8201BL _transceiver_.  Don't
confuse this with the rtl8139 NIC chip, which has the transceiver
integrated on the same chip with the NIC.

There have been several reports of mediocre preformance and kernel
problems from using the proprietary, binary-only nVidia driver.  It's
likely more efficient than the standard rtl8139 interface (before the
C+), but it's difficult to know without the driver source.

> Our actual cluster is equipped with 3Com 3c905CX-TX-M ethernet controllers,
> our doubt is about that Realtek controller because I suspect that Realtek
> ethernet nics put more load onto the main CPU ? can anyone confirm this ?

The 3c905C is one of the best Fast Ethernet NICs available.
It does well with everything but multicast filtering.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fant at pobox.com  Thu Jul 10 10:25:59 2003
From: fant at pobox.com (Andrew Fant)
Date: Thu, 10 Jul 2003 10:25:59 -0400 (EDT)
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <20030710100848.N15741-100000@net.bluemoon.net>

I am in the closing stages of a project to build a 64 CPU Xeon cluster
that is using gentoo as it's base os.  For installation and the like, I am
using Systemimager.  It's not perfect, but it has the decided advantage of
not depending on any particular packaging system to handle the installs.

You will probably want a http proxy on a head node to simplify the
installation process.  I just did a manual install of the O/S on the head
nodes and on one of the compute nodes, and cloned from there, though if
you want further automation, there is a gentoo installer project on
sourceforge, iirc, or you can script most of it in sh, of course.

Are you planning to run commercial apps on this cluster, or will it be
primarily user developed code?  I have found that most commercial apps can
be coerced into running under gentoo, but modifying their installed
scripts may be something of a PITA, and you almost certainly will get to
be good friends with rpm2targz.

One last caveat.  Depending on how "production" you are going to make this
cluster, you may need to be a little less agressive about updating ebuilds
and which versions of packages you install.  A good regression test suite
is good to have if you have layered software to install which isn't part
of an ebuild to start.

I'd be glad to talk to anyone else who has an interest in gentoo-based
beowulfish clusters.  In spite of the extra engineering work, I am pleased
with the results.

Andy

Andrew Fant      |   This    |  "If I could walk THAT way...
Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
Boston, MA USA   |   Hire    |    http://www.pharmawulf.com

On Wed, 9 Jul 2003, P. Aaron Lott wrote:

> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Thu Jul 10 05:23:44 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Thu, 10 Jul 2003 17:23:44 +0800
Subject: PVM
Message-ID: <3F0D30A0.D572627A@nchc.gov.tw>

Hi,

I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
running SLSE8 for AMD64, I got the following message:

> pvm
libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
libpvm [pid1483] mksocs() connect: No such file or directory
libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
libpvm [pid1483]: Console: Can't contact local daemon

I wonder if someone knows what is the reason causes this problem?
Thanks for any suggestion and help.

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jducom at nd.edu  Thu Jul 10 11:49:32 2003
From: jducom at nd.edu (Jean-Christophe Ducom)
Date: Thu, 10 Jul 2003 10:49:32 -0500
Subject: etherchannel
References: <6353EB090D04484B9AFF8E257A4BF84D3D5F68@blrhomx2.wipro.co.in>
Message-ID: <3F0D8B0C.40209@nd.edu>

Or you can have a look at:
http://www.st.rim.or.jp/~yumo/

	JC


Mohamed Abubakkar Siddiqu wrote:
> Hi..
> 
> 
> 
> U can try Channel Bonding. Check Bonding Documentation from the Kernel source
> 
> Siddiqu.T
> 
> 
> 
> -----Original Message-----
> From: Jyh-Shyong Ho [mailto:c00jsh00 at nchc.gov.tw]
> Sent: Tuesday, July 08, 2003 10:28 AM
> To: beowulf at beowulf.org
> Subject: etherchannel
> 
> 
> Hi,
> 
> Does anyone know how to set up and configure etherchannel
> on Linux system? 
> 
> I have a motherboard has two Broadcom gigabit ports, and 
> a 24-port SMC Gigabit TigerSwitch which also has Broadcom
> chip on it. Both support IEEE 802.3ad protocol which allows
> to combine two physical LAN ports into a logical one and
> double the bandwitch.There are several name for such feature,
> etherchannel is just one of them.
> 
> I wonder if anyone has try this on a Linux system, say
> SuSE Enterprise Server 8 or RedHat 9 ? any help or suggestion 
> will be appreciated.
> 
> Best Regards
> 
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-Performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> **************************Disclaimer************************************
> 
> Information contained in this E-MAIL being proprietary to Wipro Limited is 
> 'privileged' and 'confidential' and intended for use only by the individual
>  or entity to which it is addressed. You are notified that any use, copying 
> or dissemination of the information contained in the E-MAIL in any manner 
> whatsoever is strictly prohibited.
> 
> ***************************************************************************
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 01:37:48 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 00:37:48 -0500 (CDT)
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <Pine.GSO.4.44.0307100011120.22038-100000@geosci.uchicago.edu>

On Wed, 9 Jul 2003 at 23:14, P. Aaron Lott wrote:

>
> Hi,
>
> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.
>

If all the nodes are identical hw wise, systemimager (with network boot)
is an easy way to go for any flavor of linux. come to think of it, they
may not need to be that identical as long as your kernel support the
hardware. a search for "cloning" on freshmeat gives a few others.

i'd be interested in how you gentoo-beowulf goes...i'm sure someone else
is running one, but i don't know of any.

russell

> Thanks,
>
> Aaron
>
>
>
> palott at math.umd.edu
> http://www.lcv.umd.edu/~palott
> LCV:    IPST 4364A (301)405-4865
> Office: IPST 4364D (301)405-4843
> Fax:   (301)314-0827
>
> P. Aaron Lott
> 1301 Mathematics Building
> University of Maryland
> College Park, MD 20742-4015
>
>
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 12:03:30 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 11:03:30 -0500 (CDT)
Subject: Small PCs cluster
In-Reply-To: <Pine.LNX.4.44.0307101056310.2967-100000@beohost.scyld.com>
Message-ID: <Pine.GSO.4.44.0307101057040.29006-100000@geosci.uchicago.edu>

On Thu, 10 Jul 2003 at 11:04, Donald Becker wrote:

> On 10 Jul 2003, Daniel Fernandez wrote:
>
>
> The 3c905C is one of the best Fast Ethernet NICs available.
> It does well with everything but multicast filtering.

Could you elaborate on it's issues with multicast filtering (or point me
somewhere)? I am having some problems with multicast on a multihomed box
with these NICs and this is the first I have heard of this.

thanks
russell


>
> --
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
> Annapolis MD 21403			410-990-9993
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From vanne at venda.uku.fi  Thu Jul 10 10:06:35 2003
From: vanne at venda.uku.fi (Antti Vanne)
Date: Thu, 10 Jul 2003 17:06:35 +0300 (EEST)
Subject: kernel level ip-config and nic driver as a module
Message-ID: <Pine.LNX.4.44.0307101645080.31467-100000@venda.uku.fi>

Hi, 

I'm building my second beowulf cluster and ran into trouble with 3com 
940 network interface chip that is embedded in the mobo. DHCP works 
fine, client gets IP, but tftp won't load the pxelinux.0, it tries twice 
(according to the in.tftpd's log), but the client doesn't try to look 
for pxelinux.cfg/C0... config files. I have one similar setup working 
using the Intel e1000, and according to 
http://syslinux.zytor.com/hardware.php there's been trouble with 3com 
cards, so I figure the fault is not in the config but in the network 
chip. 

The best option would be PXE (anyone have a working pxe setup 
with 3c940?), but since it seems impossible, I'm trying to boot 
clients from floppy and use nfsroot: however the driver for 3c940 is 
available (from www.asus.com) only as kernel module, and 
unfortunately kernel runs ip-config before loading the module from 
initrd?!? How is this fixed? I'm not really a kernel hacker, obviously 
one could browse the kernel source and look for ip-config and module 
loading, but isn't there any easier way to change the boot sequence so 
that network module would be loaded before running ip-config? Any help 
would be greatly appreciated. If there is no easy way to change the 
order, what would be the next thing to do? Have minimal root 
filesystem on the floppy and then nfs-mount /usr etc. from the server? 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From samhdaniel at earthlink.net  Thu Jul 10 13:33:50 2003
From: samhdaniel at earthlink.net (Sam Daniel)
Date: 10 Jul 2003 13:33:50 -0400
Subject: ClusterWorld
Message-ID: <1057858430.4664.4.camel@wulf>

Didn't anyone attend? Doesn't anyone have anything to say about it? How
were the sessions? Will there be any Proceedings available? Etc., etc.,
etc....

If not on this list, then where?

-- Sam
Come out in the open with Linux.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From twhitcomb at apl.washington.edu  Thu Jul 10 16:52:50 2003
From: twhitcomb at apl.washington.edu (Timothy R. Whitcomb)
Date: Thu, 10 Jul 2003 13:52:50 -0700 (PDT)
Subject: help! MPI Calls not responding...
Message-ID: <Pine.LNX.4.44.0307101351090.22363-100000@snark.apl.washington.edu>

We are trying to run the Navy's COAMPS atmospheric model on a Scyld
Beowulf cluster, using the Portland Group FORTRAN compiler.  The
cluster is comprised of five nodes, each with dual AMD processors.

After some modification to the supplied Makefile, the software now
compiles and fully links.  The makefile was modified to use the
following options for the compiler
-----------------------------------------------
"EXTRALIBS= -L/usr/lib -lmpi -lmpich -lpmpich -lbproc -lbpsh -lpvfs
-lbeomap -lbeostat -ldl -llapack -lblas -lparpack_LINUX
-L/usr/coamps3/lib -lfnoc -L/usr/lib/gcc-lib/i386-redhat-linux/2.96 -lg2c"
-----------------------------------------------

However, when we try to run the code using
mpirun -allcpus atmos_forecast.exe
or
mpprun -allcpus atmos_forecast.exe
in a Perl script, it gives the following error:
-----------------------------------------------
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_INIT and code is 208
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_COMM_RANK and
code is 197
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_COMM_SIZE and
code is 197
NOT ENOUGH COMPUTATIONAL PROCESSES
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_ABORT and code is 197
Fatal error; unknown error handler
May be MPI call before MPI_INIT.  Error message is MPI_BARRIER and code is 197
-----------------------------------------------
where the NOT ENOUGH COMPUTATIONAL PROCESSES is a program message that
indicates that you've specified to use more processors than
available.  The offending section of code is
-----------------------------------------------
      call MPI_INIT(ierr_mpi)
      call MPI_COMM_RANK(MPI_COMM_WORLD, npr, ierr_mpi)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, nprtot, ierr_mpi)
-----------------------------------------------

I modified this code to add a call to MPI_INITIALIZED after the
MPI_INIT call which indicated that the MPI_INIT just plain was not
working.

If it makes any difference, I can run the Beowulf demos (like
mpi-mandel or linpack) just fine on the multiple processors.

What is going on here and how do we fix it? We're new to cluster
computing, and this is getting over our heads.  I've tried to supply
the information I thought was relevant but as this project is proving
to me what I think doesn't do me much good.

Thanks in advance...

Tim Whitcomb
twhitcomb at apl.washington.edu
University of Washington Applied Physics Laboratory

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bob at drzyzgula.org  Thu Jul 10 18:13:59 2003
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Thu, 10 Jul 2003 18:13:59 -0400
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
References: <1057857552.73501@accufo.vwh.net>
Message-ID: <20030710181359.I14673@www2>

Grid Engine. Free, open source.
Binaries are available for Tru64.

http://gridengine.sunsource.net/

--Bob Drzyzgula

On Thu, Jul 10, 2003 at 11:19:13AM -0600, sfrolov at accufo.vwh.net wrote:
> 
> Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sfrolov at accufo.vwh.net  Thu Jul 10 13:19:13 2003
From: sfrolov at accufo.vwh.net (sfrolov at accufo.vwh.net)
Date: Thu, 10 Jul 2003 11:19:13 -0600 (MDT)
Subject: batch software
Message-ID: <1057857552.73501@accufo.vwh.net>

Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Thu Jul 10 20:13:27 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Fri, 11 Jul 2003 08:13:27 +0800
Subject: queueing system for x86-64
Message-ID: <3F0E0127.8A50A8CB@nchc.gov.tw>

Hi,

I wonder if someone knows where can I find a queueing system like
OpenPBS
for x86-64 (AMD Opteron) ?

Best Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Thu Jul 10 21:22:18 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 11 Jul 2003 09:22:18 +0800 (CST)
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
Message-ID: <20030711012218.72314.qmail@web16810.mail.tpe.yahoo.com>

Sun's Gridengine is very good, it's free and
opensource. 

http://gridengine.sunsource.net/

(IMO, I think it is even better than commercial
software like PBSPro or LSF).

Andrew.

 --- sfrolov at accufo.vwh.net ????
> Can anybody recommend a good (and cheap) batch
> software for an alpha cluster running true64 Unix?
> Unfortunately we cannot afford to spend more than
> $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nordquist at geosci.uchicago.edu  Thu Jul 10 18:05:48 2003
From: nordquist at geosci.uchicago.edu (Russell Nordquist)
Date: Thu, 10 Jul 2003 17:05:48 -0500 (CDT)
Subject: batch software
In-Reply-To: <1057857552.73501@accufo.vwh.net>
Message-ID: <Pine.GSO.4.44.0307101656520.29006-100000@geosci.uchicago.edu>


Take a look at Sun Grid Engine....there are binaries for True64 (or
source) and it's free. You may want to look at running maui scheduler on
top of it. http://www.supercluster.org/maui/

russell

On Thu, 10 Jul 2003 at 11:19, sfrolov at accufo.vwh.net wrote:

> Can anybody recommend a good (and cheap) batch software for an alpha cluster running true64 Unix? Unfortunately we cannot afford to spend more than $300 on this at the moment.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

- - - - - - - - - - - -
Russell Nordquist
UNIX Systems Administrator
Geophysical Sciences Computing
http://geosci.uchicago.edu/computing
NSIT, University of Chicago
 - - - - - - - - - - -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Fri Jul 11 01:08:50 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 11 Jul 2003 13:08:50 +0800 (CST)
Subject: queueing system for x86-64
In-Reply-To: <3F0E0127.8A50A8CB@nchc.gov.tw>
Message-ID: <20030711050850.30031.qmail@web16811.mail.tpe.yahoo.com>

Has anyone tried Gridengine on Opteron?

I think the existing x86 binary should work, binary
download:
http://gridengine.sunsource.net/project/gridengine/download.html

If it doesn't, just subscribe to the users list, there
are a lot of helpful people.

http://gridengine.sunsource.net/project/gridengine/maillist.html

Another reason I like SGE is because it has Chinese
User/Admin manual:

http://www.sun.com/products-n-solutions/hardware/docs/Software/Sun_Grid_Engine/

Andrew.

 --- Jyh-Shyong Ho <c00jsh00 at nchc.gov.tw> ????
> Hi,
> 
> I wonder if someone knows where can I find a
> queueing system like
> OpenPBS
> for x86-64 (AMD Opteron) ?
> 
> Best Regards
> 
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Fri Jul 11 12:13:08 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Fri, 11 Jul 2003 12:13:08 -0400
Subject: MPICH 1.2.5 failures (net_recv)
Message-ID: <3F0EE214.6000602@lmco.com>

Good afternoon!

   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
with the following configuration:

./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
          --with-ARCH=LINUX \
          --with-device=ch_p4 \
          --without-romio --without-mpe \
          -opt=-O2  \
          -cc=/usr/pgi/linux86/bin/pgcc \
          -fc=/usr/pgi/linux86/bin/pgf90 \
          -clinker=/usr/pgi/linux86/bin/pgcc \
          -flinker=/usr/pgi/linux86/bin/pgf90 \
          -f90=/usr/pgi/linux86/bin/pgf90 \
          -f90linker=/usr/pgi/linux86/bin/pgf90 \
          -c++=/usr/pgi/linux86/bin/pgCC \
          -c++linker=/usr/pgi/linux86/bin/pgCC


I've built the 'cpi' and 'fpi' examples in the examples/basic directory
and tried running them using the following mpirun line:


/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
PBS_NODEFILE cpi


where PBS_NODEFILE is,

penguin1
penguin1
penguin2
penguin2
penguin3
penguin3
penguin4
penguin4
penguin5
penguin5

(however, I'm testing outside of PBS). The code seems to hang fo
 quite a while and then I get the following:

p0_14235: (935.961023) net_recv failed for fd = 10
p0_14235:  p4_error: net_recv read, errno = : 110
p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
/home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
/home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
/home/g593851/src/mpich-1.2.5/examples/basic


More system details - It's a RH 7.1 OS, but with a stock 2.4.20
kernel. The interconnect is FastE through a Foundry switch and the
NICS are Intel EEPro100 (using the eepro100 driver).
   Does anybody have any ideas? I've I searched around the net a bit and
the results  were inconclusive ("use LAM instead", may have bad NIC
drivers, problematic TCP stack, etc.).

TIA!

Jeff


-- 
Dr. Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From siegert at sfu.ca  Fri Jul 11 13:11:07 2003
From: siegert at sfu.ca (Martin Siegert)
Date: Fri, 11 Jul 2003 10:11:07 -0700
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030711171107.GA29718@stikine.ucs.sfu.ca>

On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:
> Good afternoon!
> 
>   Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
> 
> ./configure --prefix=/home/g593851/BIN/mpich-1.2.5/pgi \
>          --with-ARCH=LINUX \
>          --with-device=ch_p4 \
>          --without-romio --without-mpe \
>          -opt=-O2  \
>          -cc=/usr/pgi/linux86/bin/pgcc \
>          -fc=/usr/pgi/linux86/bin/pgf90 \
>          -clinker=/usr/pgi/linux86/bin/pgcc \
>          -flinker=/usr/pgi/linux86/bin/pgf90 \
>          -f90=/usr/pgi/linux86/bin/pgf90 \
>          -f90linker=/usr/pgi/linux86/bin/pgf90 \
>          -c++=/usr/pgi/linux86/bin/pgCC \
>          -c++linker=/usr/pgi/linux86/bin/pgCC
> 
> 
> I've built the 'cpi' and 'fpi' examples in the examples/basic directory
> and tried running them using the following mpirun line:
> 
> 
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun -np 10 -machinefile 
> PBS_NODEFILE cpi
> 
> 
> where PBS_NODEFILE is,
> 
> penguin1
> penguin1
> penguin2
> penguin2
> penguin3
> penguin3
> penguin4
> penguin4
> penguin5
> penguin5
> 
> (however, I'm testing outside of PBS). The code seems to hang fo
> quite a while and then I get the following:
> 
> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235:  p4_error: net_recv read, errno = : 110
> p2_12406: (935.817898) net_send: could not write to fd=7, errno = 104
> /home/g593851/BIN/mpich-1.2.5/pgi/bin/mpirun: line 1: 14235 Broken 
> pipe             /home/g593851/src/mpich-1.2.5/examples/basic/cpi -p4pg 
> /home/g593851/src/mpich-1.2.5/examples/basic/PI13983 -p4wd 
> /home/g593851/src/mpich-1.2.5/examples/basic
> 
> 
> More system details - It's a RH 7.1 OS, but with a stock 2.4.20
> kernel. The interconnect is FastE through a Foundry switch and the
> NICS are Intel EEPro100 (using the eepro100 driver).
>   Does anybody have any ideas? I've I searched around the net a bit and
> the results  were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).

I think you sent this to the wrong mailing list. As outlined on the
MPICH home page problem reports should go to

mpi-maint at mcs.anl.gov

The folks at Argonne are usually extremly helpful with solving problems.

Cheers,
Martin

-- 
Martin Siegert
Manager, Research Services
WestGrid Site Manager
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 11 13:55:10 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 11 Jul 2003 10:55:10 -0700
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030711175510.GA3185@greglaptop.greghome.keyresearch.com>

On Fri, Jul 11, 2003 at 12:13:08PM -0400, Jeff Layton wrote:

> p0_14235: (935.961023) net_recv failed for fd = 10
> p0_14235:  p4_error: net_recv read, errno = : 110

It's a shame that so many programs don't print human-readable error
messages.

errno 110 is ETIMEDOUT.

error 104 is ECONNRESET, but I would suspect that it's a secondary
error generated by p0 exiting from the errno 110.

greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From AlberT at SuperAlberT.it  Fri Jul 11 06:35:21 2003
From: AlberT at SuperAlberT.it (AlberT)
Date: Fri, 11 Jul 2003 12:35:21 +0200
Subject: PVM
In-Reply-To: <3F0D30A0.D572627A@nchc.gov.tw>
References: <3F0D30A0.D572627A@nchc.gov.tw>
Message-ID: <200307111235.21746.AlberT@SuperAlberT.it>

On Thursday 10 July 2003 11:23, Jyh-Shyong Ho wrote:
> Hi,
>
> I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
>
> running SLSE8 for AMD64, I got the following message:
> > pvm
>
> libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
> libpvm [pid1483] mksocs() connect: No such file or directory
> libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
> libpvm [pid1483]: Console: Can't contact local daemon
>
> I wonder if someone knows what is the reason causes this problem?
> Thanks for any suggestion and help.

are ou sure pvmd is running ???
check it using    ps -axu | grep pvm
-- 
<?php echo '       Emiliano `AlberT` Gabrielli       '."\n".
           '  E-Mail: AlberT at SuperAlberT.it  '."\n".
           '  Web:    http://SuperAlberT.it  '."\n".
'  IRC:    #php,#AES azzurra.com '."\n".'ICQ: 158591185'; ?>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From exa at kablonet.com.tr  Fri Jul 11 05:17:58 2003
From: exa at kablonet.com.tr (Eray Ozkural)
Date: Fri, 11 Jul 2003 12:17:58 +0300
Subject: gentoo cluster
In-Reply-To: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
References: <9FD878E4-B284-11D7-96C6-000393DC6E46@math.umd.edu>
Message-ID: <200307111217.58060.exa@kablonet.com.tr>

On Thursday 10 July 2003 06:14, P. Aaron Lott wrote:
> Hi,
>
> Our group is interested in building a beowulf cluster using gentoo
> linux as the OS. Has anyone on the list had experience with this or
> know anyone who has experience with this? We're trying to figure out
> the best way to spawn nodes once we have configured one machine
> properly. Any suggestions such as pseudo kickstart methods would be
> greatly appreciated.

I investigated this a while ago. It turns out that gentoo isn't really geared 
towards cluster use, but once you've customized it it can be pretty easy to 
use a system replication tool.

I guess gentoo could benefit from a standardized HPC clustering solution, 
including parallel system libraries and tools.

Thanks,

-- 
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara  KDE Project: http://www.kde.org
www: http://www.cs.bilkent.edu.tr/~erayo  Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B  EA0F 7C07 AE16 874D 539C
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jakob at unthought.net  Sun Jul 13 15:17:42 2003
From: jakob at unthought.net (Jakob Oestergaard)
Date: Sun, 13 Jul 2003 21:17:42 +0200
Subject: NIS problem ..
In-Reply-To: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>
References: <Pine.LNX.4.21.0307081420200.8420-100000@cse.iitkgp.ernet.in>
Message-ID: <20030713191742.GA10670@unthought.net>

On Tue, Jul 08, 2003 at 02:25:11PM +0530, Rakesh Gupta wrote:
> 
> 
> Hi, 
>    I am setting up a small 8 node cluster .. I have installed RedHat 9.0
> on all the nodes. 
>   Now I want to setup NIS .. I have ypserv , portmap, ypbind running on
> one of the nodes (The server) on the others I have ypbind and portmap.
> 
> The NIS Domain is also set in /etc/sysconfig networkk .. 
> 
> Now when I do /var/yp/make .. an error of the following form comes
> 
> " failed to send 'clear' to local ypserv: RPC: Unknown HostUpdating
> passwd.byuid " 
> 
> and a sequence of such messages follow..
> 
> can anyone please help me with this.


What's in your /var/yp/ypservers file?   Does it include the NIS server?

Are you sure that whatever hostname(s) you have there is resolvable?

Do you have 'localhost' (and the name for the local host used in the
ypservers file) in your /etc/hosts file?

Are you sure you don't have any fancy firewalling enabled by accident?

I'm shooting in the dark here... I haven't seen that particular problem
on a NIS server before.  It just looks like somehow it cannot contact
the local host, which is weird...

As a last resort, I would suggest looking thru the makefile, to see
exactly which command fails.  Once you have isolated the single command
to run to get the error message you see, try running it under "strace".
Then it should be pretty clear exactly which system call fails, and from
there on you might be able to guess why it attempts to make that call.

I haven't needed to go thru that routine with a NIS server yet...
Usually turning on debugging information, and double-checking the
configuration files should do it.

My NIS server and slave is on Debian 3 now though, and I don't know if
there are any particular oddities in the RedHat 9 setup.

-- 
................................................................
:   jakob at unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From msnitzer at lnxi.com  Mon Jul 14 16:03:33 2003
From: msnitzer at lnxi.com (Mike Snitzer)
Date: Mon, 14 Jul 2003 14:03:33 -0600
Subject: MPICH 1.2.5 failures (net_recv)
In-Reply-To: <3F0EE214.6000602@lmco.com>; from jeffrey.b.layton@lmco.com on Fri, Jul 11, 2003 at 12:13:08PM -0400
References: <3F0EE214.6000602@lmco.com>
Message-ID: <20030714140333.A10106@lnxi.com>

On Fri, Jul 11 2003 at 10:13,
Jeff Layton <jeffrey.b.layton at lmco.com> wrote:

> Good afternoon!
> 
>    Our cluster has been recently upgraded (from a 2.2 kernel to a 2.4
> kernel). I've built MPICH-1.2.5 on it using the PGI 4.1 compilers,
> with the following configuration:
...
>    Does anybody have any ideas? I've I searched around the net a bit and
> the results  were inconclusive ("use LAM instead", may have bad NIC
> drivers, problematic TCP stack, etc.).

Hey jeff,

you might try compiling mpich with gcc to eliminate PGI as a potential
source of error.  This would at least allow you to verify the integrity of
the drivers, tcp stack, nic, etc.

PGI should be perfectly fine given the minimal mpich configure you
provided but the compiler is one variable that is easy enough to eliminate
as a potential problem. If you see the same problem with gcc compiled
mpich then there is a deeper issue.  You might confine the mpirun to use
only 2 nodes and then scale up accordingly.

regards,
mike

-- 
Mike Snitzer                           msnitzer at lnxi.com
Linux Networx                          http://www.lnxi.com 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From msnitzer at lnxi.com  Mon Jul 14 16:35:41 2003
From: msnitzer at lnxi.com (Mike Snitzer)
Date: Mon, 14 Jul 2003 14:35:41 -0600
Subject: queueing system for x86-64
In-Reply-To: <3F0E0127.8A50A8CB@nchc.gov.tw>; from c00jsh00@nchc.gov.tw on Fri, Jul 11, 2003 at 08:13:27AM +0800
References: <3F0E0127.8A50A8CB@nchc.gov.tw>
Message-ID: <20030714143541.B10106@lnxi.com>

On Thu, Jul 10 2003 at 18:13,
Jyh-Shyong Ho <c00jsh00 at nchc.gov.tw> wrote:

> Hi,
> 
> I wonder if someone knows where can I find a queueing system like
> OpenPBS
> for x86-64 (AMD Opteron) ?

hello,

If you'd like to use OpenPBS on x86-64 it works fine.. once you patch the
buildutils/config.guess accordingly.  An ia64 patch is available here:

http://www.osc.edu/~troy/pbs/patches/config-ia64-2.3.12.diff

you'll need to replace all instances of 'ia64' with 'x86_64' in the patch.

fyi, you'll likely also need a patch to get gcc3.x to work with OpnePBS's
makedepend-sh; search google with: makedepend openpbs gcc3 

regards,
mike

-- 
Mike Snitzer                           msnitzer at lnxi.com
Linux Networx                          http://www.lnxi.com 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Tue Jul 15 00:23:18 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Tue, 15 Jul 2003 12:23:18 +0800
Subject: PVM
References: <3F0D30A0.D572627A@nchc.gov.tw> <200307111235.21746.AlberT@SuperAlberT.it>
Message-ID: <3F1381B6.E423FA07@nchc.gov.tw>

Hi,

Thanks for the message. I checked and found that pvmd is not running,
when I ran pvmd to initiate the daemon, it aborted immediately:

c00jsh00 at Zephyr:~> pvmd
/tmp/pvmtmp012493.0
Aborted

Here are the environment variables:

export PVM_ROOT=/usr/lib/pvm3
export PVM_ARCH=X86_64
export PVM_DPATH=$PVM_ROOT/lib/pvmd
export PVM_TMP=/tmp
export PVM=$PVM_ROOT/lib/pvm

Perhaps someone knows what might be wrong.

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC


AlberT wrote:
> 
> On Thursday 10 July 2003 11:23, Jyh-Shyong Ho wrote:
> > Hi,
> >
> > I installed pvm-3.4.4-190.x86_64.rpm on my dual Opteron box
> >
> > running SLSE8 for AMD64, I got the following message:
> > > pvm
> >
> > libpvm [pid1483]: mxfer() mxinput bad return on pvmd sock
> > libpvm [pid1483] mksocs() connect: No such file or directory
> > libpvm [pid1483]        socket address tried: /tmp/pvmtmp001485.0
> > libpvm [pid1483]: Console: Can't contact local daemon
> >
> > I wonder if someone knows what is the reason causes this problem?
> > Thanks for any suggestion and help.
> 
> are ou sure pvmd is running ???
> check it using    ps -axu | grep pvm
> --
> <?php echo '       Emiliano `AlberT` Gabrielli       '."\n".
>            '  E-Mail: AlberT at SuperAlberT.it  '."\n".
>            '  Web:    http://SuperAlberT.it  '."\n".
> '  IRC:    #php,#AES azzurra.com '."\n".'ICQ: 158591185'; ?>
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rene.storm at emplics.com  Tue Jul 15 03:11:16 2003
From: rene.storm at emplics.com (Rene Storm)
Date: Tue, 15 Jul 2003 09:11:16 +0200
Subject: Default user installed by Packages
Message-ID: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>

Hi Beowulfers,

I'm working on a little Cluster Builder which bases on rsync.
As I noticed, rsync change the owner of a file attribute via chown, if
the owner is known by the system.
Would you be so nice and take a look, if I have to expand my
"default-known" user list on the pxe-environment ?.
I would like the have it destribution independent.
Some Suse and Debian lists would be nice.


This list belongs to RH 7.3 # cat /etc/passwd | cut -d: -f1 | sort
adm
amanda
apache
bin
daemon
ftp
games
gdm
gopher
halt
ident
junkbust
ldap
lp
mail
mailman
mailnull
mysql
named
netdump
news
nfsnobody
nobody
nscd
ntp
operator
pcap
postfix
postgres
pvm
radvd
root
rpc
rpcuser
rpm
shutdown
squid
sync
uucp
vcsa
xfs

Thanks in advance
Rene Storm
__________________________
emplics AG
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Tue Jul 15 09:59:56 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Tue, 15 Jul 2003 07:59:56 -0600
Subject: PVM
In-Reply-To: <3F1381B6.E423FA07@nchc.gov.tw>
References: <3F0D30A0.D572627A@nchc.gov.tw> <200307111235.21746.AlberT@SuperAlberT.it> <3F1381B6.E423FA07@nchc.gov.tw>
Message-ID: <3F1408DC.20606@inel.gov>

Jyh-Shyong Ho wrote:

> Hi,
> 
> Thanks for the message. I checked and found that pvmd is not running,
> when I ran pvmd to initiate the daemon, it aborted immediately:
> 
> c00jsh00 at Zephyr:~> pvmd
> /tmp/pvmtmp012493.0
> Aborted
> 
> Here are the environment variables:
> 
> export PVM_ROOT=/usr/lib/pvm3
> export PVM_ARCH=X86_64
> export PVM_DPATH=$PVM_ROOT/lib/pvmd
> export PVM_TMP=/tmp
> export PVM=$PVM_ROOT/lib/pvm
> 
> Perhaps someone knows what might be wrong.

Do you have a /tmp/pvmd* file?  They can be left
after a pvm crash and prevent future instances
from starting.  Also, do you really mean to
execute pvmd directly and without arguments?

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Tue Jul 15 11:48:05 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 15 Jul 2003 11:48:05 -0400
Subject: When are diskless compute nodes inappropriate?
Message-ID: <1058284085.17543.12.camel@haze.sr.unh.edu>

Okay, I'm convinced by the arguments in favor of diskless compute
nodes, including cost savings applicable elsewhere, reduced power
consumption, and increased reliability through the elimination of
moving parts.

With all the arguments against disks, what are the arguments in favor
of diskful compute nodes? In particular, what are the situations or
types of jobs for which a cluster with a high percentage of diskless
nodes is contraindicated?

I look forward to learning from the list's collective wisdom.

Thanks.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From henken at seas.upenn.edu  Tue Jul 15 12:27:18 2003
From: henken at seas.upenn.edu (Nicholas Henke)
Date: 15 Jul 2003 12:27:18 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>

On Tue, 2003-07-15 at 11:48, Tod Hagan wrote:
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?

Anytime that accessing the data locally is faster than via NFS/OtherFS.
The other case is when you are routinely using swap for memory.

The one 'practical' situation we see here is on our Genomics cluster,
where they are running BLAST on very large data sets. It makes an
extremely large difference to copy the data to a local drive and use
that than to access the data via NFS.

HTH,
Nic
-- 
Nicholas Henke
Penguin Herder & Linux Cluster System Programmer
Liniac Project - Univ. of Pennsylvania

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Tue Jul 15 12:28:25 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 15 Jul 2003 12:28:25 -0400
Subject: Default user installed by Packages
In-Reply-To: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
References: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
Message-ID: <1058286507.17543.19.camel@haze.sr.unh.edu>

On Tue, 2003-07-15 at 03:11, Rene Storm wrote:
> Some Suse and Debian lists would be nice.

>From my Debian stable (woody) system:
backup
bin
daemon
games
gdm
gnats
identd
irc
list
lp
mail
man
news
nobody
operator
postgres
proxy
root
sshd
sync
sys
uucp
www-data


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Tue Jul 15 12:53:45 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 15 Jul 2003 12:53:45 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <1058288025.3280.102.camel@protein.scalableinformatics.com>

When you do lots of disk IO to large blocks, sequential reads/writes. 
Remote disk will bottleneck you either at the network port of the
compute node (~10 MB/s for 100 Base T, or ~80 MB/s for gigabit), or at
the network port(s) of the file server (even if you multihome it, N
clients distributed over M ports all heavily utilizing the file system
will slow down the whole system if the requested bandwidth exceeds what
the server is able to provide out its port(s)).  Or even at the disk of
the server.  

Local IO to a single spindle IDE disk can get you 30(50) MB/s
write(read) performance.  RaidO (using Linux MD device) can get you
60(80) MB/s write(read) performance.  Sure, this is less than a 200 MB/s
fibre channel, but it is also not shared like the 200 MB/s fibre channel
(which becomes effectively (200/M) MB/s fibre channel for M requestors
using lots of bandwidth).

The aggregate IO when you get many writers/readers utilizing lots of
bandwidth is a win for local disk over shared disk.  From a cost
perspective this is far better bang per US$ than shared disk for the
heavy IO applications.  At about $60 for a 40 GB IDE (ATA 100, 7200
RPM), the price isn't significant compared to the cost of an individual
compute node.  That is, unless you go SCSI for compute nodes.

If you go diskless on the OS, just have a local scratch disk space for
your heavy IO jobs.

On Tue, 2003-07-15 at 11:48, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption, and increased reliability through the elimination of
> moving parts.
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?
> 
> I look forward to learning from the list's collective wisdom.
> 
> Thanks.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Tue Jul 15 13:11:25 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 15 Jul 2003 13:11:25 -0400
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
	 <1058286438.16784.20.camel@roughneck.liniac.upenn.edu>
Message-ID: <1058289085.3280.120.camel@protein.scalableinformatics.com>

On Tue, 2003-07-15 at 12:27, Nicholas Henke wrote:

[...]

> The one 'practical' situation we see here is on our Genomics cluster,
> where they are running BLAST on very large data sets. It makes an
> extremely large difference to copy the data to a local drive and use
> that than to access the data via NFS.

One thing that you can do is to segment the databases (use the -v switch
on formatdb) or if you don't care about the absolute E-values being
correct relative to your real database size, you could pre-segment the
database using a tool such as our segment.pl at
http://scalableinformatics.com/downloads/segment.pl .  The large cost of
disk access for the large BLAST jobs comes from the way it mmaps the
indices, in case they overflow available memory.  If they do overflow
memory, then you spend your time in disk IO bringing the indices into
memory as you walk through them.  This lowers your overall absolute
performance.

Regardless of the segmentation, it is rarely a good idea (except in the
case of very small databases) to keep them on NFS for the computation.
 Even if they are small, you are going to suffer network congestion very
quickly for a reasonable number of compute nodes.


Of course this gets into the problem of moving the databases out to the
compute nodes.  We are working on a neat solution to the data motion
problem (specifically the database transport problem to the compute
nodes).  To avoid annoying everyone, please go offlist if you want to
speak to us about it.  Email/phone in .sig.
 
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From edwardsa at plk.af.mil  Tue Jul 15 17:16:43 2003
From: edwardsa at plk.af.mil (Arthur H. Edwards)
Date: Tue, 15 Jul 2003 15:16:43 -0600
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <20030715211643.GA23118@plk.af.mil>

If you are running large numbers of jobs that read and write to disk,
local disk can be much more stable. We have been running an essentially 
serial application on many nodes and in both cases where we were writing
to a parallel file system, the app would consistently crash.

Art Edwards

On Tue, Jul 15, 2003 at 11:48:05AM -0400, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption, and increased reliability through the elimination of
> moving parts.
> 
> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or
> types of jobs for which a cluster with a high percentage of diskless
> nodes is contraindicated?
> 
> I look forward to learning from the list's collective wisdom.
> 
> Thanks.
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Art Edwards
Senior Research Physicist
Air Force Research Laboratory
Electronics Foundations Branch
KAFB, New Mexico

(505) 853-6042 (v)
(505) 846-2290 (f)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From markgw at sgi.com  Wed Jul 16 02:31:23 2003
From: markgw at sgi.com (Mark Goodwin)
Date: Wed, 16 Jul 2003 16:31:23 +1000 (EST)
Subject: [ANNOUNCE] SGI Performance Co-Pilot 2.3.1 now available
Message-ID: <Pine.LNX.4.44.0307161621110.21382-100000@sherman.melbourne.sgi.com>


SGI is pleased to announce the new version of Performance Co-Pilot (PCP)
open source (version 2.3.1-4) is now available for download from

          ftp://oss.sgi.com/projects/pcp/download

This release contains mostly bug fixes following several months
of testing the "dev" releases (most recent was version 2.3.0-17).
A list of changes since the last major open source release (which
was version 2.3.0-14) is in /usr/doc/pcp-2.3.1/CHANGELOG after
installation, or at http://oss.sgi.com/projects/pcp/latest.html

There are re-built RPMs for i386 and ia64 platforms in the above ftp
directory. Other platforms will need to build RPMs from either the SRPM
or from the tarball, e.g. :

    # tar xvzf pcp-2.3.1-4.src.tar.gz
    # cd pcp-2.3.1
    # ./Makepkgs

PCP is an extensible system monitoring package with a client/server
architecture. It provides a distributed unifying abstraction for all
interesting performance statistics in /proc and assorted applications
(e.g. Apache). The PCP library APIs are robust and well documented,
supporting rapid deployment of new and diverse sources of performance
data and the development of sophisticated performance monitoring tools.

The PCP homepage is at http://oss.sgi.com/projects/pcp and you can join
the PCP mailing list via http://oss.sgi.com/projects/pcp/mail.html

SGI would like to thank those who contributed to this and earlier releases.

Thanks

-- Mark Goodwin <markgw at sgi.com>
SGI Engineering


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lange at informatik.Uni-Koeln.DE  Wed Jul 16 05:34:03 2003
From: lange at informatik.Uni-Koeln.DE (Thomas Lange)
Date: Wed, 16 Jul 2003 11:34:03 +0200
Subject: Default user installed by Packages
In-Reply-To: <1058286507.17543.19.camel@haze.sr.unh.edu>
References: <29B376A04977B944A3D87D22C495FB23D52A@vertrieb.emplics.com>
	<1058286507.17543.19.camel@haze.sr.unh.edu>
Message-ID: <16149.7179.554250.882661@informatik.uni-koeln.de>

>>>>> On 15 Jul 2003 12:28:25 -0400, Tod Hagan <tod at gust.sr.unh.edu> said:

    > On Tue, 2003-07-15 at 03:11, Rene Storm wrote:
    >> Some Suse and Debian lists would be nice.

These are the packages that are defined in the class Beowulf used in
FAI (fully automatic installation for Debian) for a Beowulf computing
node.


# packages for Beowulf clients

PACKAGES install
fping jmon
rsh-client rsh-server rstat-client rstatd rusers rusersd
autofs

dsh update-cluster-hosts update-cluster
etherwake

PACKAGES taskinst
c-dev
PACKAGES install
lam-runtime lam3 lam3-dev libpvm3 pvm-dev mpich
scalapack-mpich-dev

-- 
regards Thomas
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From franz.marini at mi.infn.it  Wed Jul 16 07:04:57 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Wed, 16 Jul 2003 13:04:57 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>

Hello,

  being in the process of deciding which net infrastructure to use for our 
next cluster (Myrinet, SCI/Dolphin or Quadrics), I was looking at the 
specs for the different types of hw.

  Provided that SCI/Dolphin implements RDMA, I was wondering why so little 
effort seems to be put into implementing a GSM solution for x86 clusters. 
The only (maybe big, maybe not) problem I see in the Dolphin hw is the 
lack of support for cache coherency. 

  I think that having GSM support in (almost) commodity clusters would be 
a really-nice-thing(tm). 

  I know that the Altix family implements GSM, but the price point of even 
a really small system (4 x Itanium2 procs, 4 Gb ram, 36 Gb HD) is really 
high, compared to an (performance wise) equivalent commodity cluster. And 
I can really see that SGI had a nice ccNUMA hw already developed, and so 
the software effort to implement GSM has (probabily) been less massive 
than the effort a Dolphin GSM solution would need. 

  Nonetheless, I still can't quite understand why so little effort is 
being put in developing a GSM solution for commodity cluster (even with 
Myrinet or Quadrics, I'm thinking about SCI/Dolphin only because of the hw 
support for RDMA operations).

  Any idea, comment or whatever ? 

  Have a nice day everyone,

Franz


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joachim at ccrl-nece.de  Wed Jul 16 09:16:09 2003
From: joachim at ccrl-nece.de (Joachim Worringen)
Date: Wed, 16 Jul 2003 15:16:09 +0200
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>
References: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it>
Message-ID: <200307161516.09818.joachim@ccrl-nece.de>

Franz Marini:
>   being in the process of deciding which net infrastructure to use for our
> next cluster (Myrinet, SCI/Dolphin or Quadrics), I was looking at the
> specs for the different types of hw.
>
>   Provided that SCI/Dolphin implements RDMA, I was wondering why so little
> effort seems to be put into implementing a GSM solution for x86 clusters.

Because MPI is what most people want to achieve code- and 
peformance-portability.

> The only (maybe big, maybe not) problem I see in the Dolphin hw is the
> lack of support for cache coherency.
>
>   I think that having GSM support in (almost) commodity clusters would be
> a really-nice-thing(tm).

Martin Schulz (formerly TU M?nchen, now Cornell Theory Center) has developed 
exactly the thing you are looking for. See 
http://wwwbode.cs.tum.edu/Par/arch/smile/software/shmem/ . You will also find 
his PhD thesis there which describes the complete software.

I do not know about the exact status of the SW right now (his approach 
required some patches to the SCI driver, and it will probably be necessary to 
apply them to the current drivers). Very interesting approach, though.

Other, non SCI approaches like MOSIX and the various DSM/SVM libraries also 
offer you some sort of global shared memory - but most do only use TCP/IP for 
communication.

 Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fmahr at gmx.de  Wed Jul 16 10:13:44 2003
From: fmahr at gmx.de (Ferdinand Mahr)
Date: Wed, 16 Jul 2003 16:13:44 +0200
Subject: Global Shared Memory and SCI/Dolphin
References: <Pine.LNX.4.53.0307161244060.4923@merlino.mi.infn.it> <200307161516.09818.joachim@ccrl-nece.de>
Message-ID: <3F155D98.7CB8BE90@gmx.de>

Joachim Worringen wrote:
> Other, non SCI approaches like MOSIX and the various DSM/SVM libraries also
> offer you some sort of global shared memory - but most do only use TCP/IP for
> communication.

Unfortunately, MOSIX (so far) does not offer global shared memory. The
node with the largest installed RAM is the restriction, since MOSIX
cannot use the memory of more than one node for one process.
The MOSIX team seems to work on DSM, but there are no official results
so far.

Regards,
Ferdinand
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Wed Jul 16 11:36:23 2003
From: jcownie at etnus.com (James Cownie)
Date: Wed, 16 Jul 2003 16:36:23 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Your message of "Wed, 16 Jul 2003 18:28:33 +0400."
             <200307161428.SAA28224@nocserv.free.net> 
Message-ID: <19coKN-5n4-00@etnus.com>

> > Because MPI is what most people want to achieve code- and
> > peformance-portability.

>   Partially I may agree, partially - not: MPI is not the best in the
> sense of portability (for example, optimiziation requires knowledge
> of interconnect topology, which may vary from cluster to cluster,
> and of course from MPP to MPP computer).

MPI has specific support for this in Rolf Hempel's topology code,
which is intended to allow you to have the system help you to choose a
good mapping of your processes onto the processors in the system.

This seems to me to be _more_ than you have in a portable way on the
ccNUMA machines, where you have to worry about

1) where every page of data lives, not just how close each process is
   to another one (and you have more pages than processes/threads to
   worry about !)

2) the scheduler choosing to move your processes/threads around the
   machine. 

> I think that if there is relative cheap and effective way to build
> ccNUMA system from cluster - it may have success.

Which is, of course, what SCI was _intended_ to be, and we saw how
well that succeeded :-(

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From c00jsh00 at nchc.gov.tw  Wed Jul 16 05:12:42 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Wed, 16 Jul 2003 17:12:42 +0800
Subject: NFS problem
Message-ID: <3F15170A.22D968E4@nchc.gov.tw>

Hi,

I set up a small cluster of 4+1 nodes, directories /home, /usr/local,
/opt and /workraid
of the master node are exported to slave nodes. With /etc/fstab defined
as nfs file system
on slave nodes and file /etc/exports defined in the master node, the NFS
should work.
However, not all of these directories are mounted when these slave nodes
are rebooted,
I always get the message when the system tries to mount the NFS
directories:

RPC portmapper failure: unable to receive

When the system is up, I can mount these directories manually. The
booting message does
include the line:

Starting RPC portmap daemon.....done

Could anyone point out what might be wrong or where to check?

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 02:45:58 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Wed, 16 Jul 2003 23:45:58 -0700
Subject: P4 dual vs P4C vs Opteron
Message-ID: <20030717064558.GA10800@sphere.math.ucdavis.edu>


I have been evaluating price/performance with a locally written
earthquake simulation code written in C, mostly floating point, and
not very cache friendly.  I thought people might be interested in the
performance numbers I collected.

Gcc-3.2.2 was used in all cases with the -O3 flag (compiled on the
machine it ran).

Dual p4-3.0/533 Mhz, no HT mahcine
1 process took 86.43 seconds.
2 proccesses in parallel took 156.9 seconds
Scaling efficiency =~ 10% (2 processes run at the same time have 10% greather 
                           throughput then a single process on a single cpu)

Dual Opteron 240-1.4 Ghz/333 MHz
1 process took 97.87 seconds.
2 proccesses in parallel took 99.79 seconds
Scaling efficiency =~ 96% (2 processes run at the same time have 97% greather 
                           throughput then a single process on a single cpu)

Single P4C-2.6 Ghz/800 Mhz FSB with HT enabled.
1 process took 81.22 seconds.
2 proccesses in parallel took 137.59 seconds
Scaling efficiency =~ 18% (2 processes run at the same time have 18% greather 
                           throughput then a single process on a single cpu)

I'd also like to do a performance per watt.  Anyone have a >= 2.6 Ghz
dual P4, 533 Mhz FSB, a rackmount motherboard, and a kill-a-watt?
Unfortunately my dual p4 has a fast 3d card which would throw my
performance per watt calculations.

I found it amusing that Hyperthreading scaled somewhat poorly, but still
managed to outscale and outperform the dual p4, despite a significantly
slower clock.

So the P4C-2.6 is the fastest for a single job and the opteron (the slowest
model sold) is the fastest for 2 jobs.  For the curious I'm seeing around
1.8 amps @ 110V running the dual opteron with 2 busy CPUs.

-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Thu Jul 17 05:01:37 2003
From: jcownie at etnus.com (James Cownie)
Date: Thu, 17 Jul 2003 10:01:37 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Mikhail Kuzminsky <kus@free.net> 
   of "Wed, 16 Jul 2003 22:31:15 +0400." <200307161831.WAA02082@nocserv.free.net> 
Message-ID: <19d4dt-1F6-00@etnus.com>


> > >   Partially I may agree, partially - not: MPI is not the best in the
> > > sense of portability (for example, optimiziation requires knowledge
> > > of interconnect topology, which may vary from cluster to cluster,
> > > and of course from MPP to MPP computer).
> 
> > MPI has specific support for this in Rolf Hempel's topology code,
> > which is intended to allow you to have the system help you to choose a
> > good mapping of your processes onto the processors in the system.
> 
>   Unfortunately I do not know about that codes :-( but for the best
> optimization I'll re-build the algorithm itself to "fit" for target
> topology.

Since it's a standard part of MPI it seems a bit unfair of you to be
saying that MPI doesn't support optimisation based on topology, when
all you mean is "I didn't RTFM so I don't know about that part of
the MPI standard".

See (for instance) chapter 6 in "MPI The Complete Reference" which
discusses the MPI topology routines at some length.
This is all MPI-1 stuff too, so it's not as if it's new ;-)

Of course it may well be that none of the vendors has bothered
actually to implement the topology routines in any way which gives you
a benefit. However it still seems unfair to blame the MPI _standard_
for failings in MPI _implementations_. After all the MPI forum spent
time arguing about this, so we were aware of the issue, and trying to
give you a solution to the problem.

> > This seems to me to be _more_ than you have in a portable way on the
> > ccNUMA machines, where you have to worry about
> > 
> > 1) where every page of data lives, not just how close each process is
> >    to another one (and you have more pages than processes/threads to
> >    worry about !)
> > 
> > 2) the scheduler choosing to move your processes/threads around the
> >    machine. 
> 
>   Yes, but "by default" I believe that they are the tasks of
> operating system, or, as maximum, the information I'm supplying to
> OS, *after* translation and linking of the program.

Having seen the effect which layout has, and the contortions people go
to to try to get their SMP codes to work efficiently in non-portable
ways (re-coding to make "first touch" happen on the "right" processor,
use of machine specific system calls for page affinity control and so
on), I remain unconvinced.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From janfrode at parallab.no  Thu Jul 17 05:04:54 2003
From: janfrode at parallab.no (Jan-Frode Myklebust)
Date: Thu, 17 Jul 2003 11:04:54 +0200
Subject: bad job distribution with MPICH
Message-ID: <20030717090453.GB23226@ii.uib.no>

Hi, 

we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
ethernet), and are having some problems with the mpich job distribution. 
An example from today:

The PBS job:

----------------------------------------
#PBS -l nodes=4:ppn=2,walltime=100:00:00
#
mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
----------------------------------------

is assigned to nodes:

	node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1

PBS generates a PBS_NODEFILE containing:

-----------------------------
node17
node15
node14
node11
node17
node15
node14
node11
-----------------------------

And this command is started in node 17:

	mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable

And then when I look over the nodes, there's 1 executable running on
node17, 3 on node15, 2 on node14 and 2 on node11.

Anybody seen something like this, and maybe have an idea of what might 
be causing it?


  -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Thu Jul 17 13:39:04 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 17 Jul 2003 13:39:04 -0400 (EDT)
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058288025.3280.102.camel@protein.scalableinformatics.com>
Message-ID: <Pine.LNX.4.44.0307171336050.10166-100000@coffee.psychology.mcmaster.ca>

as everyone said: local disks suck for reliability, but are simply
necessary if you're doing any kind of sigificant file IO, especially
checkpoints.

IMO, that means diskless net-booting with local swap/scratch.

> write(read) performance.  RaidO (using Linux MD device) can get you
> 60(80) MB/s write(read) performance.  Sure, this is less than a 200 MB/s

of course, MD can give you much higher raid0 if you use more than two disks;
it's not hard to hit 200 MB/s.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Daniel.Kidger at quadrics.com  Thu Jul 17 07:15:56 2003
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Thu, 17 Jul 2003 12:15:56 +0100
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>

Franz Marini wrote:

>  Nonetheless, I still can't quite understand why so little effort is 
>being put in developing a GSM solution for commodity cluster (even with 
>Myrinet or Quadrics, I'm thinking about SCI/Dolphin only because of the hw 
>support for RDMA operations).

The Quadrics Interconnect also does hardware RDMA, and yes a significant
percentage of people do use Global Shared Memory programming models rather
than message passing.

In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
RDMA ??


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From koz at urbi.com.br  Thu Jul 17 01:09:12 2003
From: koz at urbi.com.br (Alexandre M.)
Date: Thu, 17 Jul 2003 02:09:12 -0300
Subject: NFS problem
References: <3F15170A.22D968E4@nchc.gov.tw>
Message-ID: <000801c34c21$903eeaa0$5901020a@nhg4bx71qabh4t>

Hi,
One problem that's common is trying to mount the NFS dir while the network
is not ready yet during boot. You could see if this is the case by placing a
"sleep 5" in the NFS service bootup script just before the mount command.

----- Original Message ----- 
From: "Jyh-Shyong Ho" <c00jsh00 at nchc.gov.tw>
To: <beowulf at beowulf.org>
Sent: Wednesday, July 16, 2003 6:12 AM
Subject: NFS problem


> Hi,
>
> I set up a small cluster of 4+1 nodes, directories /home, /usr/local,
> /opt and /workraid
> of the master node are exported to slave nodes. With /etc/fstab defined
> as nfs file system
> on slave nodes and file /etc/exports defined in the master node, the NFS
> should work.
> However, not all of these directories are mounted when these slave nodes
> are rebooted,
> I always get the message when the system tries to mount the NFS
> directories:
>
> RPC portmapper failure: unable to receive
>
> When the system is up, I can mount these directories manually. The
> booting message does
> include the line:
>
> Starting RPC portmap daemon.....done
>
> Could anyone point out what might be wrong or where to check?
>
> Jyh-Shyong Ho, PhD.
> Research Scientist
> National Center for High-Performance Computing
> Hsinchu, Taiwan, ROC
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 16:42:55 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Thu, 17 Jul 2003 13:42:55 -0700
Subject: Dual Opteron-1.4 power usage
Message-ID: <20030717204255.GA15891@sphere.math.ucdavis.edu>


I figured this might be handy for those planning Power, UPS, or airconditioning
budgets.

Tyan dual opteron motherboard
4 1GB dimms (ECC registered)
enlight 8950 case
Sparkle 550 watt power supply.
No PCI cards.

Measured with a kill-a-watt.

163 watts idle 
192 watts with 2 distributed.net OGR crunchers running.
194 watts with 2 earthquake sims
196 watts Bonnie++ and 2*OGR
198 watts Bonnie++ and 2 earthquake sims
208 watts bonnie++ and pstream (2 threads banging main memory sequentially)
212 watts pstream (2 threads banging main memory sequentially)


-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rbw at ahpcrc.org  Thu Jul 17 16:40:10 2003
From: rbw at ahpcrc.org (Richard Walsh)
Date: Thu, 17 Jul 2003 15:40:10 -0500
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>


Dan Kidger wrote:

>The Quadrics Interconnect also does hardware RDMA, and yes a significant
>percentage of people do use Global Shared Memory programming models rather
>than message passing.
>
>In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
>RDMA ??

 Does this support run all the way up the stack to the MPI-2 "one-sided"
 communications stuff?  Anyone working on supporting the implicit DSM
 language constructs of CAF and/or UPC with their RDMA capability?  Comments 
 on any/all interconnects mentioned are welcome.

 Thanks,

 rbw

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Thu Jul 17 16:48:38 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Thu, 17 Jul 2003 13:48:38 -0700
Subject: When are diskless compute nodes inappropriate?
In-Reply-To: <1058284085.17543.12.camel@haze.sr.unh.edu>
References: <1058284085.17543.12.camel@haze.sr.unh.edu>
Message-ID: <20030717204838.GB15891@sphere.math.ucdavis.edu>

On Tue, Jul 15, 2003 at 11:48:05AM -0400, Tod Hagan wrote:
> Okay, I'm convinced by the arguments in favor of diskless compute
> nodes, including cost savings applicable elsewhere, reduced power
> consumption

5-10 watts.

>, and increased reliability through the elimination of
> moving parts.

Indeed.  Although similar reliability can be had if you can survive
a disk failure.

> With all the arguments against disks, what are the arguments in favor
> of diskful compute nodes? In particular, what are the situations or

Swap, and high speed disk I/O.  35 MB/sec of sequential I/O to a local disk
is very hard to centralize.  If you can make do with much less then it's not
to much of a big deal.

For our 32 node cluster on boot we:
	netboot a kernel
	kernel loads a ramdisk
	disk is partitioned
	disk is mkswaped
	/scratch and /swap are mounted.

So this leave ZERO state on the hard disk, so if a disk dies just reboot
and the node works (but doesn't have /swap and /scratch), if you pull
a disk off a shelf and stick it in a node you just reboot.

Very nice to minimize the administrative costs of managing, patching,
backing up, troubleshooting etc of N nodes, with possibly different images,
and of course any state.

My central fileserver is a dual-p4, dual PC1600 memory bus, 133 Mhz/64 bit
PCI, and several U160 channels full of 5 disks each.  I see 200-300 MB/sec
sustained for large sequential file reads/writes.  Granted the central
fileserver can not keep up with 32 nodes wanting to read/write at 35 MB/sec,
but it's enough to usually not be a bottlneck.

-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Thu Jul 17 17:13:01 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Thu, 17 Jul 2003 14:13:01 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
Message-ID: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>

On Thu, Jul 17, 2003 at 12:15:56PM +0100, Daniel Kidger wrote:

> The Quadrics Interconnect also does hardware RDMA, and yes a significant
> percentage of people do use Global Shared Memory programming models rather
> than message passing.
> 
> In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> RDMA ??

There's a terminology problem here: Some people mean cache-coherent
shared memory, like that on an SGI Origin.

Another term for non-cache-coherent but globally addressable and
accessible memory is SALC: Shared address, local consistency.

And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
non-cache-coherent kind of shared memory. Programming models in this
area are:

  * UPC: Unified Parallel C
  * CoArray Fortran
  * MPI-2 one-sided operations
  * Global Arrays from PNL
  * The Cray SHMEM library

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From eccf at super.unam.mx  Thu Jul 17 17:11:55 2003
From: eccf at super.unam.mx (Eduardo Cesar Cabrera Flores)
Date: Thu, 17 Jul 2003 16:11:55 -0500 (CDT)
Subject: bad job distribution with MPICH
In-Reply-To: <200307171904.h6HJ4Lw25122@NewBlue.Scyld.com>
Message-ID: <Pine.LNX.4.44.0307171610390.10489-100000@mezcal.super.unam.mx>


You should try mpiexec

                      
cafe


Hi,                                                                                     
                                                                                        
we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast                     
ethernet), and are having some problems with the mpich job distribution.                
An example from today:                                                                  
                                                                                        
The PBS job:                                                                            
                                                                                        
----------------------------------------                                                
#PBS -l nodes=4:ppn=2,walltime=100:00:00                                                
#                                                                                       
mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe                  
----------------------------------------                                                
                                                                                        
is assigned to nodes:                                                                   
                                                                                        
        
node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1         
                                                                                        
PBS generates a PBS_NODEFILE containing:                                                
                                                                                        
-----------------------------         
node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1         
                                                                                        
PBS generates a PBS_NODEFILE containing:                                                
                                                                                        
-----------------------------                                                           
node17                                                                                  
node15                                                                                  
node14                                                                                  
node11                                                                                  
node17                                                                                  
node15                                                                                  
node14                                                                                  
node11                                                                                  
-----------------------------                                                           
                                                                                        
And this command is started in node 17:                                                 
                                                                                        
        mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable              
                                                                                        
And then when I look over the nodes, there's 1 executable running on                    
node17, 3 on node15, 2 on node14 and 2 on node11.                                       
                                                                                        
Anybody seen something like this, and maybe have an idea of what might 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Thu Jul 17 23:20:25 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Fri, 18 Jul 2003 11:20:25 +0800 (CST)
Subject: SGE 5.3p4 released (was: queueing system for x86-64)
In-Reply-To: <20030714143541.B10106@lnxi.com>
Message-ID: <20030718032025.1909.qmail@web16813.mail.tpe.yahoo.com>

I was trying to install SGE on a x86-64 cluster, and
found that I need SGE 5.3p4 to get the resource limit
set correctly.

http://gridengine.sunsource.net/project/gridengine/news/SGE53p4-announce.html

I will find try to install SGE on x86-64 next week,
and I will tell everyone on this list my experience.

Andrew. 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Fri Jul 18 04:31:45 2003
From: jcownie at etnus.com (James Cownie)
Date: Fri, 18 Jul 2003 09:31:45 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Richard Walsh <rbw@ahpcrc.org> 
   of "Thu, 17 Jul 2003 15:40:10 CDT." <200307172040.h6HKeAm29015@mycroft.ahpcrc.org> 
Message-ID: <19dQeX-1LH-00@etnus.com>


>  Does this support run all the way up the stack to the MPI-2
>  "one-sided" communications stuff?  Anyone working on supporting the
>  implicit DSM language constructs of CAF and/or UPC with their RDMA
>  capability?  Comments on any/all interconnects mentioned are
>  welcome.

Compaq UPC (from HP) on their SC machines directly targets the
Quadrics' Elan processors.

See http://h30097.www3.hp.com/upc/ for details of the Compaq UPC
product. 

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Fri Jul 18 04:41:43 2003
From: jcownie at etnus.com (James Cownie)
Date: Fri, 18 Jul 2003 09:41:43 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Greg Lindahl <lindahl@keyresearch.com> 
   of "Thu, 17 Jul 2003 14:13:01 PDT." <20030717211301.GA4929@greglaptop.internal.keyresearch.com> 
Message-ID: <19dQoB-1LO-00@etnus.com>

> > In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> > RDMA ??
> 
> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.
> 
> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:
> 
>   * UPC: Unified Parallel C
>   * CoArray Fortran
>   * MPI-2 one-sided operations
>   * Global Arrays from PNL
>   * The Cray SHMEM library

However there's another axis to the classification which you haven't
mentioned, and which is also extremeley important, which is whether
the remote access is "punned" onto a normal load/store instruction, or
requires a different explicit operation.

I like to refer to the Quadrics' model as "explicit remote store
access", since it requires special accesses to (process mapped) device
registers to  cause remote operations to happen; therefore the process
making a remote access has to know that that's what it wants to do. It
can't just follow a chain of pointers and end up doing remote accesses
transparently.

Note, also, that AFAIK the explicit remote store accesses in the
Quadrics' implementation are cache coherent at both ends, so they are
not SALC. (Both because there isn't a shared address space, and
because they are consistent at both ends !).

As I understand it the Quadrics' model is that there are multiple
processes each with their own address space, but that by explicit
operations a process can read or write data in a cache coherent
fashion and without co-operation from its owner in any of the address
spaces. (At least that's how it worked back at Meiko ;-)

I suppose you could view the {process-id, address} tuple as a shared
address space, but it seems a bit of a stretch to me.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From franz.marini at mi.infn.it  Fri Jul 18 04:52:20 2003
From: franz.marini at mi.infn.it (Franz Marini)
Date: Fri, 18 Jul 2003 10:52:20 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com>
 <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>

On Thu, 17 Jul 2003, Greg Lindahl wrote:

> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.

I maybe wrong but I think that all the SGI machines (including the Altix) 
implement c-c shared mem. 

> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:
> 
>   * UPC: Unified Parallel C
>   * CoArray Fortran
>   * MPI-2 one-sided operations
>   * Global Arrays from PNL
>   * The Cray SHMEM library

And this should testify to the fact that the shmem programming paradigm is 
all but rarely used. As long as I can tell there is a *lot* of code out 
there that uses, e.g. the Cray SHMEM lib (btw, this is one of the things 
that makes the Scali/Dolphin solution interesting to us).

But, still, whereas, e.g. the SHMEM lib has been implemented under Scali 
(and maybe under Quadrics/Myrinet/Infiniband, not sure about it), what 
I think it'd be interesting and usefull is the support (at the OS level) 
for a GSM/single system image, providing support for POSIX threads across 
the nodes. I may be dreaming here, I know, but still... :)


Btw, on a side note, does anyone know if there is some compiler (both C 
and F90/HPF) out there supporting some kind of auto parallelization via, 
e.g. the SHMEM lib (I'm not asking for a MPI-enabled compiler, I'm not 
*so* crazy ;)) ?


---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
--------------------------------------------------------- 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sp at scali.com  Thu Jul 17 17:58:24 2003
From: sp at scali.com (Steffen Persvold)
Date: Thu, 17 Jul 2003 23:58:24 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <20030717211301.GA4929@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.44.0307172354080.21247-100000@localhost.localdomain>

On Thu, 17 Jul 2003, Greg Lindahl wrote:

> On Thu, Jul 17, 2003 at 12:15:56PM +0100, Daniel Kidger wrote:
> 
> > The Quadrics Interconnect also does hardware RDMA, and yes a significant
> > percentage of people do use Global Shared Memory programming models rather
> > than message passing.
> > 
> > In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> > RDMA ??
> 
> There's a terminology problem here: Some people mean cache-coherent
> shared memory, like that on an SGI Origin.
> 
> Another term for non-cache-coherent but globally addressable and
> accessible memory is SALC: Shared address, local consistency.
> 
> And yes, all 4 of Scali/Quadrics/Myrinet/Infiniband support the
> non-cache-coherent kind of shared memory. Programming models in this
> area are:

Just to clarify; Scali makes software, not hardware. So putting Scali in 
the same group as Quadrics, Myrinet and Infiniband is kinda wrong. It 
should have been Dolphin (as in the SCI card vendor) I guess. Our message 
passing software may run on all four interconnects (and ethernet).

Regards,
-- 
      Steffen Persvold           ,,,       mailto: sp at scali.com
   Senior Software Engineer     (o-o)      http://www.scali.com
-----------------------------oOO-(_)-OOo-----------------------------
Scali AS, PObox 150, Oppsal, N-0619 Oslo, Norway, Tel: +4792484511

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From ashley at pittman.co.uk  Fri Jul 18 07:45:01 2003
From: ashley at pittman.co.uk (Ashley Pittman)
Date: 18 Jul 2003 12:45:01 +0100
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>
References: <200307172040.h6HKeAm29015@mycroft.ahpcrc.org>
Message-ID: <1058528701.21031.57.camel@ashley>

On Thu, 2003-07-17 at 21:40, Richard Walsh wrote:
> Dan Kidger wrote:
> 
> >The Quadrics Interconnect also does hardware RDMA, and yes a significant
> >percentage of people do use Global Shared Memory programming models rather
> >than message passing.
> >
> >In fact I thought all four of SCALI/Quadrics/Myrinet/Infiniband could do
> >RDMA ??
> 
>  Does this support run all the way up the stack to the MPI-2 "one-sided"
>  communications stuff?  Anyone working on supporting the implicit DSM
>  language constructs of CAF and/or UPC with their RDMA capability?  Comments 
>  on any/all interconnects mentioned are welcome.

Yes it does, we support both Cray SHMEM and MPI-2 "one-sided" which are
essentially simple wrappers around the DMA engine.  Because it's truly
one-sided it's lower latency than Send/Recv.  I've included some pallas
figures from one of the machines here.

There are two UPC implementations which work over Quadrics hardware, one
of which is open source, check out http://upc.nersc.gov/

Ashley,


#---------------------------------------------------
# Benchmarking Unidir_Put 
# ( #processes = 2 ) 
#---------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         4096         0.07         0.00
            4         4096         1.67         2.28
            8         4096         1.68         4.55
           16         4096         1.72         8.86
           32         4096         2.19        13.95
           64         4096         2.55        23.89
          128         4096         2.77        44.06
          256         4096         3.19        76.60
          512         4096         4.14       118.06
         1024         4096         5.76       169.42
         2048         4096         8.95       218.30
         4096         4096        15.32       254.92
         8192         4096        28.00       279.04
        16384         2560        53.40       292.63
        32768         1280       104.10       300.19
        65536          640       207.56       301.12
       131072          320       412.33       303.15
       262144          160       821.94       304.16


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 18 12:14:05 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 18 Jul 2003 09:14:05 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <19dQoB-1LO-00@etnus.com>
References: <20030717211301.GA4929@greglaptop.internal.keyresearch.com> <19dQoB-1LO-00@etnus.com>
Message-ID: <20030718161405.GA13859@greglaptop.greghome.keyresearch.com>

On Fri, Jul 18, 2003 at 09:41:43AM +0100, James Cownie wrote:

> Note, also, that AFAIK the explicit remote store accesses in the
> Quadrics' implementation are cache coherent at both ends, so they are
> not SALC. (Both because there isn't a shared address space, and
> because they are consistent at both ends !).

In both cases you're using different terminology than the SALC folks do.

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From johnh at sjgeophysics.com  Fri Jul 18 15:04:52 2003
From: johnh at sjgeophysics.com (John Harrop)
Date: 18 Jul 2003 12:04:52 -0700
Subject: Empty passwords vs ssh-agent?
Message-ID: <1058555100.10220.33.camel@orion-2>

I'm currently switching our system from using r-commands to ssh.  We
have a fairly small system with 27 nodes.  The only two options I can
see with ssh are empty passwords and ssh-agent.  The first looks like it
isn't much better for security than r commands.  (We do have ssh with
passwords and known hosts on a portal machine.)  Using ssh-agent on a
cluster looks like a potentially big hassle.  Or am I mistaken about the
last impression?  After all, we have nodes that are almost hitting up
time of 400 days so ssh-add would only have been run once for each
cluster user.

What are people using as the clusters get bigger?

Thanks is advance for your comments and thought!

Cheers,

John Harrop


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rodmur at maybe.org  Fri Jul 18 16:26:50 2003
From: rodmur at maybe.org (Dale Harris)
Date: Fri, 18 Jul 2003 13:26:50 -0700
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <1058555100.10220.33.camel@orion-2>
References: <1058555100.10220.33.camel@orion-2>
Message-ID: <20030718202650.GI24530@maybe.org>

On Fri, Jul 18, 2003 at 12:04:52PM -0700, John Harrop elucidated:
> I'm currently switching our system from using r-commands to ssh.  We
> have a fairly small system with 27 nodes.  The only two options I can
> see with ssh are empty passwords and ssh-agent.  The first looks like it
> isn't much better for security than r commands.  (We do have ssh with
> passwords and known hosts on a portal machine.)  Using ssh-agent on a
> cluster looks like a potentially big hassle.  Or am I mistaken about the
> last impression?  After all, we have nodes that are almost hitting up
> time of 400 days so ssh-add would only have been run once for each
> cluster user.
> 
> What are people using as the clusters get bigger?
> 
> Thanks is advance for your comments and thought!
> 
> Cheers,
> 
> John Harrop
> 


I've have the same questions, too.  Is this something you're just doing
for administrative purposes?  Or are the users going to need use ssh to
authenticate themselves as well?


--
Dale Harris   
rodmur at maybe.org
/.-)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From xyzzy at speakeasy.org  Fri Jul 18 17:10:45 2003
From: xyzzy at speakeasy.org (Trent Piepho)
Date: Fri, 18 Jul 2003 14:10:45 -0700 (PDT)
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <20030718202650.GI24530@maybe.org>
Message-ID: <Pine.LNX.4.04.10307181403160.12532-100000@12-207-199-254.client.attbi.com>

On Fri, 18 Jul 2003, Dale Harris wrote:
> On Fri, Jul 18, 2003 at 12:04:52PM -0700, John Harrop elucidated:
> > I'm currently switching our system from using r-commands to ssh.  We
> > have a fairly small system with 27 nodes.  The only two options I can
> > see with ssh are empty passwords and ssh-agent.  The first looks like it

You can use RSA host based authentication.  This is the same style as the r
commands, except instead of only using what the remote host claims as its IP
address, a RSA/DSA key check is done.  This way you can do non-interactive ssh
just among your cluster nodes, but still have passwords for extra-cluster
connections.

ssh-agent also works well.  Users can start the agent once and leave it
running, only having to type in their password once per reboot.

A nifty thing would be if login could check for ssh-agent, and if it finds
one, setup the env variables (already can be done from the shell dot-files). 
If it doesn't find one, it starts it and runs ssh-add using the password
supplied for login. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Fri Jul 18 17:06:15 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 18 Jul 2003 14:06:15 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>
References: <010C86D15E4D1247B9A5DD312B7F5AA78DE01F@stegosaurus.bristol.quadrics.com> <20030717211301.GA4929@greglaptop.internal.keyresearch.com> <Pine.LNX.4.53.0307181041430.24577@merlino.mi.infn.it>
Message-ID: <20030718210615.GA2096@greglaptop.internal.keyresearch.com>

On Fri, Jul 18, 2003 at 10:52:20AM +0200, Franz Marini wrote:

> Btw, on a side note, does anyone know if there is some compiler (both C 
> and F90/HPF) out there supporting some kind of auto parallelization via, 
> e.g. the SHMEM lib (I'm not asking for a MPI-enabled compiler, I'm not 
> *so* crazy ;)) ?

PGI's HPF compiler can compile down to fortran + MPI calls. No doubt
they have other options. It's not going to get you to a very high
level of parallelism, though.

greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From stiehr at admiral.umsl.edu  Fri Jul 18 17:18:26 2003
From: stiehr at admiral.umsl.edu (Gary Stiehr)
Date: Fri, 18 Jul 2003 16:18:26 -0500
Subject: bad job distribution with MPICH
In-Reply-To: <20030717090453.GB23226@ii.uib.no>
References: <20030717090453.GB23226@ii.uib.no>
Message-ID: <3F186422.5030309@admiral.umsl.edu>

Hi,

Try to use "mpirun -nolocal -np ....".  I think if you don't specify the 
"-nolocal" option, the job will start one process on node17 and then 
that process will start the other 7 processes on the remaining 6 
processors not in node17; thus resulting in three processes on node15.  
Apparently if you use -nolocal, it will use all of the processors.  I'm 
not sure why this is, however, adding "-nolocal" to the mpirun command 
may help you.

HTH,
Gary

Jan-Frode Myklebust wrote:

>Hi, 
>
>we're running MPICH 1.2.4 on a 32 node dual cpu linux cluster (fast
>ethernet), and are having some problems with the mpich job distribution. 
>An example from today:
>
>The PBS job:
>
>----------------------------------------
>#PBS -l nodes=4:ppn=2,walltime=100:00:00
>#
>mpirun -np `wc -l < $PBS_NODEFILE` -machinefile $PBS_NODEFILE mfix.exe
>----------------------------------------
>
>is assigned to nodes:
>
>	node17/0+node15/0+node14/0+node11/0+node17/1+node15/1+node14/1+node11/1
>
>PBS generates a PBS_NODEFILE containing:
>
>-----------------------------
>node17
>node15
>node14
>node11
>node17
>node15
>node14
>node11
>-----------------------------
>
>And this command is started in node 17:
>
>	mpirun -np 8 -machinefile /var/spool/PBS/aux/20996.fire executable
>
>And then when I look over the nodes, there's 1 executable running on
>node17, 3 on node15, 2 on node14 and 2 on node11.
>
>Anybody seen something like this, and maybe have an idea of what might 
>be causing it?
>
>
>  -jf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shewa at inel.gov  Fri Jul 18 18:12:12 2003
From: shewa at inel.gov (Andrew Shewmaker)
Date: Fri, 18 Jul 2003 16:12:12 -0600
Subject: Empty passwords vs ssh-agent?
In-Reply-To: <1058555100.10220.33.camel@orion-2>
References: <1058555100.10220.33.camel@orion-2>
Message-ID: <3F1870BC.6030409@inel.gov>

John Harrop wrote:

> I'm currently switching our system from using r-commands to ssh.  We
> have a fairly small system with 27 nodes.  The only two options I can
> see with ssh are empty passwords and ssh-agent.  The first looks like it
> isn't much better for security than r commands.  (We do have ssh with
> passwords and known hosts on a portal machine.)  Using ssh-agent on a
> cluster looks like a potentially big hassle.  Or am I mistaken about the
> last impression?  After all, we have nodes that are almost hitting up
> time of 400 days so ssh-add would only have been run once for each
> cluster user.
> 
> What are people using as the clusters get bigger?
> 
> Thanks is advance for your comments and thought!
> 
> Cheers,
> 
> John Harrop

Have you heard of Keychain? http://www.gentoo.org/proj/en/keychain.xml
"It acts as a front-end to ssh-agent, allowing you to easily have one
long-running ssh-agent process per system, rather than per login
session."  I have used this before and it worked well, but I've been
meaning to switch to the pam_ssh module.

Does anybody use the pam_ssh module to automatically start agents on
login?  I saw it when I was looking up pam documentation on modules.
Download through cvs http://sourceforge.net/cvs/?group_id=16000

Andrew

-- 
Andrew Shewmaker, Associate Engineer
Phone: 1-208-526-1276
Idaho National Eng. and Environmental Lab.
P.0. Box 1625, M.S. 3605
Idaho Falls, Idaho 83415-3605

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From kblair at uidaho.edu  Fri Jul 18 17:41:06 2003
From: kblair at uidaho.edu (Kenneth Blair)
Date: Fri, 18 Jul 2003 14:41:06 -0700
Subject: monte boot fail
Message-ID: <1058564466.1164.28.camel@eagle2>

Having problems installing some nodes to an existing scyld cluster.

Scyld Beowulf release 27bz-7 (based on Red Hat Linux 6.2)

I run   # beoboot-install 62 /dev/hda

Creating boot images...
Building phase 1 file system image in /tmp/beoboot.22389...
ram disk image size (uncompressed): 2116K
compressing...done
ram disk image size (compressed): 792K
Kernel image is:    "/tmp/beoboot.22389".
Initial ramdisk is: "/tmp/beoboot.22389.initrd".
Kernel image is:    "/tmp/.beoboot-install.22388".
Initial ramdisk is: "/tmp/.beoboot-install.22388.initrd".
Installing beoboot on partition 1 of /dev/hda.
mke2fs 1.18, 11-Nov-1999 for EXT2 FS 0.5b, 95/08/09
/dev/hda1: 11/25584 files (0.0% non-contiguous), 3250/102280 blocks
Done

Added kernel *
Beoboot installed on node 62

BUT..... when I reboot the box, it fails on the phase 1 load with a "
"mote_boot fail  invalid argument"

Has anyone seen this before???

thanks  -ken

-- 
Kenneth D. Blair
Initiative for Bioinformatics and Evolutionary STudies
College of Engineering (Computer Science)
University of Idaho
Phone: 208-885-9830
Cell:  408-888-3579

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20030718/f8f8e1ce/attachment-0001.html>

From rouds at servihoo.com  Sat Jul 19 00:52:34 2003
From: rouds at servihoo.com (RoUdY)
Date: Sat, 19 Jul 2003 08:52:34 +0400
Subject: Beowulf digest, Vol 1 #1382 - 12 msgs
In-Reply-To: <200307181901.h6IJ1aw22843@NewBlue.Scyld.com>
Message-ID: <web-18821194@servihoo.com>

hello everybody.
I'm Roudy and I am new in making a cluster of 4-1 node.
Well, I am writing to you all in a hope to hear from you 
very soon. The coming Monday I will need to go to the 
University to build this cluster. Please send me the step 
to undergo so that it is a success.
Thanks 
Roudy
(Mauritius)
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sat Jul 19 00:52:34 2003
From: rouds at servihoo.com (RoUdY)
Date: Sat, 19 Jul 2003 08:52:34 +0400
Subject: Beowulf digest, Vol 1 #1382 - 12 msgs
In-Reply-To: <200307181901.h6IJ1aw22843@NewBlue.Scyld.com>
Message-ID: <web-18821194@servihoo.com>

hello everybody.
I'm Roudy and I am new in making a cluster of 4-1 node.
Well, I am writing to you all in a hope to hear from you 
very soon. The coming Monday I will need to go to the 
University to build this cluster. Please send me the step 
to undergo so that it is a success.
Thanks 
Roudy
(Mauritius)
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From janfrode at parallab.no  Sat Jul 19 07:32:56 2003
From: janfrode at parallab.no (Jan-Frode Myklebust)
Date: Sat, 19 Jul 2003 13:32:56 +0200
Subject: bad job distribution with MPICH
In-Reply-To: <3F186422.5030309@admiral.umsl.edu>
References: <20030717090453.GB23226@ii.uib.no> <3F186422.5030309@admiral.umsl.edu>
Message-ID: <20030719113256.GA23631@ii.uib.no>

On Fri, Jul 18, 2003 at 04:18:26PM -0500, Gary Stiehr wrote:
> 
> Try to use "mpirun -nolocal -np ....".  

Yes, that seems to fix it. Thanks!

I also got a nice explanation in private from George Sigut explainig 
what MPICH was doing whan not given the '-nolocal' flag.

"
  I seem to remember something about mpirun starting distributing the
  jobs NOT on the first node (i.e. in your case node17) and continuing
  in the circular fashion:
                                                                                                             
  given:    17 15 14 11 17 15 14 11
  expected: 17 15 14 11 17 15 14 11
  getting:  |  15 14 11 17 15 14 11  (instead of 1st 17, twice 15)
            -> 15

"

Looks like without the '-nolocal' MPICH is reserving the first node
in the machinefile for job management.


   -jf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sun Jul 20 00:38:32 2003
From: rouds at servihoo.com (RoUdY)
Date: Sun, 20 Jul 2003 08:38:32 +0400
Subject: configure a cluster of 4-1 node
In-Reply-To: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <web-18857160@servihoo.com>

hello everybody,
Can someone mail me the step how to configure a cluster of 
4-1 node using the platform Linux.
Thanks
Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Sun Jul 20 00:38:32 2003
From: rouds at servihoo.com (RoUdY)
Date: Sun, 20 Jul 2003 08:38:32 +0400
Subject: configure a cluster of 4-1 node
In-Reply-To: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <web-18857160@servihoo.com>

hello everybody,
Can someone mail me the step how to configure a cluster of 
4-1 node using the platform Linux.
Thanks
Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dlane at ap.stmarys.ca  Sun Jul 20 08:19:57 2003
From: dlane at ap.stmarys.ca (Dave Lane)
Date: Sun, 20 Jul 2003 09:19:57 -0300
Subject: configure a cluster of 4-1 node
In-Reply-To: <web-18857160@servihoo.com>
References: <200307191901.h6JJ1bw22768@NewBlue.Scyld.com>
Message-ID: <5.2.0.9.0.20030720091400.02585ea8@crux.stmarys.ca>

At 08:38 AM 7/20/2003 +0400, RoUdY wrote:
>hello everybody,
>Can someone mail me the step how to configure a cluster of 4-1 node using 
>the platform Linux.

Roudy,

This is not a simple answer that can be answered in an e-mail. I suggest 
you read at least some of Robert Brown's online book (a continuous work in 
progress for him, so its up-to-date) at:

http://www.phy.duke.edu/brahma/Resources/beowulf_book.php

That will tell you everything you need to know. You may also want to look 
at one or more of the cluster software distributions such as:

Rocks - http://www.rocksclusters.org/Rocks/
Oscar - http://oscar.sourceforge.net/

Good luck ... Dave


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From shin at solarider.org  Sun Jul 20 18:02:11 2003
From: shin at solarider.org (Shin)
Date: Sun, 20 Jul 2003 23:02:11 +0100
Subject: Clusters Vs Grids 
Message-ID: <20030720220211.GC16662@gre.ac.uk>

Hi,

I got a few queries about the exact differences between clusters and
grids and as I couldn't really find a general purpose grid list to
post on and because this list is normally a fountain of knowledge I
thought I'd ask here. However if there is somewhere more appropriate
to ask then please push me in that direction.

Broadly (very broadly) as I understand it a cluster is a collection
of machines that will run parallel jobs for codes that require high
performance - they might be connected by a high speed interconnect
(ie Myrinet, SCI, etc) or via a normal ethernet type connections.
The former are described as closely or tightly coupled and the
latter as loosely coupled? Hopefully I'm correct so far. 

A cluster will normally (always?) be located at one specific location.

A grid is also a collection of computing resources (cpu's, storage)
that will run parallel jobs for codes that also require high
performance (or perhaps very long run times?). However these
resources might be distributed over a department, campus or even
further afield in other organisations, in different parts of the
world?

As such a grid cam not be closely coupled and any codes that are
developed for a grid will have to take the very high latency
overheads of a grid into consideration. Not sure what the bandwidth
of a grid would be like?

On the other hand, a grid potentially makes more raw computing power
available to a user who does not have a local adequately specced
cluster available.

So I was wondering just how all those coders out there who are
developing codes on clusters connected with fast interconnects are
going to convert their codes to use on a grid - or is there even the
concept of a highly coupled grid - ie grid components that are
connected via fast interconnections (10Gb ethernet perhaps?) or is
that still very low in terms of what closely coupled clusters are
capable of.

Or are people making their clusters available as components of a
grid, call it a ClusterGrid and in the same way that a grid app
would specify certain resoure  requirements - it could specify that
it should look for an available cluster on a grid.

However I can't see why establishments who have spent a lot of money
developing their clusters would then make them available on a grid
for others to use - when they could just create an account for the
user on their cluster to run their code on.

I could understand the use of single machines that are mostly idle
being made available for a grid - but presumably most clusters are
in constant demand and use from users.

So I was just looking to see if I have my terminology above correct
for grids and clusters and whether there was any concept of a
tightly coupled grid or even a ClusterGrid. And if there was any
useful cross over between clusters and grids - or are the two so
completely different architecurally that they will never meet; or
not for the near future at least.

I was also curious about all these codes that use MPI across tightly
coupled systems and how they would adapt to use on loosely coupled
grid. 

I'm having a hard time marrying the 2 concept of a cluster and a
grid together; but I'm sure much finer brains than mine have already
considered all this and ruled it out/in/not-yet.

Thanks for any clarity and information you can provide. Oh and if
anyone has any comments on the following comment from a colleague
I'd appreciate that as well; "grids - hmmm - there're just the
latest computing fad - real high performance scientists won't use
them and grids will be just so much hype for many years to come".

Thanks
Shin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Sun Jul 20 20:31:54 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Sun, 20 Jul 2003 17:31:54 -0700
Subject: Clusters Vs Grids
In-Reply-To: <20030720220211.GC16662@gre.ac.uk>
References: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <20030721003154.GA16512@greglaptop.greghome.keyresearch.com>

> I got a few queries about the exact differences between clusters and
> grids and as I couldn't really find a general purpose grid list to
> post on and because this list is normally a fountain of knowledge I
> thought I'd ask here.

There's an IEEE Task Force on Cluster Computing that has an open
mailing list. But this is reasonably on-topic.

A grid deals with machines separated by significant physical distance,
and that usually cross into several administrative domains. Grids have
a lot more frequent failures than clusters.

A cluster is usually close and administered as one system.

> So I was wondering just how all those coders out there who are
> developing codes on clusters connected with fast interconnects are
> going to convert their codes to use on a grid

The speed of light is the only thing that does not scale with Moore's
Law.

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Mon Jul 21 01:40:03 2003
From: rouds at servihoo.com (RoUdY)
Date: Mon, 21 Jul 2003 09:40:03 +0400
Subject: thank Dave
In-Reply-To: <200307201902.h6KJ2Dw20695@NewBlue.Scyld.com>
Message-ID: <web-18892611@servihoo.com>

Hello Dave
Thanks for all
roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Mon Jul 21 01:40:03 2003
From: rouds at servihoo.com (RoUdY)
Date: Mon, 21 Jul 2003 09:40:03 +0400
Subject: thank Dave
In-Reply-To: <200307201902.h6KJ2Dw20695@NewBlue.Scyld.com>
Message-ID: <web-18892611@servihoo.com>

Hello Dave
Thanks for all
roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Mon Jul 21 02:58:13 2003
From: landman at scalableinformatics.com (Joseph Landman)
Date: 21 Jul 2003 02:58:13 -0400
Subject: New version of the sge_mpiblast tool
Message-ID: <1058770692.3285.13.camel@protein.scalableinformatics.com>

Hi Folks:

  We completely rewrote our sge_mpiblast execution tool into a real
program that allows you to run the excellent mpiBLAST
(http://mpiblast.lanl.gov) code within the SGE queuing system on a
bio-cluster.  The new code is named run_mpiblast and is available from
our download page (http://scalableinformatics.com/downloads/). 
Documentation is in process, and the source is heavily commented.

  The principal differences between the old and new versions are

. error detection and problem reporting
. file staging
. rewritten in a real programming language, no more shell script
. works within SGE, or from the command line
. uses config files
. run isolation
. debugging and verbosity controls

  This is a merge between an internal project and the ideas behind the
original code.  Please give it a try and let us know how it behaves. 
The link to the information page is
http://scalableinformatics.com/sge_mpiblast.html .

Joe

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From nixon at nsc.liu.se  Mon Jul 21 04:18:15 2003
From: nixon at nsc.liu.se (nixon at nsc.liu.se)
Date: Mon, 21 Jul 2003 10:18:15 +0200
Subject: Clusters Vs Grids
In-Reply-To: <20030720220211.GC16662@gre.ac.uk> (shin@solarider.org's
 message of "Sun, 20 Jul 2003 23:02:11 +0100")
References: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <m3he5gxph4.fsf@nsc.liu.se>

Shin <shin at solarider.org> writes:

> Broadly (very broadly) as I understand it a cluster is a collection
> of machines that will run parallel jobs for codes that require high
> performance - they might be connected by a high speed interconnect
> (ie Myrinet, SCI, etc) or via a normal ethernet type connections.
> The former are described as closely or tightly coupled and the
> latter as loosely coupled? Hopefully I'm correct so far. 

You're basically correct, except that a cluster doesn't necessarily
run parallel jobs. A common situation is that you have lots and lots
of non-interdependent, single-CPU jobs that you want to run as quickly
as possible.

> A grid is also a collection of computing resources (cpu's, storage)
> that will run parallel jobs for codes that also require high
> performance (or perhaps very long run times?). However these
> resources might be distributed over a department, campus or even
> further afield in other organisations, in different parts of the
> world?

Again, basically correct, except for the same point as above. I think
the key issues about a grid is that the resources are:

a) possibly distributed over large geographical distances,

b) possibly belonging to different organizations with different
   policies; there is no centralized administrative control over them.

> As such a grid cam not be closely coupled and any codes that are
> developed for a grid will have to take the very high latency
> overheads of a grid into consideration. Not sure what the bandwidth
> of a grid would be like?

That only depends on how fat pipes you put in. In Nordugrid there is
gigabit-class bandwidth between (most of) the resources. The latency,
on the other hand, is harder to do anything about.

> So I was wondering just how all those coders out there who are
> developing codes on clusters connected with fast interconnects are
> going to convert their codes to use on a grid - or is there even the
> concept of a highly coupled grid - ie grid components that are
> connected via fast interconnections (10Gb ethernet perhaps?) or is
> that still very low in terms of what closely coupled clusters are
> capable of.

There are MPI implementations that run in grid environments, but of
course you might get horrible latency if you have processes running at
different sites.

> Or are people making their clusters available as components of a
> grid, call it a ClusterGrid and in the same way that a grid app
> would specify certain resoure  requirements - it could specify that
> it should look for an available cluster on a grid.

That is a much more likely scenario for running parallel applications
on a grid, yes.

> However I can't see why establishments who have spent a lot of money
> developing their clusters would then make them available on a grid
> for others to use - when they could just create an account for the
> user on their cluster to run their code on.

It is partly a question of administrative overhead. In an non-grid
situation, if a user gets resources allocated to him at n computing
sites, he typically needs to go through n different account activation
processes. Now, consider a large project like LHC at CERN, where you
have dozens and dozens of participating computing sites and a large
number of users - it's just not feasible to have individual accounts
at individual sites.

Another part is resource location; if you have dozens and dozens of
potential job submission sites, you really don't want to manually
keep track of the current load at the different sites. 

In a grid situation, you just need your grid identity, which is a
member of the project virtual organization. You only need to submit
your job to the grid, and it will automatically be scheduled on the
least loaded site where your project VO has been granted resources.
(In theory at least. I'm not aware of many grid projects that have
gotten this far. Nordugrid is one, though.)

> So I was just looking to see if I have my terminology above correct
> for grids and clusters and whether there was any concept of a
> tightly coupled grid or even a ClusterGrid. And if there was any
> useful cross over between clusters and grids - or are the two so
> completely different architecurally that they will never meet; or
> not for the near future at least.

Think of the grid as a generalized way of locating and getting access
to resources in a fluffy, vague "network cloud" of computing
resources.

Clusters are just one type of resource that can be present in the
cloud.

Certain types of applications run best on clusters with high-speed
interconnects - well, then you can use the grid to locate and get
access to suitable clusters.

-- 
Leif Nixon                                    Systems expert
------------------------------------------------------------
National Supercomputer Centre           Linkoping University
------------------------------------------------------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jcownie at etnus.com  Mon Jul 21 05:30:12 2003
From: jcownie at etnus.com (James Cownie)
Date: Mon, 21 Jul 2003 10:30:12 +0100
Subject: Global Shared Memory and SCI/Dolphin 
In-Reply-To: Message from Greg Lindahl <lindahl@keyresearch.com> 
   of "Fri, 18 Jul 2003 09:14:05 PDT." <20030718161405.GA13859@greglaptop.greghome.keyresearch.com> 
Message-ID: <19eWzk-260-00@etnus.com>


> In both cases you're using different terminology than the SALC folks
> do.

Perhaps you could give us a reference to the real definition of SALC
then ?

Google shows up a selection of _different_ versions of the acronym

       Shared Address Local Copy
       Shared Address Local Cache
and you used
       Shared Address Local Consistency

Since the "Shared Address Local Copy" is in a paper by Bob Numrich, I
think this is likely the right one ?

If we can't even agree what the acronym stands for it's a bit hard to
decide what it means :-(

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rbw at ahpcrc.org  Mon Jul 21 10:15:04 2003
From: rbw at ahpcrc.org (Richard Walsh)
Date: Mon, 21 Jul 2003 09:15:04 -0500
Subject: Global Shared Memory and SCI/Dolphin
Message-ID: <200307211415.h6LEF4m20454@mycroft.ahpcrc.org>


Steffen Persvold wrote:

>Our message passing software may runs on all four interconnects (and ethernet).

 But the one-sided features of the (cray-like) SHMEM and MPI-2 libraries
 need underlying hardware support to perform.  You must be saying that the
 Scali implements the MPI-2 one-sided routines and they can be called even
 over Ethernet, but are actually two-sided emulations with two-sided performance
 on latency (on Ethernet), right?

 Regards,

 rbw
#---------------------------------------------------
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
#         rbw at ahpcrc.org
#
#---------------------------------------------------
# "Without mystery, there can be no authority."
#                                  -Charles DeGaulle
#---------------------------------------------------
# "Why waste time learning when ignornace is
#  instantaneous?"                 -Thomas Hobbes
#---------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Mon Jul 21 17:36:44 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Mon, 21 Jul 2003 14:36:44 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <19eWzk-260-00@etnus.com>
References: <20030718161405.GA13859@greglaptop.greghome.keyresearch.com> <19eWzk-260-00@etnus.com>
Message-ID: <20030721213644.GA1635@greglaptop.internal.keyresearch.com>

On Mon, Jul 21, 2003 at 10:30:12AM +0100, James Cownie wrote:

> Perhaps you could give us a reference to the real definition of SALC
> then ?
> 
> Google shows up a selection of _different_ versions of the acronym
> 
>        Shared Address Local Copy
>        Shared Address Local Cache
> and you used
>        Shared Address Local Consistency

What makes you think that the 1st and 3rd are actually different? They
aren't. I've never heard the 2nd.

As for what it *means*, it's exactly the model provided by the SHMEM
library, or that provided by UPC or CoArray Fortran. It is not the
model supported by ccNuma or MPI-1. Is this not clear?

-- greg

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Mon Jul 21 21:20:11 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Mon, 21 Jul 2003 18:20:11 -0700
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <1058770692.3285.13.camel@protein.scalableinformatics.com>
References: <1058770692.3285.13.camel@protein.scalableinformatics.com>
Message-ID: <20030722012011.GA2127@greglaptop.internal.keyresearch.com>

p.s. it would also help if you could explain what is different from
the last time we had this same discussion, about SALC, on this very
list, in the year 2000.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Tue Jul 22 00:07:17 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 22 Jul 2003 00:07:17 -0400 (EDT)
Subject: Clusters Vs Grids 
In-Reply-To: <20030720220211.GC16662@gre.ac.uk>
Message-ID: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>

> I'm having a hard time marrying the 2 concept of a cluster and a
> grid together; but I'm sure much finer brains than mine have already

"grid" is just a marketing term stemming from the fallacy that networks
are getting a lot faster/better/cheaper.  without those amazing crooks 
at worldcom, I figure grid would never have accumulated as much attention
as it has.  I don't know about you, but my wide-area networking experience
has improved by about a factor of 10 over the past 10-15 years.

network bandwidth and latency is *not* on an exponential curve,
but CPU power is.  (as is disk density - not surprising when you consider
that CPUs and disks are both *areal* devices, unlike networks.)  so we should
expect it to fall further behind, meaning that for a poorly-networked cluster
(aka grid), you'll need even looser-coupled programs than today.

YOU MUST READ THIS:
	http://www.clustercomputing.org/content/tfcc-5-1-gray.html

cycle scavenging is a wonderful thing, but it's about like having
a compost heap in your back yard, or a neighborhood aluminum
can collector ;)

> I'd appreciate that as well; "grids - hmmm - there're just the
> latest computing fad - real high performance scientists won't use
> them and grids will be just so much hype for many years to come".

my users are dramatically bifurcated into two sets: those who want
1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
using a grid; it's a lot easier for them to grab a piece of the 
cluster pie, though.  I wonder whether that's the fate of grids 
in general: not worth the trouble of setting up, except in extreme
cases (seti at home, etc).

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Kim.Branson at csiro.au  Tue Jul 22 04:18:20 2003
From: Kim.Branson at csiro.au (Kim Branson)
Date: Tue, 22 Jul 2003 18:18:20 +1000
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
References: <20030720220211.GC16662@gre.ac.uk>
	<Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20030722181820.2e8522be.Kim.Branson@csiro.au>


> my users are dramatically bifurcated into two sets: those who want
> 1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
> who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
> using a grid; it's a lot easier for them to grab a piece of the 
> cluster pie, though.  I wonder whether that's the fate of grids 
> in general: not worth the trouble of setting up, except in extreme
> cases (seti at home, etc).

Grids are great for my purposes, virtual screening of large chemical databases. We have lots of small independent jobs, some work 
i have done with the use of grids for virtual screening ( using the molecular docking program DOCK ) can be found at  http://www.cs.mu.oz.au/~raj/vlab/index.html
there are links to some publications off the site. This work was very much a test to see how grids and scheduling would perform. To my suprise i got better performance
from my small local 64 node 1ghz athlon cluster than i did for the grid for most calculations. The use of the machines we were soaking time on and the time taken to
run and return the calculations means the dedicated cluster is a better option. For very large datasets the grid does begin to win out, but it is dependent on the load on the grid machines. 

If you have no local resources a grid is a good option for these caclculations but a large dedicated machine is better for small jobs.  The lack of data security means most of our data cannot be dispersed on a grid, and this is perhaps another point to consider when evaluating the usefullness of grids. Would you be happy if someone else could acess your calculation results and inputs? our powers  that be certainly don't. 

cheers

kim


-- 
______________________________________________________________________ 

Dr Kim Branson
Computational Drug Design
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au

______________________________________________________________________ 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From andrewxwang at yahoo.com.tw  Tue Jul 22 07:48:42 2003
From: andrewxwang at yahoo.com.tw (=?big5?q?Andrew=20Wang?=)
Date: Tue, 22 Jul 2003 19:48:42 +0800 (CST)
Subject: New version of the sge_mpiblast tool
In-Reply-To: <1058770692.3285.13.camel@protein.scalableinformatics.com>
Message-ID: <20030722114842.50715.qmail@web16811.mail.tpe.yahoo.com>

Somewhat related, Integrating BLAST with SGE:

http://developers.sun.com/solaris/articles/integrating_blast.html

Andrew.

 --- Joseph Landman <landman at scalableinformatics.com>
????> Hi Folks:
> 
>   We completely rewrote our sge_mpiblast execution
> tool into a real
> program that allows you to run the excellent
> mpiBLAST
> (http://mpiblast.lanl.gov) code within the SGE
> queuing system on a
> bio-cluster.  The new code is named run_mpiblast and
> is available from
> our download page
> (http://scalableinformatics.com/downloads/). 
> Documentation is in process, and the source is
> heavily commented.
> 
>   The principal differences between the old and new
> versions are
> 
> . error detection and problem reporting
> . file staging
> . rewritten in a real programming language, no more
> shell script
> . works within SGE, or from the command line
> . uses config files
> . run isolation
> . debugging and verbosity controls
> 
>   This is a merge between an internal project and
> the ideas behind the
> original code.  Please give it a try and let us know
> how it behaves. 
> The link to the information page is
> http://scalableinformatics.com/sge_mpiblast.html .
> 
> Joe
> 
> -- 
> Joseph Landman, Ph.D
> Scalable Informatics LLC
> email: landman at scalableinformatics.com
>   web: http://scalableinformatics.com
> phone: +1 734 612 4615
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf 

-----------------------------------------------------------------
??? Yahoo!??
??????? - ????????????
http://fate.yahoo.com.tw/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From flifson at cs.uct.ac.za  Mon Jul 21 14:30:10 2003
From: flifson at cs.uct.ac.za (Farrel Lifson)
Date: 21 Jul 2003 20:30:10 +0200
Subject: In need of Beowulf data
Message-ID: <1058812210.4397.78.camel@asgard.cs.uct.ac.za>

Hi there,

As part of my M.Sc I hope to carry out a case study using Markov Reward
Models of a large distributed system. Being a Linux fan, a Beowulf
cluster was the obvious choice. 

Performance data seems to be quite readily available, however finding
reliability data seems to be more of a challenge. Specifically I am
looking for real word failure and repair rates for the various
components of a Beowulf node (HDD, power supply, CPU, RAM) and the
larger cluster (software failure, network, etc). 

While some components have a mean time to failure rating, this is
sometimes underestimated by the manufacturer and I am interested in
getting an as accurate as possible model of a real world Beowulf
cluster.

If anyone has any data they would be willing to share, or if you know of
any papers or reports which list such data I would greatly appreciate
any links or pointers to them.

Thanks in advance,
Farrel Lifson
-- 
Data Network Architecture Research Lab    mailto:flifson at cs.uct.ac.za
Dept. of Computer Science                 http://people.cs.uct.ac.za/~flifson
University of Cape Town                   +27-21-650-3127
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20030721/97021d50/attachment-0001.sig>

From c00jsh00 at nchc.gov.tw  Sat Jul 19 05:40:35 2003
From: c00jsh00 at nchc.gov.tw (Jyh-Shyong Ho)
Date: Sat, 19 Jul 2003 17:40:35 +0800
Subject: channel bonding on SuSE
Message-ID: <3F191213.A1898B95@nchc.gov.tw>

Hi,

Has anyone successfully set up channel bonding in SuSE?
I tried and failed many times and I think it might be the time
to ask for help.
I am using SuSE Linux Enterprise Server 8 for AMD 64,
and I tried to set up the channel bonding for the two
Broadcom gigabit LAN ports on the HDAMA motherboard (for
dual Opteron CPUs).
I followed the instructions in .../Documentation/networking/bonding.txt:

1. modify file /etc/modules.conf to include the line:

alias bond0 bonding
probeall bond0 eth0 eth1 bonding

2. create ifenslave

3. create /etc/sysconfig/network/ifcfg-bond0
   as

DEVICE=bond0
IPADDR=192.168.3.60
NETMASK=255.255.255.0
NETWORK=192.168.3.0
BROADCAST=192.168.3.255
ONBOOT=yes
STARTMODE='onboot'
BOOTPROTO=none
USERCTL=no

and modify file ifcfg-eth0 as

BROADCAST='192.168.3.255'
IPADDR='192.168.3.10'
NETMASK='255.255.255.0'
NETWORK='192.168.3.0'
REMOTE_IPADDR=''
STARTMODE='onboot'
UNIQUE='QOEa.mRtDs8d6UMD'
WIRELESS='no'
DEVICE='eth0'
USERCTL='no'
ONBOOT='yes'
MASTER='bond0'
SLAVE='yes'
BOOTPROTO='none'

and modify file ifcfg-eth1 as

BROADCAST='192.168.3.255'
IPADDR='192.168.3.40'
NETMASK='255.255.255.0'
NETWORK='192.168.3.0'
REMOTE_IPADDR=''
STARTMODE='onboot'
UNIQUE='QOEa.mRtDs8d6UMD'
WIRELESS='no'
DEVICE='eth1'
USERCTL='no'
ONBOOT='yes'
MASTER='bond0'
SLAVE='yes'
BOOTPROTO='none'


4. then I tried several ways to bring up the interface bond0:

a.   ifup bond0  
     this caused the system hang, and have to reboot the system

b.   /etc/init.d/network restart
     or 
     reboot
     did not bring up bond0

c.   /sbin/ifconfig bond0 192.168.3.60 netmask 255.255.255.0 \
      broadcast 192.168.3.255 up

      this caused the system hang, and have to reboot the system

I did make the kernel and made sure that Network Devices/bonding devices
was made as a module.

I have no idea how to proceed next, so if someone has the experience,
please help.

Regards

Jyh-Shyong Ho, PhD.
Research Scientist
National Center for High-Performance Computing
Hsinchu, Taiwan, ROC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Tue Jul 22 07:40:08 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Tue, 22 Jul 2003 06:40:08 -0500
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca>
Message-ID: <3F1D2298.9080808@tamu.edu>

I'd offer that we're going to see grids grow for at least the forseeable 
(?sp?; ?coffee?) future.  I think we need to coin another term, however, 
for the applications that will run on them in the near term: 
"pathetically" parallel.  We've seen the growth of clusters, especially 
in the NUMA/embarrassingly parallel regime.  These have proven to work 
well.  Across the 'Grid' we appreciate today, we either see parallellism 
that simply benefits from distribution due to the vast amount of data 
and thus benefits from cycle-stealing, or applications that are totally 
tolerant of disparate latency issues.

But what does the future hold?  I can foresee an application that uses 
distributed storage to preposition an entire input dataset so that all 
the distributed nodes can access it, and a version of the Logistical 
Backbone that queues data parcels for acquisition and processing and 
manages the reintegration of the returned results into an output queue. 
  Along another line, I can envision an application prepositioning all 
the data across the distributed nodes and using an enhanced version of 
semaphores to to signal when a chunk is processed, then reintegrating 
the pieces later.

Done correctly, both of these become grid-enabling mechanisms.  They 
require atraditional thinking to overcome the non-exponential curve 
associated with network speed and latency.  They will benefit from the 
introduction of some of the network protocols we've come to know and 
dream of, including MPLS and some real form of QoS agreement among 
various carriers, ISP, Universities and other endpoints.  And they won't 
happen tomorrow.

IPv6 may enable some of this; QoS is integrated into its very fabric, 
but agreement on QoS implementation is still far from universal.  Worse, 
while carriers are looking at, or actually implementing IPv6 within 
their network cores, they are not necessarily bringing it to the edge. 
Unless you're in Japan or Europe.  Oh, I'm sorry, this *IS* a globally 
distributed list.  Is anyone from Level 3 or AT&T listening?

The concept of grid computing has taken me a while to embrace, and I'm 
not sure I like it yet.  Overall, I tend to agree with Mark's rather 
cynical assessment that it's a WorldCom marketting ploy that acquired a 
life of its own.

gerry

Mark Hahn wrote:
>>I'm having a hard time marrying the 2 concept of a cluster and a
>>grid together; but I'm sure much finer brains than mine have already
> 
> 
> "grid" is just a marketing term stemming from the fallacy that networks
> are getting a lot faster/better/cheaper.  without those amazing crooks 
> at worldcom, I figure grid would never have accumulated as much attention
> as it has.  I don't know about you, but my wide-area networking experience
> has improved by about a factor of 10 over the past 10-15 years.
> 
> network bandwidth and latency is *not* on an exponential curve,
> but CPU power is.  (as is disk density - not surprising when you consider
> that CPUs and disks are both *areal* devices, unlike networks.)  so we should
> expect it to fall further behind, meaning that for a poorly-networked cluster
> (aka grid), you'll need even looser-coupled programs than today.
> 
> YOU MUST READ THIS:
> 	http://www.clustercomputing.org/content/tfcc-5-1-gray.html
> 
> cycle scavenging is a wonderful thing, but it's about like having
> a compost heap in your back yard, or a neighborhood aluminum
> can collector ;)
> 
> 
>>I'd appreciate that as well; "grids - hmmm - there're just the
>>latest computing fad - real high performance scientists won't use
>>them and grids will be just so much hype for many years to come".
> 
> 
> my users are dramatically bifurcated into two sets: those who want
> 1K CPUs with 2GB/CPU and >500 MB/s, <5 us interconnect, versus those
> who want 100 CPUs with 200KB apiece and 10bT.  the latter could be 
> using a grid; it's a lot easier for them to grab a piece of the 
> cluster pie, though.  I wonder whether that's the fate of grids 
> in general: not worth the trouble of setting up, except in extreme
> cases (seti at home, etc).
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From sp at scali.com  Mon Jul 21 10:54:31 2003
From: sp at scali.com (Steffen Persvold)
Date: Mon, 21 Jul 2003 16:54:31 +0200 (CEST)
Subject: Global Shared Memory and SCI/Dolphin
In-Reply-To: <200307211415.h6LEF4m20454@mycroft.ahpcrc.org>
Message-ID: <Pine.LNX.4.44.0307211647520.3093-100000@sp-laptop.isdn.scali.no>

On Mon, 21 Jul 2003, Richard Walsh wrote:

> 
> Steffen Persvold wrote:
> 
> >Our message passing software may runs on all four interconnects (and ethernet).
> 
>  But the one-sided features of the (cray-like) SHMEM and MPI-2 libraries
>  need underlying hardware support to perform.  You must be saying that the
>  Scali implements the MPI-2 one-sided routines and they can be called even
>  over Ethernet, but are actually two-sided emulations with two-sided performance
>  on latency (on Ethernet), right?

We don't have MPI-2 one-sided, yet, but since we now run on several 
interconnects, when we implement it we will use the hardware RDMA features 
where we can and emulate it where we can't, yes.

Regards,
-- 
      Steffen Persvold           ,,,       mailto: sp at scali.com
   Senior Software Engineer     (o-o)      http://www.scali.com
-----------------------------oOO-(_)-OOo-----------------------------
Scali AS, PObox 150, Oppsal, N-0619 Oslo, Norway, Tel: +4792484511

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From ktaka at clustcom.com  Tue Jul 22 03:13:58 2003
From: ktaka at clustcom.com (Kimitoshi Takahashi)
Date: Tue, 22 Jul 2003 16:13:58 +0900
Subject: MTU change on bonded device
Message-ID: <200307220713.AA00264@grape3.clustcom.com>

Hello,

I'm a newbie in the cluster field.
I wanted to use jumbo frame on channel bonded device.
Any number larger than 1500 seems to be rejected.

# ifconfig bond0 mtu 1501
SIOCSIFMTU: Invalid argument

# ifconfig bond0 mtu 8000
SIOCSIFMTU: Invalid argument

Does anyone know if the bonding driver support Jumbo Frame ?
Or, am I doing all wrong ?

I could change MTUs of enslaved devices,

# ifconfig  eth2 mtu  7000
# ifconfig  eth2
eth2      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7000  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:4198 (4.0 Kb)
          Interrupt:16 Base address:0xd800 Memory:ff860000-ff880000 
 
I use 2.4.20 stock kernel, with channel bonding enabled.
The bonded devices are eth1(e1000) and eth2(e1000).
Here is the relevant part of the ifconfig output,

# ifconfig  -a
bond0     Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:7305 (7.1 Kb)

eth1      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:3625 (3.5 Kb)
          Interrupt:22 Base address:0xd880 Memory:ff8c0000-ff8e0000 

eth2      Link encap:Ethernet  HWaddr 00:02:B3:96:0A:16  
          inet addr:192.168.0.201  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:23 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:3680 (3.5 Kb)
          Interrupt:16 Base address:0xd800 Memory:ff860000-ff880000 

Thanks in advance.

Kimitoshi Takahashi 
ktaka at clustcom.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Tue Jul 22 13:06:28 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Tue, 22 Jul 2003 10:06:28 -0700
Subject: Clusters Vs Grids
In-Reply-To: <3F1D2298.9080808@tamu.edu>
References: <Pine.LNX.4.44.0307212306040.11653-100000@coffee.psychology.mcmaster.ca> <3F1D2298.9080808@tamu.edu>
Message-ID: <20030722170628.GA1355@greglaptop.internal.keyresearch.com>

On Tue, Jul 22, 2003 at 06:40:08AM -0500, Gerry Creager N5JXS wrote:

> I'd offer that we're going to see grids grow for at least the forseeable 
> (?sp?; ?coffee?) future.  I think we need to coin another term, however, 
> for the applications that will run on them in the near term: 
> "pathetically" parallel.

The people who have been doing {distributed computing, metacomputing,
p2p, grids, insert new trendy term here} for a long time have built
systems which can run moderately data-intensive programs, not just
SETI at home. In fact, a realistic assessment of the bandwidth needed for
non-pathetic programs was the basis of the TeraGrid project.

> But what does the future hold?  I can foresee an application that uses 
> distributed storage to preposition an entire input dataset so that all 
> the distributed nodes can access it,

Or, you could use existing systems that do exactly that, which were
foreseen more than a decade ago, had multiple implementations 5 years
ago, and are heading towards production use today.

> Overall, I tend to agree with Mark's rather cynical assessment that
> it's a WorldCom marketting ploy that acquired a life of its own.

Which doesn't match up with the age of current grid efforts, which
predate WorldCom buying UUNet.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Tue Jul 22 13:49:48 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Tue, 22 Jul 2003 10:49:48 -0700
Subject: In need of Beowulf data
In-Reply-To: <1058812210.4397.78.camel@asgard.cs.uct.ac.za>
Message-ID: <5.2.0.9.2.20030722104510.01899928@mailhost4.jpl.nasa.gov>

At 08:30 PM 7/21/2003 +0200, Farrel Lifson wrote:
>Hi there,
>
>As part of my M.Sc I hope to carry out a case study using Markov Reward
>Models of a large distributed system. Being a Linux fan, a Beowulf
>cluster was the obvious choice.
>
>Performance data seems to be quite readily available, however finding
>reliability data seems to be more of a challenge. Specifically I am
>looking for real word failure and repair rates for the various
>components of a Beowulf node (HDD, power supply, CPU, RAM) and the
>larger cluster (software failure, network, etc).
>
>While some components have a mean time to failure rating, this is
>sometimes underestimated by the manufacturer and I am interested in
>getting an as accurate as possible model of a real world Beowulf
>cluster.

I don't know that the manufacturer failure rate data is actually 
underestimated (they tend to pay pretty close attention to this, it being a 
legally enforceable specification), but, more probably, the data is 
being  misinterpreted by the casual consumer of it.  Take, for example, an 
MTBF rating for a disk drive. A typical rating might be 50,000 
hrs.  However, what temperature is that rating at (20C)? What temperature 
are you really running the drive at (40C?), What's the life derating for 
the 20C temperature rise? What sort of operation rate is presumed in that 
failure rate (constant seeks, or some smaller duty cycle)?  What counts as 
a failure?  How many power on/power off cycles are assumed?

Most of the major manufacturers have very detailed writeups on the 
reliability of their components (i.e. go to Seagate's site, and there's 
many pages describing how they do life tests, what the results are, how to 
apply them, etc.)

For "no-name" power supplies, though, you might have a bit more of a challenge.


>If anyone has any data they would be willing to share, or if you know of
>any papers or reports which list such data I would greatly appreciate
>any links or pointers to them.
>
>Thanks in advance,
>Farrel Lifson
>--
>Data Network Architecture Research Lab    mailto:flifson at cs.uct.ac.za
>Dept. of Computer Science                 http://people.cs.uct.ac.za/~flifson
>University of Cape Town                   +27-21-650-3127

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Tue Jul 22 18:56:17 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 22 Jul 2003 18:56:17 -0400 (EDT)
Subject: Clusters Vs Grids
In-Reply-To: <3F1D2298.9080808@tamu.edu>
Message-ID: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>

On Tue, 22 Jul 2003, Gerry Creager N5JXS wrote:

> "pathetically" parallel.  We've seen the growth of clusters, especially 

Gerry, you're a genius.  Pathetically parallel indeed.  I'll have to
work this into my next talk...:-)

   rgb

(back from a fairly obvious, long, vacation:-)

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From dtj at uberh4x0r.org  Tue Jul 22 21:46:33 2003
From: dtj at uberh4x0r.org (Dean Johnson)
Date: 22 Jul 2003 20:46:33 -0500
Subject: Clusters Vs Grids
In-Reply-To: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>
References: <Pine.LNX.4.44.0307221855210.17476-100000@lilith>
Message-ID: <1058924793.1154.4.camel@terra>

On Tue, 2003-07-22 at 17:56, Robert G. Brown wrote:
> On Tue, 22 Jul 2003, Gerry Creager N5JXS wrote:
> 
> > "pathetically" parallel.  We've seen the growth of clusters, especially 
> 
> Gerry, you're a genius.  Pathetically parallel indeed.  I'll have to
> work this into my next talk...:-)
> 
>    rgb
> 
> (back from a fairly obvious, long, vacation:-)

While I agree that there needs to be a term, I think "pathetically
parallel" is ambiguous. We know what we are talking about, having been
steeped in the world of parallelism, but others aren't. If I am pathetic
at sports, it means that I am not very athletic, ie pathetically
athletic. Perhaps "Frighteningly"... ah, nevermind. ;-)

-- 

	-Dean

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 15:04:38 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 09:04:38 -1000
Subject: Thermal Problems
Message-ID: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>

I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
I am having problems with some of the nodes overheating and shutting down.
We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
inspection I noticed that the heatsink and fan sit very close to the lid of
the case.  I was wondering how much clearance is needed between the lid and
the fan that blown down onto the short copper heatsink?  When I put the lid
on the case it is almost as if the fan is working in a vaccum because it
actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
is no air resistance.  Could this be why the fans are crapping out?  I was
thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
fan might work better but I have not seen any... have you?

Also the vendor suggested that we sepetate the 1U cases because he belives
that there is heat transfer between the nodeswhen they are stacked right on
top of eachother.  I thought that if one node is running at 50c and another
node is running at 50c it wont generate a combined heatload of more than 50c
right.


Mitchel Kagawa
Systems Admin.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Wed Jul 23 16:14:40 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 23 Jul 2003 16:14:40 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
> I am having problems with some of the nodes overheating and shutting down.
> We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
> I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
> and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
> inspection I noticed that the heatsink and fan sit very close to the lid of
> the case.  I was wondering how much clearance is needed between the lid and
> the fan that blown down onto the short copper heatsink?  When I put the lid
> on the case it is almost as if the fan is working in a vaccum because it
> actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
> is no air resistance.  Could this be why the fans are crapping out?  I was
> thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
> fan might work better but I have not seen any... have you?
> 
> Also the vendor suggested that we sepetate the 1U cases because he belives
> that there is heat transfer between the nodeswhen they are stacked right on
> top of eachother.  I thought that if one node is running at 50c and another
> node is running at 50c it wont generate a combined heatload of more than 50c
> right.

AMD's really hate to run hot, and duals in 1U require some fairly
careful engineering to run cool enough, stably.  Who is your vendor?
Did they do the node design or did you?  If they did, you should be able
to ask them to just plain fix it -- replace the fans or if necessary
reengineer the whole case -- to make the problem go away.

Issues like fan clearance and stacking and overall airflow through the
case are indeed important.  Sometimes things like using round instead of
ribbon cables (which can turn sideways and interrupt airflow) makes a
big difference.  Keeping the room's ambient air "cold" (as opposed to
"comfortable") helps.  There is likely some heat transfer vertically
between the 1U cases, but if you go to the length of separating them you
might as well have used 2U cases in the first place.

>From your description, it does sound like you have some bad fans.
Whether they are bad (as in a bad design, poor vendor), or bad (as in
installed "incorrectly" in a case/mobo with inadequate clearance causing
them to fail), or bad (as in you just happened to get some fans from a
bad production batch but replacements would probably work fine) it is
very hard to say, and I don't envy you the debugging process of finding
out which.  We've been the route of replacing all of the fans once
ourselves so it can certainly happen...

   rgb

> 
> 
> Mitchel Kagawa
> Systems Admin.
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 16:33:26 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 10:33:26 -1000
Subject: pfilter.conf
Message-ID: <005001c35159$ab4a2c50$6f01a8c0@Navatek.local>

I'm having problems finding out how to open a range of ports that are being
filtered using the pfilter service.  I am able to open a specific port by
editing the /etc/pfilter.conf file with a line like 'open   tcp  3389'  but
for the life of me I can't figure out how to open a range of ports like
30000 - 33000  and I have serached everywhere on the net can any of you help
me out?  thanks!

Mitchel Kagawa
Systems Administrator


Mitchel Kagawa
Systems Administrator


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Wed Jul 23 18:19:00 2003
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 23 Jul 2003 15:19:00 -0700
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <5.2.0.9.2.20030723145932.02fa56b0@mailhost4.jpl.nasa.gov>

At 09:04 AM 7/23/2003 -1000, Mitchel Kagawa wrote:
>I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
>I am having problems with some of the nodes overheating and shutting down.
>We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
>I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
>and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
>inspection I noticed that the heatsink and fan sit very close to the lid of
>the case.  I was wondering how much clearance is needed between the lid and
>the fan that blown down onto the short copper heatsink?

To a first order, the area of the inlet should be comparable to the area of 
the outlet.  A 60 mm diameter fan has an area of around 2800 mm^2. If you 
draw from around the entire periphery (which would be around 180 mm), you'd 
need a gap of around 15 mm (probably 20 mm would be a better idea)  That's 
a fairly significant fraction of the 45 mm or so for 1 rack U.


>  When I put the lid
>on the case it is almost as if the fan is working in a vaccum because it
>actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
>is no air resistance.  Could this be why the fans are crapping out?  I was
>thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
>fan might work better but I have not seen any... have you?
>
>Also the vendor suggested that we sepetate the 1U cases because he belives
>that there is heat transfer between the nodeswhen they are stacked right on
>top of eachother.  I thought that if one node is running at 50c and another
>node is running at 50c it wont generate a combined heatload of more than 50c
>right.

So, your vendor essentially claims that his 1U case will work just fine as 
long as there is a 1U air gap above and below?

Let's look at the problem with some simple calculations:

Assume no heat transfer up or down (tightly packed), and that no heat 
transfers through the sides by conduction, as well, so all the heat has to 
go into airflow.
Assume that you've got to move about 200W out of the box, and you can 
tolerate a 10C rise in temperature of the air moving through the box. The 
question is how much air do you need to move. Air has a density of about 
1.13 kg/m^3 and a specific heat of about 1 kJ/kgK.
200W is 0.2 kJ/sec, so you need to move 0.02 kg of air every second (you 
get a 10 deg rise) is about 0.018 cubic meters/second. To relate this to 
more common fan specs: about 40 CFM or  65 cubic meters/hr. (I did a quick 
check on some smallish 60mm fans, and they only flow around 10-20 CFM into 
NO backpressure... http://www.papst.de/pdf_dat_d/Seite_13.pdf
for instance)

How fast is the air going to be moving through the vents?  What's the vent 
area... say it's 10 square inches (1 inch high and 10 inches wide...).. 40 
CFM through .07 square feet is 576 ft/min for the air flow (which is a 
reasonable speed.. 1000 ft/min is getting fast and noisy...)

But here's the thing.. you've got 32 of these things in the rack... are you 
moving 1300 CFM through the rack, or are you blowing hot air from one 
chassis into the next.


>Mitchel Kagawa
>Systems Admin.
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul 23 18:32:33 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 23 Jul 2003 18:32:33 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <002701c3514d$43af4f00$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307231830060.23607-100000@coffee.psychology.mcmaster.ca>

> We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but

I don't think it makes much sense to use cpu-fans in 1U chassis - 
not only are cpu-fans *in*general* less reliable, but you'd 
constantly be facing this sort of problem.  not to mention
the fact that the overall airflow would be near-pessimal.

far better is the kind of 1U chassis that has 1 or two fairly
large, reliable centrifugal blowers forcing air past passive 
heatsinks on the CPUs.  there are multiple vendors that sell
this kind of design.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mitchel at navships.com  Wed Jul 23 22:15:31 2003
From: mitchel at navships.com (Mitchel Kagawa)
Date: Wed, 23 Jul 2003 16:15:31 -1000
Subject: Thermal Problems
References: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu>
Message-ID: <000c01c35189$750cd310$6f01a8c0@Navatek.local>

Here are a few pictures of the culprite.  Any suggestions on how to fix it
other than buying a whole new case would be appreciated
http://neptune.navships.com/images/oscarnode-front.jpg
http://neptune.navships.com/images/oscarnode-side.jpg
http://neptune.navships.com/images/oscarnode-back.jpg

You can also see how many I'm down... it should read 65 nodes (64 + 1 head
node)
http://neptune.navships.com/ganglia

Mitchel Kagawa
Systems Administrator

----- Original Message -----
From: "Robert G. Brown" <rgb at phy.duke.edu>
To: "Mitchel Kagawa" <mitchel at navships.com>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, July 23, 2003 10:14 AM
Subject: Re: Thermal Problems


> On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
>
> > I run a small 64 node cluster each with dual AMD MP2200's in a 1U
chassis.
> > I am having problems with some of the nodes overheating and shutting
down.
> > We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm
but
> > I notice that a lot (25%) of the fans tend to freeze up or blow the
bearings
> > and spin at only 1000 RPM, which causes the cpu to overheat.  After
careful
> > inspection I noticed that the heatsink and fan sit very close to the lid
of
> > the case.  I was wondering how much clearance is needed between the lid
and
> > the fan that blown down onto the short copper heatsink?  When I put the
lid
> > on the case it is almost as if the fan is working in a vaccum because it
> > actually speeds up an aditional 600-700 rpm to over 6000 rpm... like
there
> > is no air resistance.  Could this be why the fans are crapping out?  I
was
> > thinking that a 60x60x10mm cpu fan that has air intakes on the side of
the
> > fan might work better but I have not seen any... have you?
> >
> > Also the vendor suggested that we sepetate the 1U cases because he
belives
> > that there is heat transfer between the nodeswhen they are stacked right
on
> > top of eachother.  I thought that if one node is running at 50c and
another
> > node is running at 50c it wont generate a combined heatload of more than
50c
> > right.
>
> AMD's really hate to run hot, and duals in 1U require some fairly
> careful engineering to run cool enough, stably.  Who is your vendor?
> Did they do the node design or did you?  If they did, you should be able
> to ask them to just plain fix it -- replace the fans or if necessary
> reengineer the whole case -- to make the problem go away.
>
> Issues like fan clearance and stacking and overall airflow through the
> case are indeed important.  Sometimes things like using round instead of
> ribbon cables (which can turn sideways and interrupt airflow) makes a
> big difference.  Keeping the room's ambient air "cold" (as opposed to
> "comfortable") helps.  There is likely some heat transfer vertically
> between the 1U cases, but if you go to the length of separating them you
> might as well have used 2U cases in the first place.
>
> From your description, it does sound like you have some bad fans.
> Whether they are bad (as in a bad design, poor vendor), or bad (as in
> installed "incorrectly" in a case/mobo with inadequate clearance causing
> them to fail), or bad (as in you just happened to get some fans from a
> bad production batch but replacements would probably work fine) it is
> very hard to say, and I don't envy you the debugging process of finding
> out which.  We've been the route of replacing all of the fans once
> ourselves so it can certainly happen...
>
>    rgb
>
> >
> >
> > Mitchel Kagawa
> > Systems Admin.
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> >
>
> Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From salonj at hotmail.com  Thu Jul 24 03:13:36 2003
From: salonj at hotmail.com (salon j)
Date: Thu, 24 Jul 2003 07:13:36 +0000
Subject: open the graphic interface.
Message-ID: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>

i want t open the graphic interface on three machines of my clusters,
which program with pvm, in my programme , i use gtk to program the
graphic interface, i have add machines before i spawn,  but after i use
spawn -> filename, it shown pvm>[t80001]  Cannot connect to X server
t80001 is a task on the other machine ,not the machine which start up
the pvm task. how can i do with this error?

_________________________________________________________________
??????????????? MSN Hotmail?  http://www.hotmail.com  

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mikee at mikee.ath.cx  Thu Jul 24 08:05:58 2003
From: mikee at mikee.ath.cx (Mike Eggleston)
Date: Thu, 24 Jul 2003 07:05:58 -0500
Subject: open the graphic interface.
In-Reply-To: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>; from salonj@hotmail.com on Thu, Jul 24, 2003 at 07:13:36AM +0000
References: <BAY7-F99sJ3Ih7jtrSB00030d8c@hotmail.com>
Message-ID: <20030724070558.A14082@mikee.ath.cx>

On Thu, 24 Jul 2003, salon j wrote:

> i want t open the graphic interface on three machines of my clusters,
> which program with pvm, in my programme , i use gtk to program the
> graphic interface, i have add machines before i spawn,  but after i use
> spawn -> filename, it shown pvm>[t80001]  Cannot connect to X server
> t80001 is a task on the other machine ,not the machine which start up
> the pvm task. how can i do with this error?

There is a debugging option in one of the pvm shell scripts. Setting
the debugging option will allow your programs to reach your X server.

Mike
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 08:37:11 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 14:37:11 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <3F1FD2F7.6FA2F2E3@imag.fr>

Hello,

We have recently received 48 Bi-xeon Dell 1600SC and we are performing some benchmarks to tests the cluster.
Unfortunately we have very bad perfomance with the internal gigabit card (82540EM chipset). We have passed linux netperf test and we have only 33 Mo
between 2 machines. We have changed the drivers for the last ones, installed procfgd and so on... Finally we had Win2000 installed and the last driver
from intel installed : the results are identical... To go further we have installed a PCI-X 82540EM card and re-run the tests : in that way the
results are much better : 66 Mo full duplex...
So the question is : is there a well known problem with this DELL 1600SC concernig the 82540EM integration on the motherboard ????
As anyone already have (heard about) this problem ? 
Is there any solution ?

thx for your help

Regards,


-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeffrey.b.layton at lmco.com  Thu Jul 24 08:04:20 2003
From: jeffrey.b.layton at lmco.com (Jeff Layton)
Date: Thu, 24 Jul 2003 08:04:20 -0400
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FD2F7.6FA2F2E3@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
Message-ID: <3F1FCB44.3010002@lmco.com>

Stephane,

   What kind of switch (100 or 1000)? Have you looked
at the switch ports? Are they connecting at full or half
duplex? How about the NICs? You'll see bad performance
with a duplex mismatch between the NICs and switch.
Are you forcing the NICs or are they auto-negiotiating?

Good Luck!

Jeff


> Hello,
>
> We have recently received 48 Bi-xeon Dell 1600SC and we are performing 
> some benchmarks to tests the cluster.
> Unfortunately we have very bad perfomance with the internal gigabit 
> card (82540EM chipset). We have passed linux netperf test and we have 
> only 33 Mo
>
> between 2 machines. We have changed the drivers for the last ones, 
> installed procfgd and so on... Finally we had Win2000 installed and 
> the last driver
>
> from intel installed : the results are identical... To go further we 
> have installed a PCI-X 82540EM card and re-run the tests : in that way the
>
> results are much better : 66 Mo full duplex...
> So the question is : is there a well known problem with this DELL 
> 1600SC concernig the 82540EM integration on the motherboard ????
>
> As anyone already have (heard about) this problem ?
> Is there any solution ?
>
> thx for your help
>

-- 
Dr. Jeff Layton
Chart Monkey - Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From canon at nersc.gov  Thu Jul 24 10:36:53 2003
From: canon at nersc.gov (canon at nersc.gov)
Date: Thu, 24 Jul 2003 07:36:53 -0700
Subject: Thermal Problems 
In-Reply-To: Your message of "Wed, 23 Jul 2003 15:19:00 PDT."
             <5.2.0.9.2.20030723145932.02fa56b0@mailhost4.jpl.nasa.gov> 
Message-ID: <200307241436.h6OEarX2002407@pookie.nersc.gov>


We have a similar setup and have seen a similar problem.
The vendor determined the fans weren't robust enough
and sent replacements.

With regards to adding gaps...  We have considered
(but haven't implemented) adding a gap every 10ish nodes.
This would be primarily to reset the vertical temperature
gradient.  You can run your hand up the exhaust and feel
the temperature difference between the top and the bottom.
I suspect hot air rises.  :-)  The gap would allow us
to "reset" the temperature gradient.  This would only
lose us 2 or 3U which isn't too bad if it helps the
cooling.

--Shane

------------------------------------------------------------------------
Shane Canon                             voice: 510-486-6981
PSDF Project Lead                       fax:   510-486-7520
National Energy Research Scientific
  Computing Center
1 Cyclotron Road Mailstop 943-256
Berkeley, CA 94720                      canon at nersc.gov
------------------------------------------------------------------------

> At 09:04 AM 7/23/2003 -1000, Mitchel Kagawa wrote:
> >I run a small 64 node cluster each with dual AMD MP2200's in a 1U chassis.
> >I am having problems with some of the nodes overheating and shutting down.
> >We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm but
> >I notice that a lot (25%) of the fans tend to freeze up or blow the bearings
> >and spin at only 1000 RPM, which causes the cpu to overheat.  After careful
> >inspection I noticed that the heatsink and fan sit very close to the lid of
> >the case.  I was wondering how much clearance is needed between the lid and
> >the fan that blown down onto the short copper heatsink?
> 
> To a first order, the area of the inlet should be comparable to the area of 
> the outlet.  A 60 mm diameter fan has an area of around 2800 mm^2. If you 
> draw from around the entire periphery (which would be around 180 mm), you'd 
> need a gap of around 15 mm (probably 20 mm would be a better idea)  That's 
> a fairly significant fraction of the 45 mm or so for 1 rack U.
> 
> 
> 
> >  When I put the lid
> >on the case it is almost as if the fan is working in a vaccum because it
> >actually speeds up an aditional 600-700 rpm to over 6000 rpm... like there
> >is no air resistance.  Could this be why the fans are crapping out?  I was
> >thinking that a 60x60x10mm cpu fan that has air intakes on the side of the
> >fan might work better but I have not seen any... have you?
> >
> >Also the vendor suggested that we sepetate the 1U cases because he belives
> >that there is heat transfer between the nodeswhen they are stacked right on
> >top of eachother.  I thought that if one node is running at 50c and another
> >node is running at 50c it wont generate a combined heatload of more than 50c
> >right.
> 
> So, your vendor essentially claims that his 1U case will work just fine as 
> long as there is a 1U air gap above and below?
> 
> Let's look at the problem with some simple calculations:
> 
> Assume no heat transfer up or down (tightly packed), and that no heat 
> transfers through the sides by conduction, as well, so all the heat has to 
> go into airflow.
> Assume that you've got to move about 200W out of the box, and you can 
> tolerate a 10C rise in temperature of the air moving through the box. The 
> question is how much air do you need to move. Air has a density of about 
> 1.13 kg/m^3 and a specific heat of about 1 kJ/kgK.
> 200W is 0.2 kJ/sec, so you need to move 0.02 kg of air every second (you 
> get a 10 deg rise) is about 0.018 cubic meters/second. To relate this to 
> more common fan specs: about 40 CFM or  65 cubic meters/hr. (I did a quick 
> check on some smallish 60mm fans, and they only flow around 10-20 CFM into 
> NO backpressure... http://www.papst.de/pdf_dat_d/Seite_13.pdf
> for instance)
> 
> How fast is the air going to be moving through the vents?  What's the vent 
> area... say it's 10 square inches (1 inch high and 10 inches wide...).. 40 
> CFM through .07 square feet is 576 ft/min for the air flow (which is a 
> reasonable speed.. 1000 ft/min is getting fast and noisy...)
> 
> But here's the thing.. you've got 32 of these things in the rack... are you 
> moving 1300 CFM through the rack, or are you blowing hot air from one 
> chassis into the next.
> 
> 
> 
> 
> 
> 
> >Mitchel Kagawa
> >Systems Admin.
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> James Lux, P.E.
> Spacecraft Telecommunications Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beo
> wulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 10:09:15 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 10:09:15 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307240918580.1813-100000@lilith>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> Here are a few pictures of the culprite.  Any suggestions on how to fix it
> other than buying a whole new case would be appreciated
> http://neptune.navships.com/images/oscarnode-front.jpg
> http://neptune.navships.com/images/oscarnode-side.jpg
> http://neptune.navships.com/images/oscarnode-back.jpg

The case design doesn't look totally insane, although that depends a bit
on the actual capacity of some of the fans.  You've got a fairly large,
clear aperture at the front, three fans pulling from it and blowing cool
air over the memory and all three heatsinks, and a rotary/turbine fan in
the rear corner to exhaust the heated air.  The ribbon cables are off to
the side where they don't appear to obstruct the airflow.  The hard disk
presumably has its own fan and pulls front to back over on the other
side more or less independent of the case flow.

At a guess, you're problem really is just the CPU coolers, which may not
be optimal for 1U cases.  A few minutes with google turns up a lot of
alternatives, e.g.:

  http://www.buyextras.com/cojaiuracpuc.html

which is engineered to pull air in through the copper (very good heat
conductor) fins and exhaust it to the SIDE and not out the TOP.  Another
couple of things you can try are to contact AMD and find out what CPU
cooler(s) THEY recommend for 1U systems or join one of the AMD hardware
user support lists (I'll let you do the googling on this one, but they
are out there) and see if somebody will give you a glowing testimonial
on some particular brands for quality, reliability, effectiveness.

The high end coolers aren't horribly cheap -- the one above is $20
(although the site also had a couple of coolers for $16 that might also
be adequate).  However, retrofitting fans is a lot cheaper than
replacing 64 1U cases with 2U cases AND likely having to replace the CPU
coolers anyway, as a cheap cooler is a cheap cooler and likely to fail.

If you bought the cluster from a vendor selling "1U dual Athlon nodes"
and they picked the hardware, they should replace all of the cheap fans
with good fans at their cost, and they should do it right away as you're
losing money by the bucketfull every time a node goes down and you have
to mess with it.  Downtime and your time are EXPENSIVE -- hardware is
cheap.  If they refuse to, please post their name on the list so the
rest of us can avoid them plague-like (a thing I'm tempted to do anyway
if their advice on "fixing" your cooling is to install your 1U node on a
2U spacing).

If you picked the hardware and they just assembled it, well, tough luck,
but they should still help out some -- perhaps take back the cheap fans
and replace them with good fans at cost.  However, even if they decide
to do nothing at all for you and you're stuck doing it all yourself,
you're better off spending $40 x 64 = $2560 and a couple of days of your
time and ending up with a functional cluster than living with days/weeks
of downtime fruitlessly cycling cheap replacement fans doomed to die in
their turn.  Also, eventually your CPUs will start to die and not just
crash your systems, and that gets very expensive very quickly quite
aside from the cost of downtime and labor.

There are no free lunches, and it may be that going with expensive (but
effective!) CPU cooler fans isn't enough to stabilize your systems.  For
example, if the rear exhaust fan doesn't have adequate capacity or the
cooler fans can't be installed in such a way as to establish a clean
airflow of cool air from the front, the CPU cooler fans will just end up
blowing heated air around in a turbulent loop inside the case and even
though the fans may not fail (as they won't be obstructed) the CPUs may
run hotter than you'd like.  You'll have no way of knowing without
trying.

If your vendor doesn't handle this for you I'd recommend that you
immediately spring for a "sample" of the high end fans -- perhaps eight
of them, perhaps sixteen -- and use them to repair your downed systems.
Run the nodes in their usual environment with the new fans and sample
CPU core temperatures.  I'd predict that the CPUs will run cooler than
they do now in any event, but it is good to be sure.  When you're
confident that they will a) keep the CPUs cool and b) run reliably,
given that they have unobstructed airflow you can either buy them as you
need them and just repair nodes as the cheap fans die with the new ones
or, if your cluster really needs to be up and stay up, spring for the
complete set.

BTW, you should check to make sure that the fan at the link above is
actually correct for your CPUs -- it seems like it would be, but caveat
emptor.

Good luck,

    rgb

> 
> You can also see how many I'm down... it should read 65 nodes (64 + 1 head
> node)
> http://neptune.navships.com/ganglia
> 
> Mitchel Kagawa
> Systems Administrator
> 
> ----- Original Message -----
> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: "Mitchel Kagawa" <mitchel at navships.com>
> Cc: <beowulf at beowulf.org>
> Sent: Wednesday, July 23, 2003 10:14 AM
> Subject: Re: Thermal Problems
> 
> 
> > On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
> >
> > > I run a small 64 node cluster each with dual AMD MP2200's in a 1U
> chassis.
> > > I am having problems with some of the nodes overheating and shutting
> down.
> > > We are using Dynatron 1U CPU fans which are supposed to spin at 5400 rpm
> but
> > > I notice that a lot (25%) of the fans tend to freeze up or blow the
> bearings
> > > and spin at only 1000 RPM, which causes the cpu to overheat.  After
> careful
> > > inspection I noticed that the heatsink and fan sit very close to the lid
> of
> > > the case.  I was wondering how much clearance is needed between the lid
> and
> > > the fan that blown down onto the short copper heatsink?  When I put the
> lid
> > > on the case it is almost as if the fan is working in a vaccum because it
> > > actually speeds up an aditional 600-700 rpm to over 6000 rpm... like
> there
> > > is no air resistance.  Could this be why the fans are crapping out?  I
> was
> > > thinking that a 60x60x10mm cpu fan that has air intakes on the side of
> the
> > > fan might work better but I have not seen any... have you?
> > >
> > > Also the vendor suggested that we sepetate the 1U cases because he
> belives
> > > that there is heat transfer between the nodeswhen they are stacked right
> on
> > > top of eachother.  I thought that if one node is running at 50c and
> another
> > > node is running at 50c it wont generate a combined heatload of more than
> 50c
> > > right.
> >
> > AMD's really hate to run hot, and duals in 1U require some fairly
> > careful engineering to run cool enough, stably.  Who is your vendor?
> > Did they do the node design or did you?  If they did, you should be able
> > to ask them to just plain fix it -- replace the fans or if necessary
> > reengineer the whole case -- to make the problem go away.
> >
> > Issues like fan clearance and stacking and overall airflow through the
> > case are indeed important.  Sometimes things like using round instead of
> > ribbon cables (which can turn sideways and interrupt airflow) makes a
> > big difference.  Keeping the room's ambient air "cold" (as opposed to
> > "comfortable") helps.  There is likely some heat transfer vertically
> > between the 1U cases, but if you go to the length of separating them you
> > might as well have used 2U cases in the first place.
> >
> > From your description, it does sound like you have some bad fans.
> > Whether they are bad (as in a bad design, poor vendor), or bad (as in
> > installed "incorrectly" in a case/mobo with inadequate clearance causing
> > them to fail), or bad (as in you just happened to get some fans from a
> > bad production batch but replacements would probably work fine) it is
> > very hard to say, and I don't envy you the debugging process of finding
> > out which.  We've been the route of replacing all of the fans once
> > ourselves so it can certainly happen...
> >
> >    rgb
> >
> > >
> > >
> > > Mitchel Kagawa
> > > Systems Admin.
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> >
> > Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> >
> >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> >
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 11:12:54 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 17:12:54 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <3F1FD2F7.6FA2F2E3@imag.fr> <3F1FCB44.3010002@lmco.com>
Message-ID: <3F1FF776.E586E244@imag.fr>

Jeff Layton a ?crit :
> 
> Stephane,
> 
>    What kind of switch (100 or 1000)? Have you looked
> at the switch ports? Are they connecting at full or half
> duplex? How about the NICs? You'll see bad performance
> with a duplex mismatch between the NICs and switch.
> Are you forcing the NICs or are they auto-negiotiating?
> 
> Good Luck!
> 
> Jeff
> 
> > Hello,
> >
> > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > some benchmarks to tests the cluster.
> > Unfortunately we have very bad perfomance with the internal gigabit
> > card (82540EM chipset). We have passed linux netperf test and we have
> > only 33 Mo
> >
> > between 2 machines. We have changed the drivers for the last ones,
> > installed procfgd and so on... Finally we had Win2000 installed and
> > the last driver
> >
> > from intel installed : the results are identical... To go further we
> > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> >
> > results are much better : 66 Mo full duplex...
> > So the question is : is there a well known problem with this DELL
> > 1600SC concernig the 82540EM integration on the motherboard ????
> >
> > As anyone already have (heard about) this problem ?
> > Is there any solution ?
> >
> > thx for your help
> >
> 
> --
> Dr. Jeff Layton
> Chart Monkey - Aerodynamics and CFD
> Lockheed-Martin Aeronautical Company - Marietta

Hello, 

For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
connected to a J4898A. The negociation between the NIcs ans the two switches is working.

When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo). 

All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....

thx a lot for your concern,

regards
 

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fant at pobox.com  Thu Jul 24 11:17:00 2003
From: fant at pobox.com (Andrew Fant)
Date: Thu, 24 Jul 2003 11:17:00 -0400 (EDT)
Subject: Comparing MPI Implementations
Message-ID: <20030724111221.Y73094-100000@net.bluemoon.net>


Does anyone have any experiences comparing MPI implementations for Linux?
In particular, I am interested in people's views of the relative merits of
Mpich, LAM, and MPIPro.  I currently have Mpich installed on our
production cluster, but this decision came mostly out of default, rather
than by any serious study.

Thanks in advance,
		Andy

Andrew Fant      |   This    |  "If I could walk THAT way...
Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
Boston, MA USA   |   Hire    |    http://www.pharmawulf.com

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 11:43:07 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 10:43:07 -0500
Subject: Project Help
Message-ID: <Law11-F98m15k4R9xC200005ecb@hotmail.com>


To whomever it may concern,

              I work at a company called AT systems. I was recently assigned 
the task of using up thirty extra SBC's that we have. My boss told me that 
he wants to link all of the SBC's together, and plop them in a tower, and 
donate them to a college or university as a tax write-off. We have a factory 
attached to our engineering department, which contains a turret, multiple 
work stations, and so on. So getting a hold of a custom tower, power supply, 
etc. is not a problem. I just need to create a way to use these thirty extra 
board we have. All thirty of them contain: a P266 processor, 128 MB of RAM, 
128 IDE, Compac Flash Drive, and Ethernet and USB ports. Any diagrams, 
sites, comments, or suggestions would be greatly appreciated. Thanks.

Eric Uren
AT Systems

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*  
http://join.msn.com/?page=features/virus

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Thu Jul 24 11:48:28 2003
From: seth at hogg.org (Simon Hogg)
Date: Thu, 24 Jul 2003 16:48:28 +0100
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
 <3F1FCB44.3010002@lmco.com>
Message-ID: <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>

At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
<problems snipped>
>All our tests tends to show that dell missed something in the integration 
>of the 82540EM in the 1600SC series...if not we'll really really appreciate
>to know what we are missing there cause here we have a 150 000 dollars 
>cluster said to be connected with a network gigabit having network perfs of
>three 100 card bonded (in full duplex it's even worse !!!!!). If the 
>problem is not rapidly solved the 48 machines will be returned....
>
>thx a lot for your concern,

Sorry I can't help you, but I wonder what response you have had from Dell?

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Thu Jul 24 11:48:28 2003
From: seth at hogg.org (Simon Hogg)
Date: Thu, 24 Jul 2003 16:48:28 +0100
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
References: <3F1FD2F7.6FA2F2E3@imag.fr>
 <3F1FCB44.3010002@lmco.com>
Message-ID: <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>

At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
<problems snipped>
>All our tests tends to show that dell missed something in the integration 
>of the 82540EM in the 1600SC series...if not we'll really really appreciate
>to know what we are missing there cause here we have a 150 000 dollars 
>cluster said to be connected with a network gigabit having network perfs of
>three 100 card bonded (in full duplex it's even worse !!!!!). If the 
>problem is not rapidly solved the 48 machines will be returned....
>
>thx a lot for your concern,

Sorry I can't help you, but I wonder what response you have had from Dell?

--
Simon

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 13:47:14 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 19:47:14 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <3F1FD2F7.6FA2F2E3@imag.fr>
	 <3F1FCB44.3010002@lmco.com> <4.3.2.7.2.20030724164701.00b1ca40@pop.freeuk.net>
Message-ID: <3F201BA2.F7371A4@imag.fr>

Simon Hogg a ?crit :
> 
> At 17:12 24/07/03 +0200, Stephane.Martin at imag.fr wrote:
> <problems snipped>
> >All our tests tends to show that dell missed something in the integration
> >of the 82540EM in the 1600SC series...if not we'll really really appreciate
> >to know what we are missing there cause here we have a 150 000 dollars
> >cluster said to be connected with a network gigabit having network perfs of
> >three 100 card bonded (in full duplex it's even worse !!!!!). If the
> >problem is not rapidly solved the 48 machines will be returned....
> >
> >thx a lot for your concern,
> 
> Sorry I can't help you, but I wonder what response you have had from Dell?
> 
> --
> Simon

hello,

I can't really answer to this question....hummm....first the technician sent us a link to a web page talking about another network chipset and another
machine saying that they have similar network integration (personnaly I would never compare network results between a biPIII and a BI-Xeon...but....).
It was really unuselful, the technician was arguing that it was a test of the 82540EM...not really serious; The worse of all is that he said that
those results were correct ones (because it was the same result in his link... furthemore he didnt' react much when I told him that such a poor
performance will certainely lead to a reject of all the cluster). So, for him all is all right !!!!!.
I decided to go one level up and had a similar response (I ve been sent a internal report, benchmarking again ANOTHER configuration : thats to say
this time I had numbers concerning a card plugged in the PCI-X bus !!!!!!! I've already done those test by myself...).

so what to say ? I'm not sure they really feel concerned about my (their) problems... My boss said : if no solution tomorow, the cluster is going to
be sent back...

thx for your concerns,

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From math at sizone.org  Wed Jul 23 19:58:14 2003
From: math at sizone.org (Ken Chase)
Date: Wed, 23 Jul 2003 19:58:14 -0400
Subject: cold rooms & machines
Message-ID: <20030723235814.GA11248@velocet.ca>

A group I know wants to put a cluster in their labs, but they
dont have any facilities for cooling _EXCEPT_ a cold room to store
chemicals and conduct experiments at 5C (its largely unused and could
probably be set to any temp up to 10C, really - even -10C if desired
;)

The chillers in there are pretty underworked and might be able to
handle the 3000W odd of heat that would be radiating out of the 
machines.

What other criteria should we be looking at - non-condensing
environment I would guess is one - is this just a function of the %RH
in the room? What should it be set to? Any other concerns?

/kc
-- 
Ken Chase, math at sizone.org
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jim at ks.uiuc.edu  Thu Jul 24 14:21:58 2003
From: jim at ks.uiuc.edu (Jim Phillips)
Date: Thu, 24 Jul 2003 13:21:58 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
Message-ID: <Pine.GSO.4.40.0307241307180.18431-100000@verdun.ks.uiuc.edu>

Hi,

The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get full
gigabit bandwidth, particularly if you're running at 33 MHz (look at
/proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There are no
82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 64-bit 82545EM
card?  Intel distinguishes their 32-bit 33/66 MHz PCI PRO/1000 MT Desktop
cards that use 82540EM from their 64-bit PCI-X PRO/1000 MT Server cards
that use the 82545EM (and have full gigabit performance).

-Jim


On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:

> > > Hello,
> > >
> > > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > > some benchmarks to tests the cluster.
> > > Unfortunately we have very bad perfomance with the internal gigabit
> > > card (82540EM chipset). We have passed linux netperf test and we have
> > > only 33 Mo
> > >
> > > between 2 machines. We have changed the drivers for the last ones,
> > > installed procfgd and so on... Finally we had Win2000 installed and
> > > the last driver
> > >
> > > from intel installed : the results are identical... To go further we
> > > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> > >
> > > results are much better : 66 Mo full duplex...
> > > So the question is : is there a well known problem with this DELL
> > > 1600SC concernig the 82540EM integration on the motherboard ????
> > >
> > > As anyone already have (heard about) this problem ?
> > > Is there any solution ?
> > >
> > > thx for your help
> > >
> >
> > --
> > Dr. Jeff Layton
> > Chart Monkey - Aerodynamics and CFD
> > Lockheed-Martin Aeronautical Company - Marietta
>
> Hello,
>
> For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
> ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
> connected to a J4898A. The negociation between the NIcs ans the two switches is working.
>
> When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
> compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo).
>
> All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
> to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
> three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....
>
> thx a lot for your concern,
>
> regards
>
>
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr
> ENSIMAG - Antenne de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mprinkey at aeolusresearch.com  Thu Jul 24 09:33:09 2003
From: mprinkey at aeolusresearch.com (Michael T. Prinkey)
Date: Thu, 24 Jul 2003 09:33:09 -0400 (EDT)
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <Pine.LNX.4.44.0307240909230.12184-100000@ra.thebes>

On Wed, 23 Jul 2003, Mitchel Kagawa wrote:

> Here are a few pictures of the culprite.  Any suggestions on how to fix it
> other than buying a whole new case would be appreciated
> http://neptune.navships.com/images/oscarnode-front.jpg
> http://neptune.navships.com/images/oscarnode-side.jpg
> http://neptune.navships.com/images/oscarnode-back.jpg
> 
> You can also see how many I'm down... it should read 65 nodes (64 + 1 head
> node)
> http://neptune.navships.com/ganglia
> 
> Mitchel Kagawa
> Systems Administrator
> 

The Intel Xeon ships with an interesting heat sink/fan/shroud system.  
For an normal case, you can mount the fan on the top of the shroud which
makes it work much like a "normal" heat sink/fan...the air comes in the
top and blows down onto the CPU.  But, for low-profile installations (mine
were 2U), the fan attaches to the side of the shroud to form a "wind
tunnel."  Maybe a similar solution would exist in your case, i.e., taller
heat sinks (~1") with one or two fans mounted on the side blowing across
the heat sink.  I did a quick search online, but couldn't find a vendor 
for this type heat sink.  Sorry.

You might be able to experiment.  Fans are usually only held in place with
oversized screws that go easily into soft heat sinks.  You can probably
build a pair of test heat sinks in 10 minutues.  The flow from the fan
should be aligned with the fins.  Depending on the type of heatsink you
start with, you might be able to direct the output flow in any direction
you choose.  From the photos, I would recommend that you place the fans on
the side of the heat sink near the front of the case so the exhaust is
directed to the vents at the rear of the case.

Good luck,

Mike Prinkey
Aeolus Research, Inc.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gerry.creager at tamu.edu  Thu Jul 24 09:47:22 2003
From: gerry.creager at tamu.edu (Gerry Creager N5JXS)
Date: Thu, 24 Jul 2003 08:47:22 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FCB44.3010002@lmco.com>
References: <3F1FD2F7.6FA2F2E3@imag.fr> <3F1FCB44.3010002@lmco.com>
Message-ID: <3F1FE36A.30905@tamu.edu>

And for the 802.3u impaired, you need to A) either set speed and duplex 
settings on your switch AND NIC to fixed values (preferably matching 
each other) or B) leave them all at Auto/Auto for switch and NIC(s).

For those who haven't wandered past the negotiation between switch and 
NIC recently, if you fix any value, negotiation will fail and the 
devices will go to default settings, ie., something resembling a 
consistent speed between the 2 and half-duplex.  But not even that is 
guaranteed.

Note that I've also received recent reports of horrid GBE performance on 
Serverworks botherboards with the internal E-1000 NIC.  I've not been 
able to identify a cause (Don?  Thoughts?  Definitive info?) but I've 
been able to reproduce it.

gerry

Jeff Layton wrote:
> Stephane,
> 
>   What kind of switch (100 or 1000)? Have you looked
> at the switch ports? Are they connecting at full or half
> duplex? How about the NICs? You'll see bad performance
> with a duplex mismatch between the NICs and switch.
> Are you forcing the NICs or are they auto-negiotiating?
> 
> Good Luck!
> 
> Jeff
> 
> 
>> Hello,
>>
>> We have recently received 48 Bi-xeon Dell 1600SC and we are performing 
>> some benchmarks to tests the cluster.
>> Unfortunately we have very bad perfomance with the internal gigabit 
>> card (82540EM chipset). We have passed linux netperf test and we have 
>> only 33 Mo
>>
>> between 2 machines. We have changed the drivers for the last ones, 
>> installed procfgd and so on... Finally we had Win2000 installed and 
>> the last driver
>>
>> from intel installed : the results are identical... To go further we 
>> have installed a PCI-X 82540EM card and re-run the tests : in that way 
>> the
>>
>> results are much better : 66 Mo full duplex...
>> So the question is : is there a well known problem with this DELL 
>> 1600SC concernig the 82540EM integration on the motherboard ????
>>
>> As anyone already have (heard about) this problem ?
>> Is there any solution ?
>>
>> thx for your help
>>
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Network Engineering -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bari at onelabs.com  Thu Jul 24 14:36:50 2003
From: bari at onelabs.com (Bari Ari)
Date: Thu, 24 Jul 2003 13:36:50 -0500
Subject: Thermal Problems
In-Reply-To: <000c01c35189$750cd310$6f01a8c0@Navatek.local>
References: <Pine.LNX.4.44.0307231605450.21377-100000@ganesh.phy.duke.edu> <000c01c35189$750cd310$6f01a8c0@Navatek.local>
Message-ID: <3F202742.5010107@onelabs.com>

Mitchel Kagawa wrote:

>Here are a few pictures of the culprite.  Any suggestions on how to fix it
>other than buying a whole new case would be appreciated
>http://neptune.navships.com/images/oscarnode-front.jpg
>http://neptune.navships.com/images/oscarnode-side.jpg
>http://neptune.navships.com/images/oscarnode-back.jpg
>
>  
>
The fans tied to the cpu heat sinks may be too close to the top cover 
for effective air flow/cooling. Measure the air temp at various places 
inside the case when closed and the cpu's operating. Try to get an idea 
of how much airflow is actually moving through the case vs just around 
the inside of the case.

Try placing tangential (cross flow) fans in the empty drive bays and up 
against the front panel and opening up the rear of the case.

http://www.airvac.se/products.htm

The power supply has fans at its front and rear to move air through it. 
The centrifugal blower in the rear corner may not be helping much to 
draw air across the cpu's.  The same principle applies to the enclosure. 
Try to move more air through it vs just around the inside. The cooler 
the components the lower the failure rate.

Bari


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 15:12:25 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 15:12:25 -0400 (EDT)
Subject: cold rooms & machines
In-Reply-To: <20030723235814.GA11248@velocet.ca>
Message-ID: <Pine.LNX.4.44.0307241506570.1813-100000@lilith>

On Wed, 23 Jul 2003, Ken Chase wrote:

> A group I know wants to put a cluster in their labs, but they
> dont have any facilities for cooling _EXCEPT_ a cold room to store
> chemicals and conduct experiments at 5C (its largely unused and could
> probably be set to any temp up to 10C, really - even -10C if desired
> ;)
> 
> The chillers in there are pretty underworked and might be able to
> handle the 3000W odd of heat that would be radiating out of the 
> machines.
> 
> What other criteria should we be looking at - non-condensing
> environment I would guess is one - is this just a function of the %RH
> in the room? What should it be set to? Any other concerns?

Air circulation.  The room needs to have a circulation pattern that
delivers cool air to the intake/front of the cluster and delivers warmed
air from the exhaust rear to the air return.  A cold room might or might
not have adequate airflow or chiller capacity, as it isn't really
engineered for active sources within the space but rather for removing
ambient heat from objects placed therein a single time, plus dealing
with heat bleeding through its (usually copious) insulation.

There are lots of (bad) things that could happen if the air circulation
isn't engineered right -- coils can freeze up, humidity can condense and
leak, cluster nodes can feed back heated air outside the cooled air
circulation and overheat.

I'd have them contact an AC engineer to go over the space and see
whether it can work, and if so what modifications are required.

   rgb

> 
> /kc
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rocky at atipa.com  Thu Jul 24 11:42:04 2003
From: rocky at atipa.com (Rocky McGaugh)
Date: Thu, 24 Jul 2003 10:42:04 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F1FF776.E586E244@imag.fr>
Message-ID: <Pine.LNX.4.44.0307241025330.13514-100000@rocky>


On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:

> Hello, 
> 
> For our tests we are connected to a 4108GL (J4865A), we have done all
> necessary checks (maybe we've have forget something very very big ????)
> to ensure the validity of our mesures. The ports have been tested with
> auto neg on, then off and also forced. We have also the same mesures
> when connected to a J4898A. The negociation between the NIcs ans the two
> switches is working.
> 
> When using a tyan motherboard with the 82540EM built-in and using the
> same benchs and switches ans the same procedures (drivers updates and
> compilations from Intel, various benchs, different OS) the results are
> correct (80 to 90Mo).
> 
> All our tests tends to show that dell missed something in the
> integration of the 82540EM in the 1600SC series...if not we'll really
> really appreciate to know what we are missing there cause here we have a
> 150 000 dollars cluster said to be connected with a network gigabit
> having network perfs of three 100 card bonded (in full duplex it's even
> worse !!!!!). If the problem is not rapidly solved the 48 machines will
> be returned....

I'd totally remove the switch from the situation first. See what you can
get back-to-back by directly connecting one node to another first.

While the 4108GL is great for management networks, it is not a high
performance switch. Wait till you fire up all 48 with PMB.

Your bisectional bandwidth is not going to be great, but you should still
be able to hit decent numbers with a limited number of machines. It's 
possible that broadcast and multicast traffic are interfering with your
runs.

So first remove the switch. If you get the performance you are looking for
point-to-point, then you can focus your efforts on the switch. 

Twice i've had 4108GL's that would experience a severe performance hit
when doing any traffic with a certain blade. The first time it was a fast
ethernet blade in slot "C". Any network traffic that hit a port on this
blade was severely degraded. We swapped blades with a different slot and
the problem did not follow the blade. A firmware update solved the issue.

The second time it was with a gig-E blade in slot "F". Again, any network
traffic that hit a port on this blade was severely degraded (similar to
what you're seeing now). This time, a firmware update did not fix it, but
swapping it with another gig-E blade from another 4108GL worked fine. The
"problem" blade also worked fine in the other 4108.

Targeting Pallas PMB to run on specific nodes based on the topology of the 
switch can sure tell one a lot about a switch...:)

Good luck,

-- 
Rocky McGaugh
Atipa Technologies
rocky at atipatechnologies.com
rmcgaugh at atipa.com
1-785-841-9513 x3110
http://67.8450073/
perl -e 'print unpack(u, ".=W=W+F%T:7\!A+F-O;0H`");'


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 14:58:02 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 13:58:02 -0500
Subject: Hubs
Message-ID: <Law11-F105wy5OXYirm00006891@hotmail.com>


To whomever it may concern,

           I am trying to link together 30 boards through Ethernet. What 
would be your recomendation for how many and what type of Hubs I should use 
to connect them all together. Any imput is appreciated.

Eric Uren
AT Systems

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online  
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jul 24 15:40:37 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 24 Jul 2003 15:40:37 -0400 (EDT)
Subject: Hubs
In-Reply-To: <Law11-F105wy5OXYirm00006891@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>

On Thu, 24 Jul 2003, Eric Uren wrote:

> 
> To whomever it may concern,
> 
>            I am trying to link together 30 boards through Ethernet. What 
> would be your recomendation for how many and what type of Hubs I should use 
> to connect them all together. Any imput is appreciated.

Any hint as to what you're going to be doing with the 30 boards?  The
obvious choice is a cheap 48 port 10/100BT switch from any name-brand
vendor.  However, there are circumstances where you'd want more
expensive switches, 1000BT switches, or a different network altogether.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From klight at appliedthermalsciences.com  Thu Jul 24 16:16:54 2003
From: klight at appliedthermalsciences.com (Ken Light)
Date: Thu, 24 Jul 2003 16:16:54 -0400
Subject: Thermal Problems
Message-ID: <DF234083CF6BD511AB250020781FCD13243C67@POKEY>

I think there are a lot of compromises in this layout.  The centrifugal
blower in the back looks like it is helping mostly the power supply, not the
CPUs.  The CPU fans doesn't look like they are being very effective when the
top of the case goes on and the little muffin fans near the memory are
notoriously inefficient when you present them with any kind of flow
restriction like that duct.  I would be tempted to experiment with different
CPU heat sinks and a bigger blower on front to move air over them.  The
following links show some views of a pretty good Xeon setup.  Maybe you can
get some ideas of things to try (by the way, the CPUs are under the paper).
The case is custom from Microway Inc. and is pretty deep, but the extra
space makes for a good layout.  Good luck.

http://www.clusters.umaine.edu/xeon/

-Ken

> -----Original Message-----
> From: Michael T. Prinkey [mailto:mprinkey at aeolusresearch.com] 
> Sent: Thursday, July 24, 2003 9:33 AM
> To: Mitchel Kagawa
> Cc: beowulf at beowulf.org
> Subject: Re: Thermal Problems
> 
> 
> On Wed, 23 Jul 2003, Mitchel Kagawa wrote:
> 
> > Here are a few pictures of the culprite.  Any suggestions 
> on how to fix it
> > other than buying a whole new case would be appreciated
> > http://neptune.navships.com/images/oscarnode-front.jpg
> > http://neptune.navships.com/images/oscarnode-side.jpg
> > http://neptune.navships.com/images/oscarnode-back.jpg
> > 
> > You can also see how many I'm down... it should read 65 
> nodes (64 + 1 head
> > node)
> > http://neptune.navships.com/ganglia
> > 
> > Mitchel Kagawa
> > Systems Administrator
> > 
> 
> The Intel Xeon ships with an interesting heat sink/fan/shroud 
> system.  
> For an normal case, you can mount the fan on the top of the 
> shroud which
> makes it work much like a "normal" heat sink/fan...the air 
> comes in the
> top and blows down onto the CPU.  But, for low-profile 
> installations (mine
> were 2U), the fan attaches to the side of the shroud to form a "wind
> tunnel."  Maybe a similar solution would exist in your case, 
> i.e., taller
> heat sinks (~1") with one or two fans mounted on the side 
> blowing across
> the heat sink.  I did a quick search online, but couldn't 
> find a vendor 
> for this type heat sink.  Sorry.
> 
> You might be able to experiment.  Fans are usually only held 
> in place with
> oversized screws that go easily into soft heat sinks.  You 
> can probably
> build a pair of test heat sinks in 10 minutues.  The flow from the fan
> should be aligned with the fins.  Depending on the type of 
> heatsink you
> start with, you might be able to direct the output flow in 
> any direction
> you choose.  From the photos, I would recommend that you 
> place the fans on
> the side of the heat sink near the front of the case so the exhaust is
> directed to the vents at the rear of the case.
> 
> Good luck,
> 
> Mike Prinkey
> Aeolus Research, Inc.
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From deadline at plogic.com  Thu Jul 24 16:13:01 2003
From: deadline at plogic.com (Douglas Eadline)
Date: Thu, 24 Jul 2003 16:13:01 -0400 (EDT)
Subject: Comparing MPI Implementations
In-Reply-To: <20030724111221.Y73094-100000@net.bluemoon.net>
Message-ID: <Pine.LNX.4.44.0307241554010.28707-100000@otto.plogic.com>

On Thu, 24 Jul 2003, Andrew Fant wrote:

> 
> Does anyone have any experiences comparing MPI implementations for Linux?
> In particular, I am interested in people's views of the relative merits of
> Mpich, LAM, and MPIPro.  I currently have Mpich installed on our
> production cluster, but this decision came mostly out of default, rather
> than by any serious study.

One easy way to compare is to use the NAS test suite in the
Beowulf Performance Suite. You can very easily run the NAS suite with
MPICH, LAM, and MPI-PRO, (and compilers, numbers of cpus, and test size)
The suite does not include the MPI versions.

Have  a look at:

www.cluster-rant.com/article.pl?sid=03/03/17/1838236 

for links and example output.

I have not had a chance to post some recent results, but I can say
the following:

Given the same hardware for all MPI's:

 - it depends on the application
 - it depends if you are using dual nodes running two
   copies of your program.
 - it depends on the version you use

How is that for a simple answer.

Doug


> 
> Thanks in advance,
> 		Andy
> 
> Andrew Fant      |   This    |  "If I could walk THAT way...
> Molecular Geek   |   Space   |     I wouldn't need the talcum powder!"
> fant at pobox.com   |    For    |          G. Marx (apropos of Aerosmith)
> Boston, MA USA   |   Hire    |    http://www.pharmawulf.com
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Thu Jul 24 17:52:02 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Thu, 24 Jul 2003 23:52:02 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <Pine.GSO.4.40.0307241307180.18431-100000@verdun.ks.uiuc.edu>
Message-ID: <3F205502.A2E197D3@imag.fr>

Jim Phillips a ?crit :
> 
> Hi,
> 
> The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get full
> gigabit bandwidth, particularly if you're running at 33 MHz (look at
> /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There are no
> 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 64-bit 82545EM
> card?  Intel distinguishes their 32-bit 33/66 MHz PCI PRO/1000 MT Desktop
> cards that use 82540EM from their 64-bit PCI-X PRO/1000 MT Server cards
> that use the 82545EM (and have full gigabit performance).
> 
> -Jim
> 
> On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> 
> > > > Hello,
> > > >
> > > > We have recently received 48 Bi-xeon Dell 1600SC and we are performing
> > > > some benchmarks to tests the cluster.
> > > > Unfortunately we have very bad perfomance with the internal gigabit
> > > > card (82540EM chipset). We have passed linux netperf test and we have
> > > > only 33 Mo
> > > >
> > > > between 2 machines. We have changed the drivers for the last ones,
> > > > installed procfgd and so on... Finally we had Win2000 installed and
> > > > the last driver
> > > >
> > > > from intel installed : the results are identical... To go further we
> > > > have installed a PCI-X 82540EM card and re-run the tests : in that way the
> > > >
> > > > results are much better : 66 Mo full duplex...
> > > > So the question is : is there a well known problem with this DELL
> > > > 1600SC concernig the 82540EM integration on the motherboard ????
> > > >
> > > > As anyone already have (heard about) this problem ?
> > > > Is there any solution ?
> > > >
> > > > thx for your help
> > > >
> > >
> > > --
> > > Dr. Jeff Layton
> > > Chart Monkey - Aerodynamics and CFD
> > > Lockheed-Martin Aeronautical Company - Marietta
> >
> > Hello,
> >
> > For our tests we are connected to a 4108GL (J4865A), we have done all necessary checks (maybe we've have forget something very very big ????) to
> > ensure the validity of our mesures. The ports have been tested with auto neg on, then off and also forced. We have also the same mesures when
> > connected to a J4898A. The negociation between the NIcs ans the two switches is working.
> >
> > When using a tyan motherboard with the 82540EM built-in and using the same benchs and switches ans the same procedures (drivers updates and
> > compilations from Intel, various benchs, different OS) the results are correct (80 to 90Mo).
> >
> > All our tests tends to show that dell missed something in the integration of the 82540EM in the 1600SC series...if not we'll really really appreciate
> > to know what we are missing there cause here we have a 150 000 dollars cluster said to be connected with a network gigabit having network perfs of
> > three 100 card bonded (in full duplex it's even worse !!!!!). If the problem is not rapidly solved the 48 machines will be returned....
> >
> > thx a lot for your concern,
> >
> > regards
> >
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr
> > ENSIMAG - Antenne de Montbonnot
> > ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >

I'm going to re re re re check it...

thx a lot for your concern !

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Thu Jul 24 21:36:43 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Thu, 24 Jul 2003 18:36:43 -0700 (PDT)
Subject: Thermal Problems
In-Reply-To: <3F202742.5010107@onelabs.com>
Message-ID: <Pine.LNX.3.96.1030724183211.25620A-100000@Maggie.Linux-Consulting.com>


hi ya

any system where the cpu is next to the power supply is a doomed box

if the airflow in the chassis is done right ... there should be 
minimal temp difference between the system running with covers
and without covers

cpu fans above the cpu heatsink is worthless in a 1U case .. throw it away
( unless there is a really good fan blade design to pull air and move air
( in 0.25" of space between the heatsink bottom and the cover just just
( above the fan blade

lots of fun playing with air :-)

blowers in the back of the power supply doesnt do anything
	- most power supply exhaust air out the back y its power cord
	and should NOT be blocked or have cross air flow from other fans
	like in an indented power supply ( inside the chassis )

c ya
alvin

On Thu, 24 Jul 2003, Bari Ari wrote:

> Mitchel Kagawa wrote:
> 
> >Here are a few pictures of the culprite.  Any suggestions on how to fix it
> >other than buying a whole new case would be appreciated
> >http://neptune.navships.com/images/oscarnode-front.jpg
> >http://neptune.navships.com/images/oscarnode-side.jpg
> >http://neptune.navships.com/images/oscarnode-back.jpg
> >
> >  
> >
> The fans tied to the cpu heat sinks may be too close to the top cover 
> for effective air flow/cooling. Measure the air temp at various places 
> inside the case when closed and the cpu's operating. Try to get an idea 
> of how much airflow is actually moving through the case vs just around 
> the inside of the case.
> 
> Try placing tangential (cross flow) fans in the empty drive bays and up 
> against the front panel and opening up the rear of the case.
> 
> http://www.airvac.se/products.htm
> 
> The power supply has fans at its front and rear to move air through it. 
> The centrifugal blower in the rear corner may not be helping much to 
> draw air across the cpu's.  The same principle applies to the enclosure. 
> Try to move more air through it vs just around the inside. The cooler 
> the components the lower the failure rate.
> 
> Bari
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Thu Jul 24 22:08:02 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Thu, 24 Jul 2003 21:08:02 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC10800A7@ausx2kmpc108.aus.amer.dell.com>

Desktop or server quality, I do not know, but the 1600sc does have the 82540
chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
10/100/1000 nic.  I was curious which driver you were using, e1000 or
eepro1000?  The latter has known slow transfer problems, but just as
mentioned, hard-setting all network devices should yield the best
performance.  Hope that helps.  1600sc servers are not the best for clusters
with their size and power consumption, but I would recommend the 650 or
1650s.

-matt

-----Original Message-----
From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr] 
Sent: Thursday, July 24, 2003 4:52 PM
To: Jim Phillips
Cc: boewulf
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


Jim Phillips a ?crit :
> 
> Hi,
> 
> The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get 
> full gigabit bandwidth, particularly if you're running at 33 MHz (look 
> at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There 
> are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a 
> 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI 
> PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X 
> PRO/1000 MT Server cards that use the 82545EM (and have full gigabit 
> performance).
> 
> -Jim
> 
> On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> 
> > > > Hello,
> > > >
> > > > We have recently received 48 Bi-xeon Dell 1600SC and we are 
> > > > performing some benchmarks to tests the cluster. Unfortunately 
> > > > we have very bad perfomance with the internal gigabit card 
> > > > (82540EM chipset). We have passed linux netperf test and we have 
> > > > only 33 Mo
> > > >
> > > > between 2 machines. We have changed the drivers for the last 
> > > > ones, installed procfgd and so on... Finally we had Win2000 
> > > > installed and the last driver
> > > >
> > > > from intel installed : the results are identical... To go 
> > > > further we have installed a PCI-X 82540EM card and re-run the 
> > > > tests : in that way the
> > > >
> > > > results are much better : 66 Mo full duplex...
> > > > So the question is : is there a well known problem with this 
> > > > DELL 1600SC concernig the 82540EM integration on the motherboard 
> > > > ????
> > > >
> > > > As anyone already have (heard about) this problem ?
> > > > Is there any solution ?
> > > >
> > > > thx for your help
> > > >
> > >
> > > --
> > > Dr. Jeff Layton
> > > Chart Monkey - Aerodynamics and CFD
> > > Lockheed-Martin Aeronautical Company - Marietta
> >
> > Hello,
> >
> > For our tests we are connected to a 4108GL (J4865A), we have done 
> > all necessary checks (maybe we've have forget something very very 
> > big ????) to ensure the validity of our mesures. The ports have been 
> > tested with auto neg on, then off and also forced. We have also the 
> > same mesures when connected to a J4898A. The negociation between the 
> > NIcs ans the two switches is working.
> >
> > When using a tyan motherboard with the 82540EM built-in and using 
> > the same benchs and switches ans the same procedures (drivers 
> > updates and compilations from Intel, various benchs, different OS) 
> > the results are correct (80 to 90Mo).
> >
> > All our tests tends to show that dell missed something in the 
> > integration of the 82540EM in the 1600SC series...if not we'll 
> > really really appreciate to know what we are missing there cause 
> > here we have a 150 000 dollars cluster said to be connected with a 
> > network gigabit having network perfs of three 100 card bonded (in 
> > full duplex it's even worse !!!!!). If the problem is not rapidly 
> > solved the 48 machines will be returned....
> >
> > thx a lot for your concern,
> >
> > regards
> >
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> >

I'm going to re re re re check it...

thx a lot for your concern !

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jeff.cheung at nixdsl.com  Fri Jul 25 04:02:24 2003
From: jeff.cheung at nixdsl.com (Jeff Cheung)
Date: Fri, 25 Jul 2003 16:02:24 +0800
Subject: Xoen Prefermence
Message-ID: <BAD0D13AB4DD1349859C28E6DFAA192E06860B@mail.nixdsl.com>

Hello

	Does anyone know where can I find the Linpack and NASA Parallel Benchmarks on a dual P4 Xeon 2.8GHz 533FSB with 2GB RAM 

Jeff Cheung


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Fri Jul 25 05:22:41 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Fri, 25 Jul 2003 11:22:41 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <6CB36426C6B9D541A8B1D2022FEA7FC10800A7@ausx2kmpc108.aus.amer.dell.com>
Message-ID: <3F20F6E1.346DF1CD@imag.fr>

Matthew_Wygant at Dell.com a ?crit :
> 
> Desktop or server quality, I do not know, but the 1600sc does have the 82540
> chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
> 10/100/1000 nic.  I was curious which driver you were using, e1000 or
> eepro1000?  The latter has known slow transfer problems, but just as
> mentioned, hard-setting all network devices should yield the best
> performance.  Hope that helps.  1600sc servers are not the best for clusters
> with their size and power consumption, but I would recommend the 650 or
> 1650s.
> 
> -matt
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Thursday, July 24, 2003 4:52 PM
> To: Jim Phillips
> Cc: boewulf
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Jim Phillips a ?crit :
> >
> > Hi,
> >
> > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > full gigabit bandwidth, particularly if you're running at 33 MHz (look
> > at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There
> > are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a
> > 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI
> > PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X
> > PRO/1000 MT Server cards that use the 82545EM (and have full gigabit
> > performance).
> >
> > -Jim
> >
> > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> >
> > > > > Hello,
> > > > >
> > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are
> > > > > performing some benchmarks to tests the cluster. Unfortunately
> > > > > we have very bad perfomance with the internal gigabit card
> > > > > (82540EM chipset). We have passed linux netperf test and we have
> > > > > only 33 Mo
> > > > >
> > > > > between 2 machines. We have changed the drivers for the last
> > > > > ones, installed procfgd and so on... Finally we had Win2000
> > > > > installed and the last driver
> > > > >
> > > > > from intel installed : the results are identical... To go
> > > > > further we have installed a PCI-X 82540EM card and re-run the
> > > > > tests : in that way the
> > > > >
> > > > > results are much better : 66 Mo full duplex...
> > > > > So the question is : is there a well known problem with this
> > > > > DELL 1600SC concernig the 82540EM integration on the motherboard
> > > > > ????
> > > > >
> > > > > As anyone already have (heard about) this problem ?
> > > > > Is there any solution ?
> > > > >
> > > > > thx for your help
> > > > >
> > > >
> > > > --
> > > > Dr. Jeff Layton
> > > > Chart Monkey - Aerodynamics and CFD
> > > > Lockheed-Martin Aeronautical Company - Marietta
> > >
> > > Hello,
> > >
> > > For our tests we are connected to a 4108GL (J4865A), we have done
> > > all necessary checks (maybe we've have forget something very very
> > > big ????) to ensure the validity of our mesures. The ports have been
> > > tested with auto neg on, then off and also forced. We have also the
> > > same mesures when connected to a J4898A. The negociation between the
> > > NIcs ans the two switches is working.
> > >
> > > When using a tyan motherboard with the 82540EM built-in and using
> > > the same benchs and switches ans the same procedures (drivers
> > > updates and compilations from Intel, various benchs, different OS)
> > > the results are correct (80 to 90Mo).
> > >
> > > All our tests tends to show that dell missed something in the
> > > integration of the 82540EM in the 1600SC series...if not we'll
> > > really really appreciate to know what we are missing there cause
> > > here we have a 150 000 dollars cluster said to be connected with a
> > > network gigabit having network perfs of three 100 card bonded (in
> > > full duplex it's even worse !!!!!). If the problem is not rapidly
> > > solved the 48 machines will be returned....
> > >
> > > thx a lot for your concern,
> > >
> > > regards
> > >
> > >
> > > --
> > > Stephane Martin         Stephane.Martin at imag.fr
> > > http://icluster.imag.fr
> > > Tel: 04 76 61 20 31
> > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > > 38330 MONTBONNOT SAINT MARTIN
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> 
> I'm going to re re re re check it...
> 
> thx a lot for your concern !
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
> de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

hello,

The driver used is the e1000 one; last src from intel...
We are on the way of a commercial issue to get "not on board" good gb NICs at low low cost...
Which one is the best ? (broadcom ? intel ? other ?)
I've check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT it's 82545EM : our fault !!! good news !
BUT, I've also re checked the number on the tyan motherboard and this this time it's really a 82540EM ! bad news !
So the pb is still there : why on a tyan mb we get twice the perfs in comparaison with a dell mb ? (same os install, same bench, same network)
BTW we are going to get a card on the 64 bit PCI-X bus as the onbaord is not suitable for high performance usage.

thx all for your concerns.

regards

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Fri Jul 25 07:31:23 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Fri, 25 Jul 2003 06:31:23 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DD@ausx2kmpc108.aus.amer.dell.com>

I would stick to intel, I would not use a Broadcom at all...  

-----Original Message-----
From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr] 
Sent: Friday, July 25, 2003 4:23 AM
To: Matthew_Wygant at exchange.dell.com
Cc: beowulf at beowulf.org
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


Matthew_Wygant at Dell.com a ?crit :
> 
> Desktop or server quality, I do not know, but the 1600sc does have the 
> 82540 chip, dmseg should show that much.  It is on a 33MHz bus and 
> does rate as a 10/100/1000 nic.  I was curious which driver you were 
> using, e1000 or eepro1000?  The latter has known slow transfer 
> problems, but just as mentioned, hard-setting all network devices 
> should yield the best performance.  Hope that helps.  1600sc servers 
> are not the best for clusters with their size and power consumption, 
> but I would recommend the 650 or 1650s.
> 
> -matt
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Thursday, July 24, 2003 4:52 PM
> To: Jim Phillips
> Cc: boewulf
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Jim Phillips a ?crit :
> >
> > Hi,
> >
> > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get 
> > full gigabit bandwidth, particularly if you're running at 33 MHz 
> > (look at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  
> > There are no 82540EM-based PCI-X cards, AFAIK; are you sure it 
> > wasn't a 64-bit 82545EM card?  Intel distinguishes their 32-bit 
> > 33/66 MHz PCI PRO/1000 MT Desktop cards that use 82540EM from their 
> > 64-bit PCI-X PRO/1000 MT Server cards that use the 82545EM (and have 
> > full gigabit performance).
> >
> > -Jim
> >
> > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> >
> > > > > Hello,
> > > > >
> > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are 
> > > > > performing some benchmarks to tests the cluster. Unfortunately 
> > > > > we have very bad perfomance with the internal gigabit card 
> > > > > (82540EM chipset). We have passed linux netperf test and we 
> > > > > have only 33 Mo
> > > > >
> > > > > between 2 machines. We have changed the drivers for the last 
> > > > > ones, installed procfgd and so on... Finally we had Win2000 
> > > > > installed and the last driver
> > > > >
> > > > > from intel installed : the results are identical... To go 
> > > > > further we have installed a PCI-X 82540EM card and re-run the 
> > > > > tests : in that way the
> > > > >
> > > > > results are much better : 66 Mo full duplex...
> > > > > So the question is : is there a well known problem with this 
> > > > > DELL 1600SC concernig the 82540EM integration on the 
> > > > > motherboard ????
> > > > >
> > > > > As anyone already have (heard about) this problem ? Is there 
> > > > > any solution ?
> > > > >
> > > > > thx for your help
> > > > >
> > > >
> > > > --
> > > > Dr. Jeff Layton
> > > > Chart Monkey - Aerodynamics and CFD
> > > > Lockheed-Martin Aeronautical Company - Marietta
> > >
> > > Hello,
> > >
> > > For our tests we are connected to a 4108GL (J4865A), we have done 
> > > all necessary checks (maybe we've have forget something very very 
> > > big ????) to ensure the validity of our mesures. The ports have 
> > > been tested with auto neg on, then off and also forced. We have 
> > > also the same mesures when connected to a J4898A. The negociation 
> > > between the NIcs ans the two switches is working.
> > >
> > > When using a tyan motherboard with the 82540EM built-in and using 
> > > the same benchs and switches ans the same procedures (drivers 
> > > updates and compilations from Intel, various benchs, different OS) 
> > > the results are correct (80 to 90Mo).
> > >
> > > All our tests tends to show that dell missed something in the 
> > > integration of the 82540EM in the 1600SC series...if not we'll 
> > > really really appreciate to know what we are missing there cause 
> > > here we have a 150 000 dollars cluster said to be connected with a 
> > > network gigabit having network perfs of three 100 card bonded (in 
> > > full duplex it's even worse !!!!!). If the problem is not rapidly 
> > > solved the 48 machines will be returned....
> > >
> > > thx a lot for your concern,
> > >
> > > regards
> > >
> > >
> > > --
> > > Stephane Martin         Stephane.Martin at imag.fr
> > > http://icluster.imag.fr
> > > Tel: 04 76 61 20 31
> > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann 38330 
> > > MONTBONNOT SAINT MARTIN 
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> 
> I'm going to re re re re check it...
> 
> thx a lot for your concern !
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - 
> Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

hello,

The driver used is the e1000 one; last src from intel...
We are on the way of a commercial issue to get "not on board" good gb NICs
at low low cost... Which one is the best ? (broadcom ? intel ? other ?) I've
check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT
it's 82545EM : our fault !!! good news ! BUT, I've also re checked the
number on the tyan motherboard and this this time it's really a 82540EM !
bad news ! So the pb is still there : why on a tyan mb we get twice the
perfs in comparaison with a dell mb ? (same os install, same bench, same
network) BTW we are going to get a card on the 64 bit PCI-X bus as the
onbaord is not suitable for high performance usage.

thx all for your concerns.

regards

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Stephane.Martin at imag.fr  Fri Jul 25 08:50:06 2003
From: Stephane.Martin at imag.fr (Stephane.Martin at imag.fr)
Date: Fri, 25 Jul 2003 14:50:06 +0200
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
References: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DD@ausx2kmpc108.aus.amer.dell.com>
Message-ID: <3F21277E.D0932B89@imag.fr>

Matthew_Wygant at Dell.com a ?crit :
> 
> I would stick to intel, I would not use a Broadcom at all...
> 
> -----Original Message-----
> From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> Sent: Friday, July 25, 2003 4:23 AM
> To: Matthew_Wygant at exchange.dell.com
> Cc: beowulf at beowulf.org
> Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> 
> Matthew_Wygant at Dell.com a ?crit :
> >
> > Desktop or server quality, I do not know, but the 1600sc does have the
> > 82540 chip, dmseg should show that much.  It is on a 33MHz bus and
> > does rate as a 10/100/1000 nic.  I was curious which driver you were
> > using, e1000 or eepro1000?  The latter has known slow transfer
> > problems, but just as mentioned, hard-setting all network devices
> > should yield the best performance.  Hope that helps.  1600sc servers
> > are not the best for clusters with their size and power consumption,
> > but I would recommend the 650 or 1650s.
> >
> > -matt
> >
> > -----Original Message-----
> > From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> > Sent: Thursday, July 24, 2003 4:52 PM
> > To: Jim Phillips
> > Cc: boewulf
> > Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> >
> > Jim Phillips a ?crit :
> > >
> > > Hi,
> > >
> > > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > > full gigabit bandwidth, particularly if you're running at 33 MHz
> > > (look at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).
> > > There are no 82540EM-based PCI-X cards, AFAIK; are you sure it
> > > wasn't a 64-bit 82545EM card?  Intel distinguishes their 32-bit
> > > 33/66 MHz PCI PRO/1000 MT Desktop cards that use 82540EM from their
> > > 64-bit PCI-X PRO/1000 MT Server cards that use the 82545EM (and have
> > > full gigabit performance).
> > >
> > > -Jim
> > >
> > > On Thu, 24 Jul 2003 Stephane.Martin at imag.fr wrote:
> > >
> > > > > > Hello,
> > > > > >
> > > > > > We have recently received 48 Bi-xeon Dell 1600SC and we are
> > > > > > performing some benchmarks to tests the cluster. Unfortunately
> > > > > > we have very bad perfomance with the internal gigabit card
> > > > > > (82540EM chipset). We have passed linux netperf test and we
> > > > > > have only 33 Mo
> > > > > >
> > > > > > between 2 machines. We have changed the drivers for the last
> > > > > > ones, installed procfgd and so on... Finally we had Win2000
> > > > > > installed and the last driver
> > > > > >
> > > > > > from intel installed : the results are identical... To go
> > > > > > further we have installed a PCI-X 82540EM card and re-run the
> > > > > > tests : in that way the
> > > > > >
> > > > > > results are much better : 66 Mo full duplex...
> > > > > > So the question is : is there a well known problem with this
> > > > > > DELL 1600SC concernig the 82540EM integration on the
> > > > > > motherboard ????
> > > > > >
> > > > > > As anyone already have (heard about) this problem ? Is there
> > > > > > any solution ?
> > > > > >
> > > > > > thx for your help
> > > > > >
> > > > >
> > > > > --
> > > > > Dr. Jeff Layton
> > > > > Chart Monkey - Aerodynamics and CFD
> > > > > Lockheed-Martin Aeronautical Company - Marietta
> > > >
> > > > Hello,
> > > >
> > > > For our tests we are connected to a 4108GL (J4865A), we have done
> > > > all necessary checks (maybe we've have forget something very very
> > > > big ????) to ensure the validity of our mesures. The ports have
> > > > been tested with auto neg on, then off and also forced. We have
> > > > also the same mesures when connected to a J4898A. The negociation
> > > > between the NIcs ans the two switches is working.
> > > >
> > > > When using a tyan motherboard with the 82540EM built-in and using
> > > > the same benchs and switches ans the same procedures (drivers
> > > > updates and compilations from Intel, various benchs, different OS)
> > > > the results are correct (80 to 90Mo).
> > > >
> > > > All our tests tends to show that dell missed something in the
> > > > integration of the 82540EM in the 1600SC series...if not we'll
> > > > really really appreciate to know what we are missing there cause
> > > > here we have a 150 000 dollars cluster said to be connected with a
> > > > network gigabit having network perfs of three 100 card bonded (in
> > > > full duplex it's even worse !!!!!). If the problem is not rapidly
> > > > solved the 48 machines will be returned....
> > > >
> > > > thx a lot for your concern,
> > > >
> > > > regards
> > > >
> > > >
> > > > --
> > > > Stephane Martin         Stephane.Martin at imag.fr
> > > > http://icluster.imag.fr
> > > > Tel: 04 76 61 20 31
> > > > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > > > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann 38330
> > > > MONTBONNOT SAINT MARTIN
> > > > _______________________________________________
> > > > Beowulf mailing list, Beowulf at beowulf.org
> > > > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> > > >
> >
> > I'm going to re re re re check it...
> >
> > thx a lot for your concern !
> >
> > --
> > Stephane Martin         Stephane.Martin at imag.fr
> > http://icluster.imag.fr
> > Tel: 04 76 61 20 31
> > Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG -
> > Antenne de Montbonnot ZIRST - 51, avenue Jean Kuntzmann
> > 38330 MONTBONNOT SAINT MARTIN
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> hello,
> 
> The driver used is the e1000 one; last src from intel...
> We are on the way of a commercial issue to get "not on board" good gb NICs
> at low low cost... Which one is the best ? (broadcom ? intel ? other ?) I've
> check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT
> it's 82545EM : our fault !!! good news ! BUT, I've also re checked the
> number on the tyan motherboard and this this time it's really a 82540EM !
> bad news ! So the pb is still there : why on a tyan mb we get twice the
> perfs in comparaison with a dell mb ? (same os install, same bench, same
> network) BTW we are going to get a card on the 64 bit PCI-X bus as the
> onbaord is not suitable for high performance usage.
> 
> thx all for your concerns.
> 
> regards
> 
> --
> Stephane Martin         Stephane.Martin at imag.fr
> http://icluster.imag.fr
> Tel: 04 76 61 20 31
> Informatique et distribution Web:  http://www-id.imag.fr ENSIMAG - Antenne
> de Montbonnot
> ZIRST - 51, avenue Jean Kuntzmann
> 38330 MONTBONNOT SAINT MARTIN

As someone tested those two cards ????...
those papers are not helping much ;)

http://www.veritest.com/clients/reports/intel/intel_pro1000_mt_desktop_adapter.pdf

http://www.etestinglabs.com/clients/reports/broadcom/broadcom_5703.pdf

thx for your help

regards,

-- 
Stephane Martin         Stephane.Martin at imag.fr
http://icluster.imag.fr 
Tel: 04 76 61 20 31   
Informatique et distribution Web:  http://www-id.imag.fr
ENSIMAG - Antenne de Montbonnot 
ZIRST - 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bogdan.costescu at iwr.uni-heidelberg.de  Fri Jul 25 10:13:12 2003
From: bogdan.costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 25 Jul 2003 16:13:12 +0200 (CEST)
Subject: cold rooms & machines
In-Reply-To: <20030723235814.GA11248@velocet.ca>
Message-ID: <Pine.LNX.4.44.0307251605320.27222-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 23 Jul 2003, Ken Chase wrote:

> _EXCEPT_ a cold room to store chemicals and conduct experiments at 5C
> (its largely unused

If by this you mean that computers and chemicals will share the room, I'd
advise against it. Especially if the chemicals include some acids or
volatile substances... Giving that on my university diploma it's written
"biochemist"  I think that I know what I'm talking about :-)
Even with non-dangerous substances, if some of them are obtained 
commercially they might cost an arm and a leg and even something extra, so 
the owners should know what can happen if the cooling installation fails 
for some reason...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joelja at darkwing.uoregon.edu  Fri Jul 25 10:25:39 2003
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Fri, 25 Jul 2003 07:25:39 -0700 (PDT)
Subject: Thermal Problems
In-Reply-To: <Pine.LNX.3.96.1030724183211.25620A-100000@Maggie.Linux-Consulting.com>
Message-ID: <Pine.LNX.4.44.0307250717170.32325-100000@twin.uoregon.edu>

larger passive heatsinks... low-profile dimm modules in the angled dimm 
sockets... 

The fact that the power-supply is essentially exhausting into case despite 
the blower is worrysome...

joelja

On Thu, 24 Jul 2003, Alvin Oga wrote:

> 
> hi ya
> 
> any system where the cpu is next to the power supply is a doomed box
> 
> if the airflow in the chassis is done right ... there should be 
> minimal temp difference between the system running with covers
> and without covers
> 
> cpu fans above the cpu heatsink is worthless in a 1U case .. throw it away
> ( unless there is a really good fan blade design to pull air and move air
> ( in 0.25" of space between the heatsink bottom and the cover just just
> ( above the fan blade
> 
> lots of fun playing with air :-)
> 
> blowers in the back of the power supply doesnt do anything
> 	- most power supply exhaust air out the back y its power cord
> 	and should NOT be blocked or have cross air flow from other fans
> 	like in an indented power supply ( inside the chassis )
> 
> c ya
> alvin
> 
> On Thu, 24 Jul 2003, Bari Ari wrote:
> 
> > Mitchel Kagawa wrote:
> > 
> > >Here are a few pictures of the culprite.  Any suggestions on how to fix it
> > >other than buying a whole new case would be appreciated
> > >http://neptune.navships.com/images/oscarnode-front.jpg
> > >http://neptune.navships.com/images/oscarnode-side.jpg
> > >http://neptune.navships.com/images/oscarnode-back.jpg
> > >
> > >  
> > >
> > The fans tied to the cpu heat sinks may be too close to the top cover 
> > for effective air flow/cooling. Measure the air temp at various places 
> > inside the case when closed and the cpu's operating. Try to get an idea 
> > of how much airflow is actually moving through the case vs just around 
> > the inside of the case.
> > 
> > Try placing tangential (cross flow) fans in the empty drive bays and up 
> > against the front panel and opening up the rear of the case.
> > 
> > http://www.airvac.se/products.htm
> > 
> > The power supply has fans at its front and rear to move air through it. 
> > The centrifugal blower in the rear corner may not be helping much to 
> > draw air across the cpu's.  The same principle applies to the enclosure. 
> > Try to move more air through it vs just around the inside. The cooler 
> > the components the lower the failure rate.
> > 
> > Bari
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli	      Academic User Services   joelja at darkwing.uoregon.edu    
--    PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E      --
  In Dr. Johnson's famous dictionary patriotism is defined as the last
  resort of the scoundrel.  With all due respect to an enlightened but
  inferior lexicographer I beg to submit that it is the first.
	   	            -- Ambrose Bierce, "The Devil's Dictionary"


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jim at ks.uiuc.edu  Fri Jul 25 10:47:40 2003
From: jim at ks.uiuc.edu (Jim Phillips)
Date: Fri, 25 Jul 2003 09:47:40 -0500 (CDT)
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
In-Reply-To: <3F20F6E1.346DF1CD@imag.fr>
Message-ID: <Pine.GSO.4.40.0307250943490.21070-100000@verdun.ks.uiuc.edu>

Hi again,

If the Dell has an 82540 on 33 MHz but the Tyan has it on 66 MHz, I would
expect the Tyan to have twice the performance, but still less than that of
a 64-bit 82545 at 66 MHz (or 133 MHz on PCI-X).

-Jim


On Fri, 25 Jul 2003 Stephane.Martin at imag.fr wrote:

> Matthew_Wygant at Dell.com a ?crit :
> >
> > Desktop or server quality, I do not know, but the 1600sc does have the 82540
> > chip, dmseg should show that much.  It is on a 33MHz bus and does rate as a
> > 10/100/1000 nic.  I was curious which driver you were using, e1000 or
> > eepro1000?  The latter has known slow transfer problems, but just as
> > mentioned, hard-setting all network devices should yield the best
> > performance.  Hope that helps.  1600sc servers are not the best for clusters
> > with their size and power consumption, but I would recommend the 650 or
> > 1650s.
> >
> > -matt
> >
> > -----Original Message-----
> > From: Stephane.Martin at imag.fr [mailto:Stephane.Martin at imag.fr]
> > Sent: Thursday, July 24, 2003 4:52 PM
> > To: Jim Phillips
> > Cc: boewulf
> > Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED
> >
> > Jim Phillips a ?crit :
> > >
> > > Hi,
> > >
> > > The 82540EM is a low-cost 32-bit "desktop" NIC, so it's hard to get
> > > full gigabit bandwidth, particularly if you're running at 33 MHz (look
> > > at /proc/net/PRO_LAN_Adapters/eth0/PCI_Bus_Speed to find out).  There
> > > are no 82540EM-based PCI-X cards, AFAIK; are you sure it wasn't a
> > > 64-bit 82545EM card?  Intel distinguishes their 32-bit 33/66 MHz PCI
> > > PRO/1000 MT Desktop cards that use 82540EM from their 64-bit PCI-X
> > > PRO/1000 MT Server cards that use the 82545EM (and have full gigabit
> > > performance).
> > >
> > > -Jim
> > >
>
> The driver used is the e1000 one; last src from intel...
> We are on the way of a commercial issue to get "not on board" good gb NICs at low low cost...
> Which one is the best ? (broadcom ? intel ? other ?)
> I've check (by myself this time ;) the ID of the PCI card added : YOU ARE RIGHT it's 82545EM : our fault !!! good news !
> BUT, I've also re checked the number on the tyan motherboard and this this time it's really a 82540EM ! bad news !
> So the pb is still there : why on a tyan mb we get twice the perfs in comparaison with a dell mb ? (same os install, same bench, same network)
> BTW we are going to get a card on the 64 bit PCI-X bus as the onbaord is not suitable for high performance usage.
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Matthew_Wygant at dell.com  Fri Jul 25 10:52:36 2003
From: Matthew_Wygant at dell.com (Matthew_Wygant at dell.com)
Date: Fri, 25 Jul 2003 09:52:36 -0500
Subject: Dell 1600SC + 82540EM poor performance..HELP NEEDED
Message-ID: <6CB36426C6B9D541A8B1D2022FEA7FC1BD64DE@ausx2kmpc108.aus.amer.dell.com>

A good place to go for these Dell related things are the
linux-poweredge at dell.com lists...  Thanks.

-----Original Message-----
From: Jim Phillips [mailto:jim at ks.uiuc.edu] 
Sent: Friday, July 25, 2003 9:48 AM
To: Stephane.Martin at imag.fr
Cc: Matthew_Wygant at exchange.dell.com; beowulf at beowulf.org
Subject: Re: Dell 1600SC + 82540EM poor performance..HELP NEEDED


This message uses a character set that is not supported by the Internet
Service.  To view the original message content,  open the attached message.
If the text doesn't display correctly, save the attachment to disk, and then
open it using a viewer that can display the original character set. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Fri Jul 25 13:29:54 2003
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 25 Jul 2003 10:29:54 -0700
Subject: Top node hotter thanothers?
Message-ID: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>

We have a 20 x 2U rack and I've noticed that the
top node is always a step hotter than the other nodes.

Why?

There is a slight gradient going up the rack (see
below, 01 is on the bottom, 20 on the top) but it
doesn't explain the jump at the top node.  At first
I thought it might be due to hot air moving from
the back of the rack, over the top of the highest
node, and being sucked in by it.
However no temperature change resulted when all
side vents were blocked and cardboard pasted up
the front of the rack so that only the same cold
air as the other nodes could enter.  The only other
difference between this node and the others is
that there's hot air above 20 (two empty rack slots),
but another node above all the others. So maybe all
that hot air heats the top node's case and that
couples the heat in?  I don't have an insulating
panel handy to test that hypothesis.

node case    cpu
01   +34?C   +43?C 
02   +35?C   +44?C 
03   +37?C   +48?C 
04   +42?C   +50?C 
05   +38?C   +48?C 
06   +37?C   +50?C 
07   +36?C   +45?C 
08   +38?C   +48?C 
09   +38?C   +48?C 
10   +38?C   +48?C 
11   +36?C   +44?C 
12   +38?C   +48?C 
13   +38?C   +48?C 
14   +40?C   +49?C 
15   +38?C   +46?C 
16   +36?C   +46?C 
17   +39?C   +51?C 
18   +39?C   +48?C 
19   +39?C   +49?C 
20   +44?C   +54?C

Temperatures were measured using "sensors" on these
tyan S2466 motherboards (1 CPU on each currently.)
The case value is the temperature reading by the
diode under the socket of the absent 2nd CPU.
The temperatures jump around a degree or two.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From john152 at libero.it  Fri Jul 25 13:17:20 2003
From: john152 at libero.it (john152 at libero.it)
Date: Fri, 25 Jul 2003 19:17:20 +0200
Subject: Problems with 3Com card...
Message-ID: <HILC0W$EA4355B402EBC5A66476C9949F74FD71@libero.it>

Hi all,
i'd like to use a 3Com905-TX instead of Realtek RTL-8139 i used before,
but i have problems with mii-diag software in detecting the link status.

With Realtek card  all was Ok, infact i had:
at start (cable connected):
18:54:36.592  Baseline value of MII BMSR 
(basic mode status register) is 782d.

disconnecting the link:
18:55:01.632  MII BMSR now 7809:   no link, NWay busy, 
No Jabber (0000).
18:55:01.637  Baseline value of MII BMSR 
basic mode status register) is 7809.

connecting the link:
18:55:06.722  MII BMSR now 782d: Good link, 
NWay done, No Jabber (45e1).
18:55:06.728  Baseline value of MII BMSR 
(basic mode status register) is 782d.
.
.

Now i have the following output lines with 3Com:

at start (cable connected):
18:42:46.073  Baseline value of MII BMSR 
(basic mode status register) is 782d.

disconnecting the link:
18:42:50.779  MII BMSR now 7829:   no link, 
NWay done, No Jabber (0000).
18:49:38.524  Baseline value of MII BMSR 
(basic mode status register) is 7809.

connecting the link:
18:52:15.887  MII BMSR now 7829:   no link,
 NWay done, No Jabber (41e1).
18:52:15.895  Baseline value of MII BMSR 
(basic mode status register) is 782d.
.
.

With 3Com, the Baseline value of MII BMSR is 782d with Link Good
and 7809 with no Link (and it seems like the Realtek).
When the function 'monitor_mii' starts, in the baseline_1 variable
i see a correct value, instead in the following loop
while (continue_monitor)..., there is new_1 variable
that is always wrong: 7829. (Correctly the loop ends, but i have
the output "no link" wrong!)

new_1 is the return value of mdio_read(ioaddr, phy_id, 1) and should
be the same values of baseline_1 (782d or 7809), shouldn' t it?

Can you help me?
Thanks in advance for your kind answers.

Giovanni di Giacomo


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hunting at ix.netcom.com  Fri Jul 25 14:03:17 2003
From: hunting at ix.netcom.com (Michael Huntingdon)
Date: Fri, 25 Jul 2003 11:03:17 -0700
Subject: Top node hotter than others?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <3.0.3.32.20030725110317.013f6b60@popd.ix.netcom.com>

David

Do the systems have anything similar to Insight Manager to indicate the
rate of your fans? In a rack where space is tight and systems are running
hot, a slight variance in the movement of air can be significant.

Do the cabinets have fans overhead to draw the warm air out? Less expensive
cabinets are not necessarily engineered to ensure consistent airflow under
demanding conditions, typical with clusters like this.

Are all 20 nodes purely compute or do you have head nodes somewhere in the
mix? As clusters become larger and more dense there is a great deal of
research going on in various labs, to ensure stability of temperatures not
just within cabinets, but across entire computer rooms. "Hot Spots" are a
growing issue. Have you dealt with any of the major manufactures specific
to this or any other concerns as your research clusters grow?

My Best
Michael

At 10:29 AM 7/25/2003 -0700, David Mathog wrote:
>We have a 20 x 2U rack and I've noticed that the
>top node is always a step hotter than the other nodes.
>
>Why?
>
>There is a slight gradient going up the rack (see
>below, 01 is on the bottom, 20 on the top) but it
>doesn't explain the jump at the top node.  At first
>I thought it might be due to hot air moving from
>the back of the rack, over the top of the highest
>node, and being sucked in by it.
>However no temperature change resulted when all
>side vents were blocked and cardboard pasted up
>the front of the rack so that only the same cold
>air as the other nodes could enter.  The only other
>difference between this node and the others is
>that there's hot air above 20 (two empty rack slots),
>but another node above all the others. So maybe all
>that hot air heats the top node's case and that
>couples the heat in?  I don't have an insulating
>panel handy to test that hypothesis.
>
>node case    cpu
>01   +34?C   +43?C 
>02   +35?C   +44?C 
>03   +37?C   +48?C 
>04   +42?C   +50?C 
>05   +38?C   +48?C 
>06   +37?C   +50?C 
>07   +36?C   +45?C 
>08   +38?C   +48?C 
>09   +38?C   +48?C 
>10   +38?C   +48?C 
>11   +36?C   +44?C 
>12   +38?C   +48?C 
>13   +38?C   +48?C 
>14   +40?C   +49?C 
>15   +38?C   +46?C 
>16   +36?C   +46?C 
>17   +39?C   +51?C 
>18   +39?C   +48?C 
>19   +39?C   +49?C 
>20   +44?C   +54?C
>
>Temperatures were measured using "sensors" on these
>tyan S2466 motherboards (1 CPU on each currently.)
>The case value is the temperature reading by the
>diode under the socket of the absent 2nd CPU.
>The temperatures jump around a degree or two.
>
>Regards,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Fri Jul 25 14:15:40 2003
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 25 Jul 2003 14:15:40 -0400 (EDT)
Subject: Top node hotter thanothers?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.44.0307251410440.27504-100000@ganesh.phy.duke.edu>

On Fri, 25 Jul 2003, David Mathog wrote:

> We have a 20 x 2U rack and I've noticed that the
> top node is always a step hotter than the other nodes.
> 
> Why?
> 
> There is a slight gradient going up the rack (see
> below, 01 is on the bottom, 20 on the top) but it
> doesn't explain the jump at the top node.  At first
> I thought it might be due to hot air moving from
> the back of the rack, over the top of the highest
> node, and being sucked in by it.
> However no temperature change resulted when all
> side vents were blocked and cardboard pasted up
> the front of the rack so that only the same cold
> air as the other nodes could enter.  The only other
> difference between this node and the others is
> that there's hot air above 20 (two empty rack slots),
> but another node above all the others. So maybe all
> that hot air heats the top node's case and that
> couples the heat in?  I don't have an insulating
> panel handy to test that hypothesis.

What happens if the top node is turned off?  Does the second from the
top become the hot node?  What happens when the top node is swapped with
the bottom node?  It could just be that the top node's CPU cooler fan
has a piece of lint stuck on it and is running hotter, or even that its
sensor itsn't calibrated right.

It could be some sort of loopback of heated air as you describe, but if
you put a small fan and set it to blow across the top node you should
break up the circulation pattern if any such pattern exists.  I don't
have as much faith in cardboard used to block vents, since that can also
heat up the node by impeding circulation.

   rgb

> 
> node case    cpu
> 01   +34?C   +43?C 
> 02   +35?C   +44?C 
> 03   +37?C   +48?C 
> 04   +42?C   +50?C 
> 05   +38?C   +48?C 
> 06   +37?C   +50?C 
> 07   +36?C   +45?C 
> 08   +38?C   +48?C 
> 09   +38?C   +48?C 
> 10   +38?C   +48?C 
> 11   +36?C   +44?C 
> 12   +38?C   +48?C 
> 13   +38?C   +48?C 
> 14   +40?C   +49?C 
> 15   +38?C   +46?C 
> 16   +36?C   +46?C 
> 17   +39?C   +51?C 
> 18   +39?C   +48?C 
> 19   +39?C   +49?C 
> 20   +44?C   +54?C
> 
> Temperatures were measured using "sensors" on these
> tyan S2466 motherboards (1 CPU on each currently.)
> The case value is the temperature reading by the
> diode under the socket of the absent 2nd CPU.
> The temperatures jump around a degree or two.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mas at ucla.edu  Fri Jul 25 14:37:19 2003
From: mas at ucla.edu (Michael Stein)
Date: Fri, 25 Jul 2003 11:37:19 -0700
Subject: Top node hotter thanothers?
In-Reply-To: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>; from mathog@mendel.bio.caltech.edu on Fri, Jul 25, 2003 at 10:29:54AM -0700
References: <E19g6OA-0003sx-00@mendel.bio.caltech.edu>
Message-ID: <20030725113719.A5315@mas1.ats.ucla.edu>

> node case    cpu
> 01   +34?C   +43?C 
> 02   +35?C   +44?C 
> 03   +37?C   +48?C 
> 04   +42?C   +50?C 
> 05   +38?C   +48?C 
> 06   +37?C   +50?C 
> 07   +36?C   +45?C 
> 08   +38?C   +48?C 
> 09   +38?C   +48?C 
> 10   +38?C   +48?C 
> 11   +36?C   +44?C 
> 12   +38?C   +48?C 
> 13   +38?C   +48?C 
> 14   +40?C   +49?C 
> 15   +38?C   +46?C 
> 16   +36?C   +46?C 
> 17   +39?C   +51?C 
> 18   +39?C   +48?C 
> 19   +39?C   +49?C 
> 20   +44?C   +54?C

It's not clear to me that there is an actual difference going toward
the top.  04 is +42?

Assuming the input air temperature is reasonably uniform over the
machines, I'd guess that you're seeing a combination of different
sensor calibration and different heat dissipation (or different fan
capabilities).  Ignoring sensor error, the hotter machines must have
either higher power input or less air flow (assuming similar input air
temperature).

There is a tolerance on CPU (and other chips) heat/power usage -- some are
bound to run hotter than others.

Or check what's running on each machine.  This can make a huge difference.
I've seen output air on one machine go from 81 F to 99 F (27 C to 37 C)
from unloaded to full load (dual Xeon, 2.4 Ghz, multiple burnP6+burnMMX).
This was with 72 F input air (22 C).
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at mendel.bio.caltech.edu  Fri Jul 25 14:45:21 2003
From: mathog at mendel.bio.caltech.edu (David Mathog)
Date: Fri, 25 Jul 2003 11:45:21 -0700
Subject: Top node hotter than others?
Message-ID: <E19g7ZB-0004Y0-00@mendel.bio.caltech.edu>


> Do the systems have anything similar to Insight Manager
> to indicate the rate of your fans? 

"sensors" shows that. The CPU and two chassis fans
in the various systems are within a few percent of each other.
I can't read the power supply fans though.

Example:

node   cpu  Fan1 Fan2
19     4720 4425 4474
20     4720 4377 4377

> Do the cabinets have fans overhead to draw the warm air out? 

Not installed but there's a panel that comes off
where one could be put in.  When that panel is removed 
there's not much metal holding heat on the top of the system,
but the top node only cooled off about 1 degree and no
effect at all on the other nodes.  There's a hole in the bottom
of the case where cool air can go in.  The front is
currently completely open, and the back is open but it's about
8" from a wall.  It's about 4 feet from the top of the top node
to the acoustical tile, and there's a return vent
only 4 feet away, off to one side.  (Yes, I've thought
about moving that return vent directly over the rack.)

I think the hot air is rising, but not very fast, so that it
lingers around the top of the rack no matter what.  You
are probably correct that a fan to pull it off faster
would help.  I'm beginning to think of the rack as a sort
of poorly designed chimney - the kind that doesn't "pull"
well and results in a smokey fireplace.

> 
> Are all 20 nodes purely compute

yes, the master node is across the room.

> As clusters become larger and more dense there is a great deal of
> research going on in various labs, to ensure stability of
> temperatures not just within cabinets, but across entire
> computer rooms. 

Racks should probably plug into chimneys - take all that
heat and vent it straight out of the building. Heck of
a lot cheaper than running A/C to cool it in place. We've got
old fume hood ducts somewhere up above the acoustic ceiling
that go straight to the roof, but the A/C guys didn't
like my chimney idea much because apparently it would
screw up airflow in the building.  Plus a bit
of negative pressure could suck the output from another
lab's fume hood back into my area, which isn't
an attractive prospect.


> growing issue. Have you dealt with any of the majo
> manufactures specific
> to this or any other concerns as your research clusters grow?

The cluster is big enough for now.  Growth is pretty
limited in any case by available power, A/C capacity,
my tolerance for noise since I have to work in the
same room, and of course, $$$.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From enrico341 at hotmail.com  Thu Jul 24 11:24:56 2003
From: enrico341 at hotmail.com (Eric Uren)
Date: Thu, 24 Jul 2003 10:24:56 -0500
Subject: HELP!
Message-ID: <Law11-F116yMmxsywxZ00005d9e@hotmail.com>

To whomever it may concern,

              I work at a company called AT systems. We recently aquired 
thirty SBC's. I was assigned to develop a way to link all of the boards 
together, and place them in a tower. We will then donate it to a local 
college, and use it as a tax write-off. The boards contain: P266 Mhz, 128 MB 
of RAM, 128 IDE, Compac Flash Drive, Ethernet and USB ports. I am stationed 
in the same building as our factory. We have a turret, so developing the 
tower, power supply, etc. is not a problem. My task is just to find out a 
way to use all these boards up. Any site, diagrams, or suggestions would be 
greatly appreciated. Thanks.

Eric Uren
AT Systems

_________________________________________________________________
Add photos to your messages with MSN 8. Get 2 months FREE*.  
http://join.msn.com/?page=features/featuredemail

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From law at acm.org  Fri Jul 25 16:44:02 2003
From: law at acm.org (lynn wilkins)
Date: Fri, 25 Jul 2003 13:44:02 -0700
Subject: Hubs
In-Reply-To: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>
References: <Pine.LNX.4.44.0307241538030.1813-100000@lilith>
Message-ID: <0307251344020A.22708@maggie>

Hi,
Also, some switches use "store and forward" switching. Some don't.
Is "store and forward" a "good thing" or should we avoid it? (Other things 
being equal, such as 100baseT, full duplex, etc.)
-law


On Thursday 24 July 2003 12:40, you wrote:
> On Thu, 24 Jul 2003, Eric Uren wrote:
> > To whomever it may concern,
> >
> >            I am trying to link together 30 boards through Ethernet. What
> > would be your recomendation for how many and what type of Hubs I should
> > use to connect them all together. Any imput is appreciated.
>
> Any hint as to what you're going to be doing with the 30 boards?  The
> obvious choice is a cheap 48 port 10/100BT switch from any name-brand
> vendor.  However, there are circumstances where you'd want more
> expensive switches, 1000BT switches, or a different network altogether.
>
>    rgb
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From joelja at darkwing.uoregon.edu  Fri Jul 25 17:58:15 2003
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Fri, 25 Jul 2003 14:58:15 -0700 (PDT)
Subject: Project Help
In-Reply-To: <Law11-F98m15k4R9xC200005ecb@hotmail.com>
Message-ID: <Pine.LNX.4.44.0307251456020.1942-100000@twin.uoregon.edu>

varius ee or cs embeded computing projects would probably happily take 
them off your hands as is...

joelja

On Thu, 24 Jul 2003, Eric Uren wrote:

> 
> 
> To whomever it may concern,
> 
>               I work at a company called AT systems. I was recently assigned 
> the task of using up thirty extra SBC's that we have. My boss told me that 
> he wants to link all of the SBC's together, and plop them in a tower, and 
> donate them to a college or university as a tax write-off. We have a factory 
> attached to our engineering department, which contains a turret, multiple 
> work stations, and so on. So getting a hold of a custom tower, power supply, 
> etc. is not a problem. I just need to create a way to use these thirty extra 
> board we have. All thirty of them contain: a P266 processor, 128 MB of RAM, 
> 128 IDE, Compac Flash Drive, and Ethernet and USB ports. Any diagrams, 
> sites, comments, or suggestions would be greatly appreciated. Thanks.
> 
> Eric Uren
> AT Systems
> 
> _________________________________________________________________
> MSN 8 with e-mail virus protection service: 2 months FREE*  
> http://join.msn.com/?page=features/virus
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli	      Academic User Services   joelja at darkwing.uoregon.edu    
--    PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E      --
  In Dr. Johnson's famous dictionary patriotism is defined as the last
  resort of the scoundrel.  With all due respect to an enlightened but
  inferior lexicographer I beg to submit that it is the first.
	   	            -- Ambrose Bierce, "The Devil's Dictionary"


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From seth at hogg.org  Sat Jul 26 10:28:21 2003
From: seth at hogg.org (Simon Hogg)
Date: Sat, 26 Jul 2003 15:28:21 +0100
Subject: UK only?  Power Meters
Message-ID: <4.3.2.7.2.20030726151139.00a86f00@pop.freeuk.net>

Some of the list members may remember a recent discussion of the usefulness 
of power meters.  I have just seen some for sale in Lidl[1] (of all 
places!) in the UK (with a UK 3-pin plug-through arrangement).

They were UKP 6.99 (equivalent to about US$10) and had a little lcd 
display.  Measurements performed were Current, Peak Current (poss. with 
High Current warning?), Power, Peak Power, total kWh and Power Factor.

I have no details of performance, etc. (since I didn't buy one) but the 
price is certainly very attractive compared even the the much feted 
'kill-a-watt'.  If anyone wants one and can't find a Lidl you can contact 
me off-list, and I will get on my trusty bicycle down to the shops.

--
Simon

[1] www.lidl.com (www.lidl.de) German-based trans-European discount retailer.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Sun Jul 27 04:03:14 2003
From: award at andorra.ad (Alan Ward)
Date: Sun, 27 Jul 2003 10:03:14 +0200
Subject: Infiniband: cost-effective switchless configurations
References: <200307251655.UAA08132@nocserv.free.net>
Message-ID: <3F238742.1060408@andorra.ad>

If I understand correctly, you need all-to-all connectivity?
Do all the nodes need to access the whole data set, or only
share part of the data set between a few nodes each time?

I had a case where I wanted to share the whole data
set between all nodes, using point-to-point Ethernet connections
(no broadcast). I put them in a ring, so that with e.g. four nodes:


    A -----> B -----> C -----> D
    ^                          |
    |                          |
     --------------------------

Node A sends its data, plus C's and D's to node B.
Node B sends its data, plus D's and A's to node C.
Node C sends its data, plus A's and B's to node D
Node D sends its data, plus B's and C's to node A.

Data that has done (N-1) hops is no longer forwarded.


We used a single Java program with 3 threads on each node:

- one to receive data and place it in a local array
- one to forward finished data to the next node
- one to perform calculations


The main drawback is that you need a smart algorithm to determine
which pieces of data are "new" and which are "used"; i.e. have
been used for calculation and been forwarded to the next node,
and can be chucked out to make space. Ours wasn't smart enough :-(

Alan Ward


En/na Mikhail Kuzminsky ha escrit:
>   It's possible to build 3-nodes switchless Infiniband-connected
> cluster w/following topology (I assume one 2-ports Mellanox HCA card
> per node):
> 
>     node2 -------IB------Central node-----IB-----node1
>      !                                             !
>      !                                             !
>      ----------------------IB-----------------------
> 
> It gives complete nodes connectivity and I assume to have
> 3 separate subnets w/own subnet manager for each. But I think that
> in the case if MPI broadcasting must use hardware multicasting,
> MPI broadcast will not work from nodes 1,2 (is it right ?).
> 
> OK. But may be it's possible also to build the following topology
> (I assume 2 x 2-ports Mellanox HCAs per node, and it gives also
> complete connectivity of nodes) ? :
> 
> 
>   node 2----IB-------- C e n t r a l  n o d e -----IB------node1
>        \              /                      \           /
>          \          /                         \         /
>            \       /                           \      /
>              \--node3                         node4--
> 
> and I establish also additional IB links (2-1, 2-4, 3-1, 3-4, not
> presenetd in the "picture") which gives me complete nodes connectivity.
> Sorry, is it possible (I don't think about changes in device drivers)?
> If yes, it's good way to build very small
> and cost effective IB-based switchless clusters !
> 
> BTW, if I will use IPoIB service, is it possible to use netperf
> and/or netpipe tools for measurements of TCP/IP performance ?
>        
> Yours
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jhearns at micromuse.com  Mon Jul 28 18:04:53 2003
From: jhearns at micromuse.com (John Hearns)
Date: 28 Jul 2003 23:04:53 +0100
Subject: UK power meters
Message-ID: <1059429893.1415.5.camel@harwood>

I bought two of the power meters from LIDL.
The Clapham Junction branch has dozens.

Seems to work fine! My mini-ITX system is running at 45 watts.

-- 
John Hearns <jhearns at micromuse.com>
Micromuse

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gary at lerhaupt.com  Sat Jul 26 12:44:30 2003
From: gary at lerhaupt.com (Gary Lerhaupt)
Date: 26 Jul 2003 11:44:30 -0500
Subject: Dell Linux mailing list
Message-ID: <1059237870.6969.3.camel@localhost.localdomain>

For ample amounts of help with your Dell / Linux equipment, please check
out the Linux-Poweredge mailing list at
http://lists.us.dell.com/mailman/listinfo/linux-poweredge.

Gary

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jnellis at mtcrossroads.org  Sun Jul 27 20:31:01 2003
From: jnellis at mtcrossroads.org (Joe Nellis)
Date: Sun, 27 Jul 2003 17:31:01 -0700
Subject: Neighbor table overflow
References: <200307251655.UAA08132@nocserv.free.net> <3F238742.1060408@andorra.ad>
Message-ID: <001c01c3549f$93bbd680$8800a8c0@joe>

Greetings,

I am running scyld 27bz version.  I recently started getting "neighbor table
overflow" messages on the last boot stage on one of my nodes though nothing
has changed.  Can anyone explain this message.  The node just hangs with
this message repeating every 30 seconds or so.

Sincerely,
Joe.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Mon Jul 28 18:33:59 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Mon, 28 Jul 2003 15:33:59 -0700 (PDT)
Subject: Dell Linux mailing list
In-Reply-To: <1059237870.6969.3.camel@localhost.localdomain>
Message-ID: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com>


hi ya

i cant resist...

On 26 Jul 2003, Gary Lerhaupt wrote:

> For ample amounts of help with your Dell / Linux equipment, please check
> out the Linux-Poweredge mailing list at
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge.

if dell machines needs so much "help"... something else is
wrong with the box ...

and yes, i've been going around to fix/replace lots of broken dell boxes

a good box works out of the crate ( outof the box ) and keeps
working for years and years.. and keeps working even if you
open the covers and fiddle with the insides

c ya
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From John.Hearns at micromuse.com  Mon Jul 28 07:32:14 2003
From: John.Hearns at micromuse.com (John Hearns)
Date: Mon, 28 Jul 2003 12:32:14 +0100
Subject: Power meters at LIDL
Message-ID: <027901c354fb$e82d4030$8461cdc2@DREAD>

Thanks to Simon Hogg.

I have got some cheap cycling gear from LIDL, but I never thought of buying
Beowulf bits from there!
I have a couple nearby me, so if anyone else in the UK wants one I'll see if
they are in stock
and post one on if you provide name/address.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Mon Jul 28 21:53:37 2003
From: angel at wolf.com (Angel Rivera)
Date: Tue, 29 Jul 2003 01:53:37 GMT
Subject: Dell Linux mailing list
In-Reply-To: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030728153131.1948A-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030729015337.23350.qmail@houston.wolf.com>

Alvin Oga writes: 

> 
> a good box works out of the crate ( outof the box ) and keeps
> working for years and years.. and keeps working even if you
> open the covers and fiddle with the insides

Sounds great on paper, but... 

When one buys hundreds of boxes at a whack, the major issue, besides the 
normal shipping ones, is going to be the firmware differences between the 
boxes which has a tendency to bite you that the most inopportune moment. 
Dell is no worse than some and a lot better than others. 

We drive a real production commercial cluster.  I would NEVER open an in 
service production box.  Messing up a production run results in serious 
money(and time)being lost. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Mon Jul 28 22:03:57 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Mon, 28 Jul 2003 19:03:57 -0700 (PDT)
Subject: Dell Linux mailing list
In-Reply-To: <20030729015337.23350.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030728190139.9520B-100000@Maggie.Linux-Consulting.com>


hi ya

On Tue, 29 Jul 2003, Angel Rivera wrote:

> > a good box works out of the crate ( outof the box ) and keeps
> > working for years and years.. and keeps working even if you
> > open the covers and fiddle with the insides
> 
> Sounds great on paper, but... 

yup...

and that is precisely why i dont use gateway, compaq, dell ...
( i wont be putting important data on those boxes )

i qa/qc my own boxes for production use ... 
and yes, never touch a box in production .. never ever .. no matter what
well within reason ...if the production boxes are dying... fix it asap
and methodically and documented and tested and qa'd and qc'd and
foo-blessed

c ya
alvin

> When one buys hundreds of boxes at a whack, the major issue, besides the 
> normal shipping ones, is going to be the firmware differences between the 
> boxes which has a tendency to bite you that the most inopportune moment. 
> Dell is no worse than some and a lot better than others. 
> 
> We drive a real production commercial cluster.  I would NEVER open an in 
> service production box.  Messing up a production run results in serious 
> money(and time)being lost. 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mwheeler at startext.co.uk  Tue Jul 29 05:57:29 2003
From: mwheeler at startext.co.uk (Martin WHEELER)
Date: Tue, 29 Jul 2003 09:57:29 +0000 (UTC)
Subject: Neighbor table overflow
In-Reply-To: <001c01c3549f$93bbd680$8800a8c0@joe>
Message-ID: <Pine.LNX.4.33.0307290930220.23391-100000@caxton.startext.demon.co.uk>

On Sun, 27 Jul 2003, Joe Nellis wrote:

> I am running scyld 27bz version.  I recently started getting "neighbor table
> overflow" messages on the last boot stage on one of my nodes though nothing
> has changed.  Can anyone explain this message.  The node just hangs with
> this message repeating every 30 seconds or so.

Ah.  The dreaded 'neighbour table overflow' message.

I was plagued with this a couple of years ago.

It usually means that your system is unable to resolve some of its
component machines.  But which?  (In my case, usually localhost.)

Check very carefully the contents of:

   * /etc/hosts
   * /etc/resolv.conf
   * /etc/network/interfaces

Also check that you can ping every machine on the network.
(Particularly localhost.)

Then make sure that you have *explicitly* given correct addresses,
netmasks, and gateway address in /etc/network/interfaces for both
ethernet and local loopback connections.
(see man interfaces for examples)

What does ifconfig tell you?
(You should see details of both ethernet and local loopback connections
-- if not, you've got a problem.)

If necessary, do an

     ifconfig 127.0.0.1 netmask 255.0.0.0 up

to try to kick local loopback into life.
(If it does, add the address and netmask info lines to to the lo iface
in your /etc/network/interfaces file.)

HTH
-- 
Martin Wheeler   -   StarTEXT / AVALONIX - Glastonbury - BA6 9PH - England
mwheeler at startext.co.uk                http://www.startext.co.uk/mwheeler/
GPG pub key : 01269BEB  6CAD BFFB DB11 653E B1B7 C62B  AC93 0ED8 0126 9BEB
      - Share your knowledge. It's a way of achieving immortality. -

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Tue Jul 29 18:41:23 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 29 Jul 2003 15:41:23 -0700 (PDT)
Subject: Dell Linux mailing list - testing
In-Reply-To: <20030729021918.25594.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com>


hi ya angel

lets good ... i think i shall post a reply to the list ..

On Tue, 29 Jul 2003, Angel Rivera wrote:

> > i qa/qc my own boxes for production use ... 
...
 
> Normally, when we get our boxes, they have been burned in for at least 72 
> hours by the vendor. 

yes... that's the "claim" ...

if we say its been burnt in for 72 hrs...
	- they get a list of times and dates ...
	- i prefer to do infinite kernel compiles
	( rm -rf /tmp/linux-2.x ; cp -par linux-2.x /tmp ; make bzImage ;
	  date-stamp )

	http://www.linux-1u.net/Diags/scripts/test.pl
	( a dumb/simple/easy test that runs few standard operations )

> Then we beat them using our suit of programs for a 
> week. If there are any problems, the clock gets reset.

yes... that is the trick .... to get a god set of test suites


>  Not always a very 
> popular way of doing things, but it keeps bad boxes to a very low roar.  I 

keeping testing costs time down and "start testing process all over is
key"

testing and diags
	http://www.linux-1u.net/Diags/

and everybody has their own idea of what tests to do .. and "its
considered tested" ... or the depth of the tests..

1st tests should be visual ..
	- check the bios time stamps and version
	- check the batch levels of the pcb
	- check the manufacturer of the pcb and the chips on sdrams
	- blah ... dozens of things to inspect

than the power up tests
	- run diags to read bios version numbers
	- run diags for various purposes

- diagnostics and testing should be 100% automated including
  generating failure and warning notices
	- people tend to get lazy or go on vacation 
	and most are not as meticulous about testing foo-stuff
	while the other guyz might care that bar-stuff works 

- testing is very very expensive ...
	- getting known good mb, cpu, mem, disk, fans
	( repeatedly ) is the key ...

	- problem is some vendors discontinue their mb in 2 months
	so the whole testing clock start over again

	- in our case, its cheaper to find smaller distributors
	that have inventory of the previously tested known good mb
	that we like

- if it aint broke... leave it alone .. if its doing its job :-)

c ya
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Tue Jul 29 21:26:33 2003
From: angel at wolf.com (Angel Rivera)
Date: Wed, 30 Jul 2003 01:26:33 GMT
Subject: Dell Linux mailing list - testing
In-Reply-To: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030729152033.20077B-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030730012633.3897.qmail@houston.wolf.com>

Alvin Oga writes: 

[snip] 

>> Then we beat them using our suit of programs for a 
>> week. If there are any problems, the clock gets reset.
> 
> yes... that is the trick .... to get a god set of test suites

We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
etc. We have monitoring programs so we are always getting stats and when 
something goes wrong they notify us. 

We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
that means hot!) were all under testing.  The heat blasting out the rear wa 
hot enough to triggered an alarm in the server room so they had to come take 
a look. 

>  
> 
> testing and diags
> 	http://www.linux-1u.net/Diags/ 
> 
> and everybody has their own idea of what tests to do .. and "its
> considered tested" ... or the depth of the tests.. 
> 
> 1st tests should be visual ..
> 	- check the bios time stamps and version
> 	- check the batch levels of the pcb
> 	- check the manufacturer of the pcb and the chips on sdrams
> 	- blah ... dozens of things to inspect

> than the power up tests
> 	- run diags to read bios version numbers
> 	- run diags for various purposes

This is really important when you get a demo box to test on for a month or 
so. The time between you getting that box and your order starts landing on 
the loading dock means there have been a lot of changes if you have a good 
vendor.  We test and test before they go into production-cause once we turn 
them over we have a heck of time getting them off-line for anything less 
than a total failure. 

> 
> - diagnostics and testing should be 100% automated including
>   generating failure and warning notices
> 	- people tend to get lazy or go on vacation 
> 	and most are not as meticulous about testing foo-stuff
> 	while the other guyz might care that bar-stuff works  
> 
> - testing is very very expensive ...
> 	- getting known good mb, cpu, mem, disk, fans
> 	( repeatedly ) is the key ... 
> 
> 	- problem is some vendors discontinue their mb in 2 months
> 	so the whole testing clock start over again 
> 
> 	- in our case, its cheaper to find smaller distributors
> 	that have inventory of the previously tested known good mb
> 	that we like

Ah, the voice of experience.  We are very loathe to take a shortcut. 
Sometimes it is very hard. When we bought those 28TB of storage, the first 
thing we heard was that we can test it in production.  Had we done that, we 
may have lost data-we lost a box. 

> 
> - if it aint broke... leave it alone .. if its doing its job :-)

*LOL* Once it is live our entire time is spent not messing anything up. And 
that can be very hard w/ those angstroms where you have two computers in a 
1U form factor and one goes doen. :) 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Tue Jul 29 21:52:43 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Tue, 29 Jul 2003 18:52:43 -0700 (PDT)
Subject: Dell Linux mailing list - testing
In-Reply-To: <20030730012633.3897.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com>


hi ya angel

On Wed, 30 Jul 2003, Angel Rivera wrote:

> We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
> etc. We have monitoring programs so we are always getting stats and when 
> something goes wrong they notify us. 

yup... and hopefull there is say 90- 95% probability that the "notice of
failure" as in fact correct ... :-)
	- i know people that ignore those pagers/emails becuase the
	notices are NOT real .. :-0

	- i ignore some notices too ... its now treated as a "thats nice,
	that server is still alive" notices
 
> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
> that means hot!) were all under testing.  The heat blasting out the rear wa 
> hot enough to triggered an alarm in the server room so they had to come take 
> a look. 

yes.. amd gets hot ...

and ii think angstroms has that funky indented power supply and cpu
fans on the side where the cpu and ps is fighting each other for the
4"x 4"x 1.75" air space .. pretty silly .. :-)

> > testing and diags
> > 	http://www.linux-1u.net/Diags/ 
> > 
> > and everybody has their own idea of what tests to do .. and "its
> > considered tested" ... or the depth of the tests.. 

...  
> This is really important when you get a demo box to test on for a month or 
> so.

i like to treat all boxes as if it was never tested/seen before ...
assuming time/budget allows for it 

..
> them over we have a heck of time getting them off-line for anything less 
> than a total failure. 

if something went bad... that was a bad choice for that system/parts ??

> > - testing is very very expensive ...

..

> Ah, the voice of experience.  We are very loathe to take a shortcut. 

short cuts have never paid off in the long run ..  you usually
wind up doing the same task 3x-5x  instead of doing it once correctly
	( take apart the old system, build new one, test new one
	( and now we're back to the start ... and thats ignoring
	( all the tests and changes before giving up on the old
	( shortcut system

> Sometimes it is very hard. When we bought those 28TB of storage, the first 
> thing we heard was that we can test it in production.  Had we done that, we 
> may have lost data-we lost a box. 

i assume you have at least 3 identical 28TB storage mechanisms..
otherwise, old age tells me one day, 28TB will be lost.. no matter
how good your raid and backup is
 	- nobody takes time to build/tests the backup system from
	bare metal ... and confirm the new system is identical to the
	supposed/simulated crashed box including all data being processed
	during the "backup-restore" test period

> > 
> > - if it aint broke... leave it alone .. if its doing its job :-)
> 
> *LOL* Once it is live our entire time is spent not messing anything up. And 
> that can be very hard w/ those angstroms where you have two computers in a 
> 1U form factor and one goes doen. :) 

you have those boxes that have 2 systems that depend on eachother ??
	- ie ..turn off 1 power supply and both systems go down ???

	( geez.. that $80 power supply shortcut is a bad mistake 
	( if the number of nodes is important

	- lots of ways to get 4 independent systems into one 1U shelf
	and with mini-itx, you can fit 8-16 independent 3GHz machines
	into one 1U shelf
		- that'd be a fun system to design/build/ship ...
		( about 200-400 independent p4-3G cpu in one rack )

	- i think mini-itx might very well take over the expensive blade
	market  asumming certain "pull-n-replace" options in blade
	is not too important in mini-itx ( when you have 200-400 nodes
	anyway in a rack )


have fun
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From gary at lerhaupt.com  Mon Jul 28 18:50:01 2003
From: gary at lerhaupt.com (gary at lerhaupt.com)
Date: Mon, 28 Jul 2003 17:50:01 -0500
Subject: Dell Linux mailing list
Message-ID: <1059432601.3f25a899b1c9a@www.webmail.westhost.com>

I agree and I think most of the stuff does work out of the box.  However its 
at least comforting to know that if it doesn't or if it later develops 
problems, that list will get you exactly what you need to solve the problem.  
I happened to see people with problems here and wanted to make sure they knew 
of this great resource.

Quoting Alvin Oga <alvin at Mail.Linux-Consulting.com>:

> 
> hi ya
> 
> i cant resist...
> 
> On 26 Jul 2003, Gary Lerhaupt wrote:
> 
> > For ample amounts of help with your Dell / Linux equipment, please check
> > out the Linux-Poweredge mailing list at
> > http://lists.us.dell.com/mailman/listinfo/linux-poweredge.
> 
> if dell machines needs so much "help"... something else is
> wrong with the box ...
> 
> and yes, i've been going around to fix/replace lots of broken dell boxes
> 
> a good box works out of the crate ( outof the box ) and keeps
> working for years and years.. and keeps working even if you
> open the covers and fiddle with the insides
> 
> c ya
> alvin
> 


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Daniel.Kidger at quadrics.com  Tue Jul 29 07:12:45 2003
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Tue, 29 Jul 2003 12:12:45 +0100
Subject: Power meters at LIDL
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA78DE049@stegosaurus.bristol.quadrics.com>

Thanks for the info Simon.

I too went out and bought one from our local LIDL in Fishponds,Bristol.
They has plenty in stock. Manufactured specially for LIDL by EMC
see:
  http://www.lidl.co.uk/gb/index.nsf/pages/c.o.oow.20030724.p.Energy_Monitor


One interesting extra feature this device has is that as well as the
instantaneous power reading(W) and energy over time (KWh), it will also
display the maximum power consumption(W) and the time/date it occured. This
should be useful for those of us who want to stress test nodes to get a
maximum power figure. 


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


-----Original Message-----
From: John Hearns [mailto:John.Hearns at micromuse.com]
Sent: 28 July 2003 12:32
To: beowulf at beowulf.org
Subject: Power meters at LIDL


Thanks to Simon Hogg.

I have got some cheap cycling gear from LIDL, but I never thought of buying
Beowulf bits from there!
I have a couple nearby me, so if anyone else in the UK wants one I'll see if
they are in stock
and post one on if you provide name/address.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jd89313 at hotmail.com  Tue Jul 29 12:37:37 2003
From: jd89313 at hotmail.com (Jack Douglas)
Date: Tue, 29 Jul 2003 16:37:37 +0000
Subject: Cisco switches for lam mpi
Message-ID: <BAY1-F72QsFBL4JcpE90000e41d@hotmail.com>

Hi

I wonder if someone can help me

We have just installed a 32 Node Dual Xeon Cluster, with a Cisco Cataslyst 
4003 Chassis with 48 1000Base-t ports.

We are running LAM MPI over gigabit, but we seem to be experiencing 
bottlenecks within the switch

Typically, using the cisco, we only see CPU utilisation of around 30-40%

Howver, we experimented with a Foundry Switch, and were seeing cpu 
utilisation on the same job of around 80 - 90%.

We know that there are commands to "open" the cisco, but the ones we have 
been advised dont seem to do the trick.

Was the cisco a bad idea? If so can someone recommend a good Gigabit switch 
for MPI?  I have heard HP Procurves are supposed to be pretty good.

Or does anyone know any other commands that will open the Cisco switch 
further getting the performance up

Best Regards

JD

_________________________________________________________________
On the move? Get Hotmail on your mobile phone http://www.msn.co.uk/msnmobile

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From angel at wolf.com  Wed Jul 30 08:33:14 2003
From: angel at wolf.com (Angel Rivera)
Date: Wed, 30 Jul 2003 12:33:14 GMT
Subject: Testing (Was: Re: Dell Linux mailing list - testing)
In-Reply-To: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com> 
References: <Pine.LNX.3.96.1030729184033.26300B-100000@Maggie.Linux-Consulting.com>
Message-ID: <20030730123314.18107.qmail@houston.wolf.com>

Alvin Oga writes: 

> 
> hi ya angel 
> 
> On Wed, 30 Jul 2003, Angel Rivera wrote: 
> 
>> We have set of jobs we call beater jobs that beat memory, cpu, drives, 
>> nfs etc. We have monitoring programs so we are always getting stats and 
>> when something goes wrong they notify us. 
> 
> yup... and hopefull there is say 90- 95% probability that the "notice of
> failure" as in fact correct ... :-)
> 	- i know people that ignore those pagers/emails becuase the
> 	notices are NOT real .. :-0

We have very high confidence our emails and pages are real.  Our problem is 
information overload.  We need to work on a methodology to make sure the 
important ones are not lost in the forest of messages. 


> 	- i ignore some notices too ... its now treated as a "thats nice,
> 	that server is still alive" notices

I try and at least scan them. We are making changes to help us gain 
situational awareness without having to spend all out time hunched over the 
monitors. 


>  
>> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
>> that means hot!) were all under testing.  The heat blasting out the rear was hot enough to triggered an alarm in the server room so they had to come 
>> take a look. 
> 
> yes.. amd gets hot ... 
> 
> and ii think angstroms has that funky indented power supply and cpu
> fans on the side where the cpu and ps is fighting each other for the
> 4"x 4"x 1.75" air space .. pretty silly .. :-)

each node has it's own power supply. When everything is running right it's 
the bomb. When not, then you have to take down two nodes to work on one. Or, 
until you get used how it is built, you have to be very careful that the 
reset button you hit is for the right now and not its neighbor. :) 

 
>> This is really important when you get a demo box to test on for a month >> or so.
> 
> i like to treat all boxes as if it was never tested/seen before ...
> assuming time/budget allows for it 

Before a purchase, we look at the top 2-3 choices and start testing them to 
see how fast and how we can tweak them. One of the problems is that between 
that time and the order coming in the door there can be enough changes that 
your build changes do not work properly. 

> i assume you have at least 3 identical 28TB storage mechanisms..
> otherwise, old age tells me one day, 28TB will be lost.. no matter
> how good your raid and backup is
>  	- nobody takes time to build/tests the backup system from
> 	bare metal ... and confirm the new system is identical to the
> 	supposed/simulated crashed box including all data being processed
> 	during the "backup-restore" test period

They are 10 - 2.8 (dual 1.4 3ware 7500 cards in a 6-1-1 configuration.) The 
vendor is right down the street. We keep on-site spares ready to do so we 
always have a hot spare on each card. 

We don't back up very much from the cluster. just two of the management 
nodes that keep our stats. It would be impossible to backup that much data 
in a timely manner. 

 
> you have those boxes that have 2 systems that depend on eachother ??
> 	- ie ..turn off 1 power supply and both systems go down ??? 
> 
> 	( geez.. that $80 power supply shortcut is a bad mistake 
> 	( if the number of nodes is important 
> 
> 	- lots of ways to get 4 independent systems into one 1U shelf
> 	and with mini-itx, you can fit 8-16 independent 3GHz machines
> 	into one 1U shelf
> 		- that'd be a fun system to design/build/ship ...
> 		( about 200-400 independent p4-3G cpu in one rack ) 
> 
> 	- i think mini-itx might very well take over the expensive blade
> 	market  asumming certain "pull-n-replace" options in blade
> 	is not too important in mini-itx ( when you have 200-400 nodes
> 	anyway in a rack )

No they are two standalone boxes in a 1U with different everything. That 
means it is very compact in the back and power and reset buttons close 
together in the front-so you have to pay attention. But they rock as compute 
nodes. 

We are now going to explore blades now.  Anyone have recommendations? 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Mail.Linux-Consulting.com  Wed Jul 30 08:46:41 2003
From: alvin at Mail.Linux-Consulting.com (Alvin Oga)
Date: Wed, 30 Jul 2003 05:46:41 -0700 (PDT)
Subject: Testing - blades
In-Reply-To: <20030730123314.18107.qmail@houston.wolf.com>
Message-ID: <Pine.LNX.3.96.1030730054407.16895A-100000@Maggie.Linux-Consulting.com>


hi ya angel

On Wed, 30 Jul 2003, Angel Rivera wrote:

> each node has it's own power supply. When everything is running right it's 
> the bomb. When not, then you have to take down two nodes to work on one. Or, 

thats the problem... take 2 down to fix 1... not good

 
> They are 10 - 2.8 (dual 1.4 3ware 7500 cards in a 6-1-1 configuration.) The 
> vendor is right down the street. We keep on-site spares ready to do so we 
> always have a hot spare on each card. 

if you're near 3ware in sunnyvale, than i drive by you daily .. :-)
 
> > 	- i think mini-itx might very well take over the expensive blade
> > 	market  asumming certain "pull-n-replace" options in blade
> > 	is not too important in mini-itx ( when you have 200-400 nodes
> > 	anyway in a rack )
> 
> No they are two standalone boxes in a 1U with different everything. That 
> means it is very compact in the back and power and reset buttons close 
> together in the front-so you have to pay attention. But they rock as compute 
> nodes. 

we do custom 1U boxes ... anything that is reasonable is done .. :-)
 
> We are now going to explore blades now.  Anyone have recommendations? 

blades..
	http://www.linux-1u.net/1U_Others
	- towards the bottom of the page.. up about 2-3 sections

c ya
alvin


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From Robin.Laing at drdc-rddc.gc.ca  Wed Jul 30 11:09:47 2003
From: Robin.Laing at drdc-rddc.gc.ca (Robin Laing)
Date: Wed, 30 Jul 2003 09:09:47 -0600
Subject: Interesting read - Canada's fastest computer...
Message-ID: <3F27DFBB.9090103@drdc-rddc.gc.ca>

Here is a link about Canada's fastest cluster.  There is a link off of 
the "McKenzie's" home page that explains how they worked out some of 
the latency problems using low cost gig switches.  A complete 
description of hardware is also included.

http://www.newsandevents.utoronto.ca/bin5/030721a.asp

The graphics of galaxy collisions are interesting as well.

-- 
Robin Laing
Instrumentation Technologist   Voice: 1.403.544.4762
Military Engineering Section   FAX:   1.403.544.4704
Defence R&D Canada - Suffield  Email: Robin.Laing at DRDC-RDDC.gc.ca
PO Box 4000, Station Main      WWW:http://www.suffield.drdc-rddc.gc.ca
Medicine Hat, AB, T1A 8K6
Canada

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From john152 at libero.it  Wed Jul 30 15:01:14 2003
From: john152 at libero.it (john152 at libero.it)
Date: Wed, 30 Jul 2003 21:01:14 +0200
Subject: Bug with 3com card?
Message-ID: <HIUQ62$D5EB61BD4C09E6414A19DCF481C3CFFA@libero.it>

Hi all,
i have problems with mii-diag software in detecting the link status
( -w option ).

I'm using a 3Com905-TX card instead of Realtek RTL-8139 
i used before.
With Realtek card all was Ok, infact 
with mii-diag i had the following output:

- at start (cable connected):
     18:54:36.592 Baseline value of MII BMSR
     (basic mode status register) is 782d.

- disconnecting the link:
     18:55:01.632 MII BMSR now 7809: no link, NWay busy,
     No Jabber (0000).
     18:55:01.637 Baseline value of MII BMSR
     basic mode status register) is 7809.

- connecting again the link:
     18:55:06.722 MII BMSR now 782d: Good link,
     NWay done, No Jabber (45e1).
     18:55:06.728 Baseline value of MII BMSR
     (basic mode status register) is 782d.
.
.

Now i have the following output lines with 3Com:

- at start (cable connected):
     18:42:46.073 Baseline value of MII BMSR
     (basic mode status register) is 782d.

- disconnecting the link:
     18:42:50.779 MII BMSR now 7829: no link,
     NWay done, No Jabber (0000).
     18:49:38.524 Baseline value of MII BMSR
     (basic mode status register) is 7809.

- connecting again the link:
     18:52:15.887 MII BMSR now 7829: no link,
     NWay done, No Jabber (41e1).
     18:52:15.895 Baseline value of MII BMSR
     (basic mode status register) is 782d.
.
.

The Baseline value of MII BMSR is correct with each card,
but i think there is an incorrect return value when
written "...MII BMSR now 7829..." (monitor_mii function).

I think that correct values of this new value are 
782d or 7809, aren't they? 

Could it be a bug in the software or more simply this card
is not supported?

It seems that the function mdio_read(ioaddr, phy_id, 1)
can return two different values even if the link status is the 
same!
Infact at the status change, i see two outputs coming from the 
same call "mdio_read(ioaddr, phy_id, 1)" : 
a first output is 7829 ( i don't understand the why)
and the second output is 782d or 7809 and it seems correct.

Thanks in advance for your kind answers and observations.

Giovanni di Giacomo

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bill at math.ucdavis.edu  Wed Jul 30 15:06:05 2003
From: bill at math.ucdavis.edu (Bill Broadley)
Date: Wed, 30 Jul 2003 12:06:05 -0700
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <3F27DFBB.9090103@drdc-rddc.gc.ca>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca>
Message-ID: <20030730190605.GA2640@sphere.math.ucdavis.edu>

On Wed, Jul 30, 2003 at 09:09:47AM -0600, Robin Laing wrote:
> Here is a link about Canada's fastest cluster.  There is a link off of 
> the "McKenzie's" home page that explains how they worked out some of 
> the latency problems using low cost gig switches.  A complete 
> description of hardware is also included.
> 
> http://www.newsandevents.utoronto.ca/bin5/030721a.asp
> 
> The graphics of galaxy collisions are interesting as well.

Anyone have any idea what range of latencies and bandwidths are
observed on that machine (as visible to MPI)?


-- 
Bill Broadley
Mathematics
UC Davis
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From douglas at shore.net  Wed Jul 30 16:44:58 2003
From: douglas at shore.net (Douglas O'Flaherty)
Date: Wed, 30 Jul 2003 16:44:58 -0400
Subject: Cisco switches for lam mpi
Message-ID: <3F282E4A.30301@shore.net>

    From: "Jack Douglas" <jd89313 at hotmail.com
    <http://www.mail2web.com/cgi-bin/compose.asp?mb=&mp=P&mps=0&lid=0&intListPerPage=20&messageto=jd89313 at hotmail.com&ed=0GiZyQ7mCUTaOfqbPc0PCcw5ipKw5gh%2Bk8e2sQ0iJ0kppFsWke4Syd%2Bg3IwaIWhXCYEvHrvg9CjF%0D%0AWN0oWsv6zTP7GUytPsTeOHpoiRk6sRGQsanK5As%3D>>

    To: beowulf at beowulf.org
    <http://www.mail2web.com/cgi-bin/compose.asp?mb=&mp=P&mps=0&lid=0&intListPerPage=20&messageto=beowulf at beowulf.org&ed=0GiZyQ7mCUTaOfqbPc0PCcw5ipKw5gh%2Bk8e2sQ0iJ0kppFsWke4Syd%2Bg3IwaIWhXCYEvHrvg9CjF%0D%0AWN0oWsv6zTP7GUytPsTeOHpoiRk6sRGQsanK5As%3D>

    Subject: Cisco switches for lam mpi
    Date: Tue, 29 Jul 2003 16:37:37 +0000 

    Hi

    I wonder if someone can help me

    We have just installed a 32 Node Dual Xeon Cluster, with a Cisco
    Cataslyst
    4003 Chassis with 48 1000Base-t ports.

    We are running LAM MPI over gigabit, but we seem to be experiencing
    bottlenecks within the switch

    Typically, using the cisco, we only see CPU utilisation of around
    30-40%

    Howver, we experimented with a Foundry Switch, and were seeing cpu
    utilisation on the same job of around 80 - 90%.

    We know that there are commands to "open" the cisco, but the ones we
    have
    been advised dont seem to do the trick.

    Was the cisco a bad idea? If so can someone recommend a good Gigabit
    switch
    for MPI? I have heard HP Procurves are supposed to be pretty good.

    Or does anyone know any other commands that will open the Cisco switch
    further getting the performance up

    Best Regards

    JD

==============

Jack:

Have you run Pallas' MPI benchmarks 
(http://www.pallas.com/e/products/pmb/) to quantify the differences 
between the two switches? The dramatic difference in system performance 
suggests you have something going wrong there.  You should test under no 
load and under load. The difference may be illuminating.

I'd start with an assumption you may have something wrong on the Cisco. 
And I'd call whomever you bought it form to come show otherwise.

Make certain you check your counters on the switch (and a few systems) 
to see if you have collisions, overruns or any other issues. As noted on 
this list before, the Cisco's can have pathological problems with 
auto-negotiation. You should be certain to set the ports to Full Duplex 
to get the speed up. With GigE, Jumbo Frames increases performance by a 
bit. Depending on your set up, I'd also turn off spanning tree, 
eliminate any ACLs, SNMP counters etc. which may be on the switch and 
contributing to load.

Worst case would be being backplane constrained - you have 32 GigE 
nodes. The Supervisor Engine  in the Cisco is listed as a 24-Gbps 
forwarding engine (18 million packets/sec) at peak. The Foundry NetIron 
400 & 800 backplane is 32Gbps + and they say 90mpps peak. Notice the 
math to convert between packets and backplane speed doesn't work.  My 
experience is that the Foundry is always faster and has lower latency. 

I have little experience with the HP pro curve switches. I've used them 
in data closets where backplane speed is not an issue. They've been 
reliable, but I've never considered them for a high speed network core.

doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tod at gust.sr.unh.edu  Wed Jul 30 17:56:16 2003
From: tod at gust.sr.unh.edu (Tod Hagan)
Date: 30 Jul 2003 17:56:16 -0400
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca> 
	<20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <1059602177.17090.81.camel@haze.sr.unh.edu>

On Wed, 2003-07-30 at 15:06, Bill Broadley wrote:
> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

There's a plot of the bandwidth tests they ran at the bottom of the
Mckenzie Networking HOWTO:
http://www.cita.utoronto.ca/webpages/mckenzie/tech/networking/index.html

No latency info, though.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lindahl at keyresearch.com  Wed Jul 30 18:20:54 2003
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Wed, 30 Jul 2003 15:20:54 -0700
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
References: <3F27DFBB.9090103@drdc-rddc.gc.ca> <20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <20030730222054.GA2266@greglaptop.internal.keyresearch.com>

On Wed, Jul 30, 2003 at 12:06:05PM -0700, Bill Broadley wrote:

> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

A bisection bandwidth histrogram is at the bottom of:

http://www.cita.utoronto.ca/webpages/mckenzie/tech/networking/index.html

You can tell these guys are physicists: they didn't just print the
average.

I'd guess latency in the cube network isn't very good, because they're
using Linux to forward packets. Given that, it's impressive how good
the bisection bandwidth is. Eventually the price of 10gig trunking is
going to fall to the point where it's better than this kind of
setup... until the wheel of reincarnation turns again, and we're using
10 gig links to the nodes.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From hahn at physics.mcmaster.ca  Wed Jul 30 19:25:27 2003
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 30 Jul 2003 19:25:27 -0400 (EDT)
Subject: Interesting read - Canada's fastest computer...
In-Reply-To: <20030730190605.GA2640@sphere.math.ucdavis.edu>
Message-ID: <Pine.LNX.4.44.0307301830350.6305-100000@coffee.psychology.mcmaster.ca>

> Anyone have any idea what range of latencies and bandwidths are
> observed on that machine (as visible to MPI)?

see the bottom of http://www.cita.utoronto.ca/webpages/mckenzie/

the machine is build for very latency-tolerant
aggregate-bandwidth-intensive codes.  you can see from the histograms
that their topology does a pretty good job of producing fast links,
but the 40-ish MB/s is going to be significantly affected by 
other traffic on the machine.  I guess the amount of interference
would depend largely on how efficient is the kernel's routing code.
for instance, is routing zero-copy?  I believe these are all Intel
7500CW boards, so their NICs probably have checksum-offloading
(or is that only done at endpoints?)

latency is not going to be great, if you're thinking in terms of 
myrinet or even flat 1000bT nets, since most routes will wind up
going through a small number of nodes.  it would be very interesting
to see similar histograms of latency or even just hop-count.  if 

I understand the topology correctly, you ascend into the express-cube
for 7/8ths of all possible random routes, and the weighted average
of CDCC hops is 0*(1/8)+1*(4/8)+2*(3/8)=1.25 hops.  without diagonals, 
the avg would be 1:3:3:1=1.5 hops, which isn't all that much worse.
but I think bisection cuts 8 4x1000bT links: 4 GB/s; without express
links, bisection would be half as much!

I think I'm missing something about the eth1 (point-to-point) links...

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From award at andorra.ad  Thu Jul 31 02:48:51 2003
From: award at andorra.ad (Alan Ward)
Date: Thu, 31 Jul 2003 08:48:51 +0200
Subject: small home cluster
Message-ID: <3F28BBD3.4040104@andorra.ad>

Dear list-people,

I just put the pictures of my home "civilized" cluster on the web:

	http://www.geocities.com/ward_a2003/

This is more play than work, as you can see from the Geocities address.

Best regards,
Alan Ward

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From tkonto at aegean.gr  Thu Jul 31 11:04:00 2003
From: tkonto at aegean.gr (Kontogiannis Theophanis)
Date: Thu, 31 Jul 2003 18:04:00 +0300
Subject: TEST --- IGNORE --- TEST -- IGNORE
Message-ID: <EB9251239B96944C895280024D7FB35A05FBA2FA@hermes.aegean.gr>


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From fboudra at uxp.fr  Thu Jul 31 11:04:48 2003
From: fboudra at uxp.fr (Fathi BOUDRA)
Date: Thu, 31 Jul 2003 17:04:48 +0200
Subject: 82551ER eeprom
Message-ID: <200307311704.48984.fboudra@uxp.fr>

Hi,

i try to program the 82551ER eeprom.

When i receive the eeprom, his contents was :

eepro100-diag -#2 -aaeem
eepro100-diag.c:v2.12 4/15/2003 Donald Becker (becker at scyld.com)
 http://www.scyld.com/diag/index.html
Index #2: Found a Intel 82559ER EtherExpressPro/100+ adapter at 0xe400.
i82557 chip registers at 0xe400:
  00000000 00000000 00000000 00080002 10000000 00000000
  No interrupt sources are pending.
   The transmit unit state is 'Idle'.
   The receive unit state is 'Idle'.
  This status is unusual for an activated interface.
EEPROM contents, size 64x16:
    00: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x08: ffff ffff fffd ffff ffff ffff ffff ffff  ________________
  0x10: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x18: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x20: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x28: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x30: ffff ffff ffff ffff ffff ffff ffff ffff  ________________
  0x38: ffff ffff ffff ffff ffff ffff ffff bafb  ________________
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address FF:FF:FF:FF:FF:FF.
  Board assembly ffffff-255, Physical connectors present: RJ45 BNC AUI MII
  Primary interface chip i82555 PHY #-1.
    Secondary interface chip i82555, PHY -1.

I used the -H, -G parameters and changed the eeprom_id, subsystem_id and 
subsystem_vendor :

 eepro100-diag -#1 -aaeem
eepro100-diag.c:v2.12 4/15/2003 Donald Becker (becker at scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Intel 82559ER EtherExpressPro/100+ adapter at 0xe800.
i82557 chip registers at 0xe800:
  00000000 00000000 00000000 00080002 10000000 00000000
  No interrupt sources are pending.
   The transmit unit state is 'Idle'.
   The receive unit state is 'Idle'.
  This status is unusual for an activated interface.
EEPROM contents, size 64x16:
    00: 1100 3322 5544 0000 0000 0101 4401 0000  __"3DU_______D__
  0x08: 0000 0000 4000 1209 8086 0000 0000 0000  _____ at __________
      ...
  0x38: 0000 0000 0000 0000 0000 0000 0000 09c3  ________________
 The EEPROM checksum is correct.
Intel EtherExpress Pro 10/100 EEPROM contents:
  Station address 00:11:22:33:44:55.
  Receiver lock-up bug exists. (The driver work-around *is* implemented.)
  Board assembly 000000-000, Physical connectors present: RJ45
  Primary interface chip DP83840 PHY #1.
  Transceiver-specific setup is required for the DP83840 transceiver.
Primary transceiver is MII PHY #1. MII PHY #1 transceiver registers:
   3000 7829 02a8 0154 05e1 45e1 0003 0000
   0000 0000 0000 0000 0000 0000 0000 0000
   0203 0000 0001 035e 0000 0003 0b74 0003
   0000 0000 0000 0000 0010 0000 0000 0000.
 Basic mode control register 0x3000: Auto-negotiation enabled.
 Basic mode status register 0x7829 ... 782d.
   Link status: previously broken, but now reestablished.
   Capable of  100baseTx-FD 100baseTx 10baseT-FD 10baseT.
   Able to perform Auto-negotiation, negotiation complete.
 Vendor ID is 00:aa:00:--:--:--, model 21 rev. 4.
   No specific information is known about this transceiver type.
 I'm advertising 05e1: Flow-control 100baseTx-FD 100baseTx 10baseT-FD 10baseT
   Advertising no additional info pages.
   IEEE 802.3 CSMA/CD protocol.
 Link partner capability is 45e1: Flow-control 100baseTx-FD 100baseTx 
10baseT-FD 10baseT.
   Negotiation  completed.

All these things doesn't work. I read the "online" 82551er datasheet but it 
doesn't help me (they explain only  the words 00h to 02h and 0Ah to 0Ch).

Someone know what i need to do or have a working 82551er eeprom ?

thanks
fbo

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rouds at servihoo.com  Thu Jul 31 11:53:54 2003
From: rouds at servihoo.com (RoUdY)
Date: Thu, 31 Jul 2003 19:53:54 +0400
Subject: NFS problem
In-Reply-To: <200307301906.h6UJ6tw26647@NewBlue.Scyld.com>
Message-ID: <web-19381333@servihoo.com>

Hello dear friends,

I am doing my beowulf cluster and I have a small problem 
when I test the NFS.


the command I used was :

" mount -t nfs node1:/home /home nfs "  

(where node1 is my master node)


Well the output that I obtain is 
"
RPC : Remote system error
connection refused
RPC not registered "

But when I am on NOde2 and I ping to the master node that 
is node1 it's ok..

hope to hear from u very soon for HELP

bye

Roudy
--------------------------------------------------
Get your free email address from Servihoo.com!
http://www.servihoo.com
The Portal of Mauritius
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bropers at lsu.edu  Thu Jul 31 12:35:09 2003
From: bropers at lsu.edu (Brian D. Ropers-Huilman)
Date: Thu, 31 Jul 2003 11:35:09 -0500 (CDT)
Subject: NFS problem
In-Reply-To: <web-19381333@servihoo.com>
References: <web-19381333@servihoo.com>
Message-ID: <Pine.LNX.4.56.0307311130430.12488@cannondale.ocs.lsu.edu>

Roudy,

Do you have portmapper running on node1? Do you have nfsd running on node1? 
Does your /etc/exports file include /home? Is the /home export open to the 
client node?

Do you have portmapper running on your client node? Do you have NFS support in 
your kernel or do you have a mount daemon running like rpciod or biod?

Finally, do you have any firewalling on either of the nodes?

The client and server must have all appropriate software running first and be 
properly configured before anything will work. Also, if any of those ports are 
blocked, at either end, things won't work.

On Thu, 31 Jul 2003, RoUdY wrote:
> Hello dear friends,
> 
> I am doing my beowulf cluster and I have a small problem 
> when I test the NFS.
> 
> the command I used was :
> 
> " mount -t nfs node1:/home /home nfs "  
> 
> (where node1 is my master node)
> 
> 
> Well the output that I obtain is 
> "
> RPC : Remote system error
> connection refused
> RPC not registered "
> 
> But when I am on NOde2 and I ping to the master node that 
> is node1 it's ok..
> 
> hope to hear from u very soon for HELP
> 
> bye
> 
> Roudy

--  
Brian D. Ropers-Huilman                        (225) 578-0461 (V)
Systems Administrator                 AIX      (225) 578-6400 (F)
Office of Computing Services       GNU Linux   brian at ropers-huilman.net
High Performance Computing            .^.      http://www.ropers-huilman.net/
Fred Frey Building, Rm. 201, E-1Q     /V\                          \o/
Louisiana State University           (/ \)           --  __o   /    |
Baton Rouge, LA 70803-1900           (   )          --- `\<,  /    `\\,
                                     ^^-^^              O/ O /     O/ O
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf