[Beowulf] Errors on IBM e325

Joe Landman landman at scalableinformatics.com
Mon Jun 28 14:21:34 EDT 2004

On Fri, 2004-06-25 at 11:21, Jeff Layton wrote:
> Good morning,
>    We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel:     extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel:     corrected ecc error

Does booting with iommu=off help?

>    Does anybody have any ideas what the cause might be?

The e325's have an onboard ATI VGA bit.  Last I checked it was PCI based
(I don't have a unit here to see).  There was a little discussion of
GART based issues in RH
https://www.redhat.com/archives/amd64-list/2004-May/date.html .  Which
kernel, how much memory, how is it distributed?  I have noticed that
some vendors do not configure the memory on Opteron systems correctly,
though I would expect the IBM folks not to have a problem with this. 

There are also some BIOS settings on the e325 that directly impact
memory layout, NUMA use,  etc.  Of course, I don't remember what they
are :(.


> Thanks!
> Jeff
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list