From prentice at ias.edu  Tue Dec  7 11:54:58 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 07 Dec 2010 11:54:58 -0500
Subject: [Beowulf] Memory stress testing tools.
Message-ID: <4CFE66E2.6030805@ias.edu>

Dear Beowulfers,

Can any of you recommend a good RAM stress testing tool?

I have a server with 128GB of RAM that keeps reporting single-bit 
errors. Every time this happens, I reseat the DIMMS or swap them around, 
and then run some large MPI jobs with I hope stress the RAM. Sometimes 
this produces more SBEs, sometimes it doesn't. When the system seems 
stable, I let the users back on it, and sure enough, they get it to 
start reporting SBEs in short order.

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue Dec  7 15:42:27 2010
From: mathog at caltech.edu (David Mathog)
Date: Tue, 07 Dec 2010 12:42:27 -0800
Subject: [Beowulf] Memory stress testing tools.
Message-ID: <E1PQ4Ml-0003bz-PC@mendel.bio.caltech.edu>

Prentice Bisbal wrote:

> When the system seems 
> stable, I let the users back on it, and sure enough, they get it to 
> start reporting SBEs in short order.

Sounds like you already have a good tool for triggering memory errors on
that system - your user's code.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Tue Dec  7 16:09:35 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 07 Dec 2010 16:09:35 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com>
References: <4CFE66E2.6030805@ias.edu>
	<09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com>
Message-ID: <4CFEA28F.4010709@ias.edu>

That was the first thing I looked into. memtest86  supports upto 64 GB 
of RAM. My system has 128 GB. :(

I found prime95/gimps through a wikipedia page. I'm giving it a go now.

http://www.mersenne.org/freesoft/#newusers


On 12/07/2010 01:05 PM, Mcmillan, Scott A wrote:
> memtest86
>
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Prentice Bisbal
> Sent: Tuesday, December 07, 2010 10:55 AM
> To: Beowulf Mailing List
> Subject: [Beowulf] Memory stress testing tools.
>
> Dear Beowulfers,
>
> Can any of you recommend a good RAM stress testing tool?
>
> I have a server with 128GB of RAM that keeps reporting single-bit
> errors. Every time this happens, I reseat the DIMMS or swap them around,
> and then run some large MPI jobs with I hope stress the RAM. Sometimes
> this produces more SBEs, sometimes it doesn't. When the system seems
> stable, I let the users back on it, and sure enough, they get it to
> start reporting SBEs in short order.
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From a.travis at abdn.ac.uk  Thu Dec  9 07:16:34 2010
From: a.travis at abdn.ac.uk (Tony Travis)
Date: Thu, 09 Dec 2010 12:16:34 +0000
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4CFE66E2.6030805@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
Message-ID: <4D00C8A2.3080008@abdn.ac.uk>

On 07/12/10 16:54, Prentice Bisbal wrote:
> Dear Beowulfers,
>
> Can any of you recommend a good RAM stress testing tool?
>
> I have a server with 128GB of RAM that keeps reporting single-bit
> errors. Every time this happens, I reseat the DIMMS or swap them around,
> and then run some large MPI jobs with I hope stress the RAM. Sometimes
> this produces more SBEs, sometimes it doesn't. When the system seems
> stable, I let the users back on it, and sure enough, they get it to
> start reporting SBEs in short order.

Hi, Prentice.

Have you tried Charles Cazabon's user-space "memtester" program:

   http://pyropus.ca/software/memtester/

It doesn't test *all* the memory, just what it can lock, but it does 
stress the memory sub-system in the same way that applications do...

Bye,

   Tony.
-- 
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 10:59:16 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 10:59:16 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <E1PQ5C3-0003cz-8h@mendel.bio.caltech.edu>
References: <E1PQ5C3-0003cz-8h@mendel.bio.caltech.edu>
Message-ID: <4D00FCD4.6020203@ias.edu>

On 12/07/2010 04:35 PM, David Mathog wrote:
>> True, but this is a multi-user system, so I don't know which user's code
>> is triggering the errors, nor do I know what usage pattern causes the
>> errors, so I'm looking for something more consistent. Well, I hope it
>> will be more consistent.
>
> Try setting up a script to take snapshots of the system every 15 seconds
> or.  Something like:
>
> do while [ 1 ]
>    ( date; top -b -n 1 | head -10 )>>$LOGFILE
>    sleep 15
> done
>
> Then using the memory error time stamps go back through those logs to
> find the most likely culprits.

That will identify the program, but not the problem size or data set 
being used that triggers the error.

Using a stress test that I control removes this detective work. I've 
decided to go with mprime from the gimps project which has a stress test 
feature:

http://www.mersenne.org/

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 11:08:28 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 11:08:28 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
Message-ID: <4D00FEFC.2080509@ias.edu>


On 12/08/2010 11:47 AM, Jason Clinton wrote:
> On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
>
>     Can any of you recommend a good RAM stress testing tool?
>
>
> We have an open source ISO/netboot image that can stress-test using the
> latest Linux kernel EDAC facilities and HPL as the test code. It's
> posted here: http://www.advancedclustering.com/software/breakin.html
>
> It's intended to be booted into.
>
> There's a beta of a slightly newer version posted at:
> http://lab.advancedclustering.com/bootimage/
>
> I would be interested in any feedback you have on either version.

Jason,

I know breakin well. I used it a quite a bit a in 2008 when I was 
stress-testing my then-new cluster, and sent some feedback to the 
developer at the time (last name Shoemaker, I think).  I did find that I 
could run it for days on all my cluster nodes, and then a few days 
later, when running a HPL as a single job across all the nodes, I'd get 
memory errors. I haven't used it since. Not because I don't like it, but 
I just haven't had a need for it since then.

I've also been testing this node by running a single HPL job across all 
32 cores myself, and even after days of doing this, I couldn't trigger 
any errors, but a user program could trigger an error in only a couple 
of hours.

Based on these experiences, I don't think that HPL is good at stressing 
RAM.Has anyone else had similar experiences?

Since this system has 128 GB of RAM, I think it's a good assumption that 
many programs might not use all of that RAM, so I need something memory 
specific that I know will hit all 128 GB of RAM.

So far, mprime appears to be working. I was able to trigger an SBE in 21 
hours the first time I ran it.  I plan on running it repeatedly for the 
next few days to see how well it can repeat finding errors.

-- 
Prentice Bisbal

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 11:36:32 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 11:36:32 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <E1PQ7QB-0003lS-Mi@mendel.bio.caltech.edu>
References: <E1PQ7QB-0003lS-Mi@mendel.bio.caltech.edu>
Message-ID: <4D010590.3030900@ias.edu>

On 12/07/2010 06:58 PM, David Mathog wrote:
> Try stressapptest.
>
> http://code.google.com/p/stressapptest/
>
> Note that it has a bizarre behavior where no matter how high you set N
> it the sum of their CPU usage is always 100%, even though they are not
> all running on one core on a multi-core system.  To saturate all of the
> cores one must force things, see this thread (where I talk to myself,
> the solution is at the end)
>
> http://groups.google.com/group/stressapptest-discuss/browse_thread/thread/882537e9f3f7d3f2
>

Thanks for the info. stressapptest looks like a great tool. I'm going to 
give it a try when I'm done trying out mprime. I want to see which one 
triggers the error quicker/more consistently.


-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From jlforrest at berkeley.edu  Thu Dec  9 12:00:08 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Thu, 09 Dec 2010 09:00:08 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D00FEFC.2080509@ias.edu>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu>
Message-ID: <4D010B18.4000305@berkeley.edu>

On 12/9/2010 8:08 AM, Prentice Bisbal wrote:

> So far, mprime appears to be working. I was able to trigger an SBE in 21
> hours the first time I ran it.  I plan on running it repeatedly for the
> next few days to see how well it can repeat finding errors.

After it finds an error how do you
figure out which memory module to
replace?

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 16:51:44 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 16:51:44 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimKVCr1MbROMZQAYiCtgmquV18Yobgq-f=_+Eep@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>
	<AANLkTimKVCr1MbROMZQAYiCtgmquV18Yobgq-f=_+Eep@mail.gmail.com>
Message-ID: <4D014F70.4000103@ias.edu>

Jason Clinton wrote:
> On Thu, Dec 9, 2010 at 10:08, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
> 
>     I know breakin well. I used it a quite a bit a in 2008 when I was
>     stress-testing my then-new cluster, and sent some feedback to the
>     developer at the time (last name Shoemaker, I think).  I did find that I
>     could run it for days on all my cluster nodes, and then a few days
>     later, when running a HPL as a single job across all the nodes, I'd get
>     memory errors. I haven't used it since. Not because I don't like it, but
>     I just haven't had a need for it since then.
> 
> 
> Hum. It's possible that EDAC support for your chipset didn't exist at
> the time. AMD and Intel have been pretty good about landing EDAC for
> their chips in vanilla upstream kernels for the past year and so that is
> why it is important to use a recent kernel. Or at least one with recent
> backports of that work.

At the time, I was using the latest version of Breakin available. I was
testing on AMD Barcelona processors. I was using Breakin in
September/October 2008, and the Barcelona processors came out in March -
May of that year. I would assume that would be enough time for support
for the new processors to trickle down to breakin, but that's just an
assumption, I can't confirm/prove that.

> 
> 
>     I've also been testing this node by running a single HPL job across all
>     32 cores myself, and even after days of doing this, I couldn't trigger
>     any errors, but a user program could trigger an error in only a couple
>     of hours.
> 
>     Based on these experiences, I don't think that HPL is good at stressing
>     RAM.Has anyone else had similar experiences?
> 
> 
> HPL is among the most memory intensive workloads out there. This is why
> architectural changes in the past few years that have increased the
> aggregate memory bandwidth of the architecture have resulted in higher
> measured platform efficiency.
> 
> My guess would be that the difference you've seen between the two would
> be statistical noise. How are you measuring errors? MCE events?

I don't think this is statistical noise. This system has consistently
reported SBE errors  since it was installed several months ago. I've
probably tried to trigger SBEs with HPL dozens of times. I'll often run
it 2-3 times in a row without triggering errors over a period of several
days. When the users go back to using this server, they usually trigger
errors in less time than that. I think HPL resulted in triggering the
error only a couple of times.

The system is a Dell PowerEdge something or other. It has an LCD display
that is normally blue. When hardware error is detected, it turns orange,
and shows the error. I check that several times a day. Our central log
server also e-mails any ciritical log errors that get sent to it, so
even if I didn't check the display on the front of the server, I'll
receive an e-mail shortly after the error is logged in my system logs.

It's low tech, but it works.
> 
>  
> 
>     Since this system has 128 GB of RAM, I think it's a good assumption that
>     many programs might not use all of that RAM, so I need something memory
>     specific that I know will hit all 128 GB of RAM.
> 
> 
> Breakin uses the same algorithm at
> http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
> to calculate the "N" size which will consume 90% of the RAM of a system
> using all cores (in as close to square grid as possible).
> 
>  
> 
>     So far, mprime appears to be working. I was able to trigger an SBE in 21
>     hours the first time I ran it.  I plan on running it repeatedly for the
>     next few days to see how well it can repeat finding errors.
> 
> 
>  I'm curious what kernel you're running that is giving you EDAC
> reporting. Or are you rebooting after an MCE and examining the system
> event logs?
> 
> 
> -- 
> Jason D. Clinton, Advanced Clustering Technologies
> 913-643-0306

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 16:54:57 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 16:54:57 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D010B18.4000305@berkeley.edu>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>
	<4D010B18.4000305@berkeley.edu>
Message-ID: <4D015031.70908@ias.edu>

Jon Forrest wrote:
> On 12/9/2010 8:08 AM, Prentice Bisbal wrote:
> 
>> So far, mprime appears to be working. I was able to trigger an SBE in 21
>> hours the first time I ran it.  I plan on running it repeatedly for the
>> next few days to see how well it can repeat finding errors.
> 
> After it finds an error how do you
> figure out which memory module to
> replace?
> 

The LCD display on the front of the server tells me, with a message like
this:

"SBE logging disabled on DIMM C3. Reseat DIMM"

I can also generate a report with DELL DSET that shows me a similar
other message. I'm sure there are other tools, but I usually have to
create a DSET report to send to Dell, anyway.

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From david.t.kewley at gmail.com  Thu Dec  9 21:14:41 2010
From: david.t.kewley at gmail.com (David Kewley)
Date: Thu, 9 Dec 2010 18:14:41 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D015031.70908@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu>
	<4D015031.70908@ias.edu>
Message-ID: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>

Prentice,

You only asked for memory testing programs, but I'm going to go a bit
further, to make sure some background issues are covered, and to give you
some ideas you might not yet have.  Some of this is based on a lot of
experience with Dell servers in HPC.


Some of my background thoughts on dealing with SBEs:

1) Complete and details historical records are important for correctly and
efficiently resolving these type of errors, especially on larger clusters.
Otherwise it's too easy to get confused about what happened when, and come
to incorrect conclusions about problems and solutions.  Treat it like a lab
experiement -- keep a log book or equivalent, test your hypotheses against
the data, and think broadly about what alternative hypotheses may exist.

2) The resolution process will be iterative, with physical manipulations
(e.g. moving DIMMs among slots) alternating with monitoring for SBEs and
optionally running stress applications to attempt to trigger SBEs (a
"reproducer" of the SBEs).

3) For efficient resolution, you want a quick, reliable reproducer,
something that will trigger the SBEs quickly.

4) I've seen no evidence that SBEs materially affect performance or
correctness on a server, so my practice has often been to leave affected
servers in production as much as possible, taking them out of production
(after draining jobs) only briefly to move DIMMs, replace DIMMs, etc.

Regarding (4), if anyone here has measurements or a URL to a study saying in
what circumstances there's a significant material risk to performance or
correctness of calculation with SBE correction, I'd love to see that.  I'm
not saying that SBE correction is completely free performance wise -- I bet
it takes a little time to do the correction, but I bet for normal SBE
correction rates, that time is (nearly) unmeasurable.

Also, over a few thousand server-years, I've never or almost never seen SBE
corrections morph into uncorrectable multi-bit errors.  When uncorrectable
errors have shown up (which itself has been rare in my experience, mostly in
a single situation where there was a server bug that got corrected), they've
shown up early on a server, not after a long period of only seeing SBEs.


Prentice, I believe you started this thread because you need something for
(3), is that right?  As David Mathog said, you already know what activity
most reliably triggers SBE corrections: Your users' code.  If I were in your
shoes, and I had time and/or were concerned around issue (4) above, I'd a)
identify which user, code, and specific runs trigger SBEs the most, then b)
if possible, work with that user to get a single-node version of a similar
run that you could outside production node use, to reproduce and resolve
SBEs.  I'd then monitor for SBEs in production, and when they occur, drain
jobs from those nodes, and take them out of production so I could user that
single-node user job to satisfy (2) and (3) above.

If I was in your shoes and was NOTconcerned about (4), I'd simply drain the
running job, do a manipulation (2 above), and put the node back into
production, waiting for the SBE to recur if it is going to.  This is what
I've often done.

Or if you have a dev/test cluster, replace the entire production node with a
tested, known-good node from the dev/test cluster, then test/fix the SBE
server in the context of the dev/test cluster.  I've also often done this.

My experience has been that long runs of single-node HPL was the best SBE
trigger I ever found.  Dell's mpmemory did not do as well.  I believe
memtest86{,+} also didn't find problems that HPL found, though I didn't test
memtest86{,+} as much.  It also was not immediately obvious how to gather
the memory test results from mpmemory and memtest86{,+}, though it can
probably be done, perhaps easily, with a bit of R&D.

But since you've found that HPL does not trigger SBEs as much as your user's
code, I think you have a very good pointer that you should do stress tests
with your user's code if at all possible.

If you can share what the stressful app is, and any of the relevant run
parameters, that would probably be interesting to folks on this list.

In my experience, usually SBEs are resolved by reseating or replacing the
affected DIMM.  However it can also be an issue on the motherboard (sockets
or traces or something else), or possibly the CPU (because Intel and AMD now
both have the memory controllers on-die), or possibly a BIOS issue (if a CPU
or memory related parameter isn't set quite optimally by the BIOS you're
running; BIOSes may set hardware parameters without your awareness nor
ability to tune it yourself).


Best practice may be:

A) swap the DIMM where the SBE occurred with a neighbor that underwent
similar stress but did not show any SBEs.  Keep a permanent record of which
DIMMs you swapped and when, as well as all error messages and their timing.
B) re-stress either in production (if you believe my tentative assertion (in
4 above) that SBE corrections do not materially affect performance nor
correctness), or using your reliable reproducer for an amount of time that
you know should usually re-trigger the SBE if it is going to recur.
C) assess the results and respond accordingly:
  1) if the SBE messages do not recur, then either reseating resolved it, or
it's so marginal that you will need to wait longer for it to show up; may as
well leave it in production in this case
  2) if the SBE messages follow the DIMM when you swapped it with its
neighbor, then it's very very likely the DIMM (especially if the SBE
occurred quickly upon stressing it, both before and after the DIMM move).
Present this evidence to Dell Support and ask them to send you a replacement
DIMM.  KEEP IN MIND that although the replacement DIMM will usually resolve
the issue, it has never before been stressed in your setup, and it's
possible for your stress patterns to elicit SBEs even in this replacement
DIMM.  So if the error recurs in that DIMM slot, it's possible that the
replacement DIMM also needs to be replaced.  You again need to do a neighbor
swap to check whether it really is the replacement DIMM.
  3) If the SBE stays with the slot after you did the neighbor swap, take
this evidence to Dell Support, and see what they say.  I would guess they'd
have the motherboard and/or CPU swapped.  Alternatively, you may wish (use
your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2
in that server and see whether the SBEs follow the CPU or stay with the
slot.  Just as with DIMMs, it's not unheard of for replacement motherboards
and CPUs to also have issues, so don't assume they're perfect -- usually the
suitable replacement will resolve the issue fully, but you won't know for
sure until you've stressed the system.


What model of PowerEdge are these servers?

PowerEdge systems keep a history of the messages that get printed on the LCD
in the System Event Log (SEL), in older days also called the ESM log
(embedded systems management, I believe).  The SEL is maintained by the BMC
or iDRAC.  I believe the message you report below (SBE logging disable) will
be in the SEL.  I know the SEL logs messages that indicate that the SBE
correction rate has exceeded two successive thresholds (warning and
critical).

You can read and clear the SEL using a number of different methods.  I'm
sure DSET does it.  You can also do it with ipmitool, omreport (part of the
OpenManage Server Administration (OMSA) tools for the Linux command line),
and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC
POST utility.  I'm sure there are other ways; these are the ones I've found
useful.


Normal, non-Dell-specific ipmitool will print the SEL records using
'ipmitool sel list', but it does not have the lookup tables and algorithms
needed to tell you the name of the affected DIMM on Dell servers.  You can
also do 'ipmitool sel list -v', which will dump raw field values for each
SEL record, and you can decode those raw values to figure out the affected
DIMM -- with enough examples (and comparing e.g. to theDIMM names in the
Ctrl-E POST SEL view), you might be able to figure out the decoding
algorithm on your own, or google might give you someone who has already
figured out the decoding for your specific PowerEdge model.

That is the downside of using standard ipmitool.  The upside of ipmitool,
though, is that it's quite lightweight, and can be used both on localhost
and across the network (using IPMI over LAN, if you have it configured
appropriately).


The good news is that there's a Dell-specific version of ipmitool available,
which adds some Dell-specific capabilities, including to decode DIMM names.
This works at least for current PowerEdge R and M servers, as well as older
PowerEdge models like the 1950, and probably a few generations older than
that.  I think it simply supports all models that the corresponding version
of OpenManage supports; this does not include older SC servers or current C
servers.  If you have a model that OpenManage does not support, it may be
worth trying, in case it does the right thing for you.

You can get the 'delloem' version of ipmitool from the OpenManage Management
Station package.  The current latest URL is

ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz

Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible
one.

For example, looking in the RHEL5_x86_64 subdirectory, the rpm
OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool
with 'delloem' appearing as a string internally.  (I'm not able to test it
right now.)

Once you've installed the appropriate package, do 'ipmitool delloem'; this
should tell you what the secondary options are.  I believe 'ipmitool delloem
sel' will decode the SEL including the correct DIMM names.


If you install OpenManage appropriately, you can also get the SEL decoded,
as well as get alerts automatically and immediately sent to syslog.  The
command line to print a decoded SEL is 'omreport system esmlog'.  OpenManage
is pretty heavy-weight, though.  Some people do install it and leave it
running on HPC compute nodes; some people would never do that on a
production node.

Your mention of getting log messages about the SBEs makes me think you do
have OMSA installed and its daemons running -- is that correct?  Try
'omreport system esmlog' if so.


Finally, during POST Ctrl-E at the prompted moment will get you into the BMC
or iDRAC POST menu system, in which you can view and optionally clear the
SEL.  I do not think this is easily scriptable, but if all else fails, that
is one way to view the SEL, with proper decoding.


I know that's long, and I hope that helps you and possibly others.

David

On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal <prentice at ias.edu> wrote:

> Jon Forrest wrote:
> > On 12/9/2010 8:08 AM, Prentice Bisbal wrote:
> >
> >> So far, mprime appears to be working. I was able to trigger an SBE in 21
> >> hours the first time I ran it.  I plan on running it repeatedly for the
> >> next few days to see how well it can repeat finding errors.
> >
> > After it finds an error how do you
> > figure out which memory module to
> > replace?
> >
>
> The LCD display on the front of the server tells me, with a message like
> this:
>
> "SBE logging disabled on DIMM C3. Reseat DIMM"
>
> I can also generate a report with DELL DSET that shows me a similar
> other message. I'm sure there are other tools, but I usually have to
> create a DSET report to send to Dell, anyway.
>
> --
> Prentice
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20101209/1233a52e/attachment.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From prentice at ias.edu  Fri Dec 10 08:45:25 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 10 Dec 2010 08:45:25 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>	<4D010B18.4000305@berkeley.edu>	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
Message-ID: <4D022EF5.1080801@ias.edu>

David,


-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Fri Dec 10 09:24:30 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 10 Dec 2010 09:24:30 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>	<4D010B18.4000305@berkeley.edu>	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
Message-ID: <4D02381E.5090307@ias.edu>

David,

Thanks for the e-mail due to it's length, I'm not including it in my 
reply, which I know is normally bad mailing list etiquette.

The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
GB of RAM.

I installed two identical servers at the same time, named frigga and 
odin (husband and wife in Norse mythology, if your curious). These nodes 
are not part of a beowulf cluster, but this is the best forum I know of 
to discuss problems like this.

Odin is the system with errors, and it started reporting SBE errors 
almost immediately, even when the system was completely idle. They 
started within hours of operating system installation, before users were 
even able to login to the system.

As you pointed out, I don't think SBE errors are fatal, but I like to 
address all system errors I identify, no matter how trivial. I find when 
you get used to ignoring a "harmless" errors, you eventually end up 
ignoring all errors.

So, you are right that I'm looking for a tool to quickly and reliably 
reproduce SBEs so that I can quickly resolve this problem with Dell. For 
reasons I can't discuss here, working with the user is not an option. 
Due to the nature of my institution, users are only here for a couple of 
years, anyway, and I'm looking for a tool that I can use long after this 
user (and his code) are gone.

I have been keeping detailed logs of exactly when the SBE errors occur. 
And I have been reseating and swapping DIMMS to see of the errors move 
with the DIMM or stay with the slot to determine whether it's a bad 
DIMM, or a bad motherboard. In the first occasion, the error did move 
with the DIMM, and I replaced the DIMM. Since then, the errors have been 
moving from DIMM to DIMM, even across banks of DIMMS. Since each bank 
corresponds to a socket, this would indicate that it's not a bad on-chip 
memory controller, or they're all bad.

My goal is to find a tool that I can run repeatedly to reproduce SBE 
errors in a finite time frame, and then run it repeatedly and collect 
data on where these SBEs occur. I suspect it's a bad motherboard, but 
unless I have overwhelming data showing that, Dell will just keep 
replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this 
case.

As stated earlier, HPL wasn't reliable for me in this capacity. I'm now 
using mprime's stress test mode, and will also test stressapptest.


-- 
Prentice

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mdidomenico4 at gmail.com  Fri Dec 10 10:15:16 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 10 Dec 2010 15:15:16 +0000
Subject: [Beowulf] tesla benchmarking
Message-ID: <AANLkTi=XH01E5pWqjrB2k-mKLb3Q4gZqcc9-iB+uDySt@mail.gmail.com>

does anyone know of easy to run code that would swallow the cpu/memory
on the chassis but also a tesla card?  A lot of the tools i typically
used in the past that have been ported to GPU's don't seem to use up
much of the memory, or use all the GPU constantly.  I'm running
through NAMD at the moment which does seem to make pretty good use of
the gpu processor, it doesn't seem to use much if any of the memory.
Cuda-Linpack seems to cough an error on runtime, but hopefully i'll
get that going, but i curious if there was anything else i didn't know
about.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From david.t.kewley at gmail.com  Fri Dec 10 14:47:20 2010
From: david.t.kewley at gmail.com (David Kewley)
Date: Fri, 10 Dec 2010 11:47:20 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D02381E.5090307@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu>
	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
	<4D02381E.5090307@ias.edu>
Message-ID: <AANLkTi==SR8G36LXDaR2+iBOm0YD5KCGK6CFmngqCqx+@mail.gmail.com>

Prentice,

Thanks for filling in some details.  What you say makes complete sense to
me.

Is it the case that frigga has seen similar stress with no SBE errors?  If
so, I agree it seems like something else is going on besides bad DIMMs. To
test that, if you can schedule simultaneous downtime on the two boxes, you
might swap all DIMMs between odin and frigga.

If you do a few DIMM replacements, but continue to have the sense that DIMM
replacements aren't really solving the problem, and you have good evidence
why you think that, I encourage you to make sure Dell Support hears and
understands that, and make sure they're looking more holistically than
individual DIMMs.  They may look more broadly on their own, or you may need
to nudge them.

David

On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> David,
>
> Thanks for the e-mail due to it's length, I'm not including it in my reply,
> which I know is normally bad mailing list etiquette.
>
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB
> of RAM.
>
> I installed two identical servers at the same time, named frigga and odin
> (husband and wife in Norse mythology, if your curious). These nodes are not
> part of a beowulf cluster, but this is the best forum I know of to discuss
> problems like this.
>
> Odin is the system with errors, and it started reporting SBE errors almost
> immediately, even when the system was completely idle. They started within
> hours of operating system installation, before users were even able to login
> to the system.
>
> As you pointed out, I don't think SBE errors are fatal, but I like to
> address all system errors I identify, no matter how trivial. I find when you
> get used to ignoring a "harmless" errors, you eventually end up ignoring all
> errors.
>
> So, you are right that I'm looking for a tool to quickly and reliably
> reproduce SBEs so that I can quickly resolve this problem with Dell. For
> reasons I can't discuss here, working with the user is not an option. Due to
> the nature of my institution, users are only here for a couple of years,
> anyway, and I'm looking for a tool that I can use long after this user (and
> his code) are gone.
>
> I have been keeping detailed logs of exactly when the SBE errors occur. And
> I have been reseating and swapping DIMMS to see of the errors move with the
> DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad
> motherboard. In the first occasion, the error did move with the DIMM, and I
> replaced the DIMM. Since then, the errors have been moving from DIMM to
> DIMM, even across banks of DIMMS. Since each bank corresponds to a socket,
> this would indicate that it's not a bad on-chip memory controller, or
> they're all bad.
>
> My goal is to find a tool that I can run repeatedly to reproduce SBE errors
> in a finite time frame, and then run it repeatedly and collect data on where
> these SBEs occur. I suspect it's a bad motherboard, but unless I have
> overwhelming data showing that, Dell will just keep replacing the DIMMs, and
> I'm pretty confident it's not bad DIMMs in this case.
>
> As stated earlier, HPL wasn't reliable for me in this capacity. I'm now
> using mprime's stress test mode, and will also test stressapptest.
>
>
> --
> Prentice
>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20101210/c1e6d367/attachment.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From mathog at caltech.edu  Fri Dec 10 16:18:02 2010
From: mathog at caltech.edu (David Mathog)
Date: Fri, 10 Dec 2010 13:18:02 -0800
Subject: [Beowulf] Memory stress testing tools
Message-ID: <E1PRALq-0005E7-5D@mendel.bio.caltech.edu>

Prentice Bisbal wrote:
 
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
> GB of RAM.

If the erroneous memory locations are moving around in memory without
correlation to the DIMMs then the next most likely culprits are a
marginal power supply, CPU, or motherboard, in pretty much that order.
(OK, kind of a toss up for CPU vs. motherboard, but since you have 32
cores in the system I put it first.)

If you have access to an oscilloscope look closely at the voltages on
the two machines.  No need to cut in anywhere, just measure +5 and +12V
on an unused disk or fan connector.   If the machine prone to memory
errors is significantly noisier than the one that is not, that could be
the problem.  I have seen this exactly once - all PS testers said it was
good, and a multimeter had it pegged at the right voltages, but there
was a ton of high frequency noise coming out of the power supply.  

If you can disable CPUs through the BIOS on that machine, running for a
while under each CPU alone might narrow the issue down to 1 of the 4. 
You wouldn't be done then though, because it could be the socket and not
the CPU itself.  Still, if you can get it down to 1 CPU then you could
swap that with another and see if the issue moves with it.

You probably already did this, but be sure both machines have the same
BIOS release.


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From samuel at unimelb.edu.au  Mon Dec 13 17:43:19 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 14 Dec 2010 09:43:19 +1100
Subject: [Beowulf] IB symbol error thresholds for health check scripts ?
Message-ID: <4D06A187.2080201@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

We run a bunch of health checks [1] on a compute node through
Torque [2] and if they fail the node gets knocked offline.

One of the checks we do is to check that there are no symbol
errors on the IB link. However, I'm wondering if simply saying
a single error is too brutal for this - what do other people do
about these ?

cheers!
Chris

[1] - for the record we check things like - amount of RAM,
failed DIMMs (via IPMI on IBM or memlog on SGI), number of
cores, number and speed of CPUs, LDAP OK, home directories
accessible, etc.

[2] - checks run prior to a job start, after a job exits
      and every 7.5 minutes (every 10 mom intervals).

- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/
MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V
=/gDC
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From stuartb at 4gh.net  Wed Dec 29 13:29:21 2010
From: stuartb at 4gh.net (Stuart Barkley)
Date: Wed, 29 Dec 2010 13:29:21 -0500 (EST)
Subject: [Beowulf] IB symbol error thresholds for health check scripts ?
In-Reply-To: <4D06A187.2080201@unimelb.edu.au>
References: <4D06A187.2080201@unimelb.edu.au>
Message-ID: <alpine.BSF.2.00.1012291311430.1547@freeman.4gh.net>

On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote:

> We run a bunch of health checks [1] on a compute node through Torque
> [2] and if they fail the node gets knocked offline.

Can you share these scripts?  I'm needing to get something started
along these lines (torque, Moab, Infiniband, IBM system x, xCAT).
I'm sure I'll find things needing adaption to our environment.

> One of the checks we do is to check that there are no symbol errors
> on the IB link. However, I'm wondering if simply saying a single
> error is too brutal for this - what do other people do about these ?

I'm looking at Infiniband problems currently and have been watching
our SymbolErrorCounter values.  I'm told a "small number" of these
errors are okay.  I don't know the definition of "small" or over how
long a time period.

Over the last week 24 of our nodes have shown at least two errors.
Of these 6 nodes are showing over 400 errors (450-30000) and these
nodes need attention (I've manually downed them until I can get to the
hardware).  The remaining nodes are all < 50 errors, with half of
those < 10.

I'm planning to do more proactive monitoring of the Infiniband Fabric.
The current toolset is very awkward to use for monitoring.  There is
an updated Infiniband Fabric Suite from QLogic which appears to
improve this significantly.  It should be possible to do the
Infiniband monitoring completely off node so as to not perturb the
computations too much.

> [1] - for the record we check things like - amount of RAM, failed
> DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number
> and speed of CPUs, LDAP OK, home directories accessible, etc.

All things we need to check.  I manually found several of our nodes
running with one disabled RAM stick.

> [2] - checks run prior to a job start, after a job exits and every
> 7.5 minutes (every 10 mom intervals).

Also when the node comes up before mom starts I assume?

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Tue Dec  7 11:54:58 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 07 Dec 2010 11:54:58 -0500
Subject: [Beowulf] Memory stress testing tools.
Message-ID: <4CFE66E2.6030805@ias.edu>

Dear Beowulfers,

Can any of you recommend a good RAM stress testing tool?

I have a server with 128GB of RAM that keeps reporting single-bit 
errors. Every time this happens, I reseat the DIMMS or swap them around, 
and then run some large MPI jobs with I hope stress the RAM. Sometimes 
this produces more SBEs, sometimes it doesn't. When the system seems 
stable, I let the users back on it, and sure enough, they get it to 
start reporting SBEs in short order.

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue Dec  7 15:42:27 2010
From: mathog at caltech.edu (David Mathog)
Date: Tue, 07 Dec 2010 12:42:27 -0800
Subject: [Beowulf] Memory stress testing tools.
Message-ID: <E1PQ4Ml-0003bz-PC@mendel.bio.caltech.edu>

Prentice Bisbal wrote:

> When the system seems 
> stable, I let the users back on it, and sure enough, they get it to 
> start reporting SBEs in short order.

Sounds like you already have a good tool for triggering memory errors on
that system - your user's code.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Tue Dec  7 16:09:35 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Tue, 07 Dec 2010 16:09:35 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com>
References: <4CFE66E2.6030805@ias.edu>
	<09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com>
Message-ID: <4CFEA28F.4010709@ias.edu>

That was the first thing I looked into. memtest86  supports upto 64 GB 
of RAM. My system has 128 GB. :(

I found prime95/gimps through a wikipedia page. I'm giving it a go now.

http://www.mersenne.org/freesoft/#newusers


On 12/07/2010 01:05 PM, Mcmillan, Scott A wrote:
> memtest86
>
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Prentice Bisbal
> Sent: Tuesday, December 07, 2010 10:55 AM
> To: Beowulf Mailing List
> Subject: [Beowulf] Memory stress testing tools.
>
> Dear Beowulfers,
>
> Can any of you recommend a good RAM stress testing tool?
>
> I have a server with 128GB of RAM that keeps reporting single-bit
> errors. Every time this happens, I reseat the DIMMS or swap them around,
> and then run some large MPI jobs with I hope stress the RAM. Sometimes
> this produces more SBEs, sometimes it doesn't. When the system seems
> stable, I let the users back on it, and sure enough, they get it to
> start reporting SBEs in short order.
>
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From a.travis at abdn.ac.uk  Thu Dec  9 07:16:34 2010
From: a.travis at abdn.ac.uk (Tony Travis)
Date: Thu, 09 Dec 2010 12:16:34 +0000
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4CFE66E2.6030805@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
Message-ID: <4D00C8A2.3080008@abdn.ac.uk>

On 07/12/10 16:54, Prentice Bisbal wrote:
> Dear Beowulfers,
>
> Can any of you recommend a good RAM stress testing tool?
>
> I have a server with 128GB of RAM that keeps reporting single-bit
> errors. Every time this happens, I reseat the DIMMS or swap them around,
> and then run some large MPI jobs with I hope stress the RAM. Sometimes
> this produces more SBEs, sometimes it doesn't. When the system seems
> stable, I let the users back on it, and sure enough, they get it to
> start reporting SBEs in short order.

Hi, Prentice.

Have you tried Charles Cazabon's user-space "memtester" program:

   http://pyropus.ca/software/memtester/

It doesn't test *all* the memory, just what it can lock, but it does 
stress the memory sub-system in the same way that applications do...

Bye,

   Tony.
-- 
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 10:59:16 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 10:59:16 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <E1PQ5C3-0003cz-8h@mendel.bio.caltech.edu>
References: <E1PQ5C3-0003cz-8h@mendel.bio.caltech.edu>
Message-ID: <4D00FCD4.6020203@ias.edu>

On 12/07/2010 04:35 PM, David Mathog wrote:
>> True, but this is a multi-user system, so I don't know which user's code
>> is triggering the errors, nor do I know what usage pattern causes the
>> errors, so I'm looking for something more consistent. Well, I hope it
>> will be more consistent.
>
> Try setting up a script to take snapshots of the system every 15 seconds
> or.  Something like:
>
> do while [ 1 ]
>    ( date; top -b -n 1 | head -10 )>>$LOGFILE
>    sleep 15
> done
>
> Then using the memory error time stamps go back through those logs to
> find the most likely culprits.

That will identify the program, but not the problem size or data set 
being used that triggers the error.

Using a stress test that I control removes this detective work. I've 
decided to go with mprime from the gimps project which has a stress test 
feature:

http://www.mersenne.org/

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 11:08:28 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 11:08:28 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
Message-ID: <4D00FEFC.2080509@ias.edu>


On 12/08/2010 11:47 AM, Jason Clinton wrote:
> On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
>
>     Can any of you recommend a good RAM stress testing tool?
>
>
> We have an open source ISO/netboot image that can stress-test using the
> latest Linux kernel EDAC facilities and HPL as the test code. It's
> posted here: http://www.advancedclustering.com/software/breakin.html
>
> It's intended to be booted into.
>
> There's a beta of a slightly newer version posted at:
> http://lab.advancedclustering.com/bootimage/
>
> I would be interested in any feedback you have on either version.

Jason,

I know breakin well. I used it a quite a bit a in 2008 when I was 
stress-testing my then-new cluster, and sent some feedback to the 
developer at the time (last name Shoemaker, I think).  I did find that I 
could run it for days on all my cluster nodes, and then a few days 
later, when running a HPL as a single job across all the nodes, I'd get 
memory errors. I haven't used it since. Not because I don't like it, but 
I just haven't had a need for it since then.

I've also been testing this node by running a single HPL job across all 
32 cores myself, and even after days of doing this, I couldn't trigger 
any errors, but a user program could trigger an error in only a couple 
of hours.

Based on these experiences, I don't think that HPL is good at stressing 
RAM.Has anyone else had similar experiences?

Since this system has 128 GB of RAM, I think it's a good assumption that 
many programs might not use all of that RAM, so I need something memory 
specific that I know will hit all 128 GB of RAM.

So far, mprime appears to be working. I was able to trigger an SBE in 21 
hours the first time I ran it.  I plan on running it repeatedly for the 
next few days to see how well it can repeat finding errors.

-- 
Prentice Bisbal

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 11:36:32 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 11:36:32 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <E1PQ7QB-0003lS-Mi@mendel.bio.caltech.edu>
References: <E1PQ7QB-0003lS-Mi@mendel.bio.caltech.edu>
Message-ID: <4D010590.3030900@ias.edu>

On 12/07/2010 06:58 PM, David Mathog wrote:
> Try stressapptest.
>
> http://code.google.com/p/stressapptest/
>
> Note that it has a bizarre behavior where no matter how high you set N
> it the sum of their CPU usage is always 100%, even though they are not
> all running on one core on a multi-core system.  To saturate all of the
> cores one must force things, see this thread (where I talk to myself,
> the solution is at the end)
>
> http://groups.google.com/group/stressapptest-discuss/browse_thread/thread/882537e9f3f7d3f2
>

Thanks for the info. stressapptest looks like a great tool. I'm going to 
give it a try when I'm done trying out mprime. I want to see which one 
triggers the error quicker/more consistently.


-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From jlforrest at berkeley.edu  Thu Dec  9 12:00:08 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Thu, 09 Dec 2010 09:00:08 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D00FEFC.2080509@ias.edu>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu>
Message-ID: <4D010B18.4000305@berkeley.edu>

On 12/9/2010 8:08 AM, Prentice Bisbal wrote:

> So far, mprime appears to be working. I was able to trigger an SBE in 21
> hours the first time I ran it.  I plan on running it repeatedly for the
> next few days to see how well it can repeat finding errors.

After it finds an error how do you
figure out which memory module to
replace?

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 16:51:44 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 16:51:44 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimKVCr1MbROMZQAYiCtgmquV18Yobgq-f=_+Eep@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>
	<AANLkTimKVCr1MbROMZQAYiCtgmquV18Yobgq-f=_+Eep@mail.gmail.com>
Message-ID: <4D014F70.4000103@ias.edu>

Jason Clinton wrote:
> On Thu, Dec 9, 2010 at 10:08, Prentice Bisbal <prentice at ias.edu
> <mailto:prentice at ias.edu>> wrote:
> 
>     I know breakin well. I used it a quite a bit a in 2008 when I was
>     stress-testing my then-new cluster, and sent some feedback to the
>     developer at the time (last name Shoemaker, I think).  I did find that I
>     could run it for days on all my cluster nodes, and then a few days
>     later, when running a HPL as a single job across all the nodes, I'd get
>     memory errors. I haven't used it since. Not because I don't like it, but
>     I just haven't had a need for it since then.
> 
> 
> Hum. It's possible that EDAC support for your chipset didn't exist at
> the time. AMD and Intel have been pretty good about landing EDAC for
> their chips in vanilla upstream kernels for the past year and so that is
> why it is important to use a recent kernel. Or at least one with recent
> backports of that work.

At the time, I was using the latest version of Breakin available. I was
testing on AMD Barcelona processors. I was using Breakin in
September/October 2008, and the Barcelona processors came out in March -
May of that year. I would assume that would be enough time for support
for the new processors to trickle down to breakin, but that's just an
assumption, I can't confirm/prove that.

> 
> 
>     I've also been testing this node by running a single HPL job across all
>     32 cores myself, and even after days of doing this, I couldn't trigger
>     any errors, but a user program could trigger an error in only a couple
>     of hours.
> 
>     Based on these experiences, I don't think that HPL is good at stressing
>     RAM.Has anyone else had similar experiences?
> 
> 
> HPL is among the most memory intensive workloads out there. This is why
> architectural changes in the past few years that have increased the
> aggregate memory bandwidth of the architecture have resulted in higher
> measured platform efficiency.
> 
> My guess would be that the difference you've seen between the two would
> be statistical noise. How are you measuring errors? MCE events?

I don't think this is statistical noise. This system has consistently
reported SBE errors  since it was installed several months ago. I've
probably tried to trigger SBEs with HPL dozens of times. I'll often run
it 2-3 times in a row without triggering errors over a period of several
days. When the users go back to using this server, they usually trigger
errors in less time than that. I think HPL resulted in triggering the
error only a couple of times.

The system is a Dell PowerEdge something or other. It has an LCD display
that is normally blue. When hardware error is detected, it turns orange,
and shows the error. I check that several times a day. Our central log
server also e-mails any ciritical log errors that get sent to it, so
even if I didn't check the display on the front of the server, I'll
receive an e-mail shortly after the error is logged in my system logs.

It's low tech, but it works.
> 
>  
> 
>     Since this system has 128 GB of RAM, I think it's a good assumption that
>     many programs might not use all of that RAM, so I need something memory
>     specific that I know will hit all 128 GB of RAM.
> 
> 
> Breakin uses the same algorithm at
> http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
> to calculate the "N" size which will consume 90% of the RAM of a system
> using all cores (in as close to square grid as possible).
> 
>  
> 
>     So far, mprime appears to be working. I was able to trigger an SBE in 21
>     hours the first time I ran it.  I plan on running it repeatedly for the
>     next few days to see how well it can repeat finding errors.
> 
> 
>  I'm curious what kernel you're running that is giving you EDAC
> reporting. Or are you rebooting after an MCE and examining the system
> event logs?
> 
> 
> -- 
> Jason D. Clinton, Advanced Clustering Technologies
> 913-643-0306

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Thu Dec  9 16:54:57 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Thu, 09 Dec 2010 16:54:57 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D010B18.4000305@berkeley.edu>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>
	<4D010B18.4000305@berkeley.edu>
Message-ID: <4D015031.70908@ias.edu>

Jon Forrest wrote:
> On 12/9/2010 8:08 AM, Prentice Bisbal wrote:
> 
>> So far, mprime appears to be working. I was able to trigger an SBE in 21
>> hours the first time I ran it.  I plan on running it repeatedly for the
>> next few days to see how well it can repeat finding errors.
> 
> After it finds an error how do you
> figure out which memory module to
> replace?
> 

The LCD display on the front of the server tells me, with a message like
this:

"SBE logging disabled on DIMM C3. Reseat DIMM"

I can also generate a report with DELL DSET that shows me a similar
other message. I'm sure there are other tools, but I usually have to
create a DSET report to send to Dell, anyway.

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From david.t.kewley at gmail.com  Thu Dec  9 21:14:41 2010
From: david.t.kewley at gmail.com (David Kewley)
Date: Thu, 9 Dec 2010 18:14:41 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D015031.70908@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu>
	<4D015031.70908@ias.edu>
Message-ID: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>

Prentice,

You only asked for memory testing programs, but I'm going to go a bit
further, to make sure some background issues are covered, and to give you
some ideas you might not yet have.  Some of this is based on a lot of
experience with Dell servers in HPC.


Some of my background thoughts on dealing with SBEs:

1) Complete and details historical records are important for correctly and
efficiently resolving these type of errors, especially on larger clusters.
Otherwise it's too easy to get confused about what happened when, and come
to incorrect conclusions about problems and solutions.  Treat it like a lab
experiement -- keep a log book or equivalent, test your hypotheses against
the data, and think broadly about what alternative hypotheses may exist.

2) The resolution process will be iterative, with physical manipulations
(e.g. moving DIMMs among slots) alternating with monitoring for SBEs and
optionally running stress applications to attempt to trigger SBEs (a
"reproducer" of the SBEs).

3) For efficient resolution, you want a quick, reliable reproducer,
something that will trigger the SBEs quickly.

4) I've seen no evidence that SBEs materially affect performance or
correctness on a server, so my practice has often been to leave affected
servers in production as much as possible, taking them out of production
(after draining jobs) only briefly to move DIMMs, replace DIMMs, etc.

Regarding (4), if anyone here has measurements or a URL to a study saying in
what circumstances there's a significant material risk to performance or
correctness of calculation with SBE correction, I'd love to see that.  I'm
not saying that SBE correction is completely free performance wise -- I bet
it takes a little time to do the correction, but I bet for normal SBE
correction rates, that time is (nearly) unmeasurable.

Also, over a few thousand server-years, I've never or almost never seen SBE
corrections morph into uncorrectable multi-bit errors.  When uncorrectable
errors have shown up (which itself has been rare in my experience, mostly in
a single situation where there was a server bug that got corrected), they've
shown up early on a server, not after a long period of only seeing SBEs.


Prentice, I believe you started this thread because you need something for
(3), is that right?  As David Mathog said, you already know what activity
most reliably triggers SBE corrections: Your users' code.  If I were in your
shoes, and I had time and/or were concerned around issue (4) above, I'd a)
identify which user, code, and specific runs trigger SBEs the most, then b)
if possible, work with that user to get a single-node version of a similar
run that you could outside production node use, to reproduce and resolve
SBEs.  I'd then monitor for SBEs in production, and when they occur, drain
jobs from those nodes, and take them out of production so I could user that
single-node user job to satisfy (2) and (3) above.

If I was in your shoes and was NOTconcerned about (4), I'd simply drain the
running job, do a manipulation (2 above), and put the node back into
production, waiting for the SBE to recur if it is going to.  This is what
I've often done.

Or if you have a dev/test cluster, replace the entire production node with a
tested, known-good node from the dev/test cluster, then test/fix the SBE
server in the context of the dev/test cluster.  I've also often done this.

My experience has been that long runs of single-node HPL was the best SBE
trigger I ever found.  Dell's mpmemory did not do as well.  I believe
memtest86{,+} also didn't find problems that HPL found, though I didn't test
memtest86{,+} as much.  It also was not immediately obvious how to gather
the memory test results from mpmemory and memtest86{,+}, though it can
probably be done, perhaps easily, with a bit of R&D.

But since you've found that HPL does not trigger SBEs as much as your user's
code, I think you have a very good pointer that you should do stress tests
with your user's code if at all possible.

If you can share what the stressful app is, and any of the relevant run
parameters, that would probably be interesting to folks on this list.

In my experience, usually SBEs are resolved by reseating or replacing the
affected DIMM.  However it can also be an issue on the motherboard (sockets
or traces or something else), or possibly the CPU (because Intel and AMD now
both have the memory controllers on-die), or possibly a BIOS issue (if a CPU
or memory related parameter isn't set quite optimally by the BIOS you're
running; BIOSes may set hardware parameters without your awareness nor
ability to tune it yourself).


Best practice may be:

A) swap the DIMM where the SBE occurred with a neighbor that underwent
similar stress but did not show any SBEs.  Keep a permanent record of which
DIMMs you swapped and when, as well as all error messages and their timing.
B) re-stress either in production (if you believe my tentative assertion (in
4 above) that SBE corrections do not materially affect performance nor
correctness), or using your reliable reproducer for an amount of time that
you know should usually re-trigger the SBE if it is going to recur.
C) assess the results and respond accordingly:
  1) if the SBE messages do not recur, then either reseating resolved it, or
it's so marginal that you will need to wait longer for it to show up; may as
well leave it in production in this case
  2) if the SBE messages follow the DIMM when you swapped it with its
neighbor, then it's very very likely the DIMM (especially if the SBE
occurred quickly upon stressing it, both before and after the DIMM move).
Present this evidence to Dell Support and ask them to send you a replacement
DIMM.  KEEP IN MIND that although the replacement DIMM will usually resolve
the issue, it has never before been stressed in your setup, and it's
possible for your stress patterns to elicit SBEs even in this replacement
DIMM.  So if the error recurs in that DIMM slot, it's possible that the
replacement DIMM also needs to be replaced.  You again need to do a neighbor
swap to check whether it really is the replacement DIMM.
  3) If the SBE stays with the slot after you did the neighbor swap, take
this evidence to Dell Support, and see what they say.  I would guess they'd
have the motherboard and/or CPU swapped.  Alternatively, you may wish (use
your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2
in that server and see whether the SBEs follow the CPU or stay with the
slot.  Just as with DIMMs, it's not unheard of for replacement motherboards
and CPUs to also have issues, so don't assume they're perfect -- usually the
suitable replacement will resolve the issue fully, but you won't know for
sure until you've stressed the system.


What model of PowerEdge are these servers?

PowerEdge systems keep a history of the messages that get printed on the LCD
in the System Event Log (SEL), in older days also called the ESM log
(embedded systems management, I believe).  The SEL is maintained by the BMC
or iDRAC.  I believe the message you report below (SBE logging disable) will
be in the SEL.  I know the SEL logs messages that indicate that the SBE
correction rate has exceeded two successive thresholds (warning and
critical).

You can read and clear the SEL using a number of different methods.  I'm
sure DSET does it.  You can also do it with ipmitool, omreport (part of the
OpenManage Server Administration (OMSA) tools for the Linux command line),
and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC
POST utility.  I'm sure there are other ways; these are the ones I've found
useful.


Normal, non-Dell-specific ipmitool will print the SEL records using
'ipmitool sel list', but it does not have the lookup tables and algorithms
needed to tell you the name of the affected DIMM on Dell servers.  You can
also do 'ipmitool sel list -v', which will dump raw field values for each
SEL record, and you can decode those raw values to figure out the affected
DIMM -- with enough examples (and comparing e.g. to theDIMM names in the
Ctrl-E POST SEL view), you might be able to figure out the decoding
algorithm on your own, or google might give you someone who has already
figured out the decoding for your specific PowerEdge model.

That is the downside of using standard ipmitool.  The upside of ipmitool,
though, is that it's quite lightweight, and can be used both on localhost
and across the network (using IPMI over LAN, if you have it configured
appropriately).


The good news is that there's a Dell-specific version of ipmitool available,
which adds some Dell-specific capabilities, including to decode DIMM names.
This works at least for current PowerEdge R and M servers, as well as older
PowerEdge models like the 1950, and probably a few generations older than
that.  I think it simply supports all models that the corresponding version
of OpenManage supports; this does not include older SC servers or current C
servers.  If you have a model that OpenManage does not support, it may be
worth trying, in case it does the right thing for you.

You can get the 'delloem' version of ipmitool from the OpenManage Management
Station package.  The current latest URL is

ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz

Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible
one.

For example, looking in the RHEL5_x86_64 subdirectory, the rpm
OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool
with 'delloem' appearing as a string internally.  (I'm not able to test it
right now.)

Once you've installed the appropriate package, do 'ipmitool delloem'; this
should tell you what the secondary options are.  I believe 'ipmitool delloem
sel' will decode the SEL including the correct DIMM names.


If you install OpenManage appropriately, you can also get the SEL decoded,
as well as get alerts automatically and immediately sent to syslog.  The
command line to print a decoded SEL is 'omreport system esmlog'.  OpenManage
is pretty heavy-weight, though.  Some people do install it and leave it
running on HPC compute nodes; some people would never do that on a
production node.

Your mention of getting log messages about the SBEs makes me think you do
have OMSA installed and its daemons running -- is that correct?  Try
'omreport system esmlog' if so.


Finally, during POST Ctrl-E at the prompted moment will get you into the BMC
or iDRAC POST menu system, in which you can view and optionally clear the
SEL.  I do not think this is easily scriptable, but if all else fails, that
is one way to view the SEL, with proper decoding.


I know that's long, and I hope that helps you and possibly others.

David

On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal <prentice at ias.edu> wrote:

> Jon Forrest wrote:
> > On 12/9/2010 8:08 AM, Prentice Bisbal wrote:
> >
> >> So far, mprime appears to be working. I was able to trigger an SBE in 21
> >> hours the first time I ran it.  I plan on running it repeatedly for the
> >> next few days to see how well it can repeat finding errors.
> >
> > After it finds an error how do you
> > figure out which memory module to
> > replace?
> >
>
> The LCD display on the front of the server tells me, with a message like
> this:
>
> "SBE logging disabled on DIMM C3. Reseat DIMM"
>
> I can also generate a report with DELL DSET that shows me a similar
> other message. I'm sure there are other tools, but I usually have to
> create a DSET report to send to Dell, anyway.
>
> --
> Prentice
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20101209/1233a52e/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From prentice at ias.edu  Fri Dec 10 08:45:25 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 10 Dec 2010 08:45:25 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>	<4D010B18.4000305@berkeley.edu>	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
Message-ID: <4D022EF5.1080801@ias.edu>

David,


-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Fri Dec 10 09:24:30 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 10 Dec 2010 09:24:30 -0500
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
References: <4CFE66E2.6030805@ias.edu>	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>	<4D00FEFC.2080509@ias.edu>	<4D010B18.4000305@berkeley.edu>	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
Message-ID: <4D02381E.5090307@ias.edu>

David,

Thanks for the e-mail due to it's length, I'm not including it in my 
reply, which I know is normally bad mailing list etiquette.

The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
GB of RAM.

I installed two identical servers at the same time, named frigga and 
odin (husband and wife in Norse mythology, if your curious). These nodes 
are not part of a beowulf cluster, but this is the best forum I know of 
to discuss problems like this.

Odin is the system with errors, and it started reporting SBE errors 
almost immediately, even when the system was completely idle. They 
started within hours of operating system installation, before users were 
even able to login to the system.

As you pointed out, I don't think SBE errors are fatal, but I like to 
address all system errors I identify, no matter how trivial. I find when 
you get used to ignoring a "harmless" errors, you eventually end up 
ignoring all errors.

So, you are right that I'm looking for a tool to quickly and reliably 
reproduce SBEs so that I can quickly resolve this problem with Dell. For 
reasons I can't discuss here, working with the user is not an option. 
Due to the nature of my institution, users are only here for a couple of 
years, anyway, and I'm looking for a tool that I can use long after this 
user (and his code) are gone.

I have been keeping detailed logs of exactly when the SBE errors occur. 
And I have been reseating and swapping DIMMS to see of the errors move 
with the DIMM or stay with the slot to determine whether it's a bad 
DIMM, or a bad motherboard. In the first occasion, the error did move 
with the DIMM, and I replaced the DIMM. Since then, the errors have been 
moving from DIMM to DIMM, even across banks of DIMMS. Since each bank 
corresponds to a socket, this would indicate that it's not a bad on-chip 
memory controller, or they're all bad.

My goal is to find a tool that I can run repeatedly to reproduce SBE 
errors in a finite time frame, and then run it repeatedly and collect 
data on where these SBEs occur. I suspect it's a bad motherboard, but 
unless I have overwhelming data showing that, Dell will just keep 
replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this 
case.

As stated earlier, HPL wasn't reliable for me in this capacity. I'm now 
using mprime's stress test mode, and will also test stressapptest.


-- 
Prentice

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mdidomenico4 at gmail.com  Fri Dec 10 10:15:16 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Fri, 10 Dec 2010 15:15:16 +0000
Subject: [Beowulf] tesla benchmarking
Message-ID: <AANLkTi=XH01E5pWqjrB2k-mKLb3Q4gZqcc9-iB+uDySt@mail.gmail.com>

does anyone know of easy to run code that would swallow the cpu/memory
on the chassis but also a tesla card?  A lot of the tools i typically
used in the past that have been ported to GPU's don't seem to use up
much of the memory, or use all the GPU constantly.  I'm running
through NAMD at the moment which does seem to make pretty good use of
the gpu processor, it doesn't seem to use much if any of the memory.
Cuda-Linpack seems to cough an error on runtime, but hopefully i'll
get that going, but i curious if there was anything else i didn't know
about.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From david.t.kewley at gmail.com  Fri Dec 10 14:47:20 2010
From: david.t.kewley at gmail.com (David Kewley)
Date: Fri, 10 Dec 2010 11:47:20 -0800
Subject: [Beowulf] Memory stress testing tools.
In-Reply-To: <4D02381E.5090307@ias.edu>
References: <4CFE66E2.6030805@ias.edu>
	<AANLkTikito=7baHHMxFX+b=iOUwMDGqr8u7QZWhFmx4U@mail.gmail.com>
	<4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu>
	<4D015031.70908@ias.edu>
	<AANLkTimd6A+==KnVo8ba4-4GgU08FtWD7bioat9csLgH@mail.gmail.com>
	<4D02381E.5090307@ias.edu>
Message-ID: <AANLkTi==SR8G36LXDaR2+iBOm0YD5KCGK6CFmngqCqx+@mail.gmail.com>

Prentice,

Thanks for filling in some details.  What you say makes complete sense to
me.

Is it the case that frigga has seen similar stress with no SBE errors?  If
so, I agree it seems like something else is going on besides bad DIMMs. To
test that, if you can schedule simultaneous downtime on the two boxes, you
might swap all DIMMs between odin and frigga.

If you do a few DIMM replacements, but continue to have the sense that DIMM
replacements aren't really solving the problem, and you have good evidence
why you think that, I encourage you to make sure Dell Support hears and
understands that, and make sure they're looking more holistically than
individual DIMMs.  They may look more broadly on their own, or you may need
to nudge them.

David

On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> David,
>
> Thanks for the e-mail due to it's length, I'm not including it in my reply,
> which I know is normally bad mailing list etiquette.
>
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB
> of RAM.
>
> I installed two identical servers at the same time, named frigga and odin
> (husband and wife in Norse mythology, if your curious). These nodes are not
> part of a beowulf cluster, but this is the best forum I know of to discuss
> problems like this.
>
> Odin is the system with errors, and it started reporting SBE errors almost
> immediately, even when the system was completely idle. They started within
> hours of operating system installation, before users were even able to login
> to the system.
>
> As you pointed out, I don't think SBE errors are fatal, but I like to
> address all system errors I identify, no matter how trivial. I find when you
> get used to ignoring a "harmless" errors, you eventually end up ignoring all
> errors.
>
> So, you are right that I'm looking for a tool to quickly and reliably
> reproduce SBEs so that I can quickly resolve this problem with Dell. For
> reasons I can't discuss here, working with the user is not an option. Due to
> the nature of my institution, users are only here for a couple of years,
> anyway, and I'm looking for a tool that I can use long after this user (and
> his code) are gone.
>
> I have been keeping detailed logs of exactly when the SBE errors occur. And
> I have been reseating and swapping DIMMS to see of the errors move with the
> DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad
> motherboard. In the first occasion, the error did move with the DIMM, and I
> replaced the DIMM. Since then, the errors have been moving from DIMM to
> DIMM, even across banks of DIMMS. Since each bank corresponds to a socket,
> this would indicate that it's not a bad on-chip memory controller, or
> they're all bad.
>
> My goal is to find a tool that I can run repeatedly to reproduce SBE errors
> in a finite time frame, and then run it repeatedly and collect data on where
> these SBEs occur. I suspect it's a bad motherboard, but unless I have
> overwhelming data showing that, Dell will just keep replacing the DIMMs, and
> I'm pretty confident it's not bad DIMMs in this case.
>
> As stated earlier, HPL wasn't reliable for me in this capacity. I'm now
> using mprime's stress test mode, and will also test stressapptest.
>
>
> --
> Prentice
>
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20101210/c1e6d367/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From mathog at caltech.edu  Fri Dec 10 16:18:02 2010
From: mathog at caltech.edu (David Mathog)
Date: Fri, 10 Dec 2010 13:18:02 -0800
Subject: [Beowulf] Memory stress testing tools
Message-ID: <E1PRALq-0005E7-5D@mendel.bio.caltech.edu>

Prentice Bisbal wrote:
 
> The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 
> GB of RAM.

If the erroneous memory locations are moving around in memory without
correlation to the DIMMs then the next most likely culprits are a
marginal power supply, CPU, or motherboard, in pretty much that order.
(OK, kind of a toss up for CPU vs. motherboard, but since you have 32
cores in the system I put it first.)

If you have access to an oscilloscope look closely at the voltages on
the two machines.  No need to cut in anywhere, just measure +5 and +12V
on an unused disk or fan connector.   If the machine prone to memory
errors is significantly noisier than the one that is not, that could be
the problem.  I have seen this exactly once - all PS testers said it was
good, and a multimeter had it pegged at the right voltages, but there
was a ton of high frequency noise coming out of the power supply.  

If you can disable CPUs through the BIOS on that machine, running for a
while under each CPU alone might narrow the issue down to 1 of the 4. 
You wouldn't be done then though, because it could be the socket and not
the CPU itself.  Still, if you can get it down to 1 CPU then you could
swap that with another and see if the issue moves with it.

You probably already did this, but be sure both machines have the same
BIOS release.


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From samuel at unimelb.edu.au  Mon Dec 13 17:43:19 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 14 Dec 2010 09:43:19 +1100
Subject: [Beowulf] IB symbol error thresholds for health check scripts ?
Message-ID: <4D06A187.2080201@unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

We run a bunch of health checks [1] on a compute node through
Torque [2] and if they fail the node gets knocked offline.

One of the checks we do is to check that there are no symbol
errors on the IB link. However, I'm wondering if simply saying
a single error is too brutal for this - what do other people do
about these ?

cheers!
Chris

[1] - for the record we check things like - amount of RAM,
failed DIMMs (via IPMI on IBM or memlog on SGI), number of
cores, number and speed of CPUs, LDAP OK, home directories
accessible, etc.

[2] - checks run prior to a job start, after a job exits
      and every 7.5 minutes (every 10 mom intervals).

- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/
MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V
=/gDC
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From stuartb at 4gh.net  Wed Dec 29 13:29:21 2010
From: stuartb at 4gh.net (Stuart Barkley)
Date: Wed, 29 Dec 2010 13:29:21 -0500 (EST)
Subject: [Beowulf] IB symbol error thresholds for health check scripts ?
In-Reply-To: <4D06A187.2080201@unimelb.edu.au>
References: <4D06A187.2080201@unimelb.edu.au>
Message-ID: <alpine.BSF.2.00.1012291311430.1547@freeman.4gh.net>

On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote:

> We run a bunch of health checks [1] on a compute node through Torque
> [2] and if they fail the node gets knocked offline.

Can you share these scripts?  I'm needing to get something started
along these lines (torque, Moab, Infiniband, IBM system x, xCAT).
I'm sure I'll find things needing adaption to our environment.

> One of the checks we do is to check that there are no symbol errors
> on the IB link. However, I'm wondering if simply saying a single
> error is too brutal for this - what do other people do about these ?

I'm looking at Infiniband problems currently and have been watching
our SymbolErrorCounter values.  I'm told a "small number" of these
errors are okay.  I don't know the definition of "small" or over how
long a time period.

Over the last week 24 of our nodes have shown at least two errors.
Of these 6 nodes are showing over 400 errors (450-30000) and these
nodes need attention (I've manually downed them until I can get to the
hardware).  The remaining nodes are all < 50 errors, with half of
those < 10.

I'm planning to do more proactive monitoring of the Infiniband Fabric.
The current toolset is very awkward to use for monitoring.  There is
an updated Infiniband Fabric Suite from QLogic which appears to
improve this significantly.  It should be possible to do the
Infiniband monitoring completely off node so as to not perturb the
computations too much.

> [1] - for the record we check things like - amount of RAM, failed
> DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number
> and speed of CPUs, LDAP OK, home directories accessible, etc.

All things we need to check.  I manually found several of our nodes
running with one disabled RAM stick.

> [2] - checks run prior to a job start, after a job exits and every
> 7.5 minutes (every 10 mom intervals).

Also when the node comes up before mom starts I assume?

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.