From prentice at ias.edu Tue Dec 7 11:54:58 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 07 Dec 2010 11:54:58 -0500 Subject: [Beowulf] Memory stress testing tools. Message-ID: <4CFE66E2.6030805@ias.edu> Dear Beowulfers, Can any of you recommend a good RAM stress testing tool? I have a server with 128GB of RAM that keeps reporting single-bit errors. Every time this happens, I reseat the DIMMS or swap them around, and then run some large MPI jobs with I hope stress the RAM. Sometimes this produces more SBEs, sometimes it doesn't. When the system seems stable, I let the users back on it, and sure enough, they get it to start reporting SBEs in short order. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue Dec 7 15:42:27 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 07 Dec 2010 12:42:27 -0800 Subject: [Beowulf] Memory stress testing tools. Message-ID: Prentice Bisbal wrote: > When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. Sounds like you already have a good tool for triggering memory errors on that system - your user's code. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Tue Dec 7 16:09:35 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 07 Dec 2010 16:09:35 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com> References: <4CFE66E2.6030805@ias.edu> <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com> Message-ID: <4CFEA28F.4010709@ias.edu> That was the first thing I looked into. memtest86 supports upto 64 GB of RAM. My system has 128 GB. :( I found prime95/gimps through a wikipedia page. I'm giving it a go now. http://www.mersenne.org/freesoft/#newusers On 12/07/2010 01:05 PM, Mcmillan, Scott A wrote: > memtest86 > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Prentice Bisbal > Sent: Tuesday, December 07, 2010 10:55 AM > To: Beowulf Mailing List > Subject: [Beowulf] Memory stress testing tools. > > Dear Beowulfers, > > Can any of you recommend a good RAM stress testing tool? > > I have a server with 128GB of RAM that keeps reporting single-bit > errors. Every time this happens, I reseat the DIMMS or swap them around, > and then run some large MPI jobs with I hope stress the RAM. Sometimes > this produces more SBEs, sometimes it doesn't. When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From a.travis at abdn.ac.uk Thu Dec 9 07:16:34 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Thu, 09 Dec 2010 12:16:34 +0000 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4CFE66E2.6030805@ias.edu> References: <4CFE66E2.6030805@ias.edu> Message-ID: <4D00C8A2.3080008@abdn.ac.uk> On 07/12/10 16:54, Prentice Bisbal wrote: > Dear Beowulfers, > > Can any of you recommend a good RAM stress testing tool? > > I have a server with 128GB of RAM that keeps reporting single-bit > errors. Every time this happens, I reseat the DIMMS or swap them around, > and then run some large MPI jobs with I hope stress the RAM. Sometimes > this produces more SBEs, sometimes it doesn't. When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. Hi, Prentice. Have you tried Charles Cazabon's user-space "memtester" program: http://pyropus.ca/software/memtester/ It doesn't test *all* the memory, just what it can lock, but it does stress the memory sub-system in the same way that applications do... Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 10:59:16 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 10:59:16 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: Message-ID: <4D00FCD4.6020203@ias.edu> On 12/07/2010 04:35 PM, David Mathog wrote: >> True, but this is a multi-user system, so I don't know which user's code >> is triggering the errors, nor do I know what usage pattern causes the >> errors, so I'm looking for something more consistent. Well, I hope it >> will be more consistent. > > Try setting up a script to take snapshots of the system every 15 seconds > or. Something like: > > do while [ 1 ] > ( date; top -b -n 1 | head -10 )>>$LOGFILE > sleep 15 > done > > Then using the memory error time stamps go back through those logs to > find the most likely culprits. That will identify the program, but not the problem size or data set being used that triggers the error. Using a stress test that I control removes this detective work. I've decided to go with mprime from the gimps project which has a stress test feature: http://www.mersenne.org/ -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 11:08:28 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 11:08:28 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> Message-ID: <4D00FEFC.2080509@ias.edu> On 12/08/2010 11:47 AM, Jason Clinton wrote: > On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal > wrote: > > Can any of you recommend a good RAM stress testing tool? > > > We have an open source ISO/netboot image that can stress-test using the > latest Linux kernel EDAC facilities and HPL as the test code. It's > posted here: http://www.advancedclustering.com/software/breakin.html > > It's intended to be booted into. > > There's a beta of a slightly newer version posted at: > http://lab.advancedclustering.com/bootimage/ > > I would be interested in any feedback you have on either version. Jason, I know breakin well. I used it a quite a bit a in 2008 when I was stress-testing my then-new cluster, and sent some feedback to the developer at the time (last name Shoemaker, I think). I did find that I could run it for days on all my cluster nodes, and then a few days later, when running a HPL as a single job across all the nodes, I'd get memory errors. I haven't used it since. Not because I don't like it, but I just haven't had a need for it since then. I've also been testing this node by running a single HPL job across all 32 cores myself, and even after days of doing this, I couldn't trigger any errors, but a user program could trigger an error in only a couple of hours. Based on these experiences, I don't think that HPL is good at stressing RAM.Has anyone else had similar experiences? Since this system has 128 GB of RAM, I think it's a good assumption that many programs might not use all of that RAM, so I need something memory specific that I know will hit all 128 GB of RAM. So far, mprime appears to be working. I was able to trigger an SBE in 21 hours the first time I ran it. I plan on running it repeatedly for the next few days to see how well it can repeat finding errors. -- Prentice Bisbal _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 11:36:32 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 11:36:32 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: Message-ID: <4D010590.3030900@ias.edu> On 12/07/2010 06:58 PM, David Mathog wrote: > Try stressapptest. > > http://code.google.com/p/stressapptest/ > > Note that it has a bizarre behavior where no matter how high you set N > it the sum of their CPU usage is always 100%, even though they are not > all running on one core on a multi-core system. To saturate all of the > cores one must force things, see this thread (where I talk to myself, > the solution is at the end) > > http://groups.google.com/group/stressapptest-discuss/browse_thread/thread/882537e9f3f7d3f2 > Thanks for the info. stressapptest looks like a great tool. I'm going to give it a try when I'm done trying out mprime. I want to see which one triggers the error quicker/more consistently. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlforrest at berkeley.edu Thu Dec 9 12:00:08 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 09 Dec 2010 09:00:08 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D00FEFC.2080509@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> Message-ID: <4D010B18.4000305@berkeley.edu> On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > So far, mprime appears to be working. I was able to trigger an SBE in 21 > hours the first time I ran it. I plan on running it repeatedly for the > next few days to see how well it can repeat finding errors. After it finds an error how do you figure out which memory module to replace? -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 16:51:44 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 16:51:44 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> Message-ID: <4D014F70.4000103@ias.edu> Jason Clinton wrote: > On Thu, Dec 9, 2010 at 10:08, Prentice Bisbal > wrote: > > I know breakin well. I used it a quite a bit a in 2008 when I was > stress-testing my then-new cluster, and sent some feedback to the > developer at the time (last name Shoemaker, I think). I did find that I > could run it for days on all my cluster nodes, and then a few days > later, when running a HPL as a single job across all the nodes, I'd get > memory errors. I haven't used it since. Not because I don't like it, but > I just haven't had a need for it since then. > > > Hum. It's possible that EDAC support for your chipset didn't exist at > the time. AMD and Intel have been pretty good about landing EDAC for > their chips in vanilla upstream kernels for the past year and so that is > why it is important to use a recent kernel. Or at least one with recent > backports of that work. At the time, I was using the latest version of Breakin available. I was testing on AMD Barcelona processors. I was using Breakin in September/October 2008, and the Barcelona processors came out in March - May of that year. I would assume that would be enough time for support for the new processors to trickle down to breakin, but that's just an assumption, I can't confirm/prove that. > > > I've also been testing this node by running a single HPL job across all > 32 cores myself, and even after days of doing this, I couldn't trigger > any errors, but a user program could trigger an error in only a couple > of hours. > > Based on these experiences, I don't think that HPL is good at stressing > RAM.Has anyone else had similar experiences? > > > HPL is among the most memory intensive workloads out there. This is why > architectural changes in the past few years that have increased the > aggregate memory bandwidth of the architecture have resulted in higher > measured platform efficiency. > > My guess would be that the difference you've seen between the two would > be statistical noise. How are you measuring errors? MCE events? I don't think this is statistical noise. This system has consistently reported SBE errors since it was installed several months ago. I've probably tried to trigger SBEs with HPL dozens of times. I'll often run it 2-3 times in a row without triggering errors over a period of several days. When the users go back to using this server, they usually trigger errors in less time than that. I think HPL resulted in triggering the error only a couple of times. The system is a Dell PowerEdge something or other. It has an LCD display that is normally blue. When hardware error is detected, it turns orange, and shows the error. I check that several times a day. Our central log server also e-mails any ciritical log errors that get sent to it, so even if I didn't check the display on the front of the server, I'll receive an e-mail shortly after the error is logged in my system logs. It's low tech, but it works. > > > > Since this system has 128 GB of RAM, I think it's a good assumption that > many programs might not use all of that RAM, so I need something memory > specific that I know will hit all 128 GB of RAM. > > > Breakin uses the same algorithm at > http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html > to calculate the "N" size which will consume 90% of the RAM of a system > using all cores (in as close to square grid as possible). > > > > So far, mprime appears to be working. I was able to trigger an SBE in 21 > hours the first time I ran it. I plan on running it repeatedly for the > next few days to see how well it can repeat finding errors. > > > I'm curious what kernel you're running that is giving you EDAC > reporting. Or are you rebooting after an MCE and examining the system > event logs? > > > -- > Jason D. Clinton, Advanced Clustering Technologies > 913-643-0306 -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 16:54:57 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 16:54:57 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D010B18.4000305@berkeley.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> Message-ID: <4D015031.70908@ias.edu> Jon Forrest wrote: > On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > >> So far, mprime appears to be working. I was able to trigger an SBE in 21 >> hours the first time I ran it. I plan on running it repeatedly for the >> next few days to see how well it can repeat finding errors. > > After it finds an error how do you > figure out which memory module to > replace? > The LCD display on the front of the server tells me, with a message like this: "SBE logging disabled on DIMM C3. Reseat DIMM" I can also generate a report with DELL DSET that shows me a similar other message. I'm sure there are other tools, but I usually have to create a DSET report to send to Dell, anyway. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From david.t.kewley at gmail.com Thu Dec 9 21:14:41 2010 From: david.t.kewley at gmail.com (David Kewley) Date: Thu, 9 Dec 2010 18:14:41 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D015031.70908@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: Prentice, You only asked for memory testing programs, but I'm going to go a bit further, to make sure some background issues are covered, and to give you some ideas you might not yet have. Some of this is based on a lot of experience with Dell servers in HPC. Some of my background thoughts on dealing with SBEs: 1) Complete and details historical records are important for correctly and efficiently resolving these type of errors, especially on larger clusters. Otherwise it's too easy to get confused about what happened when, and come to incorrect conclusions about problems and solutions. Treat it like a lab experiement -- keep a log book or equivalent, test your hypotheses against the data, and think broadly about what alternative hypotheses may exist. 2) The resolution process will be iterative, with physical manipulations (e.g. moving DIMMs among slots) alternating with monitoring for SBEs and optionally running stress applications to attempt to trigger SBEs (a "reproducer" of the SBEs). 3) For efficient resolution, you want a quick, reliable reproducer, something that will trigger the SBEs quickly. 4) I've seen no evidence that SBEs materially affect performance or correctness on a server, so my practice has often been to leave affected servers in production as much as possible, taking them out of production (after draining jobs) only briefly to move DIMMs, replace DIMMs, etc. Regarding (4), if anyone here has measurements or a URL to a study saying in what circumstances there's a significant material risk to performance or correctness of calculation with SBE correction, I'd love to see that. I'm not saying that SBE correction is completely free performance wise -- I bet it takes a little time to do the correction, but I bet for normal SBE correction rates, that time is (nearly) unmeasurable. Also, over a few thousand server-years, I've never or almost never seen SBE corrections morph into uncorrectable multi-bit errors. When uncorrectable errors have shown up (which itself has been rare in my experience, mostly in a single situation where there was a server bug that got corrected), they've shown up early on a server, not after a long period of only seeing SBEs. Prentice, I believe you started this thread because you need something for (3), is that right? As David Mathog said, you already know what activity most reliably triggers SBE corrections: Your users' code. If I were in your shoes, and I had time and/or were concerned around issue (4) above, I'd a) identify which user, code, and specific runs trigger SBEs the most, then b) if possible, work with that user to get a single-node version of a similar run that you could outside production node use, to reproduce and resolve SBEs. I'd then monitor for SBEs in production, and when they occur, drain jobs from those nodes, and take them out of production so I could user that single-node user job to satisfy (2) and (3) above. If I was in your shoes and was NOTconcerned about (4), I'd simply drain the running job, do a manipulation (2 above), and put the node back into production, waiting for the SBE to recur if it is going to. This is what I've often done. Or if you have a dev/test cluster, replace the entire production node with a tested, known-good node from the dev/test cluster, then test/fix the SBE server in the context of the dev/test cluster. I've also often done this. My experience has been that long runs of single-node HPL was the best SBE trigger I ever found. Dell's mpmemory did not do as well. I believe memtest86{,+} also didn't find problems that HPL found, though I didn't test memtest86{,+} as much. It also was not immediately obvious how to gather the memory test results from mpmemory and memtest86{,+}, though it can probably be done, perhaps easily, with a bit of R&D. But since you've found that HPL does not trigger SBEs as much as your user's code, I think you have a very good pointer that you should do stress tests with your user's code if at all possible. If you can share what the stressful app is, and any of the relevant run parameters, that would probably be interesting to folks on this list. In my experience, usually SBEs are resolved by reseating or replacing the affected DIMM. However it can also be an issue on the motherboard (sockets or traces or something else), or possibly the CPU (because Intel and AMD now both have the memory controllers on-die), or possibly a BIOS issue (if a CPU or memory related parameter isn't set quite optimally by the BIOS you're running; BIOSes may set hardware parameters without your awareness nor ability to tune it yourself). Best practice may be: A) swap the DIMM where the SBE occurred with a neighbor that underwent similar stress but did not show any SBEs. Keep a permanent record of which DIMMs you swapped and when, as well as all error messages and their timing. B) re-stress either in production (if you believe my tentative assertion (in 4 above) that SBE corrections do not materially affect performance nor correctness), or using your reliable reproducer for an amount of time that you know should usually re-trigger the SBE if it is going to recur. C) assess the results and respond accordingly: 1) if the SBE messages do not recur, then either reseating resolved it, or it's so marginal that you will need to wait longer for it to show up; may as well leave it in production in this case 2) if the SBE messages follow the DIMM when you swapped it with its neighbor, then it's very very likely the DIMM (especially if the SBE occurred quickly upon stressing it, both before and after the DIMM move). Present this evidence to Dell Support and ask them to send you a replacement DIMM. KEEP IN MIND that although the replacement DIMM will usually resolve the issue, it has never before been stressed in your setup, and it's possible for your stress patterns to elicit SBEs even in this replacement DIMM. So if the error recurs in that DIMM slot, it's possible that the replacement DIMM also needs to be replaced. You again need to do a neighbor swap to check whether it really is the replacement DIMM. 3) If the SBE stays with the slot after you did the neighbor swap, take this evidence to Dell Support, and see what they say. I would guess they'd have the motherboard and/or CPU swapped. Alternatively, you may wish (use your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2 in that server and see whether the SBEs follow the CPU or stay with the slot. Just as with DIMMs, it's not unheard of for replacement motherboards and CPUs to also have issues, so don't assume they're perfect -- usually the suitable replacement will resolve the issue fully, but you won't know for sure until you've stressed the system. What model of PowerEdge are these servers? PowerEdge systems keep a history of the messages that get printed on the LCD in the System Event Log (SEL), in older days also called the ESM log (embedded systems management, I believe). The SEL is maintained by the BMC or iDRAC. I believe the message you report below (SBE logging disable) will be in the SEL. I know the SEL logs messages that indicate that the SBE correction rate has exceeded two successive thresholds (warning and critical). You can read and clear the SEL using a number of different methods. I'm sure DSET does it. You can also do it with ipmitool, omreport (part of the OpenManage Server Administration (OMSA) tools for the Linux command line), and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC POST utility. I'm sure there are other ways; these are the ones I've found useful. Normal, non-Dell-specific ipmitool will print the SEL records using 'ipmitool sel list', but it does not have the lookup tables and algorithms needed to tell you the name of the affected DIMM on Dell servers. You can also do 'ipmitool sel list -v', which will dump raw field values for each SEL record, and you can decode those raw values to figure out the affected DIMM -- with enough examples (and comparing e.g. to theDIMM names in the Ctrl-E POST SEL view), you might be able to figure out the decoding algorithm on your own, or google might give you someone who has already figured out the decoding for your specific PowerEdge model. That is the downside of using standard ipmitool. The upside of ipmitool, though, is that it's quite lightweight, and can be used both on localhost and across the network (using IPMI over LAN, if you have it configured appropriately). The good news is that there's a Dell-specific version of ipmitool available, which adds some Dell-specific capabilities, including to decode DIMM names. This works at least for current PowerEdge R and M servers, as well as older PowerEdge models like the 1950, and probably a few generations older than that. I think it simply supports all models that the corresponding version of OpenManage supports; this does not include older SC servers or current C servers. If you have a model that OpenManage does not support, it may be worth trying, in case it does the right thing for you. You can get the 'delloem' version of ipmitool from the OpenManage Management Station package. The current latest URL is ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible one. For example, looking in the RHEL5_x86_64 subdirectory, the rpm OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool with 'delloem' appearing as a string internally. (I'm not able to test it right now.) Once you've installed the appropriate package, do 'ipmitool delloem'; this should tell you what the secondary options are. I believe 'ipmitool delloem sel' will decode the SEL including the correct DIMM names. If you install OpenManage appropriately, you can also get the SEL decoded, as well as get alerts automatically and immediately sent to syslog. The command line to print a decoded SEL is 'omreport system esmlog'. OpenManage is pretty heavy-weight, though. Some people do install it and leave it running on HPC compute nodes; some people would never do that on a production node. Your mention of getting log messages about the SBEs makes me think you do have OMSA installed and its daemons running -- is that correct? Try 'omreport system esmlog' if so. Finally, during POST Ctrl-E at the prompted moment will get you into the BMC or iDRAC POST menu system, in which you can view and optionally clear the SEL. I do not think this is easily scriptable, but if all else fails, that is one way to view the SEL, with proper decoding. I know that's long, and I hope that helps you and possibly others. David On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal wrote: > Jon Forrest wrote: > > On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > > > >> So far, mprime appears to be working. I was able to trigger an SBE in 21 > >> hours the first time I ran it. I plan on running it repeatedly for the > >> next few days to see how well it can repeat finding errors. > > > > After it finds an error how do you > > figure out which memory module to > > replace? > > > > The LCD display on the front of the server tells me, with a message like > this: > > "SBE logging disabled on DIMM C3. Reseat DIMM" > > I can also generate a report with DELL DSET that shows me a similar > other message. I'm sure there are other tools, but I usually have to > create a DSET report to send to Dell, anyway. > > -- > Prentice > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Dec 10 08:45:25 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 10 Dec 2010 08:45:25 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: <4D022EF5.1080801@ias.edu> David, -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Dec 10 09:24:30 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 10 Dec 2010 09:24:30 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: <4D02381E.5090307@ias.edu> David, Thanks for the e-mail due to it's length, I'm not including it in my reply, which I know is normally bad mailing list etiquette. The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB of RAM. I installed two identical servers at the same time, named frigga and odin (husband and wife in Norse mythology, if your curious). These nodes are not part of a beowulf cluster, but this is the best forum I know of to discuss problems like this. Odin is the system with errors, and it started reporting SBE errors almost immediately, even when the system was completely idle. They started within hours of operating system installation, before users were even able to login to the system. As you pointed out, I don't think SBE errors are fatal, but I like to address all system errors I identify, no matter how trivial. I find when you get used to ignoring a "harmless" errors, you eventually end up ignoring all errors. So, you are right that I'm looking for a tool to quickly and reliably reproduce SBEs so that I can quickly resolve this problem with Dell. For reasons I can't discuss here, working with the user is not an option. Due to the nature of my institution, users are only here for a couple of years, anyway, and I'm looking for a tool that I can use long after this user (and his code) are gone. I have been keeping detailed logs of exactly when the SBE errors occur. And I have been reseating and swapping DIMMS to see of the errors move with the DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad motherboard. In the first occasion, the error did move with the DIMM, and I replaced the DIMM. Since then, the errors have been moving from DIMM to DIMM, even across banks of DIMMS. Since each bank corresponds to a socket, this would indicate that it's not a bad on-chip memory controller, or they're all bad. My goal is to find a tool that I can run repeatedly to reproduce SBE errors in a finite time frame, and then run it repeatedly and collect data on where these SBEs occur. I suspect it's a bad motherboard, but unless I have overwhelming data showing that, Dell will just keep replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this case. As stated earlier, HPL wasn't reliable for me in this capacity. I'm now using mprime's stress test mode, and will also test stressapptest. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mdidomenico4 at gmail.com Fri Dec 10 10:15:16 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 10 Dec 2010 15:15:16 +0000 Subject: [Beowulf] tesla benchmarking Message-ID: does anyone know of easy to run code that would swallow the cpu/memory on the chassis but also a tesla card? A lot of the tools i typically used in the past that have been ported to GPU's don't seem to use up much of the memory, or use all the GPU constantly. I'm running through NAMD at the moment which does seem to make pretty good use of the gpu processor, it doesn't seem to use much if any of the memory. Cuda-Linpack seems to cough an error on runtime, but hopefully i'll get that going, but i curious if there was anything else i didn't know about. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From david.t.kewley at gmail.com Fri Dec 10 14:47:20 2010 From: david.t.kewley at gmail.com (David Kewley) Date: Fri, 10 Dec 2010 11:47:20 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D02381E.5090307@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> <4D02381E.5090307@ias.edu> Message-ID: Prentice, Thanks for filling in some details. What you say makes complete sense to me. Is it the case that frigga has seen similar stress with no SBE errors? If so, I agree it seems like something else is going on besides bad DIMMs. To test that, if you can schedule simultaneous downtime on the two boxes, you might swap all DIMMs between odin and frigga. If you do a few DIMM replacements, but continue to have the sense that DIMM replacements aren't really solving the problem, and you have good evidence why you think that, I encourage you to make sure Dell Support hears and understands that, and make sure they're looking more holistically than individual DIMMs. They may look more broadly on their own, or you may need to nudge them. David On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal wrote: > David, > > Thanks for the e-mail due to it's length, I'm not including it in my reply, > which I know is normally bad mailing list etiquette. > > The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB > of RAM. > > I installed two identical servers at the same time, named frigga and odin > (husband and wife in Norse mythology, if your curious). These nodes are not > part of a beowulf cluster, but this is the best forum I know of to discuss > problems like this. > > Odin is the system with errors, and it started reporting SBE errors almost > immediately, even when the system was completely idle. They started within > hours of operating system installation, before users were even able to login > to the system. > > As you pointed out, I don't think SBE errors are fatal, but I like to > address all system errors I identify, no matter how trivial. I find when you > get used to ignoring a "harmless" errors, you eventually end up ignoring all > errors. > > So, you are right that I'm looking for a tool to quickly and reliably > reproduce SBEs so that I can quickly resolve this problem with Dell. For > reasons I can't discuss here, working with the user is not an option. Due to > the nature of my institution, users are only here for a couple of years, > anyway, and I'm looking for a tool that I can use long after this user (and > his code) are gone. > > I have been keeping detailed logs of exactly when the SBE errors occur. And > I have been reseating and swapping DIMMS to see of the errors move with the > DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad > motherboard. In the first occasion, the error did move with the DIMM, and I > replaced the DIMM. Since then, the errors have been moving from DIMM to > DIMM, even across banks of DIMMS. Since each bank corresponds to a socket, > this would indicate that it's not a bad on-chip memory controller, or > they're all bad. > > My goal is to find a tool that I can run repeatedly to reproduce SBE errors > in a finite time frame, and then run it repeatedly and collect data on where > these SBEs occur. I suspect it's a bad motherboard, but unless I have > overwhelming data showing that, Dell will just keep replacing the DIMMs, and > I'm pretty confident it's not bad DIMMs in this case. > > As stated earlier, HPL wasn't reliable for me in this capacity. I'm now > using mprime's stress test mode, and will also test stressapptest. > > > -- > Prentice > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Fri Dec 10 16:18:02 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 10 Dec 2010 13:18:02 -0800 Subject: [Beowulf] Memory stress testing tools Message-ID: Prentice Bisbal wrote: > The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 > GB of RAM. If the erroneous memory locations are moving around in memory without correlation to the DIMMs then the next most likely culprits are a marginal power supply, CPU, or motherboard, in pretty much that order. (OK, kind of a toss up for CPU vs. motherboard, but since you have 32 cores in the system I put it first.) If you have access to an oscilloscope look closely at the voltages on the two machines. No need to cut in anywhere, just measure +5 and +12V on an unused disk or fan connector. If the machine prone to memory errors is significantly noisier than the one that is not, that could be the problem. I have seen this exactly once - all PS testers said it was good, and a multimeter had it pegged at the right voltages, but there was a ton of high frequency noise coming out of the power supply. If you can disable CPUs through the BIOS on that machine, running for a while under each CPU alone might narrow the issue down to 1 of the 4. You wouldn't be done then though, because it could be the socket and not the CPU itself. Still, if you can get it down to 1 CPU then you could swap that with another and see if the issue moves with it. You probably already did this, but be sure both machines have the same BIOS release. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Mon Dec 13 17:43:19 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 14 Dec 2010 09:43:19 +1100 Subject: [Beowulf] IB symbol error thresholds for health check scripts ? Message-ID: <4D06A187.2080201@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks, We run a bunch of health checks [1] on a compute node through Torque [2] and if they fail the node gets knocked offline. One of the checks we do is to check that there are no symbol errors on the IB link. However, I'm wondering if simply saying a single error is too brutal for this - what do other people do about these ? cheers! Chris [1] - for the record we check things like - amount of RAM, failed DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number and speed of CPUs, LDAP OK, home directories accessible, etc. [2] - checks run prior to a job start, after a job exits and every 7.5 minutes (every 10 mom intervals). - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/ MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V =/gDC -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From stuartb at 4gh.net Wed Dec 29 13:29:21 2010 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 29 Dec 2010 13:29:21 -0500 (EST) Subject: [Beowulf] IB symbol error thresholds for health check scripts ? In-Reply-To: <4D06A187.2080201@unimelb.edu.au> References: <4D06A187.2080201@unimelb.edu.au> Message-ID: On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote: > We run a bunch of health checks [1] on a compute node through Torque > [2] and if they fail the node gets knocked offline. Can you share these scripts? I'm needing to get something started along these lines (torque, Moab, Infiniband, IBM system x, xCAT). I'm sure I'll find things needing adaption to our environment. > One of the checks we do is to check that there are no symbol errors > on the IB link. However, I'm wondering if simply saying a single > error is too brutal for this - what do other people do about these ? I'm looking at Infiniband problems currently and have been watching our SymbolErrorCounter values. I'm told a "small number" of these errors are okay. I don't know the definition of "small" or over how long a time period. Over the last week 24 of our nodes have shown at least two errors. Of these 6 nodes are showing over 400 errors (450-30000) and these nodes need attention (I've manually downed them until I can get to the hardware). The remaining nodes are all < 50 errors, with half of those < 10. I'm planning to do more proactive monitoring of the Infiniband Fabric. The current toolset is very awkward to use for monitoring. There is an updated Infiniband Fabric Suite from QLogic which appears to improve this significantly. It should be possible to do the Infiniband monitoring completely off node so as to not perturb the computations too much. > [1] - for the record we check things like - amount of RAM, failed > DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number > and speed of CPUs, LDAP OK, home directories accessible, etc. All things we need to check. I manually found several of our nodes running with one disabled RAM stick. > [2] - checks run prior to a job start, after a job exits and every > 7.5 minutes (every 10 mom intervals). Also when the node comes up before mom starts I assume? Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Tue Dec 7 11:54:58 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 07 Dec 2010 11:54:58 -0500 Subject: [Beowulf] Memory stress testing tools. Message-ID: <4CFE66E2.6030805@ias.edu> Dear Beowulfers, Can any of you recommend a good RAM stress testing tool? I have a server with 128GB of RAM that keeps reporting single-bit errors. Every time this happens, I reseat the DIMMS or swap them around, and then run some large MPI jobs with I hope stress the RAM. Sometimes this produces more SBEs, sometimes it doesn't. When the system seems stable, I let the users back on it, and sure enough, they get it to start reporting SBEs in short order. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue Dec 7 15:42:27 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 07 Dec 2010 12:42:27 -0800 Subject: [Beowulf] Memory stress testing tools. Message-ID: Prentice Bisbal wrote: > When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. Sounds like you already have a good tool for triggering memory errors on that system - your user's code. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Tue Dec 7 16:09:35 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 07 Dec 2010 16:09:35 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com> References: <4CFE66E2.6030805@ias.edu> <09ED21B37E0F694688A2317C4FED9ED30495AF8404@azsmsx504.amr.corp.intel.com> Message-ID: <4CFEA28F.4010709@ias.edu> That was the first thing I looked into. memtest86 supports upto 64 GB of RAM. My system has 128 GB. :( I found prime95/gimps through a wikipedia page. I'm giving it a go now. http://www.mersenne.org/freesoft/#newusers On 12/07/2010 01:05 PM, Mcmillan, Scott A wrote: > memtest86 > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Prentice Bisbal > Sent: Tuesday, December 07, 2010 10:55 AM > To: Beowulf Mailing List > Subject: [Beowulf] Memory stress testing tools. > > Dear Beowulfers, > > Can any of you recommend a good RAM stress testing tool? > > I have a server with 128GB of RAM that keeps reporting single-bit > errors. Every time this happens, I reseat the DIMMS or swap them around, > and then run some large MPI jobs with I hope stress the RAM. Sometimes > this produces more SBEs, sometimes it doesn't. When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From a.travis at abdn.ac.uk Thu Dec 9 07:16:34 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Thu, 09 Dec 2010 12:16:34 +0000 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4CFE66E2.6030805@ias.edu> References: <4CFE66E2.6030805@ias.edu> Message-ID: <4D00C8A2.3080008@abdn.ac.uk> On 07/12/10 16:54, Prentice Bisbal wrote: > Dear Beowulfers, > > Can any of you recommend a good RAM stress testing tool? > > I have a server with 128GB of RAM that keeps reporting single-bit > errors. Every time this happens, I reseat the DIMMS or swap them around, > and then run some large MPI jobs with I hope stress the RAM. Sometimes > this produces more SBEs, sometimes it doesn't. When the system seems > stable, I let the users back on it, and sure enough, they get it to > start reporting SBEs in short order. Hi, Prentice. Have you tried Charles Cazabon's user-space "memtester" program: http://pyropus.ca/software/memtester/ It doesn't test *all* the memory, just what it can lock, but it does stress the memory sub-system in the same way that applications do... Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 10:59:16 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 10:59:16 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: Message-ID: <4D00FCD4.6020203@ias.edu> On 12/07/2010 04:35 PM, David Mathog wrote: >> True, but this is a multi-user system, so I don't know which user's code >> is triggering the errors, nor do I know what usage pattern causes the >> errors, so I'm looking for something more consistent. Well, I hope it >> will be more consistent. > > Try setting up a script to take snapshots of the system every 15 seconds > or. Something like: > > do while [ 1 ] > ( date; top -b -n 1 | head -10 )>>$LOGFILE > sleep 15 > done > > Then using the memory error time stamps go back through those logs to > find the most likely culprits. That will identify the program, but not the problem size or data set being used that triggers the error. Using a stress test that I control removes this detective work. I've decided to go with mprime from the gimps project which has a stress test feature: http://www.mersenne.org/ -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 11:08:28 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 11:08:28 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> Message-ID: <4D00FEFC.2080509@ias.edu> On 12/08/2010 11:47 AM, Jason Clinton wrote: > On Tue, Dec 7, 2010 at 10:54, Prentice Bisbal > wrote: > > Can any of you recommend a good RAM stress testing tool? > > > We have an open source ISO/netboot image that can stress-test using the > latest Linux kernel EDAC facilities and HPL as the test code. It's > posted here: http://www.advancedclustering.com/software/breakin.html > > It's intended to be booted into. > > There's a beta of a slightly newer version posted at: > http://lab.advancedclustering.com/bootimage/ > > I would be interested in any feedback you have on either version. Jason, I know breakin well. I used it a quite a bit a in 2008 when I was stress-testing my then-new cluster, and sent some feedback to the developer at the time (last name Shoemaker, I think). I did find that I could run it for days on all my cluster nodes, and then a few days later, when running a HPL as a single job across all the nodes, I'd get memory errors. I haven't used it since. Not because I don't like it, but I just haven't had a need for it since then. I've also been testing this node by running a single HPL job across all 32 cores myself, and even after days of doing this, I couldn't trigger any errors, but a user program could trigger an error in only a couple of hours. Based on these experiences, I don't think that HPL is good at stressing RAM.Has anyone else had similar experiences? Since this system has 128 GB of RAM, I think it's a good assumption that many programs might not use all of that RAM, so I need something memory specific that I know will hit all 128 GB of RAM. So far, mprime appears to be working. I was able to trigger an SBE in 21 hours the first time I ran it. I plan on running it repeatedly for the next few days to see how well it can repeat finding errors. -- Prentice Bisbal _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 11:36:32 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 11:36:32 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: Message-ID: <4D010590.3030900@ias.edu> On 12/07/2010 06:58 PM, David Mathog wrote: > Try stressapptest. > > http://code.google.com/p/stressapptest/ > > Note that it has a bizarre behavior where no matter how high you set N > it the sum of their CPU usage is always 100%, even though they are not > all running on one core on a multi-core system. To saturate all of the > cores one must force things, see this thread (where I talk to myself, > the solution is at the end) > > http://groups.google.com/group/stressapptest-discuss/browse_thread/thread/882537e9f3f7d3f2 > Thanks for the info. stressapptest looks like a great tool. I'm going to give it a try when I'm done trying out mprime. I want to see which one triggers the error quicker/more consistently. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlforrest at berkeley.edu Thu Dec 9 12:00:08 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 09 Dec 2010 09:00:08 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D00FEFC.2080509@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> Message-ID: <4D010B18.4000305@berkeley.edu> On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > So far, mprime appears to be working. I was able to trigger an SBE in 21 > hours the first time I ran it. I plan on running it repeatedly for the > next few days to see how well it can repeat finding errors. After it finds an error how do you figure out which memory module to replace? -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 16:51:44 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 16:51:44 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> Message-ID: <4D014F70.4000103@ias.edu> Jason Clinton wrote: > On Thu, Dec 9, 2010 at 10:08, Prentice Bisbal > wrote: > > I know breakin well. I used it a quite a bit a in 2008 when I was > stress-testing my then-new cluster, and sent some feedback to the > developer at the time (last name Shoemaker, I think). I did find that I > could run it for days on all my cluster nodes, and then a few days > later, when running a HPL as a single job across all the nodes, I'd get > memory errors. I haven't used it since. Not because I don't like it, but > I just haven't had a need for it since then. > > > Hum. It's possible that EDAC support for your chipset didn't exist at > the time. AMD and Intel have been pretty good about landing EDAC for > their chips in vanilla upstream kernels for the past year and so that is > why it is important to use a recent kernel. Or at least one with recent > backports of that work. At the time, I was using the latest version of Breakin available. I was testing on AMD Barcelona processors. I was using Breakin in September/October 2008, and the Barcelona processors came out in March - May of that year. I would assume that would be enough time for support for the new processors to trickle down to breakin, but that's just an assumption, I can't confirm/prove that. > > > I've also been testing this node by running a single HPL job across all > 32 cores myself, and even after days of doing this, I couldn't trigger > any errors, but a user program could trigger an error in only a couple > of hours. > > Based on these experiences, I don't think that HPL is good at stressing > RAM.Has anyone else had similar experiences? > > > HPL is among the most memory intensive workloads out there. This is why > architectural changes in the past few years that have increased the > aggregate memory bandwidth of the architecture have resulted in higher > measured platform efficiency. > > My guess would be that the difference you've seen between the two would > be statistical noise. How are you measuring errors? MCE events? I don't think this is statistical noise. This system has consistently reported SBE errors since it was installed several months ago. I've probably tried to trigger SBEs with HPL dozens of times. I'll often run it 2-3 times in a row without triggering errors over a period of several days. When the users go back to using this server, they usually trigger errors in less time than that. I think HPL resulted in triggering the error only a couple of times. The system is a Dell PowerEdge something or other. It has an LCD display that is normally blue. When hardware error is detected, it turns orange, and shows the error. I check that several times a day. Our central log server also e-mails any ciritical log errors that get sent to it, so even if I didn't check the display on the front of the server, I'll receive an e-mail shortly after the error is logged in my system logs. It's low tech, but it works. > > > > Since this system has 128 GB of RAM, I think it's a good assumption that > many programs might not use all of that RAM, so I need something memory > specific that I know will hit all 128 GB of RAM. > > > Breakin uses the same algorithm at > http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html > to calculate the "N" size which will consume 90% of the RAM of a system > using all cores (in as close to square grid as possible). > > > > So far, mprime appears to be working. I was able to trigger an SBE in 21 > hours the first time I ran it. I plan on running it repeatedly for the > next few days to see how well it can repeat finding errors. > > > I'm curious what kernel you're running that is giving you EDAC > reporting. Or are you rebooting after an MCE and examining the system > event logs? > > > -- > Jason D. Clinton, Advanced Clustering Technologies > 913-643-0306 -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Dec 9 16:54:57 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 09 Dec 2010 16:54:57 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D010B18.4000305@berkeley.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> Message-ID: <4D015031.70908@ias.edu> Jon Forrest wrote: > On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > >> So far, mprime appears to be working. I was able to trigger an SBE in 21 >> hours the first time I ran it. I plan on running it repeatedly for the >> next few days to see how well it can repeat finding errors. > > After it finds an error how do you > figure out which memory module to > replace? > The LCD display on the front of the server tells me, with a message like this: "SBE logging disabled on DIMM C3. Reseat DIMM" I can also generate a report with DELL DSET that shows me a similar other message. I'm sure there are other tools, but I usually have to create a DSET report to send to Dell, anyway. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From david.t.kewley at gmail.com Thu Dec 9 21:14:41 2010 From: david.t.kewley at gmail.com (David Kewley) Date: Thu, 9 Dec 2010 18:14:41 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D015031.70908@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: Prentice, You only asked for memory testing programs, but I'm going to go a bit further, to make sure some background issues are covered, and to give you some ideas you might not yet have. Some of this is based on a lot of experience with Dell servers in HPC. Some of my background thoughts on dealing with SBEs: 1) Complete and details historical records are important for correctly and efficiently resolving these type of errors, especially on larger clusters. Otherwise it's too easy to get confused about what happened when, and come to incorrect conclusions about problems and solutions. Treat it like a lab experiement -- keep a log book or equivalent, test your hypotheses against the data, and think broadly about what alternative hypotheses may exist. 2) The resolution process will be iterative, with physical manipulations (e.g. moving DIMMs among slots) alternating with monitoring for SBEs and optionally running stress applications to attempt to trigger SBEs (a "reproducer" of the SBEs). 3) For efficient resolution, you want a quick, reliable reproducer, something that will trigger the SBEs quickly. 4) I've seen no evidence that SBEs materially affect performance or correctness on a server, so my practice has often been to leave affected servers in production as much as possible, taking them out of production (after draining jobs) only briefly to move DIMMs, replace DIMMs, etc. Regarding (4), if anyone here has measurements or a URL to a study saying in what circumstances there's a significant material risk to performance or correctness of calculation with SBE correction, I'd love to see that. I'm not saying that SBE correction is completely free performance wise -- I bet it takes a little time to do the correction, but I bet for normal SBE correction rates, that time is (nearly) unmeasurable. Also, over a few thousand server-years, I've never or almost never seen SBE corrections morph into uncorrectable multi-bit errors. When uncorrectable errors have shown up (which itself has been rare in my experience, mostly in a single situation where there was a server bug that got corrected), they've shown up early on a server, not after a long period of only seeing SBEs. Prentice, I believe you started this thread because you need something for (3), is that right? As David Mathog said, you already know what activity most reliably triggers SBE corrections: Your users' code. If I were in your shoes, and I had time and/or were concerned around issue (4) above, I'd a) identify which user, code, and specific runs trigger SBEs the most, then b) if possible, work with that user to get a single-node version of a similar run that you could outside production node use, to reproduce and resolve SBEs. I'd then monitor for SBEs in production, and when they occur, drain jobs from those nodes, and take them out of production so I could user that single-node user job to satisfy (2) and (3) above. If I was in your shoes and was NOTconcerned about (4), I'd simply drain the running job, do a manipulation (2 above), and put the node back into production, waiting for the SBE to recur if it is going to. This is what I've often done. Or if you have a dev/test cluster, replace the entire production node with a tested, known-good node from the dev/test cluster, then test/fix the SBE server in the context of the dev/test cluster. I've also often done this. My experience has been that long runs of single-node HPL was the best SBE trigger I ever found. Dell's mpmemory did not do as well. I believe memtest86{,+} also didn't find problems that HPL found, though I didn't test memtest86{,+} as much. It also was not immediately obvious how to gather the memory test results from mpmemory and memtest86{,+}, though it can probably be done, perhaps easily, with a bit of R&D. But since you've found that HPL does not trigger SBEs as much as your user's code, I think you have a very good pointer that you should do stress tests with your user's code if at all possible. If you can share what the stressful app is, and any of the relevant run parameters, that would probably be interesting to folks on this list. In my experience, usually SBEs are resolved by reseating or replacing the affected DIMM. However it can also be an issue on the motherboard (sockets or traces or something else), or possibly the CPU (because Intel and AMD now both have the memory controllers on-die), or possibly a BIOS issue (if a CPU or memory related parameter isn't set quite optimally by the BIOS you're running; BIOSes may set hardware parameters without your awareness nor ability to tune it yourself). Best practice may be: A) swap the DIMM where the SBE occurred with a neighbor that underwent similar stress but did not show any SBEs. Keep a permanent record of which DIMMs you swapped and when, as well as all error messages and their timing. B) re-stress either in production (if you believe my tentative assertion (in 4 above) that SBE corrections do not materially affect performance nor correctness), or using your reliable reproducer for an amount of time that you know should usually re-trigger the SBE if it is going to recur. C) assess the results and respond accordingly: 1) if the SBE messages do not recur, then either reseating resolved it, or it's so marginal that you will need to wait longer for it to show up; may as well leave it in production in this case 2) if the SBE messages follow the DIMM when you swapped it with its neighbor, then it's very very likely the DIMM (especially if the SBE occurred quickly upon stressing it, both before and after the DIMM move). Present this evidence to Dell Support and ask them to send you a replacement DIMM. KEEP IN MIND that although the replacement DIMM will usually resolve the issue, it has never before been stressed in your setup, and it's possible for your stress patterns to elicit SBEs even in this replacement DIMM. So if the error recurs in that DIMM slot, it's possible that the replacement DIMM also needs to be replaced. You again need to do a neighbor swap to check whether it really is the replacement DIMM. 3) If the SBE stays with the slot after you did the neighbor swap, take this evidence to Dell Support, and see what they say. I would guess they'd have the motherboard and/or CPU swapped. Alternatively, you may wish (use your best judgment) to gather more data by CAREFULLY! swapping CPUs 1 and 2 in that server and see whether the SBEs follow the CPU or stay with the slot. Just as with DIMMs, it's not unheard of for replacement motherboards and CPUs to also have issues, so don't assume they're perfect -- usually the suitable replacement will resolve the issue fully, but you won't know for sure until you've stressed the system. What model of PowerEdge are these servers? PowerEdge systems keep a history of the messages that get printed on the LCD in the System Event Log (SEL), in older days also called the ESM log (embedded systems management, I believe). The SEL is maintained by the BMC or iDRAC. I believe the message you report below (SBE logging disable) will be in the SEL. I know the SEL logs messages that indicate that the SBE correction rate has exceeded two successive thresholds (warning and critical). You can read and clear the SEL using a number of different methods. I'm sure DSET does it. You can also do it with ipmitool, omreport (part of the OpenManage Server Administration (OMSA) tools for the Linux command line), and during POST by hitting Ctrl-E (I think) to get into the BMC or iDRAC POST utility. I'm sure there are other ways; these are the ones I've found useful. Normal, non-Dell-specific ipmitool will print the SEL records using 'ipmitool sel list', but it does not have the lookup tables and algorithms needed to tell you the name of the affected DIMM on Dell servers. You can also do 'ipmitool sel list -v', which will dump raw field values for each SEL record, and you can decode those raw values to figure out the affected DIMM -- with enough examples (and comparing e.g. to theDIMM names in the Ctrl-E POST SEL view), you might be able to figure out the decoding algorithm on your own, or google might give you someone who has already figured out the decoding for your specific PowerEdge model. That is the downside of using standard ipmitool. The upside of ipmitool, though, is that it's quite lightweight, and can be used both on localhost and across the network (using IPMI over LAN, if you have it configured appropriately). The good news is that there's a Dell-specific version of ipmitool available, which adds some Dell-specific capabilities, including to decode DIMM names. This works at least for current PowerEdge R and M servers, as well as older PowerEdge models like the 1950, and probably a few generations older than that. I think it simply supports all models that the corresponding version of OpenManage supports; this does not include older SC servers or current C servers. If you have a model that OpenManage does not support, it may be worth trying, in case it does the right thing for you. You can get the 'delloem' version of ipmitool from the OpenManage Management Station package. The current latest URL is ftp://ftp.dell.com/sysman/OM-MgmtStat-Dell-Web-LX-6.4.0-1401_A01.tar.gz Then unpack it and look in ./linux/bmc/ipmitool/ for your OS or a compatible one. For example, looking in the RHEL5_x86_64 subdirectory, the rpm OpenIPMI-tools-2.0.16-99.dell.1.99.1.el5.x86_64.rpm has /usr/bin/ipmitool with 'delloem' appearing as a string internally. (I'm not able to test it right now.) Once you've installed the appropriate package, do 'ipmitool delloem'; this should tell you what the secondary options are. I believe 'ipmitool delloem sel' will decode the SEL including the correct DIMM names. If you install OpenManage appropriately, you can also get the SEL decoded, as well as get alerts automatically and immediately sent to syslog. The command line to print a decoded SEL is 'omreport system esmlog'. OpenManage is pretty heavy-weight, though. Some people do install it and leave it running on HPC compute nodes; some people would never do that on a production node. Your mention of getting log messages about the SBEs makes me think you do have OMSA installed and its daemons running -- is that correct? Try 'omreport system esmlog' if so. Finally, during POST Ctrl-E at the prompted moment will get you into the BMC or iDRAC POST menu system, in which you can view and optionally clear the SEL. I do not think this is easily scriptable, but if all else fails, that is one way to view the SEL, with proper decoding. I know that's long, and I hope that helps you and possibly others. David On Thu, Dec 9, 2010 at 1:54 PM, Prentice Bisbal wrote: > Jon Forrest wrote: > > On 12/9/2010 8:08 AM, Prentice Bisbal wrote: > > > >> So far, mprime appears to be working. I was able to trigger an SBE in 21 > >> hours the first time I ran it. I plan on running it repeatedly for the > >> next few days to see how well it can repeat finding errors. > > > > After it finds an error how do you > > figure out which memory module to > > replace? > > > > The LCD display on the front of the server tells me, with a message like > this: > > "SBE logging disabled on DIMM C3. Reseat DIMM" > > I can also generate a report with DELL DSET that shows me a similar > other message. I'm sure there are other tools, but I usually have to > create a DSET report to send to Dell, anyway. > > -- > Prentice > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Dec 10 08:45:25 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 10 Dec 2010 08:45:25 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: <4D022EF5.1080801@ias.edu> David, -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Dec 10 09:24:30 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 10 Dec 2010 09:24:30 -0500 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> Message-ID: <4D02381E.5090307@ias.edu> David, Thanks for the e-mail due to it's length, I'm not including it in my reply, which I know is normally bad mailing list etiquette. The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB of RAM. I installed two identical servers at the same time, named frigga and odin (husband and wife in Norse mythology, if your curious). These nodes are not part of a beowulf cluster, but this is the best forum I know of to discuss problems like this. Odin is the system with errors, and it started reporting SBE errors almost immediately, even when the system was completely idle. They started within hours of operating system installation, before users were even able to login to the system. As you pointed out, I don't think SBE errors are fatal, but I like to address all system errors I identify, no matter how trivial. I find when you get used to ignoring a "harmless" errors, you eventually end up ignoring all errors. So, you are right that I'm looking for a tool to quickly and reliably reproduce SBEs so that I can quickly resolve this problem with Dell. For reasons I can't discuss here, working with the user is not an option. Due to the nature of my institution, users are only here for a couple of years, anyway, and I'm looking for a tool that I can use long after this user (and his code) are gone. I have been keeping detailed logs of exactly when the SBE errors occur. And I have been reseating and swapping DIMMS to see of the errors move with the DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad motherboard. In the first occasion, the error did move with the DIMM, and I replaced the DIMM. Since then, the errors have been moving from DIMM to DIMM, even across banks of DIMMS. Since each bank corresponds to a socket, this would indicate that it's not a bad on-chip memory controller, or they're all bad. My goal is to find a tool that I can run repeatedly to reproduce SBE errors in a finite time frame, and then run it repeatedly and collect data on where these SBEs occur. I suspect it's a bad motherboard, but unless I have overwhelming data showing that, Dell will just keep replacing the DIMMs, and I'm pretty confident it's not bad DIMMs in this case. As stated earlier, HPL wasn't reliable for me in this capacity. I'm now using mprime's stress test mode, and will also test stressapptest. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mdidomenico4 at gmail.com Fri Dec 10 10:15:16 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 10 Dec 2010 15:15:16 +0000 Subject: [Beowulf] tesla benchmarking Message-ID: does anyone know of easy to run code that would swallow the cpu/memory on the chassis but also a tesla card? A lot of the tools i typically used in the past that have been ported to GPU's don't seem to use up much of the memory, or use all the GPU constantly. I'm running through NAMD at the moment which does seem to make pretty good use of the gpu processor, it doesn't seem to use much if any of the memory. Cuda-Linpack seems to cough an error on runtime, but hopefully i'll get that going, but i curious if there was anything else i didn't know about. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From david.t.kewley at gmail.com Fri Dec 10 14:47:20 2010 From: david.t.kewley at gmail.com (David Kewley) Date: Fri, 10 Dec 2010 11:47:20 -0800 Subject: [Beowulf] Memory stress testing tools. In-Reply-To: <4D02381E.5090307@ias.edu> References: <4CFE66E2.6030805@ias.edu> <4D00FEFC.2080509@ias.edu> <4D010B18.4000305@berkeley.edu> <4D015031.70908@ias.edu> <4D02381E.5090307@ias.edu> Message-ID: Prentice, Thanks for filling in some details. What you say makes complete sense to me. Is it the case that frigga has seen similar stress with no SBE errors? If so, I agree it seems like something else is going on besides bad DIMMs. To test that, if you can schedule simultaneous downtime on the two boxes, you might swap all DIMMs between odin and frigga. If you do a few DIMM replacements, but continue to have the sense that DIMM replacements aren't really solving the problem, and you have good evidence why you think that, I encourage you to make sure Dell Support hears and understands that, and make sure they're looking more holistically than individual DIMMs. They may look more broadly on their own, or you may need to nudge them. David On Fri, Dec 10, 2010 at 6:24 AM, Prentice Bisbal wrote: > David, > > Thanks for the e-mail due to it's length, I'm not including it in my reply, > which I know is normally bad mailing list etiquette. > > The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 GB > of RAM. > > I installed two identical servers at the same time, named frigga and odin > (husband and wife in Norse mythology, if your curious). These nodes are not > part of a beowulf cluster, but this is the best forum I know of to discuss > problems like this. > > Odin is the system with errors, and it started reporting SBE errors almost > immediately, even when the system was completely idle. They started within > hours of operating system installation, before users were even able to login > to the system. > > As you pointed out, I don't think SBE errors are fatal, but I like to > address all system errors I identify, no matter how trivial. I find when you > get used to ignoring a "harmless" errors, you eventually end up ignoring all > errors. > > So, you are right that I'm looking for a tool to quickly and reliably > reproduce SBEs so that I can quickly resolve this problem with Dell. For > reasons I can't discuss here, working with the user is not an option. Due to > the nature of my institution, users are only here for a couple of years, > anyway, and I'm looking for a tool that I can use long after this user (and > his code) are gone. > > I have been keeping detailed logs of exactly when the SBE errors occur. And > I have been reseating and swapping DIMMS to see of the errors move with the > DIMM or stay with the slot to determine whether it's a bad DIMM, or a bad > motherboard. In the first occasion, the error did move with the DIMM, and I > replaced the DIMM. Since then, the errors have been moving from DIMM to > DIMM, even across banks of DIMMS. Since each bank corresponds to a socket, > this would indicate that it's not a bad on-chip memory controller, or > they're all bad. > > My goal is to find a tool that I can run repeatedly to reproduce SBE errors > in a finite time frame, and then run it repeatedly and collect data on where > these SBEs occur. I suspect it's a bad motherboard, but unless I have > overwhelming data showing that, Dell will just keep replacing the DIMMs, and > I'm pretty confident it's not bad DIMMs in this case. > > As stated earlier, HPL wasn't reliable for me in this capacity. I'm now > using mprime's stress test mode, and will also test stressapptest. > > > -- > Prentice > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Fri Dec 10 16:18:02 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 10 Dec 2010 13:18:02 -0800 Subject: [Beowulf] Memory stress testing tools Message-ID: Prentice Bisbal wrote: > The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 > GB of RAM. If the erroneous memory locations are moving around in memory without correlation to the DIMMs then the next most likely culprits are a marginal power supply, CPU, or motherboard, in pretty much that order. (OK, kind of a toss up for CPU vs. motherboard, but since you have 32 cores in the system I put it first.) If you have access to an oscilloscope look closely at the voltages on the two machines. No need to cut in anywhere, just measure +5 and +12V on an unused disk or fan connector. If the machine prone to memory errors is significantly noisier than the one that is not, that could be the problem. I have seen this exactly once - all PS testers said it was good, and a multimeter had it pegged at the right voltages, but there was a ton of high frequency noise coming out of the power supply. If you can disable CPUs through the BIOS on that machine, running for a while under each CPU alone might narrow the issue down to 1 of the 4. You wouldn't be done then though, because it could be the socket and not the CPU itself. Still, if you can get it down to 1 CPU then you could swap that with another and see if the issue moves with it. You probably already did this, but be sure both machines have the same BIOS release. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Mon Dec 13 17:43:19 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 14 Dec 2010 09:43:19 +1100 Subject: [Beowulf] IB symbol error thresholds for health check scripts ? Message-ID: <4D06A187.2080201@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi folks, We run a bunch of health checks [1] on a compute node through Torque [2] and if they fail the node gets knocked offline. One of the checks we do is to check that there are no symbol errors on the IB link. However, I'm wondering if simply saying a single error is too brutal for this - what do other people do about these ? cheers! Chris [1] - for the record we check things like - amount of RAM, failed DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number and speed of CPUs, LDAP OK, home directories accessible, etc. [2] - checks run prior to a job start, after a job exits and every 7.5 minutes (every 10 mom intervals). - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk0GoYcACgkQO2KABBYQAh9w1gCgh19IOhXa5BWOmC3+qyZaDDr/ MrYAn1at4YwaaNkmmZpNAVNHBF0OIH0V =/gDC -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From stuartb at 4gh.net Wed Dec 29 13:29:21 2010 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 29 Dec 2010 13:29:21 -0500 (EST) Subject: [Beowulf] IB symbol error thresholds for health check scripts ? In-Reply-To: <4D06A187.2080201@unimelb.edu.au> References: <4D06A187.2080201@unimelb.edu.au> Message-ID: On Mon, 13 Dec 2010 at 17:43 -0000, Christopher Samuel wrote: > We run a bunch of health checks [1] on a compute node through Torque > [2] and if they fail the node gets knocked offline. Can you share these scripts? I'm needing to get something started along these lines (torque, Moab, Infiniband, IBM system x, xCAT). I'm sure I'll find things needing adaption to our environment. > One of the checks we do is to check that there are no symbol errors > on the IB link. However, I'm wondering if simply saying a single > error is too brutal for this - what do other people do about these ? I'm looking at Infiniband problems currently and have been watching our SymbolErrorCounter values. I'm told a "small number" of these errors are okay. I don't know the definition of "small" or over how long a time period. Over the last week 24 of our nodes have shown at least two errors. Of these 6 nodes are showing over 400 errors (450-30000) and these nodes need attention (I've manually downed them until I can get to the hardware). The remaining nodes are all < 50 errors, with half of those < 10. I'm planning to do more proactive monitoring of the Infiniband Fabric. The current toolset is very awkward to use for monitoring. There is an updated Infiniband Fabric Suite from QLogic which appears to improve this significantly. It should be possible to do the Infiniband monitoring completely off node so as to not perturb the computations too much. > [1] - for the record we check things like - amount of RAM, failed > DIMMs (via IPMI on IBM or memlog on SGI), number of cores, number > and speed of CPUs, LDAP OK, home directories accessible, etc. All things we need to check. I manually found several of our nodes running with one disabled RAM stick. > [2] - checks run prior to a job start, after a job exits and every > 7.5 minutes (every 10 mom intervals). Also when the node comes up before mom starts I assume? Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.