From eugen at leitl.org Wed May 4 06:33:46 2011 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 4 May 2011 12:33:46 +0200 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown Message-ID: <20110504103346.GH23560@leitl.org> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines Chinese Chip Wins Energy-Efficiency Crown Though slower than competitors, the energy-saving Godson-3B is destined for the next Chinese supercomputer By Joseph Calamia / May 2011 The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in the third quarter of 2011, will have something quite different under its hood. Unlike its forerunners, which employed American-born chips, this machine will harness the country's homegrown high-end processor, the Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than its competitors' wares, at least one of which operates at more than 5 GHz, but the chip still turns heads with its record-breaking energy efficiency. It can execute 128 billion floating-point operations per second using just 40 watts?double or more the performance per watt of competitors. The Godson has an eccentric interconnect structure?for relaying messages among multiple processor cores?that also garners attention. While Intel and IBM are commercializing chips that will shuttle communications between cores merry-go-round style on a "ring interconnect," the Godson connects cores using a modified version of the gridlike interconnect system called a mesh network. The processor's designers, led by Weiwu Hu at the Chinese Academy of Sciences, in Beijing, seem to be placing their bets on a new kind of layout for future high-end computer processors. A mesh design goes hand in hand with saving energy, says Matthew Mattina, chief architect at the San Jose, Calif.?based Tilera Corp., a chipmaker now shipping 36- and 64-core processors using on-chip mesh interconnects. Imagine a ring interconnect as a traffic roundabout. Getting to some exits requires you to drive nearly around the entire circle. Traveling away from your destination before getting there, says Mattina, requires more transistor switching and therefore consumes more energy. A mesh network is more like a city's crisscrossed streets. "In a mesh, you always traverse the minimum amount of wire?you're never going the wrong way," he says. On the 8-core Godson chip, 4 cores form a tightly bound unit?each core sits on a corner of a square of interconnects, as in a usual mesh. Godson researchers have also connected each corner to its opposite, using a pair of diagonal interconnects to form an X through the square's center. A "crossbar" interconnect then serves as an overpass, linking this 4-core neighborhood to a similar 4-core setup nearby. Godson developers believe that their modified mesh's scalability will prove a key advantage, as chip designers cram more cores onto future chips. Yunji Chen, a Godson architect, says that competitors' ring interconnects may have trouble squeezing in more than 32 cores. Indeed, one of the ring's benefits could prove its future liability. Linking new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of electrical and computer engineering at the University of Toronto. After all, there's only one path to send information?or two in a bidirectional ring. But sharing a common communication path also means that each additional core adds to the length of wire that messages must travel and increases the demand for that path. With a large number of cores, "the timing around this ring just gets out of hand," Smith says. "You can't get service when you need it." Of course, adding more cores in a mesh also stresses the system. Even if you have a grid of paths providing multiple communication channels, more cores increase the demand for the network, and more demand makes traveling long distances difficult: Try driving across New York City at rush hour. Still, the bandwidth scaling of a mesh interconnect is superior to that of a ring, Tilera's Mattina says. He notes that the total bandwidth available with a mesh interconnect increases as you add cores, but with a ring interconnect, the total bandwidth remains constant even as the core count increases. Latency?the time it takes to get a message from one core to another?is also more favorable in a mesh design, Chen says. In a ring interconnect, latency increases linearly with the core count, he says, while in a mesh design it increases with the square root of the number of cores. Reid Riedlinger, a principal engineer at Intel, points out that a ring interconnect has its own scalability benefits. Intel's recently unveiled 8-core Poulson design employs a ring not only to add more cores but also to add easy-to-access on-chip memory, or cache. As long as the chip has the power and the space, Riedlinger says, a ring makes it easy to add each core and cache as a module?a move that would require more complicated validity studies and logic modification in a mesh. "Adding the additional ring stop has a very small impact on latency, and the additional cache capacity will provide performance benefits for many applications," he says. For those who are not building a national supercomputer, Riedlinger also points out that a ring setup is more easily scalable in a different direction. "You might start with an 8-core design," he says, "and then, to suit a different market segment, you might chop 4 cores out of the middle and sell it as a different product." This article originally appeared in print as "China's Godson Gamble". _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cbergstrom at pathscale.com Wed May 4 06:39:45 2011 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?Christopher_Bergstr=F6m?=) Date: Wed, 4 May 2011 17:39:45 +0700 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown In-Reply-To: <20110504103346.GH23560@leitl.org> References: <20110504103346.GH23560@leitl.org> Message-ID: On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl wrote: > > http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines > > Chinese Chip Wins Energy-Efficiency Crown > > Though slower than competitors, the energy-saving Godson-3B is destined > for the next Chinese supercomputer > > By Joseph Calamia ?/ ?May 2011 > > The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in > the third quarter of 2011, will have something quite different under its > hood. Unlike its forerunners, which employed American-born chips, this > machine will harness the country's homegrown high-end processor, the > Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than > its competitors' wares, at least one of which operates at more than 5 GHz, > but the chip still turns heads with its record-breaking energy efficiency. It > can execute 128 billion floating-point operations per second using just 40 > watts?double or more the performance per watt of competitors. *cough* Wow.. they've brought SiCortex back to life... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Wed May 4 09:50:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 04 May 2011 09:50:50 -0400 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown In-Reply-To: References: <20110504103346.GH23560@leitl.org> Message-ID: <4DC159BA.8090005@ias.edu> On 05/04/2011 06:39 AM, Christopher Bergstr?m wrote: > On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl wrote: >> >> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines >> >> Chinese Chip Wins Energy-Efficiency Crown >> >> Though slower than competitors, the energy-saving Godson-3B is destined >> for the next Chinese supercomputer >> >> By Joseph Calamia / May 2011 >> >> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in >> the third quarter of 2011, will have something quite different under its >> hood. Unlike its forerunners, which employed American-born chips, this >> machine will harness the country's homegrown high-end processor, the >> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than >> its competitors' wares, at least one of which operates at more than 5 GHz, >> but the chip still turns heads with its record-breaking energy efficiency. It >> can execute 128 billion floating-point operations per second using just 40 >> watts?double or more the performance per watt of competitors. > > *cough* > > Wow.. they've brought SiCortex back to life... Oh, snap! -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue May 10 01:37:33 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 9 May 2011 22:37:33 -0700 Subject: [Beowulf] How InfiniBand gained confusing bandwidth numbers Message-ID: <20110510053733.GB12826@bx9.net> http://dilbert.com/strips/comic/2011-05-10/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mdidomenico4 at gmail.com Wed May 11 20:57:57 2011 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 11 May 2011 20:57:57 -0400 Subject: [Beowulf] EPCC and DIR cluster Message-ID: Is there anyone on the list associated with EPCC or knows someone at EPCC? If so, i recently saw an article in Scientific Computing magazine, where there was a blurb about a smallish cluster built a EPCC utilizing Atom chips/GPU's and HDD's, whereby the design was more amdahl balanced for "data intensive research". I can't seem to locate anything on the web about it, but I'm interested in the spec's/design for the machine and how it performs thanks _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri May 20 00:35:25 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 20 May 2011 00:35:25 -0400 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: <4DD5EF8D.8070909@scalableinformatics.com> Hi folks Does anyone run a large-ish cluster without ECC ram? Or with ECC turned off at the motherboard level? I am curious if there are numbers of these, and what issues people encounter. I have some of my own data from smaller collections of systems, I am wondering about this for larger systems. Thanks! Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 01:45:01 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 19 May 2011 22:45:01 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <20110520054501.GE16676@bx9.net> On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote: > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. I don't think anyone's done the experiment with a 'larger system' since "Big Mac" had to replace all of their servers with ones that had ECC. Still, any cluster that can manipulate the BIOS appropriately could easily do the experiment. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From gmpc at sanger.ac.uk Fri May 20 04:58:59 2011 From: gmpc at sanger.ac.uk (Guy Coates) Date: Fri, 20 May 2011 09:58:59 +0100 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <20110520054501.GE16676@bx9.net> References: <4DD5EF8D.8070909@scalableinformatics.com> <20110520054501.GE16676@bx9.net> Message-ID: <4DD62D53.6030806@sanger.ac.uk> On 20/05/11 06:45, Greg Lindahl wrote: > On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote: > >> Does anyone run a large-ish cluster without ECC ram? Or with ECC >> turned off at the motherboard level? I am curious if there are numbers >> of these, and what issues people encounter. I have some of my own data >> from smaller collections of systems, I am wondering about this for >> larger systems. We did, circa 2003. Never again. When we were lucky, the uncorrected errors happened in memory in use by the kernel or application code, and we got hard machine crashes or code seg-faulting. Those were easy to spot. When we were unlucky, the errors happened in page cache, resulting in data being randomly transmuted. Most of the code we were running at the time did minimal input sanity checking. It was quite instructive to see just how much genomic analysis code would quite happily compute on DNA sequences that contained things other than ATGC. The duff runs would eventually get picked up by the various sanity-checks that happened at the end of our analysis pipelines, but it involved quite a bit of developer & sysadmin effort to track down and re-run all of the possibly affected jobs. Cheers, Guy -- Dr. Guy Coates, Informatics Systems Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From a.travis at abdn.ac.uk Fri May 20 11:35:45 2011 From: a.travis at abdn.ac.uk (Tony Travis) Date: Fri, 20 May 2011 16:35:45 +0100 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <4DD68A51.70605@abdn.ac.uk> On 20/05/11 05:35, Joe Landman wrote: > Hi folks > > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. Hi, Joe. I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it was a nightmare, as Guy described in his email, until I pre-emptively tested the memory in user-space, using Chlarles Cazabon's "memtester": http://pyropus.ca/software/memtester Prior to this, *all* the RAM had passed Memtest86+. I had a strict policy that if a system crashed, for any reason, it was re-tested with Memtest86+, then 100 passes of "memtester" before being allowed to re-join the Beowulf cluster. This made the Beowulf much more stable running openMosix. However, I've scrapped all our non-ECC nodes now because the real worry is not knowing if an error has occurred... Apparently this is still a big issue for computers in space, using non-ECC RAM for solid-state storage on grounds of cost for imaging. They, apparently, use RAM background SoftECC 'scrubbers' like this: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri May 20 11:52:43 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 20 May 2011 08:52:43 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD68A51.70605@abdn.ac.uk> Message-ID: On 5/20/11 8:35 AM, "Tony Travis" wrote: >On 20/05/11 05:35, Joe Landman wrote: >> Hi folks >> >> Does anyone run a large-ish cluster without ECC ram? Or with ECC >> turned off at the motherboard level? I am curious if there are numbers >> of these, and what issues people encounter. I have some of my own data >> from smaller collections of systems, I am wondering about this for >> larger systems. > >Hi, Joe. > >Apparently this is still a big issue for computers in space, using >non-ECC RAM for solid-state storage on grounds of cost for imaging. >They, apparently, use RAM background SoftECC 'scrubbers' like this: > >http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng >.pdf > > Yes, it's a big tradeoff in the space world. Not only does ECC require extra memory, but the EDAC logic consumes power and, typically, slows down the bus speed (I.e. You need an extra bus cycle to handle the EDAC logic propagation delay). There's also a practical detail that the upset rate might be low enough that it is ok to just tolerate the upsets, because they'll get handled at some higher level of the process. For instance, if you have a RAM buffer in a communication link handling the FEC coded bits, then there's not much difference between a bit flip in RAM and a bit error on the comm link, so you might as well just let the comm FEC code take care of the bit errors. We tend to use a lot of checksum strategies. Rather than an EDAC strategy which corrects errors, it's good enough to just know that an error occurred, and retry. This is particularly effective on Flash memory, which has transient read errors: read it again and it works ok. Another example is doing an FFT. There are some strategies which allow you to do a second fast computation that essentially provides a "check" on the results of the FFT (e.g. The mean of the input data should match the "DC term" in the FFT) We might also keep triple copies of key variables. You read all three values and compare them before starting the computation. Software Triple Redundancy, as it were. A lot of times, the probability of an error occurring "during" the computation is sufficiently low, compared to the probability of an error occurring during the very long waiting time between operating on the data. There's also the whole question of whether EDAC main memory buys you much, when all the (ever larger) cache isn't protected. Again, it comes down to a probability analysis. My own personal theory on this is that you are much more likely to have a miscalculation due to a software bug than due to an upset. Further, it's impossible to get all the bugs out in finite time/money, so you might as well design your whole system to be fault tolerant, not in a "oh my gosh, we had an error, let's do extensive fault recovery", but a "we assume the computations are always a bit wonky, so we factor that into our design". That is, design so that retries and self checks are just part of the overhead. Kind of like how a decent experiment or engineering design takes into account measurement uncertainty stack-up. As hardware gets smaller and faster and lower power, the "cost" to provide extra computational resources to implement a strategy like this gets smaller, relative to the ever increasing human labor cost to try and make it perfect. (and, of course, this *is* how humans actually do stuff.. You don't precompute all of your control inputs to the car.. You basically set a general goal, and continuously adjust to drive towards that goal.) Jim Lux > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri May 20 12:35:26 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 20 May 2011 12:35:26 -0400 (EDT) Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org> Joe While this is somewhat anecdotal, it may be helpful. Not a large-ish cluster, but as you may guess, I wondered about this for Limulus (http://limulus.basement-supercomputing.com) I wrote a script (will post it if anyone interested) that runs memtester until you stop it or it finds a error. I ran it on several Core2 Duo systems with Kingston DDR2-800 PC2-6400 memory. As I recall, I ran it on 2-3 systems, only one showed an error. I stopped the others after about three weeks. Here is an example of the script output when it fails (it logs the memtest output). There was an error, inspect memtest-1178 Start Date was: Mon Apr 20 16:04:35 EDT 2009 Failure Date was: Fri May 8 17:55:43 EDT 2009 Test ran 1178 times failing after 1561868 Seconds (26031 Minutes or 433 Hours or 18 Days) My experience in running small clusters without ECC has been very good. IMO it is also a question of the quality of the memory vendor. I never had an issue when running tests and benchmarks, which I do quite a bit on new hardware e.g. goo.gl/YoBaz -- Doug > Hi folks > > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. > > Thanks! > > Joe > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics, Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/sicluster > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Fri May 20 13:21:12 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 20 May 2011 10:21:12 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org> Message-ID: On 5/20/11 9:35 AM, "Douglas Eadline" wrote: >Joe > >While this is somewhat anecdotal, it may be helpful. > >Not a large-ish cluster, but as you may guess, I wondered >about this for Limulus >(http://limulus.basement-supercomputing.com) > >I wrote a script (will post it if anyone interested) >that runs memtester until you stop it or it finds >a error. I ran it on several Core2 Duo systems >with Kingston DDR2-800 PC2-6400 memory. > >My experience in running small clusters >without ECC has been very good. IMO it is also >a question of the quality of the memory vendor. >I never had an issue when running tests and >benchmarks, which I do quite a bit on new >hardware e.g. I'm going to guess that it's highly idiosyncratic. The timing margins on all the signals between CPU, memory, and perhipherals are tight, they're temperature dependent and process dependent, so you could have the exact same design with very similar RAM and one will get errors and the other won't. Folks who design PCI bus interfaces for a living earn their pay, especially if they have to make it work with lots of different mfrs: just because all the parts meet their databook specs doesn't mean that the system will play nice together. Consider that for memory, you have 64 odd data lines and 20 or so address lines and some strobes that ALL have to switch together. A data sensitive pattern where a bunch of lines move at the same time, and induce a bit of a voltage into an adjacent trace, which is a bit slower or faster than the rest, and you've got the makings of a challenging hunt for the problem. PC board trace lengths all have to be carefully matched, loads have to be carefully matched, etc. 66 Mhz -> 15 ns, but modern DDR rams do batches of words separated by a few ns. 1 cm is about 10-15 cm of tracelength, but it's the loading, terminations, and other stuff that causes a problem. Hang a 1 pf capacitor off that 100 ohm line, and there's a tenth of a ns time constant right there. You could also have EMI/EMC issues that cause problems. That same ragged edge timing margin might be fine with 8 tower cases sitting on a shelf, but not so good with the exact same mobo and memory stacked into 1-2U cases in a 19" rack. Power cords and ethernet cables also carry EMI around. In a large cluster these things will all be aggravated: you've got more machines running, so you increase the error probability right there. You've got more electrical noise on the power carried between machines. You've typically got denser packaging. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri May 20 14:26:31 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 20 May 2011 14:26:31 -0400 (EDT) Subject: [Beowulf] Execution time measurements - clarification Message-ID: From: Mikhail Kuzminsky Subject: [Beowulf] Execution time measurements - clarification Dear Mark, could you pls forward my message to beowulf at beowulf.org (because my messages as before can't be delivered to maillist) ? It's clarification of my previous question here. Mikhail --------------------- I have strange execution time measurements for CPU-bound jobs (to be exact, Gaussian-09 DFT frequency calculations). Results are strange for *SEQENTIAL* calculations ! Executions were performed on dual socket Opteron 2350 (Quad core) server worked under Open SuSE Linux 10.3. When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is LOWER than for single job run ! I thought that it may be wrong "own for G09" times, so I checked their via time: (time g09 *pair1.com) >&testt1 & etc But this confirms strange results: For pair of simultaneously running sequential jobs 88801.141u 52.475s 24:40:57.58 99.9% 0+0k 0+0io 1221pf+0w 88901.996u 13.472s 24:41:53.97 100.0% 0+0k 0+0io 0pf+0w For run of 1 example of the same sequential job 100365.236u 27.297s 27:53:13.53 99.9% 0+0k 0+0io 1pf+0w Is there any ideas why this situation might be ? Mikhail Kuzminsky _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 20:29:10 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 20 May 2011 17:29:10 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: <4DD68A51.70605@abdn.ac.uk> Message-ID: <20110521002910.GB14350@bx9.net> On Fri, May 20, 2011 at 08:52:43AM -0700, Lux, Jim (337C) wrote: > As hardware gets smaller and faster and lower power, the "cost" to provide > extra computational resources to implement a strategy like this gets > smaller, relative to the ever increasing human labor cost to try and make > it perfect. The cost is teaching users to add checks to their codes, and to any off-the-shelf codes they start using. In hyrodynamics (cfd), often you have quantities which are explicitly conserved by the equations, and others which are conserved by physics but not by the particular numerical method you're using. The latter were quite handy for finding bugs. I managed to discover several numerical accuracy bugs in pre-release versions of the PathScale compilers that way. "Yes, it's a bug if the 12th decimal place changes." -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 20:32:27 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 20 May 2011 17:32:27 -0700 Subject: [Beowulf] Execution time measurements - clarification Message-ID: <20110521003227.GD14350@bx9.net> On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is > LOWER than for single job run ! I'd try locking these sequential jobs to a single core, you can get quite weird effects when you don't. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Mon May 23 12:40:13 2011 From: mathog at caltech.edu (David Mathog) Date: Mon, 23 May 2011 09:40:13 -0700 Subject: [Beowulf] Execution time measurements - clarification Message-ID: > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is > > LOWER than for single job run ! Disk caching could cause that. Normally if the data read in isn't too big you see an effect where: run 1: 30 sec <-- 50% disk IO/ 50% CPU run 2: 15 sec <-- ~100% CPU where the first run loaded the data into disk cache, and the second run read it there, saving a lot of real disk IO. Under some very peculiar conditions, on a multicore system, if run 1 and 2 are "simultaneous" they could seesaw back and forth for the "lead", so they end up taking turns doing the actual disk IO, with the total run time for each ending up between the times for the two runs above. Note that they wouldn't have to be started at exactly the same time for this to happen, because the job that starts second is going to be reading from cache, so it will tend to catch up to the job that started first. Once they are close then noise in the scheduling algorithms could cause the second to pass the first. (If it didn't pass, then this couldn't happen, because the second would always be waiting for the first to pull data in from disk.) Of course, you also need to be sure that run 1 isn't interfering with run 2. They might, for instance, save/retrieve intermediate values to the same filename, so that they really cannot be run safely at the same time. That is, they run faster together, but they run incorrectly. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Mon May 23 15:32:33 2011 From: mathog at caltech.edu (David Mathog) Date: Mon, 23 May 2011 12:32:33 -0700 Subject: [Beowulf] Execution time measurements Message-ID: Mikhail Kuzminsky sent this to me and asked that it be posted: BEGIN FORWARD Mon, 23 May 2011 09:40:13 -0700 ???????????? ???? "David Mathog" : > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > When I run 2 identical examples of the same batch job > simultaneously, execution time of *each* job is > > > LOWER than for single job run ! > > Disk caching could cause that. Normally if the data read in isn't too > big you see an effect where: > > run 1: 30 sec <-- 50% disk IO/ 50% CPU > run 2: 15 sec <-- ~100% CPU I believe that jobs are CPU-bound: top says that they use 100% of CPU, and no swap activity. iostat /dev/sda3 (where IO is performed) says typically something like: Linux 2.6.22.5-31-default (c6ws1) 05/25/2011 avg-cpu: %user %nice %system %iowait %steal %idle 1.12 0.00 0.03 0.01 0.00 98.84 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda3 0.01 0.01 8.47 20720 16845881 > > Of course, you also need to be sure that run 1 isn't interfering with > run 2. They might, for instance, save/retrieve intermediate values > to the same filename, so that they really cannot be run safely at the > same time. That is, they run faster together, but they run incorrectly. File names used for IO are unique. I thought also about cpus frequency variations, but I think that null output of lsmod|grep freq is enough for fixed CPU frequency. END FORWARD OK, so not disk caching. Regarding the frequencies, better to use cat /proc/cpuinfo | grep MHz while the processes are running. Did you verify that the results for each of the two simultaneous runs are both correct? Ideally, tweak some parameter so they are slightly different from each other. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 11:41:32 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 08:41:32 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: Joe Landman wrote: > I am wondering about this for larger systems. Your post makes me wonder about ECC in much smaller systems, like dedicated single computers controlling machinery or medical devices. Some really nasty things could result from "move cutting head in X (int32 value) mm" after the most significant bit in the int32 value has flipped. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Tue May 24 11:44:15 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 24 May 2011 11:44:15 -0400 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: <4DDBD24F.7080608@scalableinformatics.com> On 05/24/2011 11:41 AM, David Mathog wrote: > Joe Landman wrote: > >> I am wondering about this for larger systems. > > Your post makes me wonder about ECC in much smaller systems, like > dedicated single computers controlling machinery or medical devices. > Some really nasty things could result from "move cutting head in X > (int32 value) mm" after the most significant bit in the int32 value has > flipped. Some bits are more important than others ... Basically I was looking for anecdotal evidence that this is a "bad thing" (TM). I have it now, and it helped me make the case I needed to make. Thanks to everyone for this, it was really helpful! -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue May 24 13:06:15 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 24 May 2011 10:06:15 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: This *is* a big problem. I suggest reading some of what Nancy Leveson has written. http://sunnyday.mit.edu/ "Professor Leveson started a new area of research, software safety, which is concerned with the problems of building software for real-time systems where failures can result in loss of life or property." Two popular papers you might find interesting and fun to read: "High-Pressure Steam Engines and Computer Software" (Postscript) or (PDF). This paper started as a keynote address at the International Conference on Software Engineering in Melbourne, Australia) and later was published in IEEE Software, October 1994. "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book. There is a generic problem with complex systems, as well. "Normal Accidents" by Charles Perrow is a good work (if a bit frightening in some ways... not in a senseless fear-mongering way, but because he lays out the fundamental reasons why these things are inevitable) Marais, Dulac, and Leveson argue that the world isn't as bad as Perrow says, though. http://esd.mit.edu/symposium/pdfs/papers/marais-b.pdf Jim Lux +1(818)354-2075 > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog > Sent: Tuesday, May 24, 2011 8:42 AM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] Curious about ECC vs non-ECC in practice > > Joe Landman wrote: > > > I am wondering about this for larger systems. > > Your post makes me wonder about ECC in much smaller systems, like > dedicated single computers controlling machinery or medical devices. > Some really nasty things could result from "move cutting head in X > (int32 value) mm" after the most significant bit in the int32 value has > flipped. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 14:27:23 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 11:27:23 -0700 Subject: [Beowulf] Execution time measurements Message-ID: Another message from Mikhail Kuzminsky, who for some reason or other cannot currently post directly to the list: BEGIN FORWARD 1st of all, I should mention that the effect is observed only for Opteron 2350/OpenSuSE 10.3. Execution of the same job w/the same binaries on Nehalem E5520/OpenSuSe 11.1 gives the same time for 1 and 2 simultaneously runnung jobs. Mon, 23 May 2011 12:32:33 -0700 ???????????? ???? "David Mathog" : > Mon, 23 May 2011 09:40:13 -0700 ???????????????????????? ???????? "David Mathog" > : > > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > > When I run 2 identical examples of the same batch job > > simultaneously, execution time of *each* job is > > > > LOWER than for single job run ! > I thought also about cpus frequency variations, but I think that null output > of > lsmod|grep freq > is enough for fixed CPU frequency. > > END FORWARD > Regarding the frequencies, better to use > cat /proc/cpuinfo | grep MHz I looked to cpuinfo, but only manually - some times (i.e. I didn't run any script w/periodical looking for CPU frequencies). All the frequencies of cores were fixed. > Did you verify that the results for each of the two simultaneous runs > are both correct? Yes, the results are the same. I looked also to number of iterations etc. But I'll check outputs again. >Ideally, tweak some parameter so they are slightly > different from each other. But I don't understand - if I change slightly some of input parameters, what may it give ? > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech Fri, 20 May 2011 20:11:15 -0400 message from Serguei Patchkovskii : > Suse 10.3 is quite old; it uses a kernel which is less than perfect at scheduling jobs and allocating resources for >NUMA systems. Try running your test job using: > > numactl --cpunodebind=0 --membind=0 g98 numactl w/all things bound to node 1 gives "big" execution time ( 1 day 4 hours; 2 simultaneous jobs run faster), for forcing different nodes for cpu and memory - execution time is even higher (+1 h). Therefore effect observed don't looks as result of numa allocations :-( Mikhail END FORWARD My point about the two different parameter sets on the jobs was to determine if the two were truly independent, or if they might not be interacting with each other through checkpoint files or shared memory, or the like. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 14:37:30 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 11:37:30 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: Jim Lux posted: > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book. Well that was really horrible. Are car computers ECC? When all they did was engine management a memory glitch wouldn't have been too terrible, but now that some of them control automatic parking and other "higher" functions, and with around 100M units in circulation just in the USA, if they aren't ECC then memory glitches in running vehicles would have to be happening every day. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue May 24 15:07:10 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 24 May 2011 12:07:10 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: > -----Original Message----- > From: David Mathog [mailto:mathog at caltech.edu] > Sent: Tuesday, May 24, 2011 11:38 AM > To: Lux, Jim (337C); beowulf at beowulf.org > Subject: RE: [Beowulf] Curious about ECC vs non-ECC in practice > > Jim Lux posted: > > > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an > updated version of the original IEEE Computer (July 1993) article. It > also appears in the appendix of my book. > > Well that was really horrible. > > Are car computers ECC? When all they did was engine management a memory > glitch wouldn't have been too terrible, but now that some of them > control automatic parking and other "higher" functions, and with around > 100M units in circulation just in the USA, if they aren't ECC then > memory glitches in running vehicles would have to be happening every day. Car controllers tend to have mask ROM for their software which is pretty upset immune. The "PROM" (which today might be flash or EEPROM) holds all the coefficients for things like the fuel injection/timing, but doesn't hold the code for controlling, say, the ABS. I would imagine (but do not know) that they do things similar to what we do in spacecraft controllers: store critical data multiple times, lots of self checks on algorithm operation, etc. The report on the Toyota Throttle controller said this: "The Main and Sub-CPUs use two types of memory: non-volatile ROM for software code and volatile Static Ram (SRAM). The SRAM is protected by a single error detect and correct and a double error detect hardware function performed by error detection and correction (EDAC) logic." There's a whole reliability of software community out there with everything from certifiable processes to coding standards (MISRA) designed to make it easy to inspect and verify that the code is doing what you think, and that it handles off-nominal cases. I haven't read the whole report, but there was an analysis of the software in the Toyota controllers recently. http://www.nhtsa.gov/staticfiles/nvs/pdf/NASA-UA_report.pdf "The NESC team examined the software code (more than 280,000 lines) for paths that might initiate such a UA, but none were identified" (UA-Unintended Acceleration) The team examined the VOQ vehicles for signs of electrical faults, and subjected these vehicles to electro-magnetic interference (EMI) radiated and conducted test levels significantly above certification levels. The EMI testing did not produce any UAs, but in some cases caused the engine to slow and/or stall. (That's probably closest to what you'd see from a memory upset) Section 6.5, page 64 of the report, is "System Fail-Safe Architecture" It's pretty sophisticated, with multiple parallel schemes to prevent runaway or failure. I'm impressed at the level of thought they gave to not just shutting down the engine, but in leaving an adequate limp-home capability when one or more parts in the chain fails (e.g. if the throttle plate actually sticks, it can control the engine by turning on and off the fuel injectors). There's also an independent mechanism that detects if the pedal isn't pressed (or the redundant pedal position sensors have failed), in which case the engine cannot exceed 2500RPM, if it does, the fuel turns off, and then turns back on when the speed drops below 1100RPM And, since we Beowulfers are for the most part software weenies.. The ECM for the 2005 Camry uses a NEC V850 E1 processor. The software is in ANSI C, and compiled with Greenhills compiler. There are 256kSLOC of non-comments (along with 241kSLOC of comments) in .c files and another 40kSLOC (noncomment) in various .h files. They ran it through Coverity and CodeSonar (both of which we use at JPL), as well as SPIN (using SWARM to run it on a cluster.. now how about that) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bill at Princeton.EDU Thu May 26 09:18:10 2011 From: bill at Princeton.EDU (Bill Wichser) Date: Thu, 26 May 2011 09:18:10 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? Message-ID: <4DDE5312.9070901@princeton.edu> Wondering if anyone out there is doing both I/O to storage as well as MPI over the same IB fabric. Following along in the Mellanox User's Guide, I see a section on how to implement the QOS for both MPI and my lustre storage. I am curious though as to what might happen to the performance of the MPI traffic when high I/O loads are placed on the storage. In our current implementation, we are using blades which are 50% blocking (2:1 oversubscribed) when moving from a 16 blade chassis to other nodes. Would trying to do storage on top dictate moving to a totally non-blocking fabric? Thanks, Bill _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 12:18:18 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 12:18:18 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDE5312.9070901@princeton.edu> References: <4DDE5312.9070901@princeton.edu> Message-ID: > Wondering if anyone out there is doing both I/O to storage as well as > MPI over the same IB fabric. I would say that is the norm. we certainly connect local storage (Lustre) to nodes via the same fabric as MPI. gigabit is completely inadequate for modern nodes, so the only alternatives would be 10G or a secondary IB fabric, both quite expensive propositions, no? I suppose if your cluster does nothing but IO-light serial/EP jobs, you might think differently. > Following along in the Mellanox User's > Guide, I see a section on how to implement the QOS for both MPI and my > lustre storage. I am curious though as to what might happen to the > performance of the MPI traffic when high I/O loads are placed on the > storage. to me, the real question is whether your IB fabric is reasonably close to full-bisection (and/or whether your storage nodes are sensibly placed, topologically.) > In our current implementation, we are using blades which are 50% > blocking (2:1 oversubscribed) when moving from a 16 blade chassis to > other nodes. Would trying to do storage on top dictate moving to a > totally non-blocking fabric? how much inter-chassis MPI do you do? how much IO do you do? IB has a small MTU, so I don't really see why mixed traffic would be a big problem. of course, IB also doesn't do all that wonderfully with hotspots. but isn't this mostly an empirical question you can answer by direct measurement? regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Shainer at Mellanox.com Thu May 26 12:50:17 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Thu, 26 May 2011 09:50:17 -0700 Subject: [Beowulf] Infiniband: MPI and I/O? References: <4DDE5312.9070901@princeton.edu> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F03ACAD7B@mtiexch01.mti.com> > Wondering if anyone out there is doing both I/O to storage as well as > MPI over the same IB fabric. Following along in the Mellanox User's > Guide, I see a section on how to implement the QOS for both MPI and my > lustre storage. I am curious though as to what might happen to the > performance of the MPI traffic when high I/O loads are placed on the > storage. I am doing it in my lab -have build my own Lustre solution and am running it on the same network as the MPI jobs. At the end it all depends on how much bandwidth do you need for the MPI and the storage, and if you can cover both, you can do it. Today the QoS solution for IB is out there, and you can set max BW and min latency as parameters for the different traffics. > In our current implementation, we are using blades which are 50% > blocking (2:1 oversubscribed) when moving from a 16 blade chassis to > other nodes. Would trying to do storage on top dictate moving to a > totally non-blocking fabric? IB congestion control is being released now (finally), so this can help you here. Gilad _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bill at Princeton.EDU Thu May 26 15:20:19 2011 From: bill at Princeton.EDU (Bill Wichser) Date: Thu, 26 May 2011 15:20:19 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: <4DDE5312.9070901@princeton.edu> Message-ID: <4DDEA7F3.8050703@princeton.edu> Mark Hahn wrote: >> Wondering if anyone out there is doing both I/O to storage as well as >> MPI over the same IB fabric. >> > > I would say that is the norm. we certainly connect local storage > (Lustre) to nodes via the same fabric as MPI. gigabit is completely > inadequate for modern nodes, so the only alternatives would be 10G > or a secondary IB fabric, both quite expensive propositions, no? > > I suppose if your cluster does nothing but IO-light serial/EP jobs, > you might think differently. > Really? I'm surprised by that statement. Perhaps I'm just way behind on the curve though. It is typical here to have local node storage, local lustre/pvfs storage, local NFS storage, and global GPFS storage running over the GigE network. Depending on I/O loads users can make use of the storage at the right layer. Yes, users fill the 1Gbps pipe to the storage per node. But as we now implement all new clusters with IB I'm hoping to increase that bandwidth even more. If you and everyone else is doing this already, that's a good sign! Lol! As we move closer to making this happen, perhaps there will be plenty of answers then for any QOS setup questions I may have. > >> Following along in the Mellanox User's >> Guide, I see a section on how to implement the QOS for both MPI and my >> lustre storage. I am curious though as to what might happen to the >> performance of the MPI traffic when high I/O loads are placed on the >> storage. >> > > to me, the real question is whether your IB fabric is reasonably close > to full-bisection (and/or whether your storage nodes are sensibly placed, > topologically.) > > >> In our current implementation, we are using blades which are 50% >> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to >> other nodes. Would trying to do storage on top dictate moving to a >> totally non-blocking fabric? >> > > how much inter-chassis MPI do you do? how much IO do you do? > IB has a small MTU, so I don't really see why mixed traffic would > be a big problem. of course, IB also doesn't do all that wonderfully > with hotspots. but isn't this mostly an empirical question you can > answer by direct measurement? > How would I measure by direct measurement? I don't have the switching infrastructure to compare a 2:1 versus a 1:1 unless you're talking about inside a chassis. But since my storage would connect into the switching infrastructure how and what would I compare? Jobs are not scheduled to run on a single chassis, or at least they try to but are not placed on hold for more than 10 minutes waiting. So there are lots of wide jobs running between chassis. Some don't even fit on a chassis. As for the question of how much data, I don't have answer. I know that a 10Gbps pipe hits 4Gbps for sustained periods to our central storage from the cluster. I also know that I can totally overwhelm a 10G connected OSS which is currently I/O bound. My question really was twofold: 1) is anyone doing this successfully and 2) does anyone have an idea of how loudly my users will scream when their MPI jobs suddenly degrade. You've answered #1 and seem to believe that for #2, no one will notice. Thanks! Bill > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From greg at keller.net Thu May 26 15:29:07 2011 From: greg at keller.net (Greg Keller) Date: Thu, 26 May 2011 14:29:07 -0500 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: Message-ID: <4DDEAA03.2040206@keller.net> Date: Thu, 26 May 2011 12:18:18 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] Infiniband: MPI and I/O? > To: Bill Wichser > Cc: Beowulf Mailing List > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > >> Wondering if anyone out there is doing both I/O to storage as well as >> MPI over the same IB fabric. > I would say that is the norm. we certainly connect local storage > (Lustre) to nodes via the same fabric as MPI. gigabit is completely > inadequate for modern nodes, so the only alternatives would be 10G > or a secondary IB fabric, both quite expensive propositions, no? > > I suppose if your cluster does nothing but IO-light serial/EP jobs, > you might think differently. > Agreed. Just finished telling another vendor, "It's not high speed storage unless it has an IB/RDMA interface". They love that. Except for some really edge cases, I can't imagine running IO over GbE for anything more than trivial IO loads. I am Curious if anyone is doing IO over IB to SRP targets or some similar "Block Device" approach. The Integration into the filesystem by Lustre/GPFS and others may be the best way to go, but we are not 100% convinced yet. Any stories to share? Cheers! Greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Thu May 26 15:35:35 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 26 May 2011 15:35:35 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEAA03.2040206@keller.net> References: <4DDEAA03.2040206@keller.net> Message-ID: <4DDEAB87.40203@scalableinformatics.com> On 05/26/2011 03:29 PM, Greg Keller wrote: > Agreed. Just finished telling another vendor, "It's not high speed > storage unless it has an IB/RDMA interface". They love that. Except Heh ... love it! > for some really edge cases, I can't imagine running IO over GbE for > anything more than trivial IO loads. Lots of our customers do, when they have a large legacy GbE network, and upgrading is expensive. We can have a very large fan in to our units, but IB (even SDR!) is really nice to move data over for storage. > I am Curious if anyone is doing IO over IB to SRP targets or some > similar "Block Device" approach. The Integration into the filesystem by Both block and file targets. SRPT on our units, and fronted by OSSes for Lustre and similar like things. Can do iSCSI as well (over IB using iSER, or over 10GbE ... works really nicely in either case). > Lustre/GPFS and others may be the best way to go, but we are not 100% > convinced yet. Any stories to share? If you do this with Lustre, make sure your OSSes are in HA pairs using pacemaker/ucarp, and use DRBD between backend units, or MD on the OSS to mirror the storage. Unfortunately IB doesn't virtualize well (last I checked), so these have to be physical OSSes. I presume something similar on GPFS. GlusterFS, PVFS2/OrangeFS, etc. go fine without the block devices, and Gluster does mirroring at the file level. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 16:13:07 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 16:13:07 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEA7F3.8050703@princeton.edu> References: <4DDE5312.9070901@princeton.edu> <4DDEA7F3.8050703@princeton.edu> Message-ID: >>> Wondering if anyone out there is doing both I/O to storage as well as >>> MPI over the same IB fabric. >>> >> >> I would say that is the norm. we certainly connect local storage (Lustre) >> to nodes via the same fabric as MPI. gigabit is completely >> inadequate for modern nodes, so the only alternatives would be 10G >> or a secondary IB fabric, both quite expensive propositions, no? >> >> I suppose if your cluster does nothing but IO-light serial/EP jobs, >> you might think differently. >> > Really? I'm surprised by that statement. Perhaps I'm just way behind on the > curve though. It is typical here to have local node storage, local > lustre/pvfs storage, local NFS storage, and global GPFS storage running over > the GigE network. sure, we use Gb as well, but only as a crutch, since it's so slow. or does each node have, say, a 4x bonded Gb for this traffic? or are we disagreeing on whether Gb is "slow"? 80-ish MB/s seems pretty slow to me, considering that's less than any single disk on the market... >> how much inter-chassis MPI do you do? how much IO do you do? >> IB has a small MTU, so I don't really see why mixed traffic would be a big >> problem. of course, IB also doesn't do all that wonderfully >> with hotspots. but isn't this mostly an empirical question you can >> answer by direct measurement? >> > How would I measure by direct measurement? I meant collecting the byte counters from nics and/or switches while real workloads are running. that tells you the actual data rates, and should show how close you are to creating hotspots. > My question really was twofold: 1) is anyone doing this successfully and 2) > does anyone have an idea of how loudly my users will scream when their MPI > jobs suddenly degrade. You've answered #1 and seem to believe that for #2, > no one will notice. we've always done it, though our main experience is with clusters that have full-bisection fabrics. our two more recent clusters have half-bisection fabrics, but I suspect that most users are not looking closely enough at performance to notice and/or complain. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 17:23:30 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 17:23:30 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEAA03.2040206@keller.net> References: <4DDEAA03.2040206@keller.net> Message-ID: > Agreed. Just finished telling another vendor, "It's not high speed > storage unless it has an IB/RDMA interface". They love that. Except what does RDMA have to do with anything? why would straight 10G ethernet not qualify? I suspect you're really saying that you want an efficient interface, as well as enough bandwidth, but that doesn't necessitate RDMA. > for some really edge cases, I can't imagine running IO over GbE for > anything more than trivial IO loads. well, it's a balance issue. if someone was using lots of Atom boards lashed into a cluster, 1Gb apiece might be pretty reasonable. but for fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all that generous. as an interesting case in point, SeaMicro was in the news again with a 512 atom system: either 64 Gb links or 16 10G links. the former (.128 Gb/core) seems low even for atoms, but .3 Gb/core might be reasonable. > I am Curious if anyone is doing IO over IB to SRP targets or some > similar "Block Device" approach. The Integration into the filesystem by > Lustre/GPFS and others may be the best way to go, but we are not 100% > convinced yet. Any stories to share? you mean you _like_ block storage? how do you make a shared FS namespace out of it, manage locking, etc? regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From greg at keller.net Thu May 26 18:30:52 2011 From: greg at keller.net (Greg Keller) Date: Thu, 26 May 2011 17:30:52 -0500 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: <4DDEAA03.2040206@keller.net> Message-ID: <4DDED49C.6070509@keller.net> On 5/26/2011 4:23 PM, Mark Hahn wrote: >> Agreed. Just finished telling another vendor, "It's not high speed >> storage unless it has an IB/RDMA interface". They love that. Except > > what does RDMA have to do with anything? why would straight 10G ethernet > not qualify? I suspect you're really saying that you want an efficient > interface, as well as enough bandwidth, but that doesn't necessitate > RDMA. > RDMA over IB is definitely a nice feature. Not required, but IP over IB has enough limits that we prefer to avoid it. >> for some really edge cases, I can't imagine running IO over GbE for >> anything more than trivial IO loads. > > well, it's a balance issue. if someone was using lots of Atom boards > lashed into a cluster, 1Gb apiece might be pretty reasonable. but for > fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all > that generous. > > as an interesting case in point, SeaMicro was in the news again with a > 512 > atom system: either 64 Gb links or 16 10G links. the former (.128 > Gb/core) > seems low even for atoms, but .3 Gb/core might be reasonable. > agreed >> I am Curious if anyone is doing IO over IB to SRP targets or some >> similar "Block Device" approach. The Integration into the filesystem by >> Lustre/GPFS and others may be the best way to go, but we are not 100% >> convinced yet. Any stories to share? > > you mean you _like_ block storage? how do you make a shared FS namespace > out of it, manage locking, etc? Well, it's a use case issue for us. You don't make a shared FS on the block devices (well, maybe you could just not in a scalable way)... but we envision leasing block devices to customers with known capacity/performance capability. Then the customer can make the call if they want to use it for a CIFS/NFS backend, possibly even lashed together via MD, through a single server. They can also lease multiple block devices and create a lustre type system. The flexibility is if they disappear and come back they may not get the same compute/storage nodes, but they can attach any server to their dedicated block storage devices. There are also some multi-tenancy security options that can be more definitively handled if they have absolute control over a block device. So in this case, they would semi-permanently lease the block devices, and then fire up front end storage nodes and compute nodes on an "as needed / as available" basis anywhere in our compute farm. Effectively we get the benefits of a massive Fibre Channel type SAN over the IB infrastructure we have to every node. If we can get the performance and cost of the block storage right, it will be compelling for some of our customers. We are still prototyping how it would work and characterizing performance options... but it's interesting to us. Cheers! Greg > > regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri May 27 23:59:42 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 27 May 2011 23:59:42 -0400 (EDT) Subject: [Beowulf] 512 atoms in a box Message-ID: I was thinking about the seamicro box - 512 atoms, 64 disks and either 64 Gb ports or 16 10G ports. it would be interesting to look at what the most appropriate "balance" is for mips/flops of cpu power compared to interconnect bandwidth. maybe the seamicro box is more intended to be a giant memcached server - that is, the question is memory bandwidth/capacity versus IC bandwidth. in any case, you have to ponder where the amazing value-add is - compactness? I'm not sure it competes all that well compared to 48 core-per-U conventional servers (whether mips/flops or memory-based). here's an idea, more commodity-oriented (hence beowulf): suppose you design a tiny widget that gets all its power via POE. maybe Atom or ARM-based - you've got 15-20W, which is quite a bit these days. for packaging, you need space for a cpu, nic and sodimm. maybe some leds. plug them into a commodity 1U 48-port Gb switch, then stack 10 of them and you've got a penny-pincher's approximation of a Seamicro SM100000! not going to win top500, but... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Sat May 28 04:26:25 2011 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 28 May 2011 10:26:25 +0200 Subject: [Beowulf] 512 atoms in a box In-Reply-To: References: Message-ID: <20110528082625.GB19622@leitl.org> On Fri, May 27, 2011 at 11:59:42PM -0400, Mark Hahn wrote: > here's an idea, more commodity-oriented (hence beowulf): suppose you > design a tiny widget that gets all its power via POE. maybe Atom or > ARM-based - you've got 15-20W, which is quite a bit these days. > for packaging, you need space for a cpu, nic and sodimm. maybe some leds. > > plug them into a commodity 1U 48-port Gb switch, then stack 10 of them > and you've got a penny-pincher's approximation of a Seamicro SM100000! > > not going to win top500, but... I was planning to do something similar with rooted Apple TV, once it's bumped up to A5 in the next generation. The devices would need spacers, a baffle and a few fans, if packed closely. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From award at uda.ad Sat May 28 06:08:50 2011 From: award at uda.ad (Alan Ward) Date: Sat, 28 May 2011 12:08:50 +0200 Subject: [Beowulf] RS: 512 atoms in a box References: Message-ID: <9CC6BA5ACC8E7C489A97B01C2DA791063D04BD@serpens.ua.ad> Just found this: http://www.raspberrypi.org/ The ARM11 does not pack much punch, there is no networking (though it should not be too difficult to add) and it is not even in production yet. But it does seem fun. Plus, $1000 would get you 40 units ... Cheers, -Alan -----Missatge original----- De: beowulf-bounces at beowulf.org en nom de Mark Hahn Enviat el: ds. 28/05/2011 05:59 Per a: Beowulf Mailing List Tema: [Beowulf] 512 atoms in a box I was thinking about the seamicro box - 512 atoms, 64 disks and either 64 Gb ports or 16 10G ports. it would be interesting to look at what the most appropriate "balance" is for mips/flops of cpu power compared to interconnect bandwidth. maybe the seamicro box is more intended to be a giant memcached server - that is, the question is memory bandwidth/capacity versus IC bandwidth. in any case, you have to ponder where the amazing value-add is - compactness? I'm not sure it competes all that well compared to 48 core-per-U conventional servers (whether mips/flops or memory-based). here's an idea, more commodity-oriented (hence beowulf): suppose you design a tiny widget that gets all its power via POE. maybe Atom or ARM-based - you've got 15-20W, which is quite a bit these days. for packaging, you need space for a cpu, nic and sodimm. maybe some leds. plug them into a commodity 1U 48-port Gb switch, then stack 10 of them and you've got a penny-pincher's approximation of a Seamicro SM100000! not going to win top500, but... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From eugen at leitl.org Wed May 4 06:33:46 2011 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 4 May 2011 12:33:46 +0200 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown Message-ID: <20110504103346.GH23560@leitl.org> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines Chinese Chip Wins Energy-Efficiency Crown Though slower than competitors, the energy-saving Godson-3B is destined for the next Chinese supercomputer By Joseph Calamia / May 2011 The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in the third quarter of 2011, will have something quite different under its hood. Unlike its forerunners, which employed American-born chips, this machine will harness the country's homegrown high-end processor, the Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than its competitors' wares, at least one of which operates at more than 5 GHz, but the chip still turns heads with its record-breaking energy efficiency. It can execute 128 billion floating-point operations per second using just 40 watts?double or more the performance per watt of competitors. The Godson has an eccentric interconnect structure?for relaying messages among multiple processor cores?that also garners attention. While Intel and IBM are commercializing chips that will shuttle communications between cores merry-go-round style on a "ring interconnect," the Godson connects cores using a modified version of the gridlike interconnect system called a mesh network. The processor's designers, led by Weiwu Hu at the Chinese Academy of Sciences, in Beijing, seem to be placing their bets on a new kind of layout for future high-end computer processors. A mesh design goes hand in hand with saving energy, says Matthew Mattina, chief architect at the San Jose, Calif.?based Tilera Corp., a chipmaker now shipping 36- and 64-core processors using on-chip mesh interconnects. Imagine a ring interconnect as a traffic roundabout. Getting to some exits requires you to drive nearly around the entire circle. Traveling away from your destination before getting there, says Mattina, requires more transistor switching and therefore consumes more energy. A mesh network is more like a city's crisscrossed streets. "In a mesh, you always traverse the minimum amount of wire?you're never going the wrong way," he says. On the 8-core Godson chip, 4 cores form a tightly bound unit?each core sits on a corner of a square of interconnects, as in a usual mesh. Godson researchers have also connected each corner to its opposite, using a pair of diagonal interconnects to form an X through the square's center. A "crossbar" interconnect then serves as an overpass, linking this 4-core neighborhood to a similar 4-core setup nearby. Godson developers believe that their modified mesh's scalability will prove a key advantage, as chip designers cram more cores onto future chips. Yunji Chen, a Godson architect, says that competitors' ring interconnects may have trouble squeezing in more than 32 cores. Indeed, one of the ring's benefits could prove its future liability. Linking new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of electrical and computer engineering at the University of Toronto. After all, there's only one path to send information?or two in a bidirectional ring. But sharing a common communication path also means that each additional core adds to the length of wire that messages must travel and increases the demand for that path. With a large number of cores, "the timing around this ring just gets out of hand," Smith says. "You can't get service when you need it." Of course, adding more cores in a mesh also stresses the system. Even if you have a grid of paths providing multiple communication channels, more cores increase the demand for the network, and more demand makes traveling long distances difficult: Try driving across New York City at rush hour. Still, the bandwidth scaling of a mesh interconnect is superior to that of a ring, Tilera's Mattina says. He notes that the total bandwidth available with a mesh interconnect increases as you add cores, but with a ring interconnect, the total bandwidth remains constant even as the core count increases. Latency?the time it takes to get a message from one core to another?is also more favorable in a mesh design, Chen says. In a ring interconnect, latency increases linearly with the core count, he says, while in a mesh design it increases with the square root of the number of cores. Reid Riedlinger, a principal engineer at Intel, points out that a ring interconnect has its own scalability benefits. Intel's recently unveiled 8-core Poulson design employs a ring not only to add more cores but also to add easy-to-access on-chip memory, or cache. As long as the chip has the power and the space, Riedlinger says, a ring makes it easy to add each core and cache as a module?a move that would require more complicated validity studies and logic modification in a mesh. "Adding the additional ring stop has a very small impact on latency, and the additional cache capacity will provide performance benefits for many applications," he says. For those who are not building a national supercomputer, Riedlinger also points out that a ring setup is more easily scalable in a different direction. "You might start with an 8-core design," he says, "and then, to suit a different market segment, you might chop 4 cores out of the middle and sell it as a different product." This article originally appeared in print as "China's Godson Gamble". _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cbergstrom at pathscale.com Wed May 4 06:39:45 2011 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?Christopher_Bergstr=F6m?=) Date: Wed, 4 May 2011 17:39:45 +0700 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown In-Reply-To: <20110504103346.GH23560@leitl.org> References: <20110504103346.GH23560@leitl.org> Message-ID: On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl wrote: > > http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines > > Chinese Chip Wins Energy-Efficiency Crown > > Though slower than competitors, the energy-saving Godson-3B is destined > for the next Chinese supercomputer > > By Joseph Calamia ?/ ?May 2011 > > The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in > the third quarter of 2011, will have something quite different under its > hood. Unlike its forerunners, which employed American-born chips, this > machine will harness the country's homegrown high-end processor, the > Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than > its competitors' wares, at least one of which operates at more than 5 GHz, > but the chip still turns heads with its record-breaking energy efficiency. It > can execute 128 billion floating-point operations per second using just 40 > watts?double or more the performance per watt of competitors. *cough* Wow.. they've brought SiCortex back to life... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Wed May 4 09:50:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 04 May 2011 09:50:50 -0400 Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown In-Reply-To: References: <20110504103346.GH23560@leitl.org> Message-ID: <4DC159BA.8090005@ias.edu> On 05/04/2011 06:39 AM, Christopher Bergstr?m wrote: > On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl wrote: >> >> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines >> >> Chinese Chip Wins Energy-Efficiency Crown >> >> Though slower than competitors, the energy-saving Godson-3B is destined >> for the next Chinese supercomputer >> >> By Joseph Calamia / May 2011 >> >> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in >> the third quarter of 2011, will have something quite different under its >> hood. Unlike its forerunners, which employed American-born chips, this >> machine will harness the country's homegrown high-end processor, the >> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than >> its competitors' wares, at least one of which operates at more than 5 GHz, >> but the chip still turns heads with its record-breaking energy efficiency. It >> can execute 128 billion floating-point operations per second using just 40 >> watts?double or more the performance per watt of competitors. > > *cough* > > Wow.. they've brought SiCortex back to life... Oh, snap! -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue May 10 01:37:33 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 9 May 2011 22:37:33 -0700 Subject: [Beowulf] How InfiniBand gained confusing bandwidth numbers Message-ID: <20110510053733.GB12826@bx9.net> http://dilbert.com/strips/comic/2011-05-10/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mdidomenico4 at gmail.com Wed May 11 20:57:57 2011 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 11 May 2011 20:57:57 -0400 Subject: [Beowulf] EPCC and DIR cluster Message-ID: Is there anyone on the list associated with EPCC or knows someone at EPCC? If so, i recently saw an article in Scientific Computing magazine, where there was a blurb about a smallish cluster built a EPCC utilizing Atom chips/GPU's and HDD's, whereby the design was more amdahl balanced for "data intensive research". I can't seem to locate anything on the web about it, but I'm interested in the spec's/design for the machine and how it performs thanks _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri May 20 00:35:25 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 20 May 2011 00:35:25 -0400 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: <4DD5EF8D.8070909@scalableinformatics.com> Hi folks Does anyone run a large-ish cluster without ECC ram? Or with ECC turned off at the motherboard level? I am curious if there are numbers of these, and what issues people encounter. I have some of my own data from smaller collections of systems, I am wondering about this for larger systems. Thanks! Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 01:45:01 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 19 May 2011 22:45:01 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <20110520054501.GE16676@bx9.net> On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote: > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. I don't think anyone's done the experiment with a 'larger system' since "Big Mac" had to replace all of their servers with ones that had ECC. Still, any cluster that can manipulate the BIOS appropriately could easily do the experiment. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From gmpc at sanger.ac.uk Fri May 20 04:58:59 2011 From: gmpc at sanger.ac.uk (Guy Coates) Date: Fri, 20 May 2011 09:58:59 +0100 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <20110520054501.GE16676@bx9.net> References: <4DD5EF8D.8070909@scalableinformatics.com> <20110520054501.GE16676@bx9.net> Message-ID: <4DD62D53.6030806@sanger.ac.uk> On 20/05/11 06:45, Greg Lindahl wrote: > On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote: > >> Does anyone run a large-ish cluster without ECC ram? Or with ECC >> turned off at the motherboard level? I am curious if there are numbers >> of these, and what issues people encounter. I have some of my own data >> from smaller collections of systems, I am wondering about this for >> larger systems. We did, circa 2003. Never again. When we were lucky, the uncorrected errors happened in memory in use by the kernel or application code, and we got hard machine crashes or code seg-faulting. Those were easy to spot. When we were unlucky, the errors happened in page cache, resulting in data being randomly transmuted. Most of the code we were running at the time did minimal input sanity checking. It was quite instructive to see just how much genomic analysis code would quite happily compute on DNA sequences that contained things other than ATGC. The duff runs would eventually get picked up by the various sanity-checks that happened at the end of our analysis pipelines, but it involved quite a bit of developer & sysadmin effort to track down and re-run all of the possibly affected jobs. Cheers, Guy -- Dr. Guy Coates, Informatics Systems Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From a.travis at abdn.ac.uk Fri May 20 11:35:45 2011 From: a.travis at abdn.ac.uk (Tony Travis) Date: Fri, 20 May 2011 16:35:45 +0100 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <4DD68A51.70605@abdn.ac.uk> On 20/05/11 05:35, Joe Landman wrote: > Hi folks > > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. Hi, Joe. I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it was a nightmare, as Guy described in his email, until I pre-emptively tested the memory in user-space, using Chlarles Cazabon's "memtester": http://pyropus.ca/software/memtester Prior to this, *all* the RAM had passed Memtest86+. I had a strict policy that if a system crashed, for any reason, it was re-tested with Memtest86+, then 100 passes of "memtester" before being allowed to re-join the Beowulf cluster. This made the Beowulf much more stable running openMosix. However, I've scrapped all our non-ECC nodes now because the real worry is not knowing if an error has occurred... Apparently this is still a big issue for computers in space, using non-ECC RAM for solid-state storage on grounds of cost for imaging. They, apparently, use RAM background SoftECC 'scrubbers' like this: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri May 20 11:52:43 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 20 May 2011 08:52:43 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD68A51.70605@abdn.ac.uk> Message-ID: On 5/20/11 8:35 AM, "Tony Travis" wrote: >On 20/05/11 05:35, Joe Landman wrote: >> Hi folks >> >> Does anyone run a large-ish cluster without ECC ram? Or with ECC >> turned off at the motherboard level? I am curious if there are numbers >> of these, and what issues people encounter. I have some of my own data >> from smaller collections of systems, I am wondering about this for >> larger systems. > >Hi, Joe. > >Apparently this is still a big issue for computers in space, using >non-ECC RAM for solid-state storage on grounds of cost for imaging. >They, apparently, use RAM background SoftECC 'scrubbers' like this: > >http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng >.pdf > > Yes, it's a big tradeoff in the space world. Not only does ECC require extra memory, but the EDAC logic consumes power and, typically, slows down the bus speed (I.e. You need an extra bus cycle to handle the EDAC logic propagation delay). There's also a practical detail that the upset rate might be low enough that it is ok to just tolerate the upsets, because they'll get handled at some higher level of the process. For instance, if you have a RAM buffer in a communication link handling the FEC coded bits, then there's not much difference between a bit flip in RAM and a bit error on the comm link, so you might as well just let the comm FEC code take care of the bit errors. We tend to use a lot of checksum strategies. Rather than an EDAC strategy which corrects errors, it's good enough to just know that an error occurred, and retry. This is particularly effective on Flash memory, which has transient read errors: read it again and it works ok. Another example is doing an FFT. There are some strategies which allow you to do a second fast computation that essentially provides a "check" on the results of the FFT (e.g. The mean of the input data should match the "DC term" in the FFT) We might also keep triple copies of key variables. You read all three values and compare them before starting the computation. Software Triple Redundancy, as it were. A lot of times, the probability of an error occurring "during" the computation is sufficiently low, compared to the probability of an error occurring during the very long waiting time between operating on the data. There's also the whole question of whether EDAC main memory buys you much, when all the (ever larger) cache isn't protected. Again, it comes down to a probability analysis. My own personal theory on this is that you are much more likely to have a miscalculation due to a software bug than due to an upset. Further, it's impossible to get all the bugs out in finite time/money, so you might as well design your whole system to be fault tolerant, not in a "oh my gosh, we had an error, let's do extensive fault recovery", but a "we assume the computations are always a bit wonky, so we factor that into our design". That is, design so that retries and self checks are just part of the overhead. Kind of like how a decent experiment or engineering design takes into account measurement uncertainty stack-up. As hardware gets smaller and faster and lower power, the "cost" to provide extra computational resources to implement a strategy like this gets smaller, relative to the ever increasing human labor cost to try and make it perfect. (and, of course, this *is* how humans actually do stuff.. You don't precompute all of your control inputs to the car.. You basically set a general goal, and continuously adjust to drive towards that goal.) Jim Lux > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri May 20 12:35:26 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 20 May 2011 12:35:26 -0400 (EDT) Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com> References: <4DD5EF8D.8070909@scalableinformatics.com> Message-ID: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org> Joe While this is somewhat anecdotal, it may be helpful. Not a large-ish cluster, but as you may guess, I wondered about this for Limulus (http://limulus.basement-supercomputing.com) I wrote a script (will post it if anyone interested) that runs memtester until you stop it or it finds a error. I ran it on several Core2 Duo systems with Kingston DDR2-800 PC2-6400 memory. As I recall, I ran it on 2-3 systems, only one showed an error. I stopped the others after about three weeks. Here is an example of the script output when it fails (it logs the memtest output). There was an error, inspect memtest-1178 Start Date was: Mon Apr 20 16:04:35 EDT 2009 Failure Date was: Fri May 8 17:55:43 EDT 2009 Test ran 1178 times failing after 1561868 Seconds (26031 Minutes or 433 Hours or 18 Days) My experience in running small clusters without ECC has been very good. IMO it is also a question of the quality of the memory vendor. I never had an issue when running tests and benchmarks, which I do quite a bit on new hardware e.g. goo.gl/YoBaz -- Doug > Hi folks > > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems. > > Thanks! > > Joe > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics, Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/sicluster > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Fri May 20 13:21:12 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 20 May 2011 10:21:12 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org> Message-ID: On 5/20/11 9:35 AM, "Douglas Eadline" wrote: >Joe > >While this is somewhat anecdotal, it may be helpful. > >Not a large-ish cluster, but as you may guess, I wondered >about this for Limulus >(http://limulus.basement-supercomputing.com) > >I wrote a script (will post it if anyone interested) >that runs memtester until you stop it or it finds >a error. I ran it on several Core2 Duo systems >with Kingston DDR2-800 PC2-6400 memory. > >My experience in running small clusters >without ECC has been very good. IMO it is also >a question of the quality of the memory vendor. >I never had an issue when running tests and >benchmarks, which I do quite a bit on new >hardware e.g. I'm going to guess that it's highly idiosyncratic. The timing margins on all the signals between CPU, memory, and perhipherals are tight, they're temperature dependent and process dependent, so you could have the exact same design with very similar RAM and one will get errors and the other won't. Folks who design PCI bus interfaces for a living earn their pay, especially if they have to make it work with lots of different mfrs: just because all the parts meet their databook specs doesn't mean that the system will play nice together. Consider that for memory, you have 64 odd data lines and 20 or so address lines and some strobes that ALL have to switch together. A data sensitive pattern where a bunch of lines move at the same time, and induce a bit of a voltage into an adjacent trace, which is a bit slower or faster than the rest, and you've got the makings of a challenging hunt for the problem. PC board trace lengths all have to be carefully matched, loads have to be carefully matched, etc. 66 Mhz -> 15 ns, but modern DDR rams do batches of words separated by a few ns. 1 cm is about 10-15 cm of tracelength, but it's the loading, terminations, and other stuff that causes a problem. Hang a 1 pf capacitor off that 100 ohm line, and there's a tenth of a ns time constant right there. You could also have EMI/EMC issues that cause problems. That same ragged edge timing margin might be fine with 8 tower cases sitting on a shelf, but not so good with the exact same mobo and memory stacked into 1-2U cases in a 19" rack. Power cords and ethernet cables also carry EMI around. In a large cluster these things will all be aggravated: you've got more machines running, so you increase the error probability right there. You've got more electrical noise on the power carried between machines. You've typically got denser packaging. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri May 20 14:26:31 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 20 May 2011 14:26:31 -0400 (EDT) Subject: [Beowulf] Execution time measurements - clarification Message-ID: From: Mikhail Kuzminsky Subject: [Beowulf] Execution time measurements - clarification Dear Mark, could you pls forward my message to beowulf at beowulf.org (because my messages as before can't be delivered to maillist) ? It's clarification of my previous question here. Mikhail --------------------- I have strange execution time measurements for CPU-bound jobs (to be exact, Gaussian-09 DFT frequency calculations). Results are strange for *SEQENTIAL* calculations ! Executions were performed on dual socket Opteron 2350 (Quad core) server worked under Open SuSE Linux 10.3. When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is LOWER than for single job run ! I thought that it may be wrong "own for G09" times, so I checked their via time: (time g09 *pair1.com) >&testt1 & etc But this confirms strange results: For pair of simultaneously running sequential jobs 88801.141u 52.475s 24:40:57.58 99.9% 0+0k 0+0io 1221pf+0w 88901.996u 13.472s 24:41:53.97 100.0% 0+0k 0+0io 0pf+0w For run of 1 example of the same sequential job 100365.236u 27.297s 27:53:13.53 99.9% 0+0k 0+0io 1pf+0w Is there any ideas why this situation might be ? Mikhail Kuzminsky _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 20:29:10 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 20 May 2011 17:29:10 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: <4DD68A51.70605@abdn.ac.uk> Message-ID: <20110521002910.GB14350@bx9.net> On Fri, May 20, 2011 at 08:52:43AM -0700, Lux, Jim (337C) wrote: > As hardware gets smaller and faster and lower power, the "cost" to provide > extra computational resources to implement a strategy like this gets > smaller, relative to the ever increasing human labor cost to try and make > it perfect. The cost is teaching users to add checks to their codes, and to any off-the-shelf codes they start using. In hyrodynamics (cfd), often you have quantities which are explicitly conserved by the equations, and others which are conserved by physics but not by the particular numerical method you're using. The latter were quite handy for finding bugs. I managed to discover several numerical accuracy bugs in pre-release versions of the PathScale compilers that way. "Yes, it's a bug if the 12th decimal place changes." -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri May 20 20:32:27 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 20 May 2011 17:32:27 -0700 Subject: [Beowulf] Execution time measurements - clarification Message-ID: <20110521003227.GD14350@bx9.net> On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is > LOWER than for single job run ! I'd try locking these sequential jobs to a single core, you can get quite weird effects when you don't. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Mon May 23 12:40:13 2011 From: mathog at caltech.edu (David Mathog) Date: Mon, 23 May 2011 09:40:13 -0700 Subject: [Beowulf] Execution time measurements - clarification Message-ID: > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is > > LOWER than for single job run ! Disk caching could cause that. Normally if the data read in isn't too big you see an effect where: run 1: 30 sec <-- 50% disk IO/ 50% CPU run 2: 15 sec <-- ~100% CPU where the first run loaded the data into disk cache, and the second run read it there, saving a lot of real disk IO. Under some very peculiar conditions, on a multicore system, if run 1 and 2 are "simultaneous" they could seesaw back and forth for the "lead", so they end up taking turns doing the actual disk IO, with the total run time for each ending up between the times for the two runs above. Note that they wouldn't have to be started at exactly the same time for this to happen, because the job that starts second is going to be reading from cache, so it will tend to catch up to the job that started first. Once they are close then noise in the scheduling algorithms could cause the second to pass the first. (If it didn't pass, then this couldn't happen, because the second would always be waiting for the first to pull data in from disk.) Of course, you also need to be sure that run 1 isn't interfering with run 2. They might, for instance, save/retrieve intermediate values to the same filename, so that they really cannot be run safely at the same time. That is, they run faster together, but they run incorrectly. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Mon May 23 15:32:33 2011 From: mathog at caltech.edu (David Mathog) Date: Mon, 23 May 2011 12:32:33 -0700 Subject: [Beowulf] Execution time measurements Message-ID: Mikhail Kuzminsky sent this to me and asked that it be posted: BEGIN FORWARD Mon, 23 May 2011 09:40:13 -0700 ???????????? ???? "David Mathog" : > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > When I run 2 identical examples of the same batch job > simultaneously, execution time of *each* job is > > > LOWER than for single job run ! > > Disk caching could cause that. Normally if the data read in isn't too > big you see an effect where: > > run 1: 30 sec <-- 50% disk IO/ 50% CPU > run 2: 15 sec <-- ~100% CPU I believe that jobs are CPU-bound: top says that they use 100% of CPU, and no swap activity. iostat /dev/sda3 (where IO is performed) says typically something like: Linux 2.6.22.5-31-default (c6ws1) 05/25/2011 avg-cpu: %user %nice %system %iowait %steal %idle 1.12 0.00 0.03 0.01 0.00 98.84 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda3 0.01 0.01 8.47 20720 16845881 > > Of course, you also need to be sure that run 1 isn't interfering with > run 2. They might, for instance, save/retrieve intermediate values > to the same filename, so that they really cannot be run safely at the > same time. That is, they run faster together, but they run incorrectly. File names used for IO are unique. I thought also about cpus frequency variations, but I think that null output of lsmod|grep freq is enough for fixed CPU frequency. END FORWARD OK, so not disk caching. Regarding the frequencies, better to use cat /proc/cpuinfo | grep MHz while the processes are running. Did you verify that the results for each of the two simultaneous runs are both correct? Ideally, tweak some parameter so they are slightly different from each other. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 11:41:32 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 08:41:32 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: Joe Landman wrote: > I am wondering about this for larger systems. Your post makes me wonder about ECC in much smaller systems, like dedicated single computers controlling machinery or medical devices. Some really nasty things could result from "move cutting head in X (int32 value) mm" after the most significant bit in the int32 value has flipped. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Tue May 24 11:44:15 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 24 May 2011 11:44:15 -0400 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: <4DDBD24F.7080608@scalableinformatics.com> On 05/24/2011 11:41 AM, David Mathog wrote: > Joe Landman wrote: > >> I am wondering about this for larger systems. > > Your post makes me wonder about ECC in much smaller systems, like > dedicated single computers controlling machinery or medical devices. > Some really nasty things could result from "move cutting head in X > (int32 value) mm" after the most significant bit in the int32 value has > flipped. Some bits are more important than others ... Basically I was looking for anecdotal evidence that this is a "bad thing" (TM). I have it now, and it helped me make the case I needed to make. Thanks to everyone for this, it was really helpful! -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue May 24 13:06:15 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 24 May 2011 10:06:15 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: This *is* a big problem. I suggest reading some of what Nancy Leveson has written. http://sunnyday.mit.edu/ "Professor Leveson started a new area of research, software safety, which is concerned with the problems of building software for real-time systems where failures can result in loss of life or property." Two popular papers you might find interesting and fun to read: "High-Pressure Steam Engines and Computer Software" (Postscript) or (PDF). This paper started as a keynote address at the International Conference on Software Engineering in Melbourne, Australia) and later was published in IEEE Software, October 1994. "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book. There is a generic problem with complex systems, as well. "Normal Accidents" by Charles Perrow is a good work (if a bit frightening in some ways... not in a senseless fear-mongering way, but because he lays out the fundamental reasons why these things are inevitable) Marais, Dulac, and Leveson argue that the world isn't as bad as Perrow says, though. http://esd.mit.edu/symposium/pdfs/papers/marais-b.pdf Jim Lux +1(818)354-2075 > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog > Sent: Tuesday, May 24, 2011 8:42 AM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] Curious about ECC vs non-ECC in practice > > Joe Landman wrote: > > > I am wondering about this for larger systems. > > Your post makes me wonder about ECC in much smaller systems, like > dedicated single computers controlling machinery or medical devices. > Some really nasty things could result from "move cutting head in X > (int32 value) mm" after the most significant bit in the int32 value has > flipped. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 14:27:23 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 11:27:23 -0700 Subject: [Beowulf] Execution time measurements Message-ID: Another message from Mikhail Kuzminsky, who for some reason or other cannot currently post directly to the list: BEGIN FORWARD 1st of all, I should mention that the effect is observed only for Opteron 2350/OpenSuSE 10.3. Execution of the same job w/the same binaries on Nehalem E5520/OpenSuSe 11.1 gives the same time for 1 and 2 simultaneously runnung jobs. Mon, 23 May 2011 12:32:33 -0700 ???????????? ???? "David Mathog" : > Mon, 23 May 2011 09:40:13 -0700 ???????????????????????? ???????? "David Mathog" > : > > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message: > > > > When I run 2 identical examples of the same batch job > > simultaneously, execution time of *each* job is > > > > LOWER than for single job run ! > I thought also about cpus frequency variations, but I think that null output > of > lsmod|grep freq > is enough for fixed CPU frequency. > > END FORWARD > Regarding the frequencies, better to use > cat /proc/cpuinfo | grep MHz I looked to cpuinfo, but only manually - some times (i.e. I didn't run any script w/periodical looking for CPU frequencies). All the frequencies of cores were fixed. > Did you verify that the results for each of the two simultaneous runs > are both correct? Yes, the results are the same. I looked also to number of iterations etc. But I'll check outputs again. >Ideally, tweak some parameter so they are slightly > different from each other. But I don't understand - if I change slightly some of input parameters, what may it give ? > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech Fri, 20 May 2011 20:11:15 -0400 message from Serguei Patchkovskii : > Suse 10.3 is quite old; it uses a kernel which is less than perfect at scheduling jobs and allocating resources for >NUMA systems. Try running your test job using: > > numactl --cpunodebind=0 --membind=0 g98 numactl w/all things bound to node 1 gives "big" execution time ( 1 day 4 hours; 2 simultaneous jobs run faster), for forcing different nodes for cpu and memory - execution time is even higher (+1 h). Therefore effect observed don't looks as result of numa allocations :-( Mikhail END FORWARD My point about the two different parameter sets on the jobs was to determine if the two were truly independent, or if they might not be interacting with each other through checkpoint files or shared memory, or the like. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue May 24 14:37:30 2011 From: mathog at caltech.edu (David Mathog) Date: Tue, 24 May 2011 11:37:30 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice Message-ID: Jim Lux posted: > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book. Well that was really horrible. Are car computers ECC? When all they did was engine management a memory glitch wouldn't have been too terrible, but now that some of them control automatic parking and other "higher" functions, and with around 100M units in circulation just in the USA, if they aren't ECC then memory glitches in running vehicles would have to be happening every day. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue May 24 15:07:10 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 24 May 2011 12:07:10 -0700 Subject: [Beowulf] Curious about ECC vs non-ECC in practice In-Reply-To: References: Message-ID: > -----Original Message----- > From: David Mathog [mailto:mathog at caltech.edu] > Sent: Tuesday, May 24, 2011 11:38 AM > To: Lux, Jim (337C); beowulf at beowulf.org > Subject: RE: [Beowulf] Curious about ECC vs non-ECC in practice > > Jim Lux posted: > > > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an > updated version of the original IEEE Computer (July 1993) article. It > also appears in the appendix of my book. > > Well that was really horrible. > > Are car computers ECC? When all they did was engine management a memory > glitch wouldn't have been too terrible, but now that some of them > control automatic parking and other "higher" functions, and with around > 100M units in circulation just in the USA, if they aren't ECC then > memory glitches in running vehicles would have to be happening every day. Car controllers tend to have mask ROM for their software which is pretty upset immune. The "PROM" (which today might be flash or EEPROM) holds all the coefficients for things like the fuel injection/timing, but doesn't hold the code for controlling, say, the ABS. I would imagine (but do not know) that they do things similar to what we do in spacecraft controllers: store critical data multiple times, lots of self checks on algorithm operation, etc. The report on the Toyota Throttle controller said this: "The Main and Sub-CPUs use two types of memory: non-volatile ROM for software code and volatile Static Ram (SRAM). The SRAM is protected by a single error detect and correct and a double error detect hardware function performed by error detection and correction (EDAC) logic." There's a whole reliability of software community out there with everything from certifiable processes to coding standards (MISRA) designed to make it easy to inspect and verify that the code is doing what you think, and that it handles off-nominal cases. I haven't read the whole report, but there was an analysis of the software in the Toyota controllers recently. http://www.nhtsa.gov/staticfiles/nvs/pdf/NASA-UA_report.pdf "The NESC team examined the software code (more than 280,000 lines) for paths that might initiate such a UA, but none were identified" (UA-Unintended Acceleration) The team examined the VOQ vehicles for signs of electrical faults, and subjected these vehicles to electro-magnetic interference (EMI) radiated and conducted test levels significantly above certification levels. The EMI testing did not produce any UAs, but in some cases caused the engine to slow and/or stall. (That's probably closest to what you'd see from a memory upset) Section 6.5, page 64 of the report, is "System Fail-Safe Architecture" It's pretty sophisticated, with multiple parallel schemes to prevent runaway or failure. I'm impressed at the level of thought they gave to not just shutting down the engine, but in leaving an adequate limp-home capability when one or more parts in the chain fails (e.g. if the throttle plate actually sticks, it can control the engine by turning on and off the fuel injectors). There's also an independent mechanism that detects if the pedal isn't pressed (or the redundant pedal position sensors have failed), in which case the engine cannot exceed 2500RPM, if it does, the fuel turns off, and then turns back on when the speed drops below 1100RPM And, since we Beowulfers are for the most part software weenies.. The ECM for the 2005 Camry uses a NEC V850 E1 processor. The software is in ANSI C, and compiled with Greenhills compiler. There are 256kSLOC of non-comments (along with 241kSLOC of comments) in .c files and another 40kSLOC (noncomment) in various .h files. They ran it through Coverity and CodeSonar (both of which we use at JPL), as well as SPIN (using SWARM to run it on a cluster.. now how about that) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bill at Princeton.EDU Thu May 26 09:18:10 2011 From: bill at Princeton.EDU (Bill Wichser) Date: Thu, 26 May 2011 09:18:10 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? Message-ID: <4DDE5312.9070901@princeton.edu> Wondering if anyone out there is doing both I/O to storage as well as MPI over the same IB fabric. Following along in the Mellanox User's Guide, I see a section on how to implement the QOS for both MPI and my lustre storage. I am curious though as to what might happen to the performance of the MPI traffic when high I/O loads are placed on the storage. In our current implementation, we are using blades which are 50% blocking (2:1 oversubscribed) when moving from a 16 blade chassis to other nodes. Would trying to do storage on top dictate moving to a totally non-blocking fabric? Thanks, Bill _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 12:18:18 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 12:18:18 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDE5312.9070901@princeton.edu> References: <4DDE5312.9070901@princeton.edu> Message-ID: > Wondering if anyone out there is doing both I/O to storage as well as > MPI over the same IB fabric. I would say that is the norm. we certainly connect local storage (Lustre) to nodes via the same fabric as MPI. gigabit is completely inadequate for modern nodes, so the only alternatives would be 10G or a secondary IB fabric, both quite expensive propositions, no? I suppose if your cluster does nothing but IO-light serial/EP jobs, you might think differently. > Following along in the Mellanox User's > Guide, I see a section on how to implement the QOS for both MPI and my > lustre storage. I am curious though as to what might happen to the > performance of the MPI traffic when high I/O loads are placed on the > storage. to me, the real question is whether your IB fabric is reasonably close to full-bisection (and/or whether your storage nodes are sensibly placed, topologically.) > In our current implementation, we are using blades which are 50% > blocking (2:1 oversubscribed) when moving from a 16 blade chassis to > other nodes. Would trying to do storage on top dictate moving to a > totally non-blocking fabric? how much inter-chassis MPI do you do? how much IO do you do? IB has a small MTU, so I don't really see why mixed traffic would be a big problem. of course, IB also doesn't do all that wonderfully with hotspots. but isn't this mostly an empirical question you can answer by direct measurement? regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Shainer at Mellanox.com Thu May 26 12:50:17 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Thu, 26 May 2011 09:50:17 -0700 Subject: [Beowulf] Infiniband: MPI and I/O? References: <4DDE5312.9070901@princeton.edu> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F03ACAD7B@mtiexch01.mti.com> > Wondering if anyone out there is doing both I/O to storage as well as > MPI over the same IB fabric. Following along in the Mellanox User's > Guide, I see a section on how to implement the QOS for both MPI and my > lustre storage. I am curious though as to what might happen to the > performance of the MPI traffic when high I/O loads are placed on the > storage. I am doing it in my lab -have build my own Lustre solution and am running it on the same network as the MPI jobs. At the end it all depends on how much bandwidth do you need for the MPI and the storage, and if you can cover both, you can do it. Today the QoS solution for IB is out there, and you can set max BW and min latency as parameters for the different traffics. > In our current implementation, we are using blades which are 50% > blocking (2:1 oversubscribed) when moving from a 16 blade chassis to > other nodes. Would trying to do storage on top dictate moving to a > totally non-blocking fabric? IB congestion control is being released now (finally), so this can help you here. Gilad _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bill at Princeton.EDU Thu May 26 15:20:19 2011 From: bill at Princeton.EDU (Bill Wichser) Date: Thu, 26 May 2011 15:20:19 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: <4DDE5312.9070901@princeton.edu> Message-ID: <4DDEA7F3.8050703@princeton.edu> Mark Hahn wrote: >> Wondering if anyone out there is doing both I/O to storage as well as >> MPI over the same IB fabric. >> > > I would say that is the norm. we certainly connect local storage > (Lustre) to nodes via the same fabric as MPI. gigabit is completely > inadequate for modern nodes, so the only alternatives would be 10G > or a secondary IB fabric, both quite expensive propositions, no? > > I suppose if your cluster does nothing but IO-light serial/EP jobs, > you might think differently. > Really? I'm surprised by that statement. Perhaps I'm just way behind on the curve though. It is typical here to have local node storage, local lustre/pvfs storage, local NFS storage, and global GPFS storage running over the GigE network. Depending on I/O loads users can make use of the storage at the right layer. Yes, users fill the 1Gbps pipe to the storage per node. But as we now implement all new clusters with IB I'm hoping to increase that bandwidth even more. If you and everyone else is doing this already, that's a good sign! Lol! As we move closer to making this happen, perhaps there will be plenty of answers then for any QOS setup questions I may have. > >> Following along in the Mellanox User's >> Guide, I see a section on how to implement the QOS for both MPI and my >> lustre storage. I am curious though as to what might happen to the >> performance of the MPI traffic when high I/O loads are placed on the >> storage. >> > > to me, the real question is whether your IB fabric is reasonably close > to full-bisection (and/or whether your storage nodes are sensibly placed, > topologically.) > > >> In our current implementation, we are using blades which are 50% >> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to >> other nodes. Would trying to do storage on top dictate moving to a >> totally non-blocking fabric? >> > > how much inter-chassis MPI do you do? how much IO do you do? > IB has a small MTU, so I don't really see why mixed traffic would > be a big problem. of course, IB also doesn't do all that wonderfully > with hotspots. but isn't this mostly an empirical question you can > answer by direct measurement? > How would I measure by direct measurement? I don't have the switching infrastructure to compare a 2:1 versus a 1:1 unless you're talking about inside a chassis. But since my storage would connect into the switching infrastructure how and what would I compare? Jobs are not scheduled to run on a single chassis, or at least they try to but are not placed on hold for more than 10 minutes waiting. So there are lots of wide jobs running between chassis. Some don't even fit on a chassis. As for the question of how much data, I don't have answer. I know that a 10Gbps pipe hits 4Gbps for sustained periods to our central storage from the cluster. I also know that I can totally overwhelm a 10G connected OSS which is currently I/O bound. My question really was twofold: 1) is anyone doing this successfully and 2) does anyone have an idea of how loudly my users will scream when their MPI jobs suddenly degrade. You've answered #1 and seem to believe that for #2, no one will notice. Thanks! Bill > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From greg at keller.net Thu May 26 15:29:07 2011 From: greg at keller.net (Greg Keller) Date: Thu, 26 May 2011 14:29:07 -0500 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: Message-ID: <4DDEAA03.2040206@keller.net> Date: Thu, 26 May 2011 12:18:18 -0400 (EDT) > From: Mark Hahn > Subject: Re: [Beowulf] Infiniband: MPI and I/O? > To: Bill Wichser > Cc: Beowulf Mailing List > Message-ID: > > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > >> Wondering if anyone out there is doing both I/O to storage as well as >> MPI over the same IB fabric. > I would say that is the norm. we certainly connect local storage > (Lustre) to nodes via the same fabric as MPI. gigabit is completely > inadequate for modern nodes, so the only alternatives would be 10G > or a secondary IB fabric, both quite expensive propositions, no? > > I suppose if your cluster does nothing but IO-light serial/EP jobs, > you might think differently. > Agreed. Just finished telling another vendor, "It's not high speed storage unless it has an IB/RDMA interface". They love that. Except for some really edge cases, I can't imagine running IO over GbE for anything more than trivial IO loads. I am Curious if anyone is doing IO over IB to SRP targets or some similar "Block Device" approach. The Integration into the filesystem by Lustre/GPFS and others may be the best way to go, but we are not 100% convinced yet. Any stories to share? Cheers! Greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Thu May 26 15:35:35 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 26 May 2011 15:35:35 -0400 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEAA03.2040206@keller.net> References: <4DDEAA03.2040206@keller.net> Message-ID: <4DDEAB87.40203@scalableinformatics.com> On 05/26/2011 03:29 PM, Greg Keller wrote: > Agreed. Just finished telling another vendor, "It's not high speed > storage unless it has an IB/RDMA interface". They love that. Except Heh ... love it! > for some really edge cases, I can't imagine running IO over GbE for > anything more than trivial IO loads. Lots of our customers do, when they have a large legacy GbE network, and upgrading is expensive. We can have a very large fan in to our units, but IB (even SDR!) is really nice to move data over for storage. > I am Curious if anyone is doing IO over IB to SRP targets or some > similar "Block Device" approach. The Integration into the filesystem by Both block and file targets. SRPT on our units, and fronted by OSSes for Lustre and similar like things. Can do iSCSI as well (over IB using iSER, or over 10GbE ... works really nicely in either case). > Lustre/GPFS and others may be the best way to go, but we are not 100% > convinced yet. Any stories to share? If you do this with Lustre, make sure your OSSes are in HA pairs using pacemaker/ucarp, and use DRBD between backend units, or MD on the OSS to mirror the storage. Unfortunately IB doesn't virtualize well (last I checked), so these have to be physical OSSes. I presume something similar on GPFS. GlusterFS, PVFS2/OrangeFS, etc. go fine without the block devices, and Gluster does mirroring at the file level. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 16:13:07 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 16:13:07 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEA7F3.8050703@princeton.edu> References: <4DDE5312.9070901@princeton.edu> <4DDEA7F3.8050703@princeton.edu> Message-ID: >>> Wondering if anyone out there is doing both I/O to storage as well as >>> MPI over the same IB fabric. >>> >> >> I would say that is the norm. we certainly connect local storage (Lustre) >> to nodes via the same fabric as MPI. gigabit is completely >> inadequate for modern nodes, so the only alternatives would be 10G >> or a secondary IB fabric, both quite expensive propositions, no? >> >> I suppose if your cluster does nothing but IO-light serial/EP jobs, >> you might think differently. >> > Really? I'm surprised by that statement. Perhaps I'm just way behind on the > curve though. It is typical here to have local node storage, local > lustre/pvfs storage, local NFS storage, and global GPFS storage running over > the GigE network. sure, we use Gb as well, but only as a crutch, since it's so slow. or does each node have, say, a 4x bonded Gb for this traffic? or are we disagreeing on whether Gb is "slow"? 80-ish MB/s seems pretty slow to me, considering that's less than any single disk on the market... >> how much inter-chassis MPI do you do? how much IO do you do? >> IB has a small MTU, so I don't really see why mixed traffic would be a big >> problem. of course, IB also doesn't do all that wonderfully >> with hotspots. but isn't this mostly an empirical question you can >> answer by direct measurement? >> > How would I measure by direct measurement? I meant collecting the byte counters from nics and/or switches while real workloads are running. that tells you the actual data rates, and should show how close you are to creating hotspots. > My question really was twofold: 1) is anyone doing this successfully and 2) > does anyone have an idea of how loudly my users will scream when their MPI > jobs suddenly degrade. You've answered #1 and seem to believe that for #2, > no one will notice. we've always done it, though our main experience is with clusters that have full-bisection fabrics. our two more recent clusters have half-bisection fabrics, but I suspect that most users are not looking closely enough at performance to notice and/or complain. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu May 26 17:23:30 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 26 May 2011 17:23:30 -0400 (EDT) Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: <4DDEAA03.2040206@keller.net> References: <4DDEAA03.2040206@keller.net> Message-ID: > Agreed. Just finished telling another vendor, "It's not high speed > storage unless it has an IB/RDMA interface". They love that. Except what does RDMA have to do with anything? why would straight 10G ethernet not qualify? I suspect you're really saying that you want an efficient interface, as well as enough bandwidth, but that doesn't necessitate RDMA. > for some really edge cases, I can't imagine running IO over GbE for > anything more than trivial IO loads. well, it's a balance issue. if someone was using lots of Atom boards lashed into a cluster, 1Gb apiece might be pretty reasonable. but for fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all that generous. as an interesting case in point, SeaMicro was in the news again with a 512 atom system: either 64 Gb links or 16 10G links. the former (.128 Gb/core) seems low even for atoms, but .3 Gb/core might be reasonable. > I am Curious if anyone is doing IO over IB to SRP targets or some > similar "Block Device" approach. The Integration into the filesystem by > Lustre/GPFS and others may be the best way to go, but we are not 100% > convinced yet. Any stories to share? you mean you _like_ block storage? how do you make a shared FS namespace out of it, manage locking, etc? regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From greg at keller.net Thu May 26 18:30:52 2011 From: greg at keller.net (Greg Keller) Date: Thu, 26 May 2011 17:30:52 -0500 Subject: [Beowulf] Infiniband: MPI and I/O? In-Reply-To: References: <4DDEAA03.2040206@keller.net> Message-ID: <4DDED49C.6070509@keller.net> On 5/26/2011 4:23 PM, Mark Hahn wrote: >> Agreed. Just finished telling another vendor, "It's not high speed >> storage unless it has an IB/RDMA interface". They love that. Except > > what does RDMA have to do with anything? why would straight 10G ethernet > not qualify? I suspect you're really saying that you want an efficient > interface, as well as enough bandwidth, but that doesn't necessitate > RDMA. > RDMA over IB is definitely a nice feature. Not required, but IP over IB has enough limits that we prefer to avoid it. >> for some really edge cases, I can't imagine running IO over GbE for >> anything more than trivial IO loads. > > well, it's a balance issue. if someone was using lots of Atom boards > lashed into a cluster, 1Gb apiece might be pretty reasonable. but for > fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all > that generous. > > as an interesting case in point, SeaMicro was in the news again with a > 512 > atom system: either 64 Gb links or 16 10G links. the former (.128 > Gb/core) > seems low even for atoms, but .3 Gb/core might be reasonable. > agreed >> I am Curious if anyone is doing IO over IB to SRP targets or some >> similar "Block Device" approach. The Integration into the filesystem by >> Lustre/GPFS and others may be the best way to go, but we are not 100% >> convinced yet. Any stories to share? > > you mean you _like_ block storage? how do you make a shared FS namespace > out of it, manage locking, etc? Well, it's a use case issue for us. You don't make a shared FS on the block devices (well, maybe you could just not in a scalable way)... but we envision leasing block devices to customers with known capacity/performance capability. Then the customer can make the call if they want to use it for a CIFS/NFS backend, possibly even lashed together via MD, through a single server. They can also lease multiple block devices and create a lustre type system. The flexibility is if they disappear and come back they may not get the same compute/storage nodes, but they can attach any server to their dedicated block storage devices. There are also some multi-tenancy security options that can be more definitively handled if they have absolute control over a block device. So in this case, they would semi-permanently lease the block devices, and then fire up front end storage nodes and compute nodes on an "as needed / as available" basis anywhere in our compute farm. Effectively we get the benefits of a massive Fibre Channel type SAN over the IB infrastructure we have to every node. If we can get the performance and cost of the block storage right, it will be compelling for some of our customers. We are still prototyping how it would work and characterizing performance options... but it's interesting to us. Cheers! Greg > > regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri May 27 23:59:42 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 27 May 2011 23:59:42 -0400 (EDT) Subject: [Beowulf] 512 atoms in a box Message-ID: I was thinking about the seamicro box - 512 atoms, 64 disks and either 64 Gb ports or 16 10G ports. it would be interesting to look at what the most appropriate "balance" is for mips/flops of cpu power compared to interconnect bandwidth. maybe the seamicro box is more intended to be a giant memcached server - that is, the question is memory bandwidth/capacity versus IC bandwidth. in any case, you have to ponder where the amazing value-add is - compactness? I'm not sure it competes all that well compared to 48 core-per-U conventional servers (whether mips/flops or memory-based). here's an idea, more commodity-oriented (hence beowulf): suppose you design a tiny widget that gets all its power via POE. maybe Atom or ARM-based - you've got 15-20W, which is quite a bit these days. for packaging, you need space for a cpu, nic and sodimm. maybe some leds. plug them into a commodity 1U 48-port Gb switch, then stack 10 of them and you've got a penny-pincher's approximation of a Seamicro SM100000! not going to win top500, but... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Sat May 28 04:26:25 2011 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 28 May 2011 10:26:25 +0200 Subject: [Beowulf] 512 atoms in a box In-Reply-To: References: Message-ID: <20110528082625.GB19622@leitl.org> On Fri, May 27, 2011 at 11:59:42PM -0400, Mark Hahn wrote: > here's an idea, more commodity-oriented (hence beowulf): suppose you > design a tiny widget that gets all its power via POE. maybe Atom or > ARM-based - you've got 15-20W, which is quite a bit these days. > for packaging, you need space for a cpu, nic and sodimm. maybe some leds. > > plug them into a commodity 1U 48-port Gb switch, then stack 10 of them > and you've got a penny-pincher's approximation of a Seamicro SM100000! > > not going to win top500, but... I was planning to do something similar with rooted Apple TV, once it's bumped up to A5 in the next generation. The devices would need spacers, a baffle and a few fans, if packed closely. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From award at uda.ad Sat May 28 06:08:50 2011 From: award at uda.ad (Alan Ward) Date: Sat, 28 May 2011 12:08:50 +0200 Subject: [Beowulf] RS: 512 atoms in a box References: Message-ID: <9CC6BA5ACC8E7C489A97B01C2DA791063D04BD@serpens.ua.ad> Just found this: http://www.raspberrypi.org/ The ARM11 does not pack much punch, there is no networking (though it should not be too difficult to add) and it is not even in production yet. But it does seem fun. Plus, $1000 would get you 40 units ... Cheers, -Alan -----Missatge original----- De: beowulf-bounces at beowulf.org en nom de Mark Hahn Enviat el: ds. 28/05/2011 05:59 Per a: Beowulf Mailing List Tema: [Beowulf] 512 atoms in a box I was thinking about the seamicro box - 512 atoms, 64 disks and either 64 Gb ports or 16 10G ports. it would be interesting to look at what the most appropriate "balance" is for mips/flops of cpu power compared to interconnect bandwidth. maybe the seamicro box is more intended to be a giant memcached server - that is, the question is memory bandwidth/capacity versus IC bandwidth. in any case, you have to ponder where the amazing value-add is - compactness? I'm not sure it competes all that well compared to 48 core-per-U conventional servers (whether mips/flops or memory-based). here's an idea, more commodity-oriented (hence beowulf): suppose you design a tiny widget that gets all its power via POE. maybe Atom or ARM-based - you've got 15-20W, which is quite a bit these days. for packaging, you need space for a cpu, nic and sodimm. maybe some leds. plug them into a commodity 1U 48-port Gb switch, then stack 10 of them and you've got a penny-pincher's approximation of a Seamicro SM100000! not going to win top500, but... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf