From eugen at leitl.org  Wed May  4 06:33:46 2011
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 4 May 2011 12:33:46 +0200
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
Message-ID: <20110504103346.GH23560@leitl.org>


http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines

Chinese Chip Wins Energy-Efficiency Crown 

Though slower than competitors, the energy-saving Godson-3B is destined 
for the next Chinese supercomputer

By Joseph Calamia  /  May 2011

The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
the third quarter of 2011, will have something quite different under its
hood. Unlike its forerunners, which employed American-born chips, this
machine will harness the country's homegrown high-end processor, the
Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
its competitors' wares, at least one of which operates at more than 5 GHz,
but the chip still turns heads with its record-breaking energy efficiency. It
can execute 128 billion floating-point operations per second using just 40
watts?double or more the performance per watt of competitors.

The Godson has an eccentric interconnect structure?for relaying messages
among multiple processor cores?that also garners attention. While Intel and
IBM are commercializing chips that will shuttle communications between cores
merry-go-round style on a "ring interconnect," the Godson connects cores
using a modified version of the gridlike interconnect system called a mesh
network. The processor's designers, led by Weiwu Hu at the Chinese Academy of
Sciences, in Beijing, seem to be placing their bets on a new kind of layout
for future high-end computer processors.

A mesh design goes hand in hand with saving energy, says Matthew Mattina,
chief architect at the San Jose, Calif.?based Tilera Corp., a chipmaker now
shipping 36- and 64-core processors using on-chip mesh interconnects.

Imagine a ring interconnect as a traffic roundabout. Getting to some exits
requires you to drive nearly around the entire circle. Traveling away from
your destination before getting there, says Mattina, requires more transistor
switching and therefore consumes more energy. A mesh network is more like a
city's crisscrossed streets. "In a mesh, you always traverse the minimum
amount of wire?you're never going the wrong way," he says.

On the 8-core Godson chip, 4 cores form a tightly bound unit?each core sits
on a corner of a square of interconnects, as in a usual mesh. Godson
researchers have also connected each corner to its opposite, using a pair of
diagonal interconnects to form an X through the square's center. A "crossbar"
interconnect then serves as an overpass, linking this 4-core neighborhood to
a similar 4-core setup nearby.

Godson developers believe that their modified mesh's scalability will prove a
key advantage, as chip designers cram more cores onto future chips. Yunji
Chen, a Godson architect, says that competitors' ring interconnects may have
trouble squeezing in more than 32 cores.

Indeed, one of the ring's benefits could prove its future liability. Linking
new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of
electrical and computer engineering at the University of Toronto. After all,
there's only one path to send information?or two in a bidirectional ring. But
sharing a common communication path also means that each additional core adds
to the length of wire that messages must travel and increases the demand for
that path. With a large number of cores, "the timing around this ring just
gets out of hand," Smith says. "You can't get service when you need it."

Of course, adding more cores in a mesh also stresses the system. Even if you
have a grid of paths providing multiple communication channels, more cores
increase the demand for the network, and more demand makes traveling long
distances difficult: Try driving across New York City at rush hour. Still,
the bandwidth scaling of a mesh interconnect is superior to that of a ring,
Tilera's Mattina says. He notes that the total bandwidth available with a
mesh interconnect increases as you add cores, but with a ring interconnect,
the total bandwidth remains constant even as the core count increases.
Latency?the time it takes to get a message from one core to another?is also
more favorable in a mesh design, Chen says. In a ring interconnect, latency
increases linearly with the core count, he says, while in a mesh design it
increases with the square root of the number of cores.

Reid Riedlinger, a principal engineer at Intel, points out that a ring
interconnect has its own scalability benefits. Intel's recently unveiled
8-core Poulson design employs a ring not only to add more cores but also to
add easy-to-access on-chip memory, or cache. As long as the chip has the
power and the space, Riedlinger says, a ring makes it easy to add each core
and cache as a module?a move that would require more complicated validity
studies and logic modification in a mesh. "Adding the additional ring stop
has a very small impact on latency, and the additional cache capacity will
provide performance benefits for many applications," he says.

For those who are not building a national supercomputer, Riedlinger also
points out that a ring setup is more easily scalable in a different
direction. "You might start with an 8-core design," he says, "and then, to
suit a different market segment, you might chop 4 cores out of the middle and
sell it as a different product."

This article originally appeared in print as "China's Godson Gamble".

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From cbergstrom at pathscale.com  Wed May  4 06:39:45 2011
From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?Christopher_Bergstr=F6m?=)
Date: Wed, 4 May 2011 17:39:45 +0700
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
In-Reply-To: <20110504103346.GH23560@leitl.org>
References: <20110504103346.GH23560@leitl.org>
Message-ID: <BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>

On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl <eugen at leitl.org> wrote:
>
> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines
>
> Chinese Chip Wins Energy-Efficiency Crown
>
> Though slower than competitors, the energy-saving Godson-3B is destined
> for the next Chinese supercomputer
>
> By Joseph Calamia ?/ ?May 2011
>
> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
> the third quarter of 2011, will have something quite different under its
> hood. Unlike its forerunners, which employed American-born chips, this
> machine will harness the country's homegrown high-end processor, the
> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
> its competitors' wares, at least one of which operates at more than 5 GHz,
> but the chip still turns heads with its record-breaking energy efficiency. It
> can execute 128 billion floating-point operations per second using just 40
> watts?double or more the performance per watt of competitors.

*cough*

Wow.. they've brought SiCortex back to life...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Wed May  4 09:50:50 2011
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 04 May 2011 09:50:50 -0400
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
In-Reply-To: <BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>
References: <20110504103346.GH23560@leitl.org>
	<BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>
Message-ID: <4DC159BA.8090005@ias.edu>


On 05/04/2011 06:39 AM, Christopher Bergstr?m wrote:
> On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl <eugen at leitl.org> wrote:
>>
>> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines
>>
>> Chinese Chip Wins Energy-Efficiency Crown
>>
>> Though slower than competitors, the energy-saving Godson-3B is destined
>> for the next Chinese supercomputer
>>
>> By Joseph Calamia  /  May 2011
>>
>> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
>> the third quarter of 2011, will have something quite different under its
>> hood. Unlike its forerunners, which employed American-born chips, this
>> machine will harness the country's homegrown high-end processor, the
>> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
>> its competitors' wares, at least one of which operates at more than 5 GHz,
>> but the chip still turns heads with its record-breaking energy efficiency. It
>> can execute 128 billion floating-point operations per second using just 40
>> watts?double or more the performance per watt of competitors.
> 
> *cough*
> 
> Wow.. they've brought SiCortex back to life...

Oh, snap!

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Tue May 10 01:37:33 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 9 May 2011 22:37:33 -0700
Subject: [Beowulf] How InfiniBand gained confusing bandwidth numbers
Message-ID: <20110510053733.GB12826@bx9.net>

http://dilbert.com/strips/comic/2011-05-10/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mdidomenico4 at gmail.com  Wed May 11 20:57:57 2011
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 11 May 2011 20:57:57 -0400
Subject: [Beowulf] EPCC and DIR cluster
Message-ID: <BANLkTin6JYM9WXxeyN+26VyP_mNnqADeJw@mail.gmail.com>

Is there anyone on the list associated with EPCC or knows someone at
EPCC?  If so, i recently saw an article in Scientific Computing
magazine, where there was a blurb about a smallish cluster built a
EPCC utilizing Atom chips/GPU's and HDD's, whereby the design was more
amdahl balanced for "data intensive research".  I can't seem to locate
anything on the web about it, but I'm interested in the spec's/design
for the machine and how it performs

thanks
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Fri May 20 00:35:25 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 20 May 2011 00:35:25 -0400
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <4DD5EF8D.8070909@scalableinformatics.com>

Hi folks

   Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
turned off at the motherboard level?  I am curious if there are numbers 
of these, and what issues people encounter.  I have some of my own data 
from smaller collections of systems, I am wondering about this for 
larger systems.

   Thanks!

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 01:45:01 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 19 May 2011 22:45:01 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <20110520054501.GE16676@bx9.net>

On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote:

>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
> turned off at the motherboard level?  I am curious if there are numbers 
> of these, and what issues people encounter.  I have some of my own data 
> from smaller collections of systems, I am wondering about this for 
> larger systems.

I don't think anyone's done the experiment with a 'larger system'
since "Big Mac" had to replace all of their servers with ones that had
ECC. Still, any cluster that can manipulate the BIOS appropriately
could easily do the experiment.

-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From gmpc at sanger.ac.uk  Fri May 20 04:58:59 2011
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Fri, 20 May 2011 09:58:59 +0100
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <20110520054501.GE16676@bx9.net>
References: <4DD5EF8D.8070909@scalableinformatics.com>
	<20110520054501.GE16676@bx9.net>
Message-ID: <4DD62D53.6030806@sanger.ac.uk>

On 20/05/11 06:45, Greg Lindahl wrote:
> On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote:
> 
>>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
>> turned off at the motherboard level?  I am curious if there are numbers 
>> of these, and what issues people encounter.  I have some of my own data 
>> from smaller collections of systems, I am wondering about this for 
>> larger systems.

We did, circa 2003. Never again.

When we were lucky, the uncorrected errors happened in memory in use by
the kernel or application code, and we got hard machine crashes or code
seg-faulting. Those were easy to spot.

When we were unlucky, the errors happened in page cache, resulting in
data being randomly transmuted. Most of the code we were running at the
time did minimal input sanity checking. It was quite instructive to see
just how much genomic analysis code would quite happily compute on DNA
sequences that contained things other than ATGC.

The duff runs would eventually get picked up by the various
sanity-checks that happened at the end of our analysis pipelines, but it
involved quite a bit of developer & sysadmin effort to track down and
re-run all of the possibly affected jobs.

Cheers,

Guy


-- 
Dr. Guy Coates, Informatics Systems Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From a.travis at abdn.ac.uk  Fri May 20 11:35:45 2011
From: a.travis at abdn.ac.uk (Tony Travis)
Date: Fri, 20 May 2011 16:35:45 +0100
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <4DD68A51.70605@abdn.ac.uk>

On 20/05/11 05:35, Joe Landman wrote:
> Hi folks
>
>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
> turned off at the motherboard level?  I am curious if there are numbers
> of these, and what issues people encounter.  I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.

Hi, Joe.

I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it 
was a nightmare, as Guy described in his email, until I pre-emptively 
tested the memory in user-space, using Chlarles Cazabon's "memtester":

   http://pyropus.ca/software/memtester

Prior to this, *all* the RAM had passed Memtest86+.

I had a strict policy that if a system crashed, for any reason, it was 
re-tested with Memtest86+, then 100 passes of "memtester" before being 
allowed to re-join the Beowulf cluster. This made the Beowulf much more 
stable running openMosix. However, I've scrapped all our non-ECC nodes 
now because the real worry is not knowing if an error has occurred...

Apparently this is still a big issue for computers in space, using 
non-ECC RAM for solid-state storage on grounds of cost for imaging. 
They, apparently, use RAM background SoftECC 'scrubbers' like this:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

Bye,

   Tony.
-- 
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Fri May 20 11:52:43 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 20 May 2011 08:52:43 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD68A51.70605@abdn.ac.uk>
Message-ID: <C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>


On 5/20/11 8:35 AM, "Tony Travis" <a.travis at abdn.ac.uk> wrote:

>On 20/05/11 05:35, Joe Landman wrote:
>> Hi folks
>>
>>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
>> turned off at the motherboard level?  I am curious if there are numbers
>> of these, and what issues people encounter.  I have some of my own data
>> from smaller collections of systems, I am wondering about this for
>> larger systems.
>
>Hi, Joe.
>
>Apparently this is still a big issue for computers in space, using
>non-ECC RAM for solid-state storage on grounds of cost for imaging.
>They, apparently, use RAM background SoftECC 'scrubbers' like this:
>
>http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng
>.pdf
>
>

Yes, it's a big tradeoff in the space world. Not only does ECC require
extra memory, but the EDAC logic consumes power and, typically, slows down
the bus speed (I.e. You need an extra bus cycle to handle the EDAC logic
propagation delay).


There's also a practical detail that the upset rate might be low enough
that it is ok to just tolerate the upsets, because they'll get handled at
some higher level of the process.

For instance, if you have a RAM buffer in a communication link handling
the FEC coded bits, then there's not much difference between a bit flip in
RAM and a bit error on the comm link, so you might as well just let the
comm FEC code take care of the bit errors.

We tend to use a lot of checksum strategies.  Rather than an EDAC strategy
which corrects errors, it's good enough to just know that an error
occurred, and retry. This is particularly effective on Flash memory, which
has transient read errors: read it again and it works ok.

Another example is doing an FFT.  There are some strategies which allow
you to do a second fast computation that essentially provides a "check" on
the results of the FFT (e.g. The mean of the input data should match the
"DC term" in the FFT)

We might also keep triple copies of key variables.  You read all three
values and compare them before starting the computation.  Software Triple
Redundancy, as it were.  A lot of times, the probability of an error
occurring "during" the computation is sufficiently low, compared to the
probability of an error occurring during the very long waiting time
between operating on the data.

There's also the whole question of whether EDAC main memory buys you much,
when all the (ever larger) cache isn't protected.  Again, it comes down to
a probability analysis.


My own personal theory on this is that you are much more likely to have a
miscalculation due to a software bug than due to an upset.  Further, it's
impossible to get all the bugs out in finite time/money, so you might as
well design your whole system to be fault tolerant, not in a "oh my gosh,
we had an error, let's do extensive fault recovery", but a "we assume the
computations are always a bit wonky, so we factor that into our design".
 That is, design so that retries and self checks are just part of the
overhead.  Kind of like how a decent experiment or engineering design
takes into account measurement uncertainty stack-up.

As hardware gets smaller and faster and lower power, the "cost" to provide
extra computational resources to implement a strategy like this gets
smaller, relative to the ever increasing human labor cost to try and make
it perfect.

(and, of course, this *is* how humans actually do stuff.. You don't
precompute all of your control inputs to the car.. You basically set a
general goal, and continuously adjust to drive towards that goal.)


Jim Lux
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From deadline at eadline.org  Fri May 20 12:35:26 2011
From: deadline at eadline.org (Douglas Eadline)
Date: Fri, 20 May 2011 12:35:26 -0400 (EDT)
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org>

Joe

While this is somewhat anecdotal, it may be helpful.

Not a large-ish cluster, but as you may guess, I wondered
about this for Limulus
(http://limulus.basement-supercomputing.com)

I wrote a script (will post it if anyone interested)
that runs memtester until you stop it or it finds
a error. I ran it on several Core2 Duo systems
with Kingston DDR2-800 PC2-6400 memory.

As I recall, I ran it on 2-3 systems, only
one showed an error. I stopped the others
after about three weeks. Here is an example of the
script output when it fails (it logs the
memtest output).

  There was an error, inspect memtest-1178
  Start Date was: Mon Apr 20 16:04:35 EDT 2009
  Failure Date was: Fri May  8 17:55:43 EDT 2009
  Test ran 1178 times failing after 1561868 Seconds
  (26031 Minutes or 433 Hours or 18 Days)

My experience in running small clusters
without ECC has been very good. IMO it is also
a question of the quality of the memory vendor.
I never had an issue when running tests and
benchmarks, which I do quite a bit on new
hardware e.g.

  goo.gl/YoBaz

--
Doug


> Hi folks
>
>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC
> turned off at the motherboard level?  I am curious if there are numbers
> of these, and what issues people encounter.  I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.
>
>    Thanks!
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>         http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>


--
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From james.p.lux at jpl.nasa.gov  Fri May 20 13:21:12 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 20 May 2011 10:21:12 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org>
Message-ID: <C9FBEDF6.7E35%james.p.lux@jpl.nasa.gov>


On 5/20/11 9:35 AM, "Douglas Eadline" <deadline at eadline.org> wrote:

>Joe
>
>While this is somewhat anecdotal, it may be helpful.
>
>Not a large-ish cluster, but as you may guess, I wondered
>about this for Limulus
>(http://limulus.basement-supercomputing.com)
>
>I wrote a script (will post it if anyone interested)
>that runs memtester until you stop it or it finds
>a error. I ran it on several Core2 Duo systems
>with Kingston DDR2-800 PC2-6400 memory.
>
>My experience in running small clusters
>without ECC has been very good. IMO it is also
>a question of the quality of the memory vendor.
>I never had an issue when running tests and
>benchmarks, which I do quite a bit on new
>hardware e.g.


I'm going to guess that it's highly idiosyncratic.  The timing margins on
all the signals between CPU, memory, and perhipherals are tight, they're
temperature dependent and process dependent, so you could have the exact
same design with very similar RAM and one will get errors and the other
won't.  Folks who design PCI bus interfaces for a living earn their pay,
especially if they have to make it work with lots of different mfrs: just
because all the parts meet their databook specs doesn't mean that the
system will play nice together.

Consider that for memory, you have 64 odd data lines and 20 or so address
lines and some strobes that ALL have to switch together.  A data sensitive
pattern where a bunch of lines move at the same time, and induce a bit of
a voltage into an adjacent trace, which is a bit slower or faster than the
rest, and you've got the makings of a challenging hunt for the problem.
PC board trace lengths all have to be carefully matched, loads have to be
carefully matched, etc. 66 Mhz -> 15 ns, but modern DDR rams do batches of
words separated by a few ns.

1 cm is about 10-15 cm of tracelength, but it's the loading, terminations,
and other stuff that causes a problem.  Hang a 1 pf capacitor off that 100
ohm line, and there's  a tenth of a ns time constant right there.

You could also have EMI/EMC issues that cause problems. That same ragged
edge timing margin might be fine with 8 tower cases sitting on a shelf,
but not so good with the exact same mobo and memory stacked into 1-2U
cases in a 19" rack.  Power cords and ethernet cables also carry EMI
around. 

In a large cluster these things will all be aggravated: you've got more
machines running, so you increase the error probability right there.
You've got more electrical noise on the power carried between machines.
You've typically got denser packaging.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Fri May 20 14:26:31 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 20 May 2011 14:26:31 -0400 (EDT)
Subject: [Beowulf]  Execution time measurements - clarification
Message-ID: <Pine.LNX.4.64.1105201425491.17408@coffee.psychology.mcmaster.ca>

From: Mikhail Kuzminsky <mikky_m at mail.ru>
Subject: [Beowulf] Execution time measurements - clarification

Dear Mark, could you pls forward my message to beowulf at beowulf.org (because my messages as before can't be delivered to maillist) ? It's clarification of my previous question here.

Mikhail
---------------------
I have strange execution time measurements for CPU-bound jobs (to be exact, Gaussian-09 DFT frequency

calculations). Results are strange for *SEQENTIAL* calculations !

Executions were performed on dual socket Opteron 2350 (Quad core) server worked under Open SuSE Linux

10.3.

When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is

LOWER than for single job run !

I thought that it may be wrong "own for G09" times, so I checked their via time:
(time g09 *pair1.com) >&testt1 &
etc 
But this confirms strange results:

For pair of simultaneously running sequential jobs

88801.141u 52.475s 24:40:57.58 99.9%    0+0k 0+0io 1221pf+0w
88901.996u 13.472s 24:41:53.97 100.0%   0+0k 0+0io 0pf+0w


For run of 1 example of the same sequential job

100365.236u 27.297s 27:53:13.53 99.9%   0+0k 0+0io 1pf+0w

Is there any ideas why this situation might be ?

Mikhail Kuzminsky
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 20:29:10 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 20 May 2011 17:29:10 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>
References: <4DD68A51.70605@abdn.ac.uk>
	<C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>
Message-ID: <20110521002910.GB14350@bx9.net>

On Fri, May 20, 2011 at 08:52:43AM -0700, Lux, Jim (337C) wrote:

> As hardware gets smaller and faster and lower power, the "cost" to provide
> extra computational resources to implement a strategy like this gets
> smaller, relative to the ever increasing human labor cost to try and make
> it perfect.

The cost is teaching users to add checks to their codes, and to any
off-the-shelf codes they start using.

In hyrodynamics (cfd), often you have quantities which are explicitly
conserved by the equations, and others which are conserved by physics
but not by the particular numerical method you're using. The latter
were quite handy for finding bugs. I managed to discover several
numerical accuracy bugs in pre-release versions of the PathScale
compilers that way. "Yes, it's a bug if the 12th decimal place
changes."

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 20:32:27 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 20 May 2011 17:32:27 -0700
Subject: [Beowulf] Execution time measurements - clarification
Message-ID: <20110521003227.GD14350@bx9.net>

On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:

> When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is
> LOWER than for single job run !

I'd try locking these sequential jobs to a single core, you can get
quite weird effects when you don't.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Mon May 23 12:40:13 2011
From: mathog at caltech.edu (David Mathog)
Date: Mon, 23 May 2011 09:40:13 -0700
Subject: [Beowulf] Execution time measurements - clarification
Message-ID: <E1QOYAv-0002cv-Nb@mendel.bio.caltech.edu>

> On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:
> 
> > When I run 2 identical examples of the same batch job
simultaneously, execution time of *each* job is
> > LOWER than for single job run !

Disk caching could cause that.  Normally if the data read in isn't too
big you see an effect where:

run 1:  30 sec  <-- 50% disk IO/ 50% CPU
run 2:  15 sec  <-- ~100% CPU

where the first run loaded the data into disk cache, and the second run 
read it there, saving a lot of real disk IO.  Under some very peculiar
conditions, on a multicore system, if run 1 and 2 are "simultaneous"
they could seesaw back and forth for the "lead", so they end up taking
turns doing the actual disk IO, with the total run time for each ending
up between the times for the two runs above.  Note that they wouldn't
have to be started at exactly the same time for this to happen, because
the job that starts second is going to be reading from cache, so it will
tend to catch up to the job that started first.  Once they are close
then noise in the scheduling algorithms could cause the second to pass
the first.  (If it didn't pass, then this couldn't happen, because the
second would always be waiting for the first to pull data in from disk.) 

Of course, you also need to be sure that run 1 isn't interfering with
run 2.  They might, for instance, save/retrieve intermediate values
to the same filename, so that they really cannot be run safely at the
same time.  That is, they run faster together, but they run incorrectly.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Mon May 23 15:32:33 2011
From: mathog at caltech.edu (David Mathog)
Date: Mon, 23 May 2011 12:32:33 -0700
Subject: [Beowulf] Execution time measurements
Message-ID: <E1QOarh-0002fV-BF@mendel.bio.caltech.edu>

Mikhail Kuzminsky sent this to me and asked that it be posted:

BEGIN FORWARD

Mon, 23 May 2011 09:40:13 -0700 ???????????? ???? "David Mathog"
<mathog at caltech.edu>:
> > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:
> > > When I run 2 identical examples of the same batch job
> simultaneously, execution time of *each* job is
> > > LOWER than for single job run !
>
> Disk caching could cause that. Normally if the data read in isn't too
> big you see an effect where:
>
> run 1: 30 sec <-- 50% disk IO/ 50% CPU
> run 2: 15 sec <-- ~100% CPU

I believe that jobs are CPU-bound: top says that they use 100% of CPU,
and no swap activity.

iostat /dev/sda3 (where IO is performed) says typically something like:

Linux 2.6.22.5-31-default (c6ws1) 05/25/2011
avg-cpu: %user %nice %system %iowait %steal %idle
          1.12 0.00 0.03 0.01 0.00 98.84
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda3 0.01 0.01 8.47 20720 16845881

>
> Of course, you also need to be sure that run 1 isn't interfering with
> run 2. They might, for instance, save/retrieve intermediate values
> to the same filename, so that they really cannot be run safely at the
> same time. That is, they run faster together, but they run incorrectly.

File names used for IO are unique.
I thought also about cpus frequency variations, but I think that null output
of
lsmod|grep freq

is enough for fixed CPU frequency.

END FORWARD

OK, so not disk caching.  

Regarding the frequencies, better to use

  cat /proc/cpuinfo | grep MHz

while the processes are running.

Did you verify that the results for each of the two simultaneous runs
are both correct?  Ideally, tweak some parameter so they are slightly
different from each other.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 11:41:32 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 08:41:32 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>

Joe Landman wrote:

> I am wondering about this for larger systems.

Your post makes me wonder about ECC in much smaller systems, like
dedicated single computers controlling machinery or medical devices. 
Some really nasty things could result from "move cutting head in X
(int32 value) mm" after the most significant bit in the int32 value has
flipped.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Tue May 24 11:44:15 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 24 May 2011 11:44:15 -0400
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
References: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
Message-ID: <4DDBD24F.7080608@scalableinformatics.com>

On 05/24/2011 11:41 AM, David Mathog wrote:
> Joe Landman wrote:
>
>> I am wondering about this for larger systems.
>
> Your post makes me wonder about ECC in much smaller systems, like
> dedicated single computers controlling machinery or medical devices.
> Some really nasty things could result from "move cutting head in X
> (int32 value) mm" after the most significant bit in the int32 value has
> flipped.

Some bits are more important than others ...

Basically I was looking for anecdotal evidence that this is a "bad 
thing" (TM).  I have it now, and it helped me make the case I needed to 
make.

Thanks to everyone for this, it was really helpful!


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Tue May 24 13:06:15 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Tue, 24 May 2011 10:06:15 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
References: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47F010847E3CBBA@ALTPHYEMBEVSP20.RES.AD.JPL>

This *is* a big problem.
I suggest reading some of what Nancy Leveson has written.  
http://sunnyday.mit.edu/
"Professor Leveson started a new area of research, software safety, which is concerned with the problems of building software for real-time systems where failures can result in loss of life or property."

Two popular papers you might find interesting and fun to read:
"High-Pressure Steam Engines and Computer Software" (Postscript) or (PDF). This paper started as a keynote address at the International Conference on Software Engineering in Melbourne, Australia) and later was published in IEEE Software, October 1994.

"The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book.

There is a generic problem with complex systems, as well.  "Normal Accidents" by Charles Perrow is a good work (if a bit frightening in some ways... not in a senseless fear-mongering way, but because he lays out the fundamental reasons why these things are inevitable)

Marais, Dulac, and Leveson argue that the world isn't as bad as Perrow says, though. 
http://esd.mit.edu/symposium/pdfs/papers/marais-b.pdf


Jim Lux
+1(818)354-2075 
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog
> Sent: Tuesday, May 24, 2011 8:42 AM
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] Curious about ECC vs non-ECC in practice
> 
> Joe Landman wrote:
> 
> > I am wondering about this for larger systems.
> 
> Your post makes me wonder about ECC in much smaller systems, like
> dedicated single computers controlling machinery or medical devices.
> Some really nasty things could result from "move cutting head in X
> (int32 value) mm" after the most significant bit in the int32 value has
> flipped.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 14:27:23 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 11:27:23 -0700
Subject: [Beowulf] Execution time measurements
Message-ID: <E1QOwKB-00033z-UL@mendel.bio.caltech.edu>

Another message from Mikhail Kuzminsky, who for some reason or other 
cannot currently post directly to the list:

BEGIN FORWARD

1st of all, I should mention that the effect is observed only for
Opteron 2350/OpenSuSE 10.3.
Execution of the same job w/the same binaries on Nehalem E5520/OpenSuSe
11.1 gives the same time for 1
and 2 simultaneously runnung jobs.

Mon, 23 May 2011 12:32:33 -0700 ???????????? ???? "David Mathog"
<mathog at caltech.edu>:
> Mon, 23 May 2011 09:40:13 -0700 ???????????????????????? ????????
"David Mathog"
> <mathog at caltech.edu>:
> > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a
message:
> > > > When I run 2 identical examples of the same batch job
> > simultaneously, execution time of *each* job is
> > > > LOWER than for single job run !

> I thought also about cpus frequency variations, but I think that null
output
> of
> lsmod|grep freq
> is enough for fixed CPU frequency.
>
> END FORWARD

> Regarding the frequencies, better to use
> cat /proc/cpuinfo | grep MHz

I looked to cpuinfo, but only manually - some times (i.e. I didn't run
any script w/periodical looking for CPU frequencies).
All the frequencies of cores were fixed.

> Did you verify that the results for each of the two simultaneous runs
> are both correct?  
Yes, the results are the same. I looked also to number of iterations etc.
But I'll check outputs again.

>Ideally, tweak some parameter so they are slightly
> different from each other.

But I don't understand - if I change slightly some of input parameters,
what may it give ?

> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

Fri, 20 May 2011 20:11:15 -0400 message from Serguei Patchkovskii
<serguei.patchkovskii at gmail.com>:
>    Suse 10.3 is quite old; it uses a kernel which is less than perfect
at scheduling jobs and allocating resources for >NUMA systems. Try
running your  test job using:
>
>    numactl --cpunodebind=0 --membind=0 g98

numactl w/all things  bound to node 1 gives "big" execution time ( 1 day
4 hours; 2 simultaneous jobs run faster), for forcing different nodes
for cpu and memory - execution time is even  higher (+1 h). Therefore
effect observed don't looks as result of numa allocations :-(

Mikhail

END FORWARD

My point about the two different parameter sets on the jobs was to
determine if the two were truly independent, or if they might not be
interacting with each other through checkpoint files or shared memory,
or the like.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 14:37:30 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 11:37:30 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>

Jim Lux posted:

> "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an
updated version of the original IEEE Computer (July 1993) article. It
also appears in the appendix of my book.

Well that was really horrible. 

Are car computers ECC?  When all they did was engine management a memory
glitch wouldn't have been too terrible, but now that some of them
control automatic parking and other "higher" functions, and with around
100M units in circulation just in the USA, if they aren't ECC then
memory glitches in running vehicles would have to be happening every day.

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Tue May 24 15:07:10 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Tue, 24 May 2011 12:07:10 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>
References: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47F010847E3CC03@ALTPHYEMBEVSP20.RES.AD.JPL>

> -----Original Message-----
> From: David Mathog [mailto:mathog at caltech.edu]
> Sent: Tuesday, May 24, 2011 11:38 AM
> To: Lux, Jim (337C); beowulf at beowulf.org
> Subject: RE: [Beowulf] Curious about ECC vs non-ECC in practice
> 
> Jim Lux posted:
> 
> > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an
> updated version of the original IEEE Computer (July 1993) article. It
> also appears in the appendix of my book.
> 
> Well that was really horrible.
> 
> Are car computers ECC?  When all they did was engine management a memory
> glitch wouldn't have been too terrible, but now that some of them
> control automatic parking and other "higher" functions, and with around
> 100M units in circulation just in the USA, if they aren't ECC then
> memory glitches in running vehicles would have to be happening every day.


Car controllers tend to have mask ROM for their software which is pretty upset immune.  The "PROM" (which today might be flash or EEPROM) holds all the coefficients for things like the fuel injection/timing, but doesn't hold the code for controlling, say, the ABS.

I would imagine (but do not know) that they do things similar to what we do in spacecraft controllers: store critical data multiple times, lots of self checks on algorithm operation, etc.  The report on the Toyota Throttle controller said this:

"The Main and Sub-CPUs use two types of memory: non-volatile ROM for software code and volatile Static Ram (SRAM). The SRAM is protected by a single error detect and correct and a double error detect hardware function performed by error detection and correction (EDAC) logic."

There's a whole reliability of software community out there with everything from certifiable processes to coding standards (MISRA) designed to make it easy to inspect and verify that the code is doing what you think, and that it handles off-nominal cases.

I haven't read the whole report, but there was an analysis of the software in the Toyota controllers recently.


http://www.nhtsa.gov/staticfiles/nvs/pdf/NASA-UA_report.pdf

"The NESC team examined the software code (more than 280,000 lines) for paths that might initiate such a UA, but none were identified"  (UA-Unintended Acceleration)

The team examined the VOQ vehicles for signs of electrical faults, and subjected these vehicles to electro-magnetic interference (EMI) radiated and conducted test levels significantly above certification levels. The EMI testing did not produce any UAs, but in some cases caused the engine to slow and/or stall. (That's probably closest to what you'd see from a memory upset)

Section 6.5, page 64 of the report, is "System Fail-Safe Architecture"

It's pretty sophisticated, with multiple parallel schemes to prevent runaway or failure.  I'm impressed at the level of thought they gave to not just shutting down the engine, but in leaving an adequate limp-home capability when one or more parts in the chain fails (e.g. if the throttle plate actually sticks, it can control the engine by turning on and off the fuel injectors). There's also an independent mechanism that detects if the pedal isn't pressed (or the redundant pedal position sensors have failed), in which case the engine cannot exceed 2500RPM, if it does, the fuel turns off, and then turns back on when the speed drops below 1100RPM


And, since we Beowulfers are for the most part software weenies..
The ECM for the 2005 Camry uses a NEC V850 E1 processor. The software is in ANSI C, and compiled with Greenhills compiler.
There are 256kSLOC of non-comments (along with 241kSLOC of comments) in .c files and another 40kSLOC (noncomment) in various .h files.

They ran it through Coverity and CodeSonar (both of which we use at JPL), as well as SPIN (using SWARM to run it on a cluster.. now how about that)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From bill at Princeton.EDU  Thu May 26 09:18:10 2011
From: bill at Princeton.EDU (Bill Wichser)
Date: Thu, 26 May 2011 09:18:10 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
Message-ID: <4DDE5312.9070901@princeton.edu>

Wondering if anyone out there is doing both I/O to storage as well as 
MPI over the same IB fabric.  Following along in the Mellanox User's 
Guide, I see a section on how to implement the QOS for both MPI and my 
lustre storage.  I am curious though as to what might happen to the 
performance of the MPI traffic when high I/O loads are placed on the 
storage. 

In our current implementation, we are using blades which are 50% 
blocking (2:1 oversubscribed) when moving from a 16 blade chassis to 
other nodes.  Would trying to do storage on top dictate moving to a 
totally non-blocking fabric?

Thanks,
Bill
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 12:18:18 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 12:18:18 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDE5312.9070901@princeton.edu>
References: <4DDE5312.9070901@princeton.edu>
Message-ID: <Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>

> Wondering if anyone out there is doing both I/O to storage as well as
> MPI over the same IB fabric.

I would say that is the norm.  we certainly connect local storage 
(Lustre) to nodes via the same fabric as MPI.  gigabit is completely
inadequate for modern nodes, so the only alternatives would be 10G
or a secondary IB fabric, both quite expensive propositions, no?

I suppose if your cluster does nothing but IO-light serial/EP jobs,
you might think differently.

> Following along in the Mellanox User's
> Guide, I see a section on how to implement the QOS for both MPI and my
> lustre storage.  I am curious though as to what might happen to the
> performance of the MPI traffic when high I/O loads are placed on the
> storage.

to me, the real question is whether your IB fabric is reasonably close 
to full-bisection (and/or whether your storage nodes are sensibly placed,
topologically.)

> In our current implementation, we are using blades which are 50%
> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to
> other nodes.  Would trying to do storage on top dictate moving to a
> totally non-blocking fabric?

how much inter-chassis MPI do you do?  how much IO do you do?
IB has a small MTU, so I don't really see why mixed traffic would 
be a big problem.  of course, IB also doesn't do all that wonderfully
with hotspots.  but isn't this mostly an empirical question you can
answer by direct measurement?

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From Shainer at Mellanox.com  Thu May 26 12:50:17 2011
From: Shainer at Mellanox.com (Gilad Shainer)
Date: Thu, 26 May 2011 09:50:17 -0700
Subject: [Beowulf] Infiniband: MPI and I/O?
References: <4DDE5312.9070901@princeton.edu>
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F03ACAD7B@mtiexch01.mti.com>

> Wondering if anyone out there is doing both I/O to storage as well as 
> MPI over the same IB fabric.  Following along in the Mellanox User's 
> Guide, I see a section on how to implement the QOS for both MPI and my

> lustre storage.  I am curious though as to what might happen to the 
> performance of the MPI traffic when high I/O loads are placed on the 
> storage. 

I am doing it in my lab  -have build my own Lustre solution and am
running it on the same network as the MPI jobs. At the end it all
depends on how much bandwidth do you need for the MPI and the storage,
and if you can cover both, you can do it. Today the QoS solution for IB
is out there, and you can set max BW and min latency as parameters for
the different traffics. 


> In our current implementation, we are using blades which are 50% 
> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to 
> other nodes.  Would trying to do storage on top dictate moving to a 
> totally non-blocking fabric?

IB congestion control is being released now (finally), so this can help
you here. 

Gilad


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From bill at Princeton.EDU  Thu May 26 15:20:19 2011
From: bill at Princeton.EDU (Bill Wichser)
Date: Thu, 26 May 2011 15:20:19 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
References: <4DDE5312.9070901@princeton.edu>
	<Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
Message-ID: <4DDEA7F3.8050703@princeton.edu>


Mark Hahn wrote:
>> Wondering if anyone out there is doing both I/O to storage as well as
>> MPI over the same IB fabric.
>>     
>
> I would say that is the norm.  we certainly connect local storage 
> (Lustre) to nodes via the same fabric as MPI.  gigabit is completely
> inadequate for modern nodes, so the only alternatives would be 10G
> or a secondary IB fabric, both quite expensive propositions, no?
>
> I suppose if your cluster does nothing but IO-light serial/EP jobs,
> you might think differently.
>   
Really?  I'm surprised by that statement.  Perhaps I'm just way behind 
on the curve though.  It is typical here to have local node storage, 
local lustre/pvfs storage, local NFS storage, and global GPFS storage 
running over the GigE network.  Depending on I/O loads users can make 
use of the storage at the right layer.  Yes, users fill the 1Gbps pipe 
to the storage per node.   But as we now implement all new clusters with 
IB I'm hoping to increase that bandwidth even more.  If you and everyone 
else is doing this already, that's a good sign!  Lol!  As we move closer 
to making this happen, perhaps there will be plenty of answers then for 
any QOS setup questions I may have.
>   
>> Following along in the Mellanox User's
>> Guide, I see a section on how to implement the QOS for both MPI and my
>> lustre storage.  I am curious though as to what might happen to the
>> performance of the MPI traffic when high I/O loads are placed on the
>> storage.
>>     
>
> to me, the real question is whether your IB fabric is reasonably close 
> to full-bisection (and/or whether your storage nodes are sensibly placed,
> topologically.)
>
>   
>> In our current implementation, we are using blades which are 50%
>> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to
>> other nodes.  Would trying to do storage on top dictate moving to a
>> totally non-blocking fabric?
>>     
>
> how much inter-chassis MPI do you do?  how much IO do you do?
> IB has a small MTU, so I don't really see why mixed traffic would 
> be a big problem.  of course, IB also doesn't do all that wonderfully
> with hotspots.  but isn't this mostly an empirical question you can
> answer by direct measurement?
>   
How would I measure by direct measurement?  I don't have the switching 
infrastructure to compare a 2:1 versus a 1:1 unless you're talking about 
inside a chassis.  But since my storage would connect into the switching 
infrastructure how and what would I compare?

Jobs are not scheduled to run on a single chassis, or at least they try 
to but are not placed on hold for more than 10 minutes waiting.  So 
there are lots of wide jobs running between chassis.  Some don't even 
fit on a chassis.  As for the question of how much data, I don't have 
answer.  I know that a 10Gbps pipe hits 4Gbps for sustained periods to 
our central storage from the cluster.  I also know that I can totally 
overwhelm a 10G connected OSS which is currently I/O bound.

My question really was twofold: 1) is anyone doing this successfully and 
2) does anyone have an idea of how loudly my users will scream when 
their MPI jobs suddenly degrade.   You've answered #1 and seem to 
believe that for #2, no one will notice.

Thanks!
Bill
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>   
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From greg at keller.net  Thu May 26 15:29:07 2011
From: greg at keller.net (Greg Keller)
Date: Thu, 26 May 2011 14:29:07 -0500
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <mailman.1.1306436401.10703.beowulf@beowulf.org>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
Message-ID: <4DDEAA03.2040206@keller.net>

Date: Thu, 26 May 2011 12:18:18 -0400 (EDT)
> From: Mark Hahn<hahn at mcmaster.ca>
> Subject: Re: [Beowulf] Infiniband: MPI and I/O?
> To: Bill Wichser<bill at Princeton.EDU>
> Cc: Beowulf Mailing List<beowulf at beowulf.org>
> Message-ID:
> 	<Pine.LNX.4.64.1105261210510.7148 at coffee.psychology.mcmaster.ca>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
>> Wondering if anyone out there is doing both I/O to storage as well as
>> MPI over the same IB fabric.
> I would say that is the norm.  we certainly connect local storage
> (Lustre) to nodes via the same fabric as MPI.  gigabit is completely
> inadequate for modern nodes, so the only alternatives would be 10G
> or a secondary IB fabric, both quite expensive propositions, no?
>
> I suppose if your cluster does nothing but IO-light serial/EP jobs,
> you might think differently.
>
Agreed.  Just finished telling another vendor, "It's not high speed 
storage unless it has an IB/RDMA interface".   They love that.  Except 
for some really edge cases, I can't imagine running IO over GbE for 
anything more than trivial IO loads.


I am Curious if anyone is doing IO over IB to SRP targets or some 
similar "Block Device" approach.  The Integration into the filesystem by 
Lustre/GPFS and others may be the best way to go, but we are not 100% 
convinced yet.  Any stories to share?

Cheers!
Greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Thu May 26 15:35:35 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 26 May 2011 15:35:35 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEAA03.2040206@keller.net>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
Message-ID: <4DDEAB87.40203@scalableinformatics.com>

On 05/26/2011 03:29 PM, Greg Keller wrote:

> Agreed.  Just finished telling another vendor, "It's not high speed
> storage unless it has an IB/RDMA interface".   They love that.  Except

Heh ... love it!

> for some really edge cases, I can't imagine running IO over GbE for
> anything more than trivial IO loads.

Lots of our customers do, when they have a large legacy GbE network, and 
upgrading is expensive.  We can have a very large fan in to our units, 
but IB (even SDR!) is really nice to move data over for storage.

> I am Curious if anyone is doing IO over IB to SRP targets or some
> similar "Block Device" approach.  The Integration into the filesystem by

Both block and file targets.  SRPT on our units, and fronted by OSSes 
for Lustre and similar like things.  Can do iSCSI as well (over IB using 
iSER, or over 10GbE ... works really nicely in either case).

> Lustre/GPFS and others may be the best way to go, but we are not 100%
> convinced yet.  Any stories to share?

If you do this with Lustre, make sure your OSSes are in HA pairs using 
pacemaker/ucarp, and use DRBD between backend units, or MD on the OSS to 
mirror the storage.  Unfortunately IB doesn't virtualize well (last I 
checked), so these have to be physical OSSes.  I presume something 
similar on GPFS.

GlusterFS, PVFS2/OrangeFS, etc. go fine without the block devices, and 
Gluster does mirroring at the file level.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 16:13:07 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 16:13:07 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEA7F3.8050703@princeton.edu>
References: <4DDE5312.9070901@princeton.edu>
	<Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
	<4DDEA7F3.8050703@princeton.edu>
Message-ID: <Pine.LNX.4.64.1105261558080.7148@coffee.psychology.mcmaster.ca>

>>> Wondering if anyone out there is doing both I/O to storage as well as
>>> MPI over the same IB fabric.
>>> 
>> 
>> I would say that is the norm.  we certainly connect local storage (Lustre) 
>> to nodes via the same fabric as MPI.  gigabit is completely
>> inadequate for modern nodes, so the only alternatives would be 10G
>> or a secondary IB fabric, both quite expensive propositions, no?
>> 
>> I suppose if your cluster does nothing but IO-light serial/EP jobs,
>> you might think differently.
>> 
> Really?  I'm surprised by that statement.  Perhaps I'm just way behind on the 
> curve though.  It is typical here to have local node storage, local 
> lustre/pvfs storage, local NFS storage, and global GPFS storage running over 
> the GigE network.

sure, we use Gb as well, but only as a crutch, since it's so slow.
or does each node have, say, a 4x bonded Gb for this traffic?

or are we disagreeing on whether Gb is "slow"?  80-ish MB/s seems pretty 
slow to me, considering that's less than any single disk on the market...

>> how much inter-chassis MPI do you do?  how much IO do you do?
>> IB has a small MTU, so I don't really see why mixed traffic would be a big 
>> problem.  of course, IB also doesn't do all that wonderfully
>> with hotspots.  but isn't this mostly an empirical question you can
>> answer by direct measurement?
>> 
> How would I measure by direct measurement?

I meant collecting the byte counters from nics and/or switches
while real workloads are running.  that tells you the actual data rates,
and should show how close you are to creating hotspots.

> My question really was twofold: 1) is anyone doing this successfully and 2) 
> does anyone have an idea of how loudly my users will scream when their MPI 
> jobs suddenly degrade.   You've answered #1 and seem to believe that for #2, 
> no one will notice.

we've always done it, though our main experience is with clusters that have 
full-bisection fabrics.  our two more recent clusters have half-bisection 
fabrics, but I suspect that most users are not looking closely enough at 
performance to notice and/or complain.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 17:23:30 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 17:23:30 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEAA03.2040206@keller.net>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
Message-ID: <Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>

> Agreed.  Just finished telling another vendor, "It's not high speed
> storage unless it has an IB/RDMA interface".   They love that.  Except

what does RDMA have to do with anything?  why would straight 10G ethernet
not qualify?  I suspect you're really saying that you want an efficient
interface, as well as enough bandwidth, but that doesn't necessitate RDMA.

> for some really edge cases, I can't imagine running IO over GbE for
> anything more than trivial IO loads.

well, it's a balance issue.  if someone was using lots of Atom boards
lashed into a cluster, 1Gb apiece might be pretty reasonable.  but for 
fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all 
that generous.

as an interesting case in point, SeaMicro was in the news again with a 512
atom system: either 64 Gb links or 16 10G links.  the former (.128 Gb/core)
seems low even for atoms, but .3 Gb/core might be reasonable.

> I am Curious if anyone is doing IO over IB to SRP targets or some
> similar "Block Device" approach.  The Integration into the filesystem by
> Lustre/GPFS and others may be the best way to go, but we are not 100%
> convinced yet.  Any stories to share?

you mean you _like_ block storage?  how do you make a shared FS namespace
out of it, manage locking, etc?

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From greg at keller.net  Thu May 26 18:30:52 2011
From: greg at keller.net (Greg Keller)
Date: Thu, 26 May 2011 17:30:52 -0500
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
	<Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>
Message-ID: <4DDED49C.6070509@keller.net>


On 5/26/2011 4:23 PM, Mark Hahn wrote:
>> Agreed.  Just finished telling another vendor, "It's not high speed
>> storage unless it has an IB/RDMA interface".   They love that.  Except
>
> what does RDMA have to do with anything?  why would straight 10G ethernet
> not qualify?  I suspect you're really saying that you want an efficient
> interface, as well as enough bandwidth, but that doesn't necessitate 
> RDMA.
>
RDMA over IB is definitely a nice feature.  Not required, but IP over IB 
has enough limits that we prefer to avoid it.
>> for some really edge cases, I can't imagine running IO over GbE for
>> anything more than trivial IO loads.
>
> well, it's a balance issue.  if someone was using lots of Atom boards
> lashed into a cluster, 1Gb apiece might be pretty reasonable.  but for 
> fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all 
> that generous.
>
> as an interesting case in point, SeaMicro was in the news again with a 
> 512
> atom system: either 64 Gb links or 16 10G links.  the former (.128 
> Gb/core)
> seems low even for atoms, but .3 Gb/core might be reasonable.
>
agreed
>> I am Curious if anyone is doing IO over IB to SRP targets or some
>> similar "Block Device" approach.  The Integration into the filesystem by
>> Lustre/GPFS and others may be the best way to go, but we are not 100%
>> convinced yet.  Any stories to share?
>
> you mean you _like_ block storage?  how do you make a shared FS namespace
> out of it, manage locking, etc?
Well, it's a use case issue for us.  You don't make a shared FS on the 
block devices (well, maybe you could just not in a scalable way)... but 
we envision leasing block devices to customers with known 
capacity/performance capability.  Then the customer can make the call if 
they want to use it for a CIFS/NFS backend, possibly even lashed 
together via MD, through a single server.  They can also lease multiple 
block devices and create a lustre type system.

The flexibility is if they disappear and come back they may not get the 
same compute/storage nodes, but they can attach any server to their 
dedicated block storage devices.  There are also some multi-tenancy 
security options that can be more definitively handled if they have 
absolute control over a block device.  So in this case, they would 
semi-permanently lease the block devices, and then fire up front end 
storage nodes and compute nodes on an "as needed / as available" basis 
anywhere in our compute farm.  Effectively we get the benefits of a 
massive Fibre Channel type SAN over the IB infrastructure we have to 
every node.  If we can get the performance and cost of the block storage 
right, it will be compelling for some of our customers.

We are still prototyping how it would work and characterizing 
performance options...  but it's interesting to us.

Cheers!
Greg
>
> regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Fri May 27 23:59:42 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 27 May 2011 23:59:42 -0400 (EDT)
Subject: [Beowulf] 512 atoms in a box
Message-ID: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>

I was thinking about the seamicro box - 512 atoms, 64 disks and 
either 64 Gb ports or 16 10G ports.  it would be interesting to 
look at what the most appropriate "balance" is for mips/flops 
of cpu power compared to interconnect bandwidth.  maybe the seamicro
box is more intended to be a giant memcached server - that is,
the question is memory bandwidth/capacity versus IC bandwidth.

in any case, you have to ponder where the amazing value-add is - 
compactness?  I'm not sure it competes all that well compared to 
48 core-per-U conventional servers (whether mips/flops or memory-based).

here's an idea, more commodity-oriented (hence beowulf): suppose you 
design a tiny widget that gets all its power via POE.  maybe Atom or 
ARM-based - you've got 15-20W, which is quite a bit these days.
for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.

plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
and you've got a penny-pincher's approximation of a Seamicro SM100000!

not going to win top500, but...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From eugen at leitl.org  Sat May 28 04:26:25 2011
From: eugen at leitl.org (Eugen Leitl)
Date: Sat, 28 May 2011 10:26:25 +0200
Subject: [Beowulf] 512 atoms in a box
In-Reply-To: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
Message-ID: <20110528082625.GB19622@leitl.org>

On Fri, May 27, 2011 at 11:59:42PM -0400, Mark Hahn wrote:

> here's an idea, more commodity-oriented (hence beowulf): suppose you 
> design a tiny widget that gets all its power via POE.  maybe Atom or 
> ARM-based - you've got 15-20W, which is quite a bit these days.
> for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.
> 
> plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
> and you've got a penny-pincher's approximation of a Seamicro SM100000!
> 
> not going to win top500, but...

I was planning to do something similar with rooted Apple TV,
once it's bumped up to A5 in the next generation.

The devices would need spacers, a baffle and a few fans,
if packed closely.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From award at uda.ad  Sat May 28 06:08:50 2011
From: award at uda.ad (Alan Ward)
Date: Sat, 28 May 2011 12:08:50 +0200
Subject: [Beowulf] RS:  512 atoms in a box
References: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
Message-ID: <9CC6BA5ACC8E7C489A97B01C2DA791063D04BD@serpens.ua.ad>


Just found this:

  http://www.raspberrypi.org/

The ARM11 does not pack much punch, there is no networking (though it should not be too difficult to add) and it is not even in production yet. But it does seem fun. Plus, $1000 would get you 40 units ...

Cheers,
-Alan


-----Missatge original-----
De: beowulf-bounces at beowulf.org en nom de Mark Hahn
Enviat el: ds. 28/05/2011 05:59
Per a: Beowulf Mailing List
Tema: [Beowulf] 512 atoms in a box
 
I was thinking about the seamicro box - 512 atoms, 64 disks and 
either 64 Gb ports or 16 10G ports.  it would be interesting to 
look at what the most appropriate "balance" is for mips/flops 
of cpu power compared to interconnect bandwidth.  maybe the seamicro
box is more intended to be a giant memcached server - that is,
the question is memory bandwidth/capacity versus IC bandwidth.

in any case, you have to ponder where the amazing value-add is - 
compactness?  I'm not sure it competes all that well compared to 
48 core-per-U conventional servers (whether mips/flops or memory-based).

here's an idea, more commodity-oriented (hence beowulf): suppose you 
design a tiny widget that gets all its power via POE.  maybe Atom or 
ARM-based - you've got 15-20W, which is quite a bit these days.
for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.

plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
and you've got a penny-pincher's approximation of a Seamicro SM100000!

not going to win top500, but...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20110528/f96891c1/attachment.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From eugen at leitl.org  Wed May  4 06:33:46 2011
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 4 May 2011 12:33:46 +0200
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
Message-ID: <20110504103346.GH23560@leitl.org>


http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines

Chinese Chip Wins Energy-Efficiency Crown 

Though slower than competitors, the energy-saving Godson-3B is destined 
for the next Chinese supercomputer

By Joseph Calamia  /  May 2011

The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
the third quarter of 2011, will have something quite different under its
hood. Unlike its forerunners, which employed American-born chips, this
machine will harness the country's homegrown high-end processor, the
Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
its competitors' wares, at least one of which operates at more than 5 GHz,
but the chip still turns heads with its record-breaking energy efficiency. It
can execute 128 billion floating-point operations per second using just 40
watts?double or more the performance per watt of competitors.

The Godson has an eccentric interconnect structure?for relaying messages
among multiple processor cores?that also garners attention. While Intel and
IBM are commercializing chips that will shuttle communications between cores
merry-go-round style on a "ring interconnect," the Godson connects cores
using a modified version of the gridlike interconnect system called a mesh
network. The processor's designers, led by Weiwu Hu at the Chinese Academy of
Sciences, in Beijing, seem to be placing their bets on a new kind of layout
for future high-end computer processors.

A mesh design goes hand in hand with saving energy, says Matthew Mattina,
chief architect at the San Jose, Calif.?based Tilera Corp., a chipmaker now
shipping 36- and 64-core processors using on-chip mesh interconnects.

Imagine a ring interconnect as a traffic roundabout. Getting to some exits
requires you to drive nearly around the entire circle. Traveling away from
your destination before getting there, says Mattina, requires more transistor
switching and therefore consumes more energy. A mesh network is more like a
city's crisscrossed streets. "In a mesh, you always traverse the minimum
amount of wire?you're never going the wrong way," he says.

On the 8-core Godson chip, 4 cores form a tightly bound unit?each core sits
on a corner of a square of interconnects, as in a usual mesh. Godson
researchers have also connected each corner to its opposite, using a pair of
diagonal interconnects to form an X through the square's center. A "crossbar"
interconnect then serves as an overpass, linking this 4-core neighborhood to
a similar 4-core setup nearby.

Godson developers believe that their modified mesh's scalability will prove a
key advantage, as chip designers cram more cores onto future chips. Yunji
Chen, a Godson architect, says that competitors' ring interconnects may have
trouble squeezing in more than 32 cores.

Indeed, one of the ring's benefits could prove its future liability. Linking
new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of
electrical and computer engineering at the University of Toronto. After all,
there's only one path to send information?or two in a bidirectional ring. But
sharing a common communication path also means that each additional core adds
to the length of wire that messages must travel and increases the demand for
that path. With a large number of cores, "the timing around this ring just
gets out of hand," Smith says. "You can't get service when you need it."

Of course, adding more cores in a mesh also stresses the system. Even if you
have a grid of paths providing multiple communication channels, more cores
increase the demand for the network, and more demand makes traveling long
distances difficult: Try driving across New York City at rush hour. Still,
the bandwidth scaling of a mesh interconnect is superior to that of a ring,
Tilera's Mattina says. He notes that the total bandwidth available with a
mesh interconnect increases as you add cores, but with a ring interconnect,
the total bandwidth remains constant even as the core count increases.
Latency?the time it takes to get a message from one core to another?is also
more favorable in a mesh design, Chen says. In a ring interconnect, latency
increases linearly with the core count, he says, while in a mesh design it
increases with the square root of the number of cores.

Reid Riedlinger, a principal engineer at Intel, points out that a ring
interconnect has its own scalability benefits. Intel's recently unveiled
8-core Poulson design employs a ring not only to add more cores but also to
add easy-to-access on-chip memory, or cache. As long as the chip has the
power and the space, Riedlinger says, a ring makes it easy to add each core
and cache as a module?a move that would require more complicated validity
studies and logic modification in a mesh. "Adding the additional ring stop
has a very small impact on latency, and the additional cache capacity will
provide performance benefits for many applications," he says.

For those who are not building a national supercomputer, Riedlinger also
points out that a ring setup is more easily scalable in a different
direction. "You might start with an 8-core design," he says, "and then, to
suit a different market segment, you might chop 4 cores out of the middle and
sell it as a different product."

This article originally appeared in print as "China's Godson Gamble".

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From cbergstrom at pathscale.com  Wed May  4 06:39:45 2011
From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?Christopher_Bergstr=F6m?=)
Date: Wed, 4 May 2011 17:39:45 +0700
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
In-Reply-To: <20110504103346.GH23560@leitl.org>
References: <20110504103346.GH23560@leitl.org>
Message-ID: <BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>

On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl <eugen at leitl.org> wrote:
>
> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines
>
> Chinese Chip Wins Energy-Efficiency Crown
>
> Though slower than competitors, the energy-saving Godson-3B is destined
> for the next Chinese supercomputer
>
> By Joseph Calamia ?/ ?May 2011
>
> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
> the third quarter of 2011, will have something quite different under its
> hood. Unlike its forerunners, which employed American-born chips, this
> machine will harness the country's homegrown high-end processor, the
> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
> its competitors' wares, at least one of which operates at more than 5 GHz,
> but the chip still turns heads with its record-breaking energy efficiency. It
> can execute 128 billion floating-point operations per second using just 40
> watts?double or more the performance per watt of competitors.

*cough*

Wow.. they've brought SiCortex back to life...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From prentice at ias.edu  Wed May  4 09:50:50 2011
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 04 May 2011 09:50:50 -0400
Subject: [Beowulf] Chinese Chip Wins Energy-Efficiency Crown
In-Reply-To: <BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>
References: <20110504103346.GH23560@leitl.org>
	<BANLkTimQcySYmYpqm4Z-PkLodVs_91GbGg@mail.gmail.com>
Message-ID: <4DC159BA.8090005@ias.edu>


On 05/04/2011 06:39 AM, Christopher Bergstr?m wrote:
> On Wed, May 4, 2011 at 5:33 PM, Eugen Leitl <eugen at leitl.org> wrote:
>>
>> http://spectrum.ieee.org/semiconductors/processors/chinese-chip-wins-energyefficiency-crown?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+IeeeSpectrum+%28IEEE+Spectrum%29&utm_content=Bloglines
>>
>> Chinese Chip Wins Energy-Efficiency Crown
>>
>> Though slower than competitors, the energy-saving Godson-3B is destined
>> for the next Chinese supercomputer
>>
>> By Joseph Calamia  /  May 2011
>>
>> The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in
>> the third quarter of 2011, will have something quite different under its
>> hood. Unlike its forerunners, which employed American-born chips, this
>> machine will harness the country's homegrown high-end processor, the
>> Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than
>> its competitors' wares, at least one of which operates at more than 5 GHz,
>> but the chip still turns heads with its record-breaking energy efficiency. It
>> can execute 128 billion floating-point operations per second using just 40
>> watts?double or more the performance per watt of competitors.
> 
> *cough*
> 
> Wow.. they've brought SiCortex back to life...

Oh, snap!

-- 
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Tue May 10 01:37:33 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 9 May 2011 22:37:33 -0700
Subject: [Beowulf] How InfiniBand gained confusing bandwidth numbers
Message-ID: <20110510053733.GB12826@bx9.net>

http://dilbert.com/strips/comic/2011-05-10/
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mdidomenico4 at gmail.com  Wed May 11 20:57:57 2011
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Wed, 11 May 2011 20:57:57 -0400
Subject: [Beowulf] EPCC and DIR cluster
Message-ID: <BANLkTin6JYM9WXxeyN+26VyP_mNnqADeJw@mail.gmail.com>

Is there anyone on the list associated with EPCC or knows someone at
EPCC?  If so, i recently saw an article in Scientific Computing
magazine, where there was a blurb about a smallish cluster built a
EPCC utilizing Atom chips/GPU's and HDD's, whereby the design was more
amdahl balanced for "data intensive research".  I can't seem to locate
anything on the web about it, but I'm interested in the spec's/design
for the machine and how it performs

thanks
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Fri May 20 00:35:25 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 20 May 2011 00:35:25 -0400
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <4DD5EF8D.8070909@scalableinformatics.com>

Hi folks

   Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
turned off at the motherboard level?  I am curious if there are numbers 
of these, and what issues people encounter.  I have some of my own data 
from smaller collections of systems, I am wondering about this for 
larger systems.

   Thanks!

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 01:45:01 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 19 May 2011 22:45:01 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <20110520054501.GE16676@bx9.net>

On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote:

>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
> turned off at the motherboard level?  I am curious if there are numbers 
> of these, and what issues people encounter.  I have some of my own data 
> from smaller collections of systems, I am wondering about this for 
> larger systems.

I don't think anyone's done the experiment with a 'larger system'
since "Big Mac" had to replace all of their servers with ones that had
ECC. Still, any cluster that can manipulate the BIOS appropriately
could easily do the experiment.

-- greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From gmpc at sanger.ac.uk  Fri May 20 04:58:59 2011
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Fri, 20 May 2011 09:58:59 +0100
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <20110520054501.GE16676@bx9.net>
References: <4DD5EF8D.8070909@scalableinformatics.com>
	<20110520054501.GE16676@bx9.net>
Message-ID: <4DD62D53.6030806@sanger.ac.uk>

On 20/05/11 06:45, Greg Lindahl wrote:
> On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote:
> 
>>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC 
>> turned off at the motherboard level?  I am curious if there are numbers 
>> of these, and what issues people encounter.  I have some of my own data 
>> from smaller collections of systems, I am wondering about this for 
>> larger systems.

We did, circa 2003. Never again.

When we were lucky, the uncorrected errors happened in memory in use by
the kernel or application code, and we got hard machine crashes or code
seg-faulting. Those were easy to spot.

When we were unlucky, the errors happened in page cache, resulting in
data being randomly transmuted. Most of the code we were running at the
time did minimal input sanity checking. It was quite instructive to see
just how much genomic analysis code would quite happily compute on DNA
sequences that contained things other than ATGC.

The duff runs would eventually get picked up by the various
sanity-checks that happened at the end of our analysis pipelines, but it
involved quite a bit of developer & sysadmin effort to track down and
re-run all of the possibly affected jobs.

Cheers,

Guy


-- 
Dr. Guy Coates, Informatics Systems Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From a.travis at abdn.ac.uk  Fri May 20 11:35:45 2011
From: a.travis at abdn.ac.uk (Tony Travis)
Date: Fri, 20 May 2011 16:35:45 +0100
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <4DD68A51.70605@abdn.ac.uk>

On 20/05/11 05:35, Joe Landman wrote:
> Hi folks
>
>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
> turned off at the motherboard level?  I am curious if there are numbers
> of these, and what issues people encounter.  I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.

Hi, Joe.

I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it 
was a nightmare, as Guy described in his email, until I pre-emptively 
tested the memory in user-space, using Chlarles Cazabon's "memtester":

   http://pyropus.ca/software/memtester

Prior to this, *all* the RAM had passed Memtest86+.

I had a strict policy that if a system crashed, for any reason, it was 
re-tested with Memtest86+, then 100 passes of "memtester" before being 
allowed to re-join the Beowulf cluster. This made the Beowulf much more 
stable running openMosix. However, I've scrapped all our non-ECC nodes 
now because the real worry is not knowing if an error has occurred...

Apparently this is still a big issue for computers in space, using 
non-ECC RAM for solid-state storage on grounds of cost for imaging. 
They, apparently, use RAM background SoftECC 'scrubbers' like this:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

Bye,

   Tony.
-- 
Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition
and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK
tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk
mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Fri May 20 11:52:43 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 20 May 2011 08:52:43 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD68A51.70605@abdn.ac.uk>
Message-ID: <C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>


On 5/20/11 8:35 AM, "Tony Travis" <a.travis at abdn.ac.uk> wrote:

>On 20/05/11 05:35, Joe Landman wrote:
>> Hi folks
>>
>>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
>> turned off at the motherboard level?  I am curious if there are numbers
>> of these, and what issues people encounter.  I have some of my own data
>> from smaller collections of systems, I am wondering about this for
>> larger systems.
>
>Hi, Joe.
>
>Apparently this is still a big issue for computers in space, using
>non-ECC RAM for solid-state storage on grounds of cost for imaging.
>They, apparently, use RAM background SoftECC 'scrubbers' like this:
>
>http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng
>.pdf
>
>

Yes, it's a big tradeoff in the space world. Not only does ECC require
extra memory, but the EDAC logic consumes power and, typically, slows down
the bus speed (I.e. You need an extra bus cycle to handle the EDAC logic
propagation delay).


There's also a practical detail that the upset rate might be low enough
that it is ok to just tolerate the upsets, because they'll get handled at
some higher level of the process.

For instance, if you have a RAM buffer in a communication link handling
the FEC coded bits, then there's not much difference between a bit flip in
RAM and a bit error on the comm link, so you might as well just let the
comm FEC code take care of the bit errors.

We tend to use a lot of checksum strategies.  Rather than an EDAC strategy
which corrects errors, it's good enough to just know that an error
occurred, and retry. This is particularly effective on Flash memory, which
has transient read errors: read it again and it works ok.

Another example is doing an FFT.  There are some strategies which allow
you to do a second fast computation that essentially provides a "check" on
the results of the FFT (e.g. The mean of the input data should match the
"DC term" in the FFT)

We might also keep triple copies of key variables.  You read all three
values and compare them before starting the computation.  Software Triple
Redundancy, as it were.  A lot of times, the probability of an error
occurring "during" the computation is sufficiently low, compared to the
probability of an error occurring during the very long waiting time
between operating on the data.

There's also the whole question of whether EDAC main memory buys you much,
when all the (ever larger) cache isn't protected.  Again, it comes down to
a probability analysis.


My own personal theory on this is that you are much more likely to have a
miscalculation due to a software bug than due to an upset.  Further, it's
impossible to get all the bugs out in finite time/money, so you might as
well design your whole system to be fault tolerant, not in a "oh my gosh,
we had an error, let's do extensive fault recovery", but a "we assume the
computations are always a bit wonky, so we factor that into our design".
 That is, design so that retries and self checks are just part of the
overhead.  Kind of like how a decent experiment or engineering design
takes into account measurement uncertainty stack-up.

As hardware gets smaller and faster and lower power, the "cost" to provide
extra computational resources to implement a strategy like this gets
smaller, relative to the ever increasing human labor cost to try and make
it perfect.

(and, of course, this *is* how humans actually do stuff.. You don't
precompute all of your control inputs to the car.. You basically set a
general goal, and continuously adjust to drive towards that goal.)


Jim Lux
>

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From deadline at eadline.org  Fri May 20 12:35:26 2011
From: deadline at eadline.org (Douglas Eadline)
Date: Fri, 20 May 2011 12:35:26 -0400 (EDT)
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <4DD5EF8D.8070909@scalableinformatics.com>
References: <4DD5EF8D.8070909@scalableinformatics.com>
Message-ID: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org>

Joe

While this is somewhat anecdotal, it may be helpful.

Not a large-ish cluster, but as you may guess, I wondered
about this for Limulus
(http://limulus.basement-supercomputing.com)

I wrote a script (will post it if anyone interested)
that runs memtester until you stop it or it finds
a error. I ran it on several Core2 Duo systems
with Kingston DDR2-800 PC2-6400 memory.

As I recall, I ran it on 2-3 systems, only
one showed an error. I stopped the others
after about three weeks. Here is an example of the
script output when it fails (it logs the
memtest output).

  There was an error, inspect memtest-1178
  Start Date was: Mon Apr 20 16:04:35 EDT 2009
  Failure Date was: Fri May  8 17:55:43 EDT 2009
  Test ran 1178 times failing after 1561868 Seconds
  (26031 Minutes or 433 Hours or 18 Days)

My experience in running small clusters
without ECC has been very good. IMO it is also
a question of the quality of the memory vendor.
I never had an issue when running tests and
benchmarks, which I do quite a bit on new
hardware e.g.

  goo.gl/YoBaz

--
Doug


> Hi folks
>
>    Does anyone run a large-ish cluster without ECC ram?  Or with ECC
> turned off at the motherboard level?  I am curious if there are numbers
> of these, and what issues people encounter.  I have some of my own data
> from smaller collections of systems, I am wondering about this for
> larger systems.
>
>    Thanks!
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>         http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>


--
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From james.p.lux at jpl.nasa.gov  Fri May 20 13:21:12 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Fri, 20 May 2011 10:21:12 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <50574.192.168.93.213.1305909326.squirrel@mail.eadline.org>
Message-ID: <C9FBEDF6.7E35%james.p.lux@jpl.nasa.gov>


On 5/20/11 9:35 AM, "Douglas Eadline" <deadline at eadline.org> wrote:

>Joe
>
>While this is somewhat anecdotal, it may be helpful.
>
>Not a large-ish cluster, but as you may guess, I wondered
>about this for Limulus
>(http://limulus.basement-supercomputing.com)
>
>I wrote a script (will post it if anyone interested)
>that runs memtester until you stop it or it finds
>a error. I ran it on several Core2 Duo systems
>with Kingston DDR2-800 PC2-6400 memory.
>
>My experience in running small clusters
>without ECC has been very good. IMO it is also
>a question of the quality of the memory vendor.
>I never had an issue when running tests and
>benchmarks, which I do quite a bit on new
>hardware e.g.


I'm going to guess that it's highly idiosyncratic.  The timing margins on
all the signals between CPU, memory, and perhipherals are tight, they're
temperature dependent and process dependent, so you could have the exact
same design with very similar RAM and one will get errors and the other
won't.  Folks who design PCI bus interfaces for a living earn their pay,
especially if they have to make it work with lots of different mfrs: just
because all the parts meet their databook specs doesn't mean that the
system will play nice together.

Consider that for memory, you have 64 odd data lines and 20 or so address
lines and some strobes that ALL have to switch together.  A data sensitive
pattern where a bunch of lines move at the same time, and induce a bit of
a voltage into an adjacent trace, which is a bit slower or faster than the
rest, and you've got the makings of a challenging hunt for the problem.
PC board trace lengths all have to be carefully matched, loads have to be
carefully matched, etc. 66 Mhz -> 15 ns, but modern DDR rams do batches of
words separated by a few ns.

1 cm is about 10-15 cm of tracelength, but it's the loading, terminations,
and other stuff that causes a problem.  Hang a 1 pf capacitor off that 100
ohm line, and there's  a tenth of a ns time constant right there.

You could also have EMI/EMC issues that cause problems. That same ragged
edge timing margin might be fine with 8 tower cases sitting on a shelf,
but not so good with the exact same mobo and memory stacked into 1-2U
cases in a 19" rack.  Power cords and ethernet cables also carry EMI
around. 

In a large cluster these things will all be aggravated: you've got more
machines running, so you increase the error probability right there.
You've got more electrical noise on the power carried between machines.
You've typically got denser packaging.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Fri May 20 14:26:31 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 20 May 2011 14:26:31 -0400 (EDT)
Subject: [Beowulf]  Execution time measurements - clarification
Message-ID: <Pine.LNX.4.64.1105201425491.17408@coffee.psychology.mcmaster.ca>

From: Mikhail Kuzminsky <mikky_m at mail.ru>
Subject: [Beowulf] Execution time measurements - clarification

Dear Mark, could you pls forward my message to beowulf at beowulf.org (because my messages as before can't be delivered to maillist) ? It's clarification of my previous question here.

Mikhail
---------------------
I have strange execution time measurements for CPU-bound jobs (to be exact, Gaussian-09 DFT frequency

calculations). Results are strange for *SEQENTIAL* calculations !

Executions were performed on dual socket Opteron 2350 (Quad core) server worked under Open SuSE Linux

10.3.

When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is

LOWER than for single job run !

I thought that it may be wrong "own for G09" times, so I checked their via time:
(time g09 *pair1.com) >&testt1 &
etc 
But this confirms strange results:

For pair of simultaneously running sequential jobs

88801.141u 52.475s 24:40:57.58 99.9%    0+0k 0+0io 1221pf+0w
88901.996u 13.472s 24:41:53.97 100.0%   0+0k 0+0io 0pf+0w


For run of 1 example of the same sequential job

100365.236u 27.297s 27:53:13.53 99.9%   0+0k 0+0io 1pf+0w

Is there any ideas why this situation might be ?

Mikhail Kuzminsky
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 20:29:10 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 20 May 2011 17:29:10 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>
References: <4DD68A51.70605@abdn.ac.uk>
	<C9FBD96F.7DF8%james.p.lux@jpl.nasa.gov>
Message-ID: <20110521002910.GB14350@bx9.net>

On Fri, May 20, 2011 at 08:52:43AM -0700, Lux, Jim (337C) wrote:

> As hardware gets smaller and faster and lower power, the "cost" to provide
> extra computational resources to implement a strategy like this gets
> smaller, relative to the ever increasing human labor cost to try and make
> it perfect.

The cost is teaching users to add checks to their codes, and to any
off-the-shelf codes they start using.

In hyrodynamics (cfd), often you have quantities which are explicitly
conserved by the equations, and others which are conserved by physics
but not by the particular numerical method you're using. The latter
were quite handy for finding bugs. I managed to discover several
numerical accuracy bugs in pre-release versions of the PathScale
compilers that way. "Yes, it's a bug if the 12th decimal place
changes."

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Fri May 20 20:32:27 2011
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 20 May 2011 17:32:27 -0700
Subject: [Beowulf] Execution time measurements - clarification
Message-ID: <20110521003227.GD14350@bx9.net>

On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:

> When I run 2 identical examples of the same batch job simultaneously, execution time of *each* job is
> LOWER than for single job run !

I'd try locking these sequential jobs to a single core, you can get
quite weird effects when you don't.

-- greg


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Mon May 23 12:40:13 2011
From: mathog at caltech.edu (David Mathog)
Date: Mon, 23 May 2011 09:40:13 -0700
Subject: [Beowulf] Execution time measurements - clarification
Message-ID: <E1QOYAv-0002cv-Nb@mendel.bio.caltech.edu>

> On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:
> 
> > When I run 2 identical examples of the same batch job
simultaneously, execution time of *each* job is
> > LOWER than for single job run !

Disk caching could cause that.  Normally if the data read in isn't too
big you see an effect where:

run 1:  30 sec  <-- 50% disk IO/ 50% CPU
run 2:  15 sec  <-- ~100% CPU

where the first run loaded the data into disk cache, and the second run 
read it there, saving a lot of real disk IO.  Under some very peculiar
conditions, on a multicore system, if run 1 and 2 are "simultaneous"
they could seesaw back and forth for the "lead", so they end up taking
turns doing the actual disk IO, with the total run time for each ending
up between the times for the two runs above.  Note that they wouldn't
have to be started at exactly the same time for this to happen, because
the job that starts second is going to be reading from cache, so it will
tend to catch up to the job that started first.  Once they are close
then noise in the scheduling algorithms could cause the second to pass
the first.  (If it didn't pass, then this couldn't happen, because the
second would always be waiting for the first to pull data in from disk.) 

Of course, you also need to be sure that run 1 isn't interfering with
run 2.  They might, for instance, save/retrieve intermediate values
to the same filename, so that they really cannot be run safely at the
same time.  That is, they run faster together, but they run incorrectly.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Mon May 23 15:32:33 2011
From: mathog at caltech.edu (David Mathog)
Date: Mon, 23 May 2011 12:32:33 -0700
Subject: [Beowulf] Execution time measurements
Message-ID: <E1QOarh-0002fV-BF@mendel.bio.caltech.edu>

Mikhail Kuzminsky sent this to me and asked that it be posted:

BEGIN FORWARD

Mon, 23 May 2011 09:40:13 -0700 ???????????? ???? "David Mathog"
<mathog at caltech.edu>:
> > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a message:
> > > When I run 2 identical examples of the same batch job
> simultaneously, execution time of *each* job is
> > > LOWER than for single job run !
>
> Disk caching could cause that. Normally if the data read in isn't too
> big you see an effect where:
>
> run 1: 30 sec <-- 50% disk IO/ 50% CPU
> run 2: 15 sec <-- ~100% CPU

I believe that jobs are CPU-bound: top says that they use 100% of CPU,
and no swap activity.

iostat /dev/sda3 (where IO is performed) says typically something like:

Linux 2.6.22.5-31-default (c6ws1) 05/25/2011
avg-cpu: %user %nice %system %iowait %steal %idle
          1.12 0.00 0.03 0.01 0.00 98.84
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda3 0.01 0.01 8.47 20720 16845881

>
> Of course, you also need to be sure that run 1 isn't interfering with
> run 2. They might, for instance, save/retrieve intermediate values
> to the same filename, so that they really cannot be run safely at the
> same time. That is, they run faster together, but they run incorrectly.

File names used for IO are unique.
I thought also about cpus frequency variations, but I think that null output
of
lsmod|grep freq

is enough for fixed CPU frequency.

END FORWARD

OK, so not disk caching.  

Regarding the frequencies, better to use

  cat /proc/cpuinfo | grep MHz

while the processes are running.

Did you verify that the results for each of the two simultaneous runs
are both correct?  Ideally, tweak some parameter so they are slightly
different from each other.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 11:41:32 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 08:41:32 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>

Joe Landman wrote:

> I am wondering about this for larger systems.

Your post makes me wonder about ECC in much smaller systems, like
dedicated single computers controlling machinery or medical devices. 
Some really nasty things could result from "move cutting head in X
(int32 value) mm" after the most significant bit in the int32 value has
flipped.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Tue May 24 11:44:15 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 24 May 2011 11:44:15 -0400
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
References: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
Message-ID: <4DDBD24F.7080608@scalableinformatics.com>

On 05/24/2011 11:41 AM, David Mathog wrote:
> Joe Landman wrote:
>
>> I am wondering about this for larger systems.
>
> Your post makes me wonder about ECC in much smaller systems, like
> dedicated single computers controlling machinery or medical devices.
> Some really nasty things could result from "move cutting head in X
> (int32 value) mm" after the most significant bit in the int32 value has
> flipped.

Some bits are more important than others ...

Basically I was looking for anecdotal evidence that this is a "bad 
thing" (TM).  I have it now, and it helped me make the case I needed to 
make.

Thanks to everyone for this, it was really helpful!


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Tue May 24 13:06:15 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Tue, 24 May 2011 10:06:15 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
References: <E1QOtjg-0002x7-HO@mendel.bio.caltech.edu>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47F010847E3CBBA@ALTPHYEMBEVSP20.RES.AD.JPL>

This *is* a big problem.
I suggest reading some of what Nancy Leveson has written.  
http://sunnyday.mit.edu/
"Professor Leveson started a new area of research, software safety, which is concerned with the problems of building software for real-time systems where failures can result in loss of life or property."

Two popular papers you might find interesting and fun to read:
"High-Pressure Steam Engines and Computer Software" (Postscript) or (PDF). This paper started as a keynote address at the International Conference on Software Engineering in Melbourne, Australia) and later was published in IEEE Software, October 1994.

"The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an updated version of the original IEEE Computer (July 1993) article. It also appears in the appendix of my book.

There is a generic problem with complex systems, as well.  "Normal Accidents" by Charles Perrow is a good work (if a bit frightening in some ways... not in a senseless fear-mongering way, but because he lays out the fundamental reasons why these things are inevitable)

Marais, Dulac, and Leveson argue that the world isn't as bad as Perrow says, though. 
http://esd.mit.edu/symposium/pdfs/papers/marais-b.pdf


Jim Lux
+1(818)354-2075 
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog
> Sent: Tuesday, May 24, 2011 8:42 AM
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] Curious about ECC vs non-ECC in practice
> 
> Joe Landman wrote:
> 
> > I am wondering about this for larger systems.
> 
> Your post makes me wonder about ECC in much smaller systems, like
> dedicated single computers controlling machinery or medical devices.
> Some really nasty things could result from "move cutting head in X
> (int32 value) mm" after the most significant bit in the int32 value has
> flipped.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 14:27:23 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 11:27:23 -0700
Subject: [Beowulf] Execution time measurements
Message-ID: <E1QOwKB-00033z-UL@mendel.bio.caltech.edu>

Another message from Mikhail Kuzminsky, who for some reason or other 
cannot currently post directly to the list:

BEGIN FORWARD

1st of all, I should mention that the effect is observed only for
Opteron 2350/OpenSuSE 10.3.
Execution of the same job w/the same binaries on Nehalem E5520/OpenSuSe
11.1 gives the same time for 1
and 2 simultaneously runnung jobs.

Mon, 23 May 2011 12:32:33 -0700 ???????????? ???? "David Mathog"
<mathog at caltech.edu>:
> Mon, 23 May 2011 09:40:13 -0700 ???????????????????????? ????????
"David Mathog"
> <mathog at caltech.edu>:
> > > On Fri, May 20, 2011 at 02:26:31PM -0400, Mark Hahn forwarded a
message:
> > > > When I run 2 identical examples of the same batch job
> > simultaneously, execution time of *each* job is
> > > > LOWER than for single job run !

> I thought also about cpus frequency variations, but I think that null
output
> of
> lsmod|grep freq
> is enough for fixed CPU frequency.
>
> END FORWARD

> Regarding the frequencies, better to use
> cat /proc/cpuinfo | grep MHz

I looked to cpuinfo, but only manually - some times (i.e. I didn't run
any script w/periodical looking for CPU frequencies).
All the frequencies of cores were fixed.

> Did you verify that the results for each of the two simultaneous runs
> are both correct?  
Yes, the results are the same. I looked also to number of iterations etc.
But I'll check outputs again.

>Ideally, tweak some parameter so they are slightly
> different from each other.

But I don't understand - if I change slightly some of input parameters,
what may it give ?

> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

Fri, 20 May 2011 20:11:15 -0400 message from Serguei Patchkovskii
<serguei.patchkovskii at gmail.com>:
>    Suse 10.3 is quite old; it uses a kernel which is less than perfect
at scheduling jobs and allocating resources for >NUMA systems. Try
running your  test job using:
>
>    numactl --cpunodebind=0 --membind=0 g98

numactl w/all things  bound to node 1 gives "big" execution time ( 1 day
4 hours; 2 simultaneous jobs run faster), for forcing different nodes
for cpu and memory - execution time is even  higher (+1 h). Therefore
effect observed don't looks as result of numa allocations :-(

Mikhail

END FORWARD

My point about the two different parameter sets on the jobs was to
determine if the two were truly independent, or if they might not be
interacting with each other through checkpoint files or shared memory,
or the like.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From mathog at caltech.edu  Tue May 24 14:37:30 2011
From: mathog at caltech.edu (David Mathog)
Date: Tue, 24 May 2011 11:37:30 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
Message-ID: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>

Jim Lux posted:

> "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an
updated version of the original IEEE Computer (July 1993) article. It
also appears in the appendix of my book.

Well that was really horrible. 

Are car computers ECC?  When all they did was engine management a memory
glitch wouldn't have been too terrible, but now that some of them
control automatic parking and other "higher" functions, and with around
100M units in circulation just in the USA, if they aren't ECC then
memory glitches in running vehicles would have to be happening every day.

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From james.p.lux at jpl.nasa.gov  Tue May 24 15:07:10 2011
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Tue, 24 May 2011 12:07:10 -0700
Subject: [Beowulf] Curious about ECC vs non-ECC in practice
In-Reply-To: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>
References: <E1QOwTy-000348-9v@mendel.bio.caltech.edu>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47F010847E3CC03@ALTPHYEMBEVSP20.RES.AD.JPL>

> -----Original Message-----
> From: David Mathog [mailto:mathog at caltech.edu]
> Sent: Tuesday, May 24, 2011 11:38 AM
> To: Lux, Jim (337C); beowulf at beowulf.org
> Subject: RE: [Beowulf] Curious about ECC vs non-ECC in practice
> 
> Jim Lux posted:
> 
> > "The Therac-25 Accidents" (Postscript ) or (PDF). This paper is an
> updated version of the original IEEE Computer (July 1993) article. It
> also appears in the appendix of my book.
> 
> Well that was really horrible.
> 
> Are car computers ECC?  When all they did was engine management a memory
> glitch wouldn't have been too terrible, but now that some of them
> control automatic parking and other "higher" functions, and with around
> 100M units in circulation just in the USA, if they aren't ECC then
> memory glitches in running vehicles would have to be happening every day.


Car controllers tend to have mask ROM for their software which is pretty upset immune.  The "PROM" (which today might be flash or EEPROM) holds all the coefficients for things like the fuel injection/timing, but doesn't hold the code for controlling, say, the ABS.

I would imagine (but do not know) that they do things similar to what we do in spacecraft controllers: store critical data multiple times, lots of self checks on algorithm operation, etc.  The report on the Toyota Throttle controller said this:

"The Main and Sub-CPUs use two types of memory: non-volatile ROM for software code and volatile Static Ram (SRAM). The SRAM is protected by a single error detect and correct and a double error detect hardware function performed by error detection and correction (EDAC) logic."

There's a whole reliability of software community out there with everything from certifiable processes to coding standards (MISRA) designed to make it easy to inspect and verify that the code is doing what you think, and that it handles off-nominal cases.

I haven't read the whole report, but there was an analysis of the software in the Toyota controllers recently.


http://www.nhtsa.gov/staticfiles/nvs/pdf/NASA-UA_report.pdf

"The NESC team examined the software code (more than 280,000 lines) for paths that might initiate such a UA, but none were identified"  (UA-Unintended Acceleration)

The team examined the VOQ vehicles for signs of electrical faults, and subjected these vehicles to electro-magnetic interference (EMI) radiated and conducted test levels significantly above certification levels. The EMI testing did not produce any UAs, but in some cases caused the engine to slow and/or stall. (That's probably closest to what you'd see from a memory upset)

Section 6.5, page 64 of the report, is "System Fail-Safe Architecture"

It's pretty sophisticated, with multiple parallel schemes to prevent runaway or failure.  I'm impressed at the level of thought they gave to not just shutting down the engine, but in leaving an adequate limp-home capability when one or more parts in the chain fails (e.g. if the throttle plate actually sticks, it can control the engine by turning on and off the fuel injectors). There's also an independent mechanism that detects if the pedal isn't pressed (or the redundant pedal position sensors have failed), in which case the engine cannot exceed 2500RPM, if it does, the fuel turns off, and then turns back on when the speed drops below 1100RPM


And, since we Beowulfers are for the most part software weenies..
The ECM for the 2005 Camry uses a NEC V850 E1 processor. The software is in ANSI C, and compiled with Greenhills compiler.
There are 256kSLOC of non-comments (along with 241kSLOC of comments) in .c files and another 40kSLOC (noncomment) in various .h files.

They ran it through Coverity and CodeSonar (both of which we use at JPL), as well as SPIN (using SWARM to run it on a cluster.. now how about that)
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From bill at Princeton.EDU  Thu May 26 09:18:10 2011
From: bill at Princeton.EDU (Bill Wichser)
Date: Thu, 26 May 2011 09:18:10 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
Message-ID: <4DDE5312.9070901@princeton.edu>

Wondering if anyone out there is doing both I/O to storage as well as 
MPI over the same IB fabric.  Following along in the Mellanox User's 
Guide, I see a section on how to implement the QOS for both MPI and my 
lustre storage.  I am curious though as to what might happen to the 
performance of the MPI traffic when high I/O loads are placed on the 
storage. 

In our current implementation, we are using blades which are 50% 
blocking (2:1 oversubscribed) when moving from a 16 blade chassis to 
other nodes.  Would trying to do storage on top dictate moving to a 
totally non-blocking fabric?

Thanks,
Bill
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 12:18:18 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 12:18:18 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDE5312.9070901@princeton.edu>
References: <4DDE5312.9070901@princeton.edu>
Message-ID: <Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>

> Wondering if anyone out there is doing both I/O to storage as well as
> MPI over the same IB fabric.

I would say that is the norm.  we certainly connect local storage 
(Lustre) to nodes via the same fabric as MPI.  gigabit is completely
inadequate for modern nodes, so the only alternatives would be 10G
or a secondary IB fabric, both quite expensive propositions, no?

I suppose if your cluster does nothing but IO-light serial/EP jobs,
you might think differently.

> Following along in the Mellanox User's
> Guide, I see a section on how to implement the QOS for both MPI and my
> lustre storage.  I am curious though as to what might happen to the
> performance of the MPI traffic when high I/O loads are placed on the
> storage.

to me, the real question is whether your IB fabric is reasonably close 
to full-bisection (and/or whether your storage nodes are sensibly placed,
topologically.)

> In our current implementation, we are using blades which are 50%
> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to
> other nodes.  Would trying to do storage on top dictate moving to a
> totally non-blocking fabric?

how much inter-chassis MPI do you do?  how much IO do you do?
IB has a small MTU, so I don't really see why mixed traffic would 
be a big problem.  of course, IB also doesn't do all that wonderfully
with hotspots.  but isn't this mostly an empirical question you can
answer by direct measurement?

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From Shainer at Mellanox.com  Thu May 26 12:50:17 2011
From: Shainer at Mellanox.com (Gilad Shainer)
Date: Thu, 26 May 2011 09:50:17 -0700
Subject: [Beowulf] Infiniband: MPI and I/O?
References: <4DDE5312.9070901@princeton.edu>
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F03ACAD7B@mtiexch01.mti.com>

> Wondering if anyone out there is doing both I/O to storage as well as 
> MPI over the same IB fabric.  Following along in the Mellanox User's 
> Guide, I see a section on how to implement the QOS for both MPI and my

> lustre storage.  I am curious though as to what might happen to the 
> performance of the MPI traffic when high I/O loads are placed on the 
> storage. 

I am doing it in my lab  -have build my own Lustre solution and am
running it on the same network as the MPI jobs. At the end it all
depends on how much bandwidth do you need for the MPI and the storage,
and if you can cover both, you can do it. Today the QoS solution for IB
is out there, and you can set max BW and min latency as parameters for
the different traffics. 


> In our current implementation, we are using blades which are 50% 
> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to 
> other nodes.  Would trying to do storage on top dictate moving to a 
> totally non-blocking fabric?

IB congestion control is being released now (finally), so this can help
you here. 

Gilad


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From bill at Princeton.EDU  Thu May 26 15:20:19 2011
From: bill at Princeton.EDU (Bill Wichser)
Date: Thu, 26 May 2011 15:20:19 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
References: <4DDE5312.9070901@princeton.edu>
	<Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
Message-ID: <4DDEA7F3.8050703@princeton.edu>


Mark Hahn wrote:
>> Wondering if anyone out there is doing both I/O to storage as well as
>> MPI over the same IB fabric.
>>     
>
> I would say that is the norm.  we certainly connect local storage 
> (Lustre) to nodes via the same fabric as MPI.  gigabit is completely
> inadequate for modern nodes, so the only alternatives would be 10G
> or a secondary IB fabric, both quite expensive propositions, no?
>
> I suppose if your cluster does nothing but IO-light serial/EP jobs,
> you might think differently.
>   
Really?  I'm surprised by that statement.  Perhaps I'm just way behind 
on the curve though.  It is typical here to have local node storage, 
local lustre/pvfs storage, local NFS storage, and global GPFS storage 
running over the GigE network.  Depending on I/O loads users can make 
use of the storage at the right layer.  Yes, users fill the 1Gbps pipe 
to the storage per node.   But as we now implement all new clusters with 
IB I'm hoping to increase that bandwidth even more.  If you and everyone 
else is doing this already, that's a good sign!  Lol!  As we move closer 
to making this happen, perhaps there will be plenty of answers then for 
any QOS setup questions I may have.
>   
>> Following along in the Mellanox User's
>> Guide, I see a section on how to implement the QOS for both MPI and my
>> lustre storage.  I am curious though as to what might happen to the
>> performance of the MPI traffic when high I/O loads are placed on the
>> storage.
>>     
>
> to me, the real question is whether your IB fabric is reasonably close 
> to full-bisection (and/or whether your storage nodes are sensibly placed,
> topologically.)
>
>   
>> In our current implementation, we are using blades which are 50%
>> blocking (2:1 oversubscribed) when moving from a 16 blade chassis to
>> other nodes.  Would trying to do storage on top dictate moving to a
>> totally non-blocking fabric?
>>     
>
> how much inter-chassis MPI do you do?  how much IO do you do?
> IB has a small MTU, so I don't really see why mixed traffic would 
> be a big problem.  of course, IB also doesn't do all that wonderfully
> with hotspots.  but isn't this mostly an empirical question you can
> answer by direct measurement?
>   
How would I measure by direct measurement?  I don't have the switching 
infrastructure to compare a 2:1 versus a 1:1 unless you're talking about 
inside a chassis.  But since my storage would connect into the switching 
infrastructure how and what would I compare?

Jobs are not scheduled to run on a single chassis, or at least they try 
to but are not placed on hold for more than 10 minutes waiting.  So 
there are lots of wide jobs running between chassis.  Some don't even 
fit on a chassis.  As for the question of how much data, I don't have 
answer.  I know that a 10Gbps pipe hits 4Gbps for sustained periods to 
our central storage from the cluster.  I also know that I can totally 
overwhelm a 10G connected OSS which is currently I/O bound.

My question really was twofold: 1) is anyone doing this successfully and 
2) does anyone have an idea of how loudly my users will scream when 
their MPI jobs suddenly degrade.   You've answered #1 and seem to 
believe that for #2, no one will notice.

Thanks!
Bill
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>   
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From greg at keller.net  Thu May 26 15:29:07 2011
From: greg at keller.net (Greg Keller)
Date: Thu, 26 May 2011 14:29:07 -0500
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <mailman.1.1306436401.10703.beowulf@beowulf.org>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
Message-ID: <4DDEAA03.2040206@keller.net>

Date: Thu, 26 May 2011 12:18:18 -0400 (EDT)
> From: Mark Hahn<hahn at mcmaster.ca>
> Subject: Re: [Beowulf] Infiniband: MPI and I/O?
> To: Bill Wichser<bill at Princeton.EDU>
> Cc: Beowulf Mailing List<beowulf at beowulf.org>
> Message-ID:
> 	<Pine.LNX.4.64.1105261210510.7148 at coffee.psychology.mcmaster.ca>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
>> Wondering if anyone out there is doing both I/O to storage as well as
>> MPI over the same IB fabric.
> I would say that is the norm.  we certainly connect local storage
> (Lustre) to nodes via the same fabric as MPI.  gigabit is completely
> inadequate for modern nodes, so the only alternatives would be 10G
> or a secondary IB fabric, both quite expensive propositions, no?
>
> I suppose if your cluster does nothing but IO-light serial/EP jobs,
> you might think differently.
>
Agreed.  Just finished telling another vendor, "It's not high speed 
storage unless it has an IB/RDMA interface".   They love that.  Except 
for some really edge cases, I can't imagine running IO over GbE for 
anything more than trivial IO loads.


I am Curious if anyone is doing IO over IB to SRP targets or some 
similar "Block Device" approach.  The Integration into the filesystem by 
Lustre/GPFS and others may be the best way to go, but we are not 100% 
convinced yet.  Any stories to share?

Cheers!
Greg
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From landman at scalableinformatics.com  Thu May 26 15:35:35 2011
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 26 May 2011 15:35:35 -0400
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEAA03.2040206@keller.net>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
Message-ID: <4DDEAB87.40203@scalableinformatics.com>

On 05/26/2011 03:29 PM, Greg Keller wrote:

> Agreed.  Just finished telling another vendor, "It's not high speed
> storage unless it has an IB/RDMA interface".   They love that.  Except

Heh ... love it!

> for some really edge cases, I can't imagine running IO over GbE for
> anything more than trivial IO loads.

Lots of our customers do, when they have a large legacy GbE network, and 
upgrading is expensive.  We can have a very large fan in to our units, 
but IB (even SDR!) is really nice to move data over for storage.

> I am Curious if anyone is doing IO over IB to SRP targets or some
> similar "Block Device" approach.  The Integration into the filesystem by

Both block and file targets.  SRPT on our units, and fronted by OSSes 
for Lustre and similar like things.  Can do iSCSI as well (over IB using 
iSER, or over 10GbE ... works really nicely in either case).

> Lustre/GPFS and others may be the best way to go, but we are not 100%
> convinced yet.  Any stories to share?

If you do this with Lustre, make sure your OSSes are in HA pairs using 
pacemaker/ucarp, and use DRBD between backend units, or MD on the OSS to 
mirror the storage.  Unfortunately IB doesn't virtualize well (last I 
checked), so these have to be physical OSSes.  I presume something 
similar on GPFS.

GlusterFS, PVFS2/OrangeFS, etc. go fine without the block devices, and 
Gluster does mirroring at the file level.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 16:13:07 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 16:13:07 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEA7F3.8050703@princeton.edu>
References: <4DDE5312.9070901@princeton.edu>
	<Pine.LNX.4.64.1105261210510.7148@coffee.psychology.mcmaster.ca>
	<4DDEA7F3.8050703@princeton.edu>
Message-ID: <Pine.LNX.4.64.1105261558080.7148@coffee.psychology.mcmaster.ca>

>>> Wondering if anyone out there is doing both I/O to storage as well as
>>> MPI over the same IB fabric.
>>> 
>> 
>> I would say that is the norm.  we certainly connect local storage (Lustre) 
>> to nodes via the same fabric as MPI.  gigabit is completely
>> inadequate for modern nodes, so the only alternatives would be 10G
>> or a secondary IB fabric, both quite expensive propositions, no?
>> 
>> I suppose if your cluster does nothing but IO-light serial/EP jobs,
>> you might think differently.
>> 
> Really?  I'm surprised by that statement.  Perhaps I'm just way behind on the 
> curve though.  It is typical here to have local node storage, local 
> lustre/pvfs storage, local NFS storage, and global GPFS storage running over 
> the GigE network.

sure, we use Gb as well, but only as a crutch, since it's so slow.
or does each node have, say, a 4x bonded Gb for this traffic?

or are we disagreeing on whether Gb is "slow"?  80-ish MB/s seems pretty 
slow to me, considering that's less than any single disk on the market...

>> how much inter-chassis MPI do you do?  how much IO do you do?
>> IB has a small MTU, so I don't really see why mixed traffic would be a big 
>> problem.  of course, IB also doesn't do all that wonderfully
>> with hotspots.  but isn't this mostly an empirical question you can
>> answer by direct measurement?
>> 
> How would I measure by direct measurement?

I meant collecting the byte counters from nics and/or switches
while real workloads are running.  that tells you the actual data rates,
and should show how close you are to creating hotspots.

> My question really was twofold: 1) is anyone doing this successfully and 2) 
> does anyone have an idea of how loudly my users will scream when their MPI 
> jobs suddenly degrade.   You've answered #1 and seem to believe that for #2, 
> no one will notice.

we've always done it, though our main experience is with clusters that have 
full-bisection fabrics.  our two more recent clusters have half-bisection 
fabrics, but I suspect that most users are not looking closely enough at 
performance to notice and/or complain.


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Thu May 26 17:23:30 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 26 May 2011 17:23:30 -0400 (EDT)
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <4DDEAA03.2040206@keller.net>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
Message-ID: <Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>

> Agreed.  Just finished telling another vendor, "It's not high speed
> storage unless it has an IB/RDMA interface".   They love that.  Except

what does RDMA have to do with anything?  why would straight 10G ethernet
not qualify?  I suspect you're really saying that you want an efficient
interface, as well as enough bandwidth, but that doesn't necessitate RDMA.

> for some really edge cases, I can't imagine running IO over GbE for
> anything more than trivial IO loads.

well, it's a balance issue.  if someone was using lots of Atom boards
lashed into a cluster, 1Gb apiece might be pretty reasonable.  but for 
fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all 
that generous.

as an interesting case in point, SeaMicro was in the news again with a 512
atom system: either 64 Gb links or 16 10G links.  the former (.128 Gb/core)
seems low even for atoms, but .3 Gb/core might be reasonable.

> I am Curious if anyone is doing IO over IB to SRP targets or some
> similar "Block Device" approach.  The Integration into the filesystem by
> Lustre/GPFS and others may be the best way to go, but we are not 100%
> convinced yet.  Any stories to share?

you mean you _like_ block storage?  how do you make a shared FS namespace
out of it, manage locking, etc?

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From greg at keller.net  Thu May 26 18:30:52 2011
From: greg at keller.net (Greg Keller)
Date: Thu, 26 May 2011 17:30:52 -0500
Subject: [Beowulf] Infiniband: MPI and I/O?
In-Reply-To: <Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>
References: <mailman.1.1306436401.10703.beowulf@beowulf.org>
	<4DDEAA03.2040206@keller.net>
	<Pine.LNX.4.64.1105261710070.7148@coffee.psychology.mcmaster.ca>
Message-ID: <4DDED49C.6070509@keller.net>


On 5/26/2011 4:23 PM, Mark Hahn wrote:
>> Agreed.  Just finished telling another vendor, "It's not high speed
>> storage unless it has an IB/RDMA interface".   They love that.  Except
>
> what does RDMA have to do with anything?  why would straight 10G ethernet
> not qualify?  I suspect you're really saying that you want an efficient
> interface, as well as enough bandwidth, but that doesn't necessitate 
> RDMA.
>
RDMA over IB is definitely a nice feature.  Not required, but IP over IB 
has enough limits that we prefer to avoid it.
>> for some really edge cases, I can't imagine running IO over GbE for
>> anything more than trivial IO loads.
>
> well, it's a balance issue.  if someone was using lots of Atom boards
> lashed into a cluster, 1Gb apiece might be pretty reasonable.  but for 
> fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all 
> that generous.
>
> as an interesting case in point, SeaMicro was in the news again with a 
> 512
> atom system: either 64 Gb links or 16 10G links.  the former (.128 
> Gb/core)
> seems low even for atoms, but .3 Gb/core might be reasonable.
>
agreed
>> I am Curious if anyone is doing IO over IB to SRP targets or some
>> similar "Block Device" approach.  The Integration into the filesystem by
>> Lustre/GPFS and others may be the best way to go, but we are not 100%
>> convinced yet.  Any stories to share?
>
> you mean you _like_ block storage?  how do you make a shared FS namespace
> out of it, manage locking, etc?
Well, it's a use case issue for us.  You don't make a shared FS on the 
block devices (well, maybe you could just not in a scalable way)... but 
we envision leasing block devices to customers with known 
capacity/performance capability.  Then the customer can make the call if 
they want to use it for a CIFS/NFS backend, possibly even lashed 
together via MD, through a single server.  They can also lease multiple 
block devices and create a lustre type system.

The flexibility is if they disappear and come back they may not get the 
same compute/storage nodes, but they can attach any server to their 
dedicated block storage devices.  There are also some multi-tenancy 
security options that can be more definitively handled if they have 
absolute control over a block device.  So in this case, they would 
semi-permanently lease the block devices, and then fire up front end 
storage nodes and compute nodes on an "as needed / as available" basis 
anywhere in our compute farm.  Effectively we get the benefits of a 
massive Fibre Channel type SAN over the IB infrastructure we have to 
every node.  If we can get the performance and cost of the block storage 
right, it will be compelling for some of our customers.

We are still prototyping how it would work and characterizing 
performance options...  but it's interesting to us.

Cheers!
Greg
>
> regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From hahn at mcmaster.ca  Fri May 27 23:59:42 2011
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 27 May 2011 23:59:42 -0400 (EDT)
Subject: [Beowulf] 512 atoms in a box
Message-ID: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>

I was thinking about the seamicro box - 512 atoms, 64 disks and 
either 64 Gb ports or 16 10G ports.  it would be interesting to 
look at what the most appropriate "balance" is for mips/flops 
of cpu power compared to interconnect bandwidth.  maybe the seamicro
box is more intended to be a giant memcached server - that is,
the question is memory bandwidth/capacity versus IC bandwidth.

in any case, you have to ponder where the amazing value-add is - 
compactness?  I'm not sure it competes all that well compared to 
48 core-per-U conventional servers (whether mips/flops or memory-based).

here's an idea, more commodity-oriented (hence beowulf): suppose you 
design a tiny widget that gets all its power via POE.  maybe Atom or 
ARM-based - you've got 15-20W, which is quite a bit these days.
for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.

plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
and you've got a penny-pincher's approximation of a Seamicro SM100000!

not going to win top500, but...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From eugen at leitl.org  Sat May 28 04:26:25 2011
From: eugen at leitl.org (Eugen Leitl)
Date: Sat, 28 May 2011 10:26:25 +0200
Subject: [Beowulf] 512 atoms in a box
In-Reply-To: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
Message-ID: <20110528082625.GB19622@leitl.org>

On Fri, May 27, 2011 at 11:59:42PM -0400, Mark Hahn wrote:

> here's an idea, more commodity-oriented (hence beowulf): suppose you 
> design a tiny widget that gets all its power via POE.  maybe Atom or 
> ARM-based - you've got 15-20W, which is quite a bit these days.
> for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.
> 
> plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
> and you've got a penny-pincher's approximation of a Seamicro SM100000!
> 
> not going to win top500, but...

I was planning to do something similar with rooted Apple TV,
once it's bumped up to A5 in the next generation.

The devices would need spacers, a baffle and a few fans,
if packed closely.

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From award at uda.ad  Sat May 28 06:08:50 2011
From: award at uda.ad (Alan Ward)
Date: Sat, 28 May 2011 12:08:50 +0200
Subject: [Beowulf] RS:  512 atoms in a box
References: <Pine.LNX.4.64.1105272340310.28935@coffee.psychology.mcmaster.ca>
Message-ID: <9CC6BA5ACC8E7C489A97B01C2DA791063D04BD@serpens.ua.ad>


Just found this:

  http://www.raspberrypi.org/

The ARM11 does not pack much punch, there is no networking (though it should not be too difficult to add) and it is not even in production yet. But it does seem fun. Plus, $1000 would get you 40 units ...

Cheers,
-Alan


-----Missatge original-----
De: beowulf-bounces at beowulf.org en nom de Mark Hahn
Enviat el: ds. 28/05/2011 05:59
Per a: Beowulf Mailing List
Tema: [Beowulf] 512 atoms in a box
 
I was thinking about the seamicro box - 512 atoms, 64 disks and 
either 64 Gb ports or 16 10G ports.  it would be interesting to 
look at what the most appropriate "balance" is for mips/flops 
of cpu power compared to interconnect bandwidth.  maybe the seamicro
box is more intended to be a giant memcached server - that is,
the question is memory bandwidth/capacity versus IC bandwidth.

in any case, you have to ponder where the amazing value-add is - 
compactness?  I'm not sure it competes all that well compared to 
48 core-per-U conventional servers (whether mips/flops or memory-based).

here's an idea, more commodity-oriented (hence beowulf): suppose you 
design a tiny widget that gets all its power via POE.  maybe Atom or 
ARM-based - you've got 15-20W, which is quite a bit these days.
for packaging, you need space for a cpu, nic and sodimm.  maybe some leds.

plug them into a commodity 1U 48-port Gb switch, then stack 10 of them
and you've got a penny-pincher's approximation of a Seamicro SM100000!

not going to win top500, but...
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20110528/f96891c1/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf