From eagles051387 at gmail.com Fri Jul 1 04:26:51 2011 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 01 Jul 2011 10:26:51 +0200 Subject: [Beowulf] Suggestions for used workhorse servers In-Reply-To: <4E0CE127.9060605@ias.edu> References: <4E0CDD91.2090405@cora.nwra.com> <4E0CE127.9060605@ias.edu> Message-ID: <4E0D84CB.9010006@gmail.com> I have a nice IBM 1u m3250 and It runs like a charm no issues what so ever. Specs are rather nice and its moderatly priced. Has a dual core E7400 and can take up to 8gb of ram. Cost was about 1,300 euros. On 30/06/2011 22:48, Prentice Bisbal wrote: > I've always been pleased with the HP ProLiant systems, like the DL385 > models. The seemed pretty reliable to me. I'd trust one of those over a > Dell PowerEdge, however I'm sure you'll get as many opinions as there > are subscribers on this list. > > -- > Prentice > > On 06/30/2011 04:33 PM, Orion Poplawski wrote: >> One can find some pretty inexpensive older servers on eBay that probably could >> yield a decent $/flop ratio. I was wondering if people here had suggestions >> for classic workhorse servers - basic 1U boxes that did/do pretty well be are >> a couple years old at this point. >> >> Thanks! >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eagles051387 at gmail.com Fri Jul 1 04:28:18 2011 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 01 Jul 2011 10:28:18 +0200 Subject: [Beowulf] Suggestions for used workhorse servers In-Reply-To: References: Message-ID: <4E0D8522.4090403@gmail.com> Jim also dont forget in the case of having 4-8 nodes and possibly setup the master with solid state, the other nodes could be setup diskless. that way all you would need to reconfigure would be the master node and not all the slaves. On 30/06/2011 23:15, Lux, Jim (337C) wrote: > One might make a nice small cluster for learning purposes. With 8 nodes, > you could do a lot of experimenting. Even 4 nodes works, but with 8, if > your parallelization works, you get a pretty dramatic speedup. > > And, when you screw up, and need to reinstall all the software everywhere, > 4-8 nodes is manageable by hand. > > You could also, if you have extra network cards, experiment with things > like different interconnect architectures. > > There is significant value in a stack of boxes which you "own" and don't > have to account for the use of (or lack), for that sort of "fooling > around" > > > For production purposes, you're probably better off buying newer > computers: Power consumption, hassles, etc. > > > On 6/30/11 1:33 PM, "Orion Poplawski" wrote: > >> One can find some pretty inexpensive older servers on eBay that probably >> could >> yield a decent $/flop ratio. I was wondering if people here had >> suggestions >> for classic workhorse servers - basic 1U boxes that did/do pretty well be >> are >> a couple years old at this point. >> >> Thanks! >> >> -- >> Orion Poplawski >> Technical Manager 303-415-9701 x222 >> NWRA/CoRA Division FAX: 303-415-9702 >> 3380 Mitchell Lane orion at cora.nwra.com >> Boulder, CO 80301 http://www.cora.nwra.com >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From orion at cora.nwra.com Fri Jul 1 12:21:06 2011 From: orion at cora.nwra.com (Orion Poplawski) Date: Fri, 01 Jul 2011 10:21:06 -0600 Subject: [Beowulf] AVX In-Reply-To: <4E0CE34E.10807@scalableinformatics.com> References: <4E0CDD91.2090405@cora.nwra.com> <4E0CE34E.10807@scalableinformatics.com> Message-ID: <4E0DF3F2.3050206@cora.nwra.com> On 06/30/2011 02:57 PM, Joe Landman wrote: > We are building a number of more ... specialty ... type cluster things these > days. Its very possible to put together a pretty good 4 core 16GB ram > modern/fast Xeon unit (with AVX bits ... Sandy Bridge based unit) This sparked my curiosity as well. What are people's experience with AVX? What software uses it? Performance improvements? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 7 10:13:05 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 7 Jul 2011 16:13:05 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee Message-ID: <20110707141305.GJ16178@leitl.org> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee One million ARM chips challenge Intel bumblebee Manchester works toward human brain 07 Jul 2011 13:25 | by Matthew Finnegan in London | posted in Chips One million ARM chips challenge Intel bumblebee - A project to replicate the workings of the human brain has received a boost with the delivery of one million ARM processors. While Intel has its sights set on reaching bumble bee brain level in the near future, it seems its rival is involved in one further. Scientists at the University of Manchester will link together the ARM chips as the system architecture of a massive computer, dubbed SpiNNaker, or Spiking Neural Network architecture. Despite the mass of chips it will only be possible to recreate models of up to one percent of the human brain. The chips have arrived and are past functionality testing. A similar experiment was once attempted with a load of old Centrino chips found at the back of our stationary cupboard, though so far we haven't even managed to replicate the cranial workings of a particularly slow slug. The work, headed up by Professor Steve Furber, has the potential to become a revolutionary tool for neuroscientists and psychologists in understanding how our brains work. SpiNNaker will attempt to replicate the workings of the 100 billion neurons and the 1,000 million connections that are used to create high connectivity in cells. SpiNNaker will model the electric signals that neurons emit, with each impulse modelled as a ?packet? of data, similar to the way that information is transferred over the internet. The packet is sent to other neurons, represented by small equations solved in real time by ARM processors. The chips, designed in Machester and built in Taiwan, each contain 18 ARM processors. The bespoke 18 core chips are able to provide the computing power of a personal computer in a fraction of the space, using just one watt of power. Now that the chips have arrived it will be possible to get cracking on building model. ?The project revolves around getting the chips made, which has taken the past five years to get right,? Professor Steve Furber told TechEye. ?We will know be increasing the scale of the project over the next 18 months before it reaches its final form, with one million processors used. We already have the system working on a smaller scale, and we are able to look at fifty to sixty thousand neurons currently.? As well as offering possibilities as a scientific research tool, Furber hopes that the system will help pave the way for computational advancements too. ?It will help to analyse the intermediate levels of the brain, which are very difficult to focus on otherwise,? he says. ?Another area which this help is in building more reliable computing systems. As chip manufacturers continue towards the end of Moore?s Law, transistors will become increasingly unreliable. And computer systems are very susceptible to malfunctioning transistors.? Furber says biology works differently. ?Biology, on the other hand, reacts to the malfunctioning of neurons very well, with it happening regularly with all brains, so this could help future chips become more reliable.? Of course, we also wanted to know how this all compares with Intel?s famous bumblebee claims. Unfortunately, professor Furber couldn't specifically help us with information about bumblebee brain processing. He was, however, able to reel off some details about the honeybee. ?The honeybee brain has around 850,000 neurons so we will be able to reach that level of processing in the next few months. Of course, we don?t have a honeybee brain model to run, but we will have the computing power.? Over to you, Intel. Read more: http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee#ixzz1RQe7GYoL _______________________________________________ tt mailing list tt at postbiota.org http://postbiota.org/mailman/listinfo/tt _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 11:36:02 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 11:36:02 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <20110707141305.GJ16178@leitl.org> References: <20110707141305.GJ16178@leitl.org> Message-ID: <4E15D262.90209@ias.edu> On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > One million ARM chips challenge Intel bumblebee > Now say it like Dr. Evil: one MILLION processors. How long is it going to take to wire them all up? And how fast are they going to fail? If there's a MTBF of one million hours, that's still one failure per hour. Should be interesting. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 12:31:24 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 09:31:24 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15D262.90209@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: > On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > > > One million ARM chips challenge Intel bumblebee > > > > Now say it like Dr. Evil: one MILLION processors. > > > How long is it going to take to wire them all up? And how fast are they > going to fail? If there's a MTBF of one million hours, that's still one > failure per hour. But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 7 12:56:29 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 7 Jul 2011 18:56:29 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <20110707165629.GL16178@leitl.org> On Thu, Jul 07, 2011 at 09:31:24AM -0700, Lux, Jim (337C) wrote: > > On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > > > > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > > > > > One million ARM chips challenge Intel bumblebee > > > > > > > Now say it like Dr. Evil: one MILLION processors. > > > > > > How long is it going to take to wire them all up? And how fast are they > > going to fail? If there's a MTBF of one million hours, that's still one > > failure per hour. > > > But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. They do address som of that in ftp://ftp.cs.man.ac.uk/pub/amulet/papers/SBF_ACSD09.pdf It's also specific to neural emulation. These should tolerate pretty huge error rates without fouling up the qualitative system behaviour they're trying to model. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Daniel.Pfenniger at unige.ch Thu Jul 7 13:05:22 2011 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Thu, 07 Jul 2011 19:05:22 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15D262.90209@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <4E15E752.8080907@unige.ch> Prentice Bisbal wrote: > > On 07/07/2011 10:13 AM, Eugen Leitl wrote: >> >> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee >> >> One million ARM chips challenge Intel bumblebee >> > > Now say it like Dr. Evil: one MILLION processors. > > > How long is it going to take to wire them all up? And how fast are they > going to fail? If there's a MTBF of one million hours, that's still one > failure per hour. > > Should be interesting. The real challenge is they want to simulate a human brain (~10^11 neurons, 10^14-10^15 synapses) with so few processors. I guess in any real brain many neurons and even more synapses are permanently out of order... Dan _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 13:05:55 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 10:05:55 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15E752.8080907@unige.ch> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15E752.8080907@unige.ch> Message-ID: > -----Original Message----- > Subject: Re: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee > > > The real challenge is they want to simulate a human brain > (~10^11 neurons, 10^14-10^15 synapses) with so few processors. > I guess in any real brain many neurons and even more synapses are > permanently out of order... Kind of depends on the time of day after the conference bird of a feather session gets started, eh? _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 13:17:30 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 13:17:30 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <4E15EA2A.9010200@ias.edu> On 07/07/2011 12:31 PM, Lux, Jim (337C) wrote: >> On 07/07/2011 10:13 AM, Eugen Leitl wrote: >>> >>> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee >>> >>> One million ARM chips challenge Intel bumblebee >>> >> >> Now say it like Dr. Evil: one MILLION processors. >> >> >> How long is it going to take to wire them all up? And how fast are they >> going to fail? If there's a MTBF of one million hours, that's still one >> failure per hour. > > > But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. > Just to be clear - I wasn't saying this was a bad idea. Scaling up to this size seems inevitable. I was just imagining the team of admins who would have to be working non-stop to replace dead processors! I wonder what the architecture for this system will be like. I imagine it will be built around small multi-socket blades that are hot-swappable to handle this. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 14:26:04 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 11:26:04 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15EA2A.9010200@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this > list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work > in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in > unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. > > > > Just to be clear - I wasn't saying this was a bad idea. Scaling up to > this size seems inevitable. I was just imagining the team of admins who > would have to be working non-stop to replace dead processors! > > I wonder what the architecture for this system will be like. I imagine > it will be built around small multi-socket blades that are hot-swappable > to handle this. I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 7 15:25:35 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 07 Jul 2011 15:25:35 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <4E16082F.5060700@runnersroll.com> On 07/07/11 14:26, Lux, Jim (337C) wrote: >>> It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. >>> >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". Or rather than replace or add to the pool, perhaps just allow the ones that die to just, well, stay dead. The issue with things this scale is that unlike the individual or smallish business there are very few good reasons to not upgrade every, say, 3 to 5 years. The costs involved in having spare CPUs sitting around waiting to be swapped in, the maintenance of having administrators replacing stuff and any potential downtime replacements require seem at first glance to outweigh the elegance of "letting nature taking it's course" with the supercomputer. For instance, if Prentice's MTBF of 1 million hours is realistic (I personally have no idea if it is), then that's "only" 43,800 CPUs by the end of year 5. That's less than 5% of the total capacity - i.e. not a big deal if this system can truly tolerate and route around failures as our brains do. Perhaps they could study old and/or drug abusing bees at that stage, hehe. Just my 2 wampum, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 15:38:34 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 15:38:34 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <4E160B3A.1030009@ias.edu> On 07/07/2011 02:26 PM, Lux, Jim (337C) wrote: >>> It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. >>> >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > > > I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". > Did you read the paper that someone else posted a link to? I just read the first half of it. A good part of this research is focused on fault-tolerance/resiliency of computer systems. They're not just interested in creating a computer to mimic the brain, they want to learn how to mimic the brain's fault-tolerance in computers. To paraphrase the paper, we lose a neuron a second in our brains for our entire lives, but we never notice any problems from that. This research hopes to learn how to duplicate with that this computer, so you could say hardware failures are desirable and necessary for this research. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ntmoore at gmail.com Thu Jul 7 18:22:47 2011 From: ntmoore at gmail.com (Nathan Moore) Date: Thu, 7 Jul 2011 17:22:47 -0500 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E160B3A.1030009@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> <4E160B3A.1030009@ias.edu> Message-ID: Some of these "decay over time" questions have been worked on in the context of detector design in high energy physics. Everything in the big detectors needs to be radiation hard... On Thu, Jul 7, 2011 at 2:38 PM, Prentice Bisbal wrote: > On 07/07/2011 02:26 PM, Lux, Jim (337C) wrote: > >>> It's all about ultimate scalability. Anybody with a moderate > competence (certainly anyone on this > >> list) could devise a scheme to use 1000 perfect processors that never > fail to do 1000 quanta of work > >> in unit time. It's substantially more challenging to devise a scheme to > do 1000 quanta of work in > >> unit time on, say, 1500 processors with a 20% failure rate. Or even in > 1.2*unit time. > >>> > >> > >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to > >> this size seems inevitable. I was just imagining the team of admins who > >> would have to be working non-stop to replace dead processors! > >> > >> I wonder what the architecture for this system will be like. I imagine > >> it will be built around small multi-socket blades that are hot-swappable > >> to handle this. > > > > > > > > I think that you just anticipate the failures and deal with them. It's > challenging to write code to do this, but it's certainly a worthy objective. > I can easily see a situation where the cost to replace dead units is so high > that you just don't bother doing it: it's cheaper to just add more live ones > to the "pool". > > > > Did you read the paper that someone else posted a link to? I just read > the first half of it. A good part of this research is focused on > fault-tolerance/resiliency of computer systems. They're not just > interested in creating a computer to mimic the brain, they want to learn > how to mimic the brain's fault-tolerance in computers. > > To paraphrase the paper, we lose a neuron a second in our brains for our > entire lives, but we never notice any problems from that. This research > hopes to learn how to duplicate with that this computer, so you could > say hardware failures are desirable and necessary for this research. > > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Associate Professor, Physics Winona State University - - - - - - - - - - - - - - - - - - - - - -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Fri Jul 8 17:33:27 2011 From: mathog at caltech.edu (David Mathog) Date: Fri, 08 Jul 2011 14:33:27 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee Message-ID: "Ellis H. Wilson III" wrote > For instance, if Prentice's MTBF of 1 million hours is realistic (I > personally have no idea if it is), then that's "only" 43,800 CPUs by the > end of year 5. That's less than 5% of the total capacity - i.e. not a > big deal if this system can truly tolerate and route around failures as > our brains do. Perhaps they could study old and/or drug abusing bees at > that stage, hehe. Serendipitously, 5 years is roughly the maximum life span of a queen bee. However, it is a pretty safe bet that if they are emulating a bee brain, it is a worker bee brain, and a worker bee is lucky if it lives a couple of months. Did it say anywhere that the emulation was real time? Very common for emulations to run orders of magnitude slower than real time, so processor loss could still be an issue during runs. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri Jul 8 23:33:55 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 8 Jul 2011 23:33:55 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <58977.68.83.96.23.1310182435.squirrel@mail.eadline.org> >> > It's all about ultimate scalability. Anybody with a moderate >> competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never >> fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to >> do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in >> 1.2*unit time. >> > >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > > > I think that you just anticipate the failures and deal with them. It's > challenging to write code to do this, but it's certainly a worthy > objective. I can easily see a situation where the cost to replace dead > units is so high that you just don't bother doing it: it's cheaper to just > add more live ones to the "pool". I wrote about the programming issue in a series of three articles (conjecture, never really tried it, if only I had the time ...). The first article links (at the end) to the other two. http://www.clustermonkey.net//content/view/158/28/ And yes, disposable "nodes" just like a failed cable in a large cluster, route a new one, don't worry about unbundling a huge cable tree. I assume there will be a high level of integration so there may be "nodes" are left for dead which are integrated into a much larger blade. -- Doug > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Fri Jul 8 23:42:28 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 8 Jul 2011 23:42:28 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E160B3A.1030009@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> <4E160B3A.1030009@ias.edu> Message-ID: <35078.68.83.96.23.1310182948.squirrel@mail.eadline.org> --snip-- > > Did you read the paper that someone else posted a link to? I just read > the first half of it. A good part of this research is focused on > fault-tolerance/resiliency of computer systems. They're not just > interested in creating a computer to mimic the brain, they want to learn > how to mimic the brain's fault-tolerance in computers. > > To paraphrase the paper, we lose a neuron a second in our brains for our > entire lives, but we never notice any problems from that. I know some people from the sixties that may beg to differ ;-) -- Doug > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Mon Jul 11 16:34:34 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Mon, 11 Jul 2011 16:34:34 -0400 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: We are (beta) releasing a drop-in package for SGE6.2u5, SGE6.2u5p1, and SGE6.2u5p2 for thread-binding: http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html Mainly tested on Intel boxes -- would be great if AMD Magny-Cours server owners offer help with testing! (Play it safe -- setup a 1 or 2-node test cluster by using the non-standard SGE TCP ports). Thanks! Rayson On Mon, Apr 18, 2011 at 2:26 PM, Rayson Ho wrote: > For those who had issues with earlier version, please try the latest > loadcheck v4: > > http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html > > I compiled the binary on Oracle Linux, which is compatible with RHEL > 5.x, Scientific Linux or Centos 5.x. I tested the binary on the > standard Red Hat kernel, and Oracle enhanced "Unbreakable Enterprise > Kernel", Fedora 13, Ubuntu 10.04 LTS. > > Optimizing for AMD's NUMA machine characteristics is on the ToDo list. > > Rayson > > > > On Wed, Apr 13, 2011 at 2:15 PM, Prakashan Korambath wrote: >> Hi Rayson, >> >> Do you have a statically linked version? Thanks. >> >> ./loadcheck: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by >> ./loadcheck) >> >> Prakashan >> >> >> >> On 04/13/2011 09:21 AM, Rayson Ho wrote: >>> >>> Carlos, >>> >>> I notice that you have "lx24-amd64" instead of "lx26-amd64" for the >>> arch string, so I believe you are running the loadcheck from standard >>> Oracle Grid Engine, Sun Grid Engine, or one of the forks instead of >>> the one from the Open Grid Scheduler page. >>> >>> The existing Grid Engine (including the latest Open Grid Scheduler >>> releases: SGE 6.2u5p1& ?SGE 6.2u5p2, or Univa's fork) uses PLPA, and >>> it is known to be wrong on magny-cours. >>> >>> (i.e. SGE 6.2u5p1& ?SGE 6.2u5p2 from: >>> http://sourceforge.net/projects/gridscheduler/files/ ) >>> >>> >>> Chansup on the Grid Engine mailing list (it's the general purpose Grid >>> Engine mailing list for now) tested the version I uploaded last night, >>> and seems to work on a dual-socket magny-cours AMD machine. It prints: >>> >>> m_topology ? ? ?SCCCCCCCCCCCCSCCCCCCCCCCCC >>> >>> However, I am still fixing the processor, core id mapping code: >>> >>> http://gridengine.org/pipermail/users/2011-April/000629.html >>> http://gridengine.org/pipermail/users/2011-April/000628.html >>> >>> I compiled the hwloc enabled loadcheck on kernel 2.6.34& ?glibc 2.12, >>> so it may not work on machines running lower kernel or glibc versions, >>> you can download it from: >>> >>> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html >>> >>> Rayson >>> >>> >>> >>> On Wed, Apr 13, 2011 at 3:03 AM, Carlos Fernandez Sanchez >>> ?wrote: >>>> >>>> This is the output of a 2 sockets, 12 cores/socket (magny-cours) AMD >>>> system >>>> (and seems to be wrong!): >>>> >>>> arch ? ? ? ? ? ?lx24-amd64 >>>> num_proc ? ? ? ?24 >>>> m_socket ? ? ? ?2 >>>> m_core ? ? ? ? ?12 >>>> m_topology ? ? ?SCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTT >>>> load_short ? ? ?0.29 >>>> load_medium ? ? 0.13 >>>> load_long ? ? ? 0.04 >>>> mem_free ? ? ? ?26257.382812M >>>> swap_free ? ? ? 8191.992188M >>>> virtual_free ? ?34449.375000M >>>> mem_total ? ? ? 32238.328125M >>>> swap_total ? ? ?8191.992188M >>>> virtual_total ? 40430.320312M >>>> mem_used ? ? ? ?5980.945312M >>>> swap_used ? ? ? 0.000000M >>>> virtual_used ? ?5980.945312M >>>> cpu ? ? ? ? ? ? 0.0% >>>> >>>> >>>> Carlos Fernandez Sanchez >>>> Systems Manager >>>> CESGA >>>> Avda. de Vigo s/n. Campus Vida >>>> Tel.: (+34) 981569810, ext. 232 >>>> 15705 - Santiago de Compostela >>>> SPAIN >>>> >>>> -------------------------------------------------- >>>> From: "Rayson Ho" >>>> Sent: Tuesday, April 12, 2011 10:31 PM >>>> To: "Beowulf List" >>>> Subject: [Beowulf] Grid Engine multi-core thread binding enhancement >>>> -pre-alpha release >>>> >>>>> If you are using the "Job to Core Binding" feature in SGE and running >>>>> SGE on newer hardware, then please give the new hwloc enabled >>>>> loadcheck a try. >>>>> >>>>> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html >>>>> >>>>> The current hardware topology discovery library (Portable Linux >>>>> Processor Affinity - PLPA) used by SGE was deprecated in 2009, and new >>>>> hardware topology may not be detected correctly by PLPA. >>>>> >>>>> If you are running SGE on AMD Magny-Cours servers, please post your >>>>> loadcheck output, as it is known to be wrong when handled by PLPA. >>>>> >>>>> The Open Grid Scheduler is migrating to hwloc -- we will ship hwloc >>>>> support in later releases of Grid Engine / Grid Scheduler. >>>>> >>>>> http://gridscheduler.sourceforge.net/ >>>>> >>>>> Thanks!! >>>>> Rayson >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Mon Jul 11 23:39:46 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 11 Jul 2011 23:39:46 -0400 (EDT) Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: > http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html since this isn't an SGE list, I don't want to pursue an off-topic too far, but out of curiosity, does this make the scheduler topology aware? that is, not just topo-aware binding, but topo-aware resource allocation? you know, avoid unnecessary resource contention among the threads belonging to multiple jobs that happen to be on the same node. large-memory processes not getting bound to a single memory node. packing both small and large-memory processes within a node. etc? thanks, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From kilian.cavalotti.work at gmail.com Tue Jul 12 05:16:26 2011 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Tue, 12 Jul 2011 11:16:26 +0200 Subject: [Beowulf] HPC storage purchasing process Message-ID: An interesting insider view about the procurement process that occured for the purchase of a HPC storage system at CHPC http://www.hpcwire.com/hpcwire/2011-07-08/hpc_center_traces_storage_selection_experience.html Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From raysonlogin at gmail.com Tue Jul 12 16:12:14 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 12 Jul 2011 16:12:14 -0400 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: On Mon, Jul 11, 2011 at 11:39 PM, Mark Hahn wrote: >> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html > > since this isn't an SGE list, I don't want to pursue an off-topic too far, Hi Mark, I think a lot of this will apply to non-SGE batch schedulers -- in fact Torque will support hwloc in a future release. And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of CPU set support for many years, but now this feature is more important as the interaction of different hardware layers impacts the performance more as more cores are added per socket. > but out of curiosity, does this make the scheduler topology aware? > that is, not just topo-aware binding, but topo-aware resource allocation? > you know, avoid unnecessary resource contention among the threads belonging > to multiple jobs that happen to be on the same node. You can tell SGE (now: Grid Scheduler) how you want to allocate hardware resource, but then different hardware architectures & program behaviors can introduce interactions that will cause different performance impact. For example, a few years ago while I was still working for a large UNIX system vendor, I found that a few SPEC OMP benchmarks run faster when the threads are closer to each other (even when sharing the same core by running in SMT mode), while most benchmarks benefit from more L2/L3 caches & memory bandwidth (I'm talking about the same thread count for both cases). But it is hard even as a compiler developer to find out how to choose the optimal thread allocation -- even with high-level array access pattern information & memory bandwidth models available at compilation time. For batch systems, we don't have as much info as the compiler. While we can profile systems on the fly by PAPI, I doubt we will go that route in the near future. So, that means we need the job submitter to tell us what he wants -- in SGE/OGS, we have "qsub -binding striding::", which means you will need to benchmark the code and see how the code interacts with the hardware, and see whether it runs better with more L2/L3/memory bandwidth (meaning step-size >= 2), or "qsub -binding linear", which means the job will get the core by itself. http://wikis.sun.com/display/gridengine62u5/Using+Job+to+Core+Binding >?large-memory processes > not getting bound to a single memory node. ?packing both small and > large-memory processes within a node. ?etc? For memory nodes, a call to numactl should be able to handle most use-cases. Rayson > > thanks, mark hahn. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Wed Jul 13 04:47:44 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 13 Jul 2011 09:47:44 +0100 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2><4DA5E85D.4010801@ats.ucla.edu> Message-ID: <207BB2F60743C34496BE41039233A8090656ACFF@MRL-PWEXCHMB02.mil.tagmclarengroup.com> > > Hi Mark, > > I think a lot of this will apply to non-SGE batch schedulers -- in > fact Torque will support hwloc in a future release. > That sounds good to me! (Hint - if anyone from Altair is listening in it would be useful...) > And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of > CPU set support for many years, but now this feature is more important > as the interaction of different hardware layers impacts the > performance more as more cores are added per socket. > I agree very much with what you say here. The Gridengine topology-aware scheduling sounds great, and as you say with multi-core architectures will be more and more useful. I'd like to mention cpuset support though - cpusets are of course vitally useful, however you can have situations where a topology-aware scheduler would allow you to allocate high core count jobs on a machine, using cores which are physically close to each other, yet also run small core count yet high memory jobs which access the memory of those cores. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Sat Jul 16 16:19:43 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 16 Jul 2011 16:19:43 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <20110707165629.GL16178@leitl.org> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <20110707165629.GL16178@leitl.org> Message-ID: >>> How long is it going to take to wire them all up? And how fast are they >>> going to fail? If there's a MTBF of one million hours, that's still one >>> failure per hour. ... > They do address som of that in ftp://ftp.cs.man.ac.uk/pub/amulet/papers/SBF_ACSD09.pdf the 1m proc seems to be referring to cores, of which their current SOC has 20/chip, and there are 4 chips on their current test board: http://www.eetimes.com/electronics-news/4217842/SpiNNaker-ARM-chip-slide-show?pageNumber=4 hmm, that article says 18 cores (maybe reduced for yield). stacked dram, not sure what the other companion chip is on the test board. anyway, compare it to the K computer: 516096 compute cores, 64512 packages, versus 50k packages for Spinnaker. Spinnaker will obviously put more chips onto a single board (board links are more reliable than connectors, as well as more power-efficient.) Spinnaker has 6 links for a 2d toroidal mesh (not 3d for some reason) - K also uses a 6-link mesh. obviously, off-board links need a connector, but if I were designing either box I'd have each board plug into a per-rack backplane, again, to avoid dealing with cables. if you have a per-rack sub-mesh anyway, it should be 3d, shouldn't it? in abstract, it seems like Spinnaker would want a 3d mesh to better model the failure effect in the brain (which is certainly not 2d nearest-neighbor!) in fact, if you wanted to embrace brain-like topologies, I'd think a flat-network-neighborhood would be most realistic (albeit cable-intensive. but we're not afraid of failed cables, since the brain isn't!) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Mon Jul 18 10:18:41 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 18 Jul 2011 10:18:41 -0400 (EDT) Subject: [Beowulf] Multi-core benchmarking Message-ID: <32768.192.168.93.213.1310998721.squirrel@mail.eadline.org> I had recently run some tests on a few multi-core processors that I use in my Limulus Project Benchmarking A Multi-Core Processor For HPC http://www.clustermonkey.net//content/view/306/1/ I have always been interested in how well multi-core can support parallel codes. I used the following CPUs: * Intel Core2 Quad-core Q6600 running at 2.4GHz (Kentsfield) * AMD Phenom II X4 quad-core 910e running at 2.6GHz (Deneb) * Intel Core i5 Quad-core i5-2400S running at 2.5 GHz (Sandybridge) -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Mon Jul 18 13:22:57 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 18 Jul 2011 13:22:57 -0400 (EDT) Subject: [Beowulf] Multi-core benchmarking In-Reply-To: References: <32768.192.168.93.213.1310998721.squirrel@mail.eadline.org> Message-ID: <60878.192.168.93.213.1311009777.squirrel@mail.eadline.org> forward from Mikhail Kuzminsky: > Dear Douglas, > I can't send my messages directly to beowulf maillist, so could you pls > forward my answer manually to beowulf maillist ? > > 18 ???????? 2011, 18:20 ???? "Douglas Eadline" : > > I had recently run some tests on a few multi-core > processors that I use in my Limulus Project > > ??Benchmarking A Multi-Core Processor For HPC > ?? > http://www.clustermonkey.net//content/view/306/1/ > > I have always been interested in how well > multi-core can support parallel codes. > I used the following CPUs: > > ????* Intel Core2 Quad-core Q6600 running at 2.4GHz (Kentsfield) > ????* AMD Phenom II X4 quad-core 910e running at 2.6GHz (Deneb) > ????* Intel Core i5 Quad-core i5-2400S running at 2.5 GHz (Sandybridge) > > The compiler used was gfortran. Does it have AVX support ? For full AVX > extension support for i5 > ??we need gcc 4.6. 0 or higher. gfortran version 4.4.4, I don't think it has AVX. I was mostly interested in memory bandwidth in these tests. I will be testing compilers at some other point. -- Doug > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From vanallsburg at hope.edu Tue Jul 19 09:32:39 2011 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 19 Jul 2011 09:32:39 -0400 Subject: [Beowulf] initialize starting number for torque 2.5.5 Message-ID: <4E258777.4030508@hope.edu> Hi All, I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this possible or problematic? Thanks! Paul -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hearnsj at googlemail.com Tue Jul 19 09:44:06 2011 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 19 Jul 2011 14:44:06 +0100 Subject: [Beowulf] initialize starting number for torque 2.5.5 In-Reply-To: <4E258777.4030508@hope.edu> References: <4E258777.4030508@hope.edu> Message-ID: On 19 July 2011 14:32, Paul Van Allsburg wrote: > Hi All, > > I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this > possible or problematic? In SGE you would just edit the jobseqno file. In PBS that sequence number is held in the file serverdb, and I have had call to alter it once when we upgraded a cluster and I wanted to keep a continuity of job sequence numbers. As serverdb is an XML file in Torque, if I Google serves me well, so I would try installing then running the first job, stopping then editing serverdb? _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From vanallsburg at hope.edu Tue Jul 19 09:50:30 2011 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 19 Jul 2011 09:50:30 -0400 Subject: [Beowulf] initialize starting number for torque 2.5.5 In-Reply-To: References: <4E258777.4030508@hope.edu> Message-ID: <4E258BA6.8050806@hope.edu> Terrific! Thanks Joey, Paul On 7/19/2011 9:36 AM, Joey Jones wrote: > > Shouldn't be a problem at all: > > In qmgr: > > set server next_job_number = 100000 > > ____________________________________________ > Joey B. Jones > HPC2 Systems And Network Administration > Mississippi State University > > On Tue, 19 Jul 2011, Paul Van Allsburg wrote: > >> Hi All, >> >> I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this >> possible or problematic? >> >> Thanks! >> >> Paul >> >> -- >> Paul Van Allsburg >> Scientific Computing Specialist >> Natural Sciences Division, Hope College >> 35 E. 12th St. Holland, Michigan 49423 >> 616-395-7292 vanallsburg at hope.edu >> http://www.hope.edu/academic/csm/ >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 21 04:18:36 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 21 Jul 2011 10:18:36 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: <20110721081836.GD16178@leitl.org> http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Thu Jul 21 11:45:28 2011 From: deadline at eadline.org (Douglas Eadline) Date: Thu, 21 Jul 2011 11:45:28 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721081836.GD16178@leitl.org> References: <20110721081836.GD16178@leitl.org> Message-ID: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> I'm curious, has anyone tried building one of these or know of anyone who has? Seems like a cheap solution for raw backup. -- Doug > > http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ > > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From eugen at leitl.org Thu Jul 21 12:09:51 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 21 Jul 2011 18:09:51 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> Message-ID: <20110721160951.GN16178@leitl.org> On Thu, Jul 21, 2011 at 11:45:28AM -0400, Douglas Eadline wrote: > I'm curious, has anyone tried building one of these or know > of anyone who has? > > Seems like a cheap solution for raw backup. We use quite a few of wire-shelved HP N36L with 8 GByte ECC DDR3 RAM and 4x 3 TByte Hitachi drives with zfs, running NexentaCore and napp-it. They export via NFS and CIFS, but in principle you could use these for a cluster FS. > -- > Doug > > > > > http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 12:28:00 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 12:28:00 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721160951.GN16178@leitl.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> Message-ID: <4E285390.6050308@runnersroll.com> On 07/21/11 12:09, Eugen Leitl wrote: > On Thu, Jul 21, 2011 at 11:45:28AM -0400, Douglas Eadline wrote: >> I'm curious, has anyone tried building one of these or know >> of anyone who has? >> >> Seems like a cheap solution for raw backup. I have doubts about the manageability of such large data without complex software sitting above the spinning rust to enable scalability of performance and recovery of drive failures, which are inevitable at this scale. I mean, what is the actual value of this article? They really don't tell you "how" to build reliable storage at that scale, just a hand-waving description on how some of the items fit in the box and a few file-system specifics. THe SATA wiring diagram is probably the most detailed thing in the post and even that leaves a lot of questions to be answered. Either way, I think if someone were to foolishly just toss together >100TB of data into a box they would have a hell of a time getting anywhere near even 10% of the theoretical max performance-wise. Not to mention double-disk failures (not /that/ uncommon with same make,model,lot hdds) would just wreck all their data. Now for Backblaze (which is a pretty poor name choice IMHO), they manage all that data in-house so building cheap units makes sense since they can safely rely on the software stack they've built over a couple years. For traditional Beowulfers, spending a year or two developing custom software just to manage big data is likely not worth it. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Jul 21 14:29:56 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 11:29:56 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110721182956.GA19104@bx9.net> On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > For traditional Beowulfers, spending a year or two developing custom > software just to manage big data is likely not worth it. There are many open-souce packages for big data, HDFS being one file-oriented example in the Hadoop family. While they generally don't have the features you'd want for running with HPC programs, they do have sufficient features to do things like backups. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 14:55:30 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 14:55:30 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721182956.GA19104@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> Message-ID: <4E287622.6010400@runnersroll.com> On 07/21/11 14:29, Greg Lindahl wrote: > On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > >> For traditional Beowulfers, spending a year or two developing custom >> software just to manage big data is likely not worth it. > > There are many open-souce packages for big data, HDFS being one > file-oriented example in the Hadoop family. While they generally don't > have the features you'd want for running with HPC programs, they do > have sufficient features to do things like backups. I'm actually doing a bunch of work with Hadoop right now, so it's funny you mention it. My experience with and understanding of Hadoop/HDFS is that it is really more geared towards actually doing something with the data once you have it on storage, which is why it's based of off google fs (and undoubtedly why you mention it, being in the search arena yourself). As purely a backup solution it would be particularly clunky, especially in a setup like this one where there's a high HDD to CPU ratio. My personal experience with getting large amounts of data from local storage to HDFS has been suboptimal compared to something more raw, but perhaps I'm doing something wrong. Do you know of any distributed file-systems that are geared towards high-sequential-performance and resilient backup/restore? I think even for HPC (checkpoints), there's a pretty good desire to be able to push massive data down and get it back over wide pipes. Perhaps pNFS will fill this need? ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mm at yuhu.biz Thu Jul 21 17:58:16 2011 From: mm at yuhu.biz (Marian Marinov) Date: Fri, 22 Jul 2011 00:58:16 +0300 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E287622.6010400@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> Message-ID: <201107220058.20707.mm@yuhu.biz> On Thursday 21 July 2011 21:55:30 Ellis H. Wilson III wrote: > On 07/21/11 14:29, Greg Lindahl wrote: > > On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > >> For traditional Beowulfers, spending a year or two developing custom > >> > >> software just to manage big data is likely not worth it. > > > > There are many open-souce packages for big data, HDFS being one > > file-oriented example in the Hadoop family. While they generally don't > > have the features you'd want for running with HPC programs, they do > > have sufficient features to do things like backups. > > I'm actually doing a bunch of work with Hadoop right now, so it's funny > you mention it. My experience with and understanding of Hadoop/HDFS is > that it is really more geared towards actually doing something with the > data once you have it on storage, which is why it's based of off google > fs (and undoubtedly why you mention it, being in the search arena > yourself). As purely a backup solution it would be particularly clunky, > especially in a setup like this one where there's a high HDD to CPU ratio. > > My personal experience with getting large amounts of data from local > storage to HDFS has been suboptimal compared to something more raw, but > perhaps I'm doing something wrong. Do you know of any distributed > file-systems that are geared towards high-sequential-performance and > resilient backup/restore? I think even for HPC (checkpoints), there's a > pretty good desire to be able to push massive data down and get it back > over wide pipes. Perhaps pNFS will fill this need? > I think that GlusterFS would fit perfectly in that place. HDFS is actually a very poor choice for such storages because its performance is not good. The article explaines that they have compared JFS, XFS and Ext4. When I was desiging my backup solution I also compared those 3 and GlusterFS on top of them. I also concluded that Ext4 was the way to go. And with utilizing LVM or having a software to handle the HW failures it actually prooves to be quite suitable for backups. The performance of Ext4 is far better then JFS and XFS, we also tested Ext3 but abondand that. However I'm not sure that this kind of storage is very good for anything else then backups. I believe that more random I/O may kill the performance and hardware of such systems. If you are doing only backups on these drives and you are keeping hot spares on the controler having a tripple failure is quite hard to achieve. And even in those situations if you lose only a single RAID6 array, not the whole storage node. Currently my servers are with 34TB capacity, and what these guys show me, is how I can rearange my current hardware and double the capacity of the backups. So I'm extremely happy that they share this with the world. -- Best regards, Marian Marinov CEO of 1H Ltd. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Thu Jul 21 18:07:42 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 15:07:42 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E287622.6010400@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> Message-ID: <20110721220742.GD22174@bx9.net> On Thu, Jul 21, 2011 at 02:55:30PM -0400, Ellis H. Wilson III wrote: > My personal experience with getting large amounts of data from local > storage to HDFS has been suboptimal compared to something more raw, If you're writing 3 copies of everything on 3 different nodes, then sure, it's a lot slower than writing 1 copy. The benefit you get from this extra up-front expense is resilience. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 20:03:58 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 20:03:58 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721220742.GD22174@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> <20110721220742.GD22174@bx9.net> Message-ID: <4E28BE6E.5030800@runnersroll.com> On 07/21/11 18:07, Greg Lindahl wrote: > On Thu, Jul 21, 2011 at 02:55:30PM -0400, Ellis H. Wilson III wrote: >> My personal experience with getting large amounts of data from local >> storage to HDFS has been suboptimal compared to something more raw, > > If you're writing 3 copies of everything on 3 different nodes, then > sure, it's a lot slower than writing 1 copy. The benefit you get from > this extra up-front expense is resilience. Used in a backup solution, triplication won't get you much more resilience than RAID6 but will pay a much greater performance penalty to simply get your backup or checkpoint completed. Additionally, unless you have a ton of these boxes you won't get some of the important benefits of Hadoop such as rack-aware replication placement. Perhaps you could alter HDFS to handle triplication in the background once you get the local copy on-disk, but this isn't really what it was built for so again one is probably better off going with a more efficient, if less complex distributed file system. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Jul 21 20:22:03 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 17:22:03 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E28BE6E.5030800@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> <20110721220742.GD22174@bx9.net> <4E28BE6E.5030800@runnersroll.com> Message-ID: <20110722002203.GD1350@bx9.net> On Thu, Jul 21, 2011 at 08:03:58PM -0400, Ellis H. Wilson III wrote: > Used in a backup solution, triplication won't get you much more > resilience than RAID6 but will pay a much greater performance penalty to > simply get your backup or checkpoint completed. Hey, if you don't see any benefit from R3, then it's no surprise that you find the cost too high. Me, I don't like being woken up in the dead of the night to run to the colo to replace a disk. And I trust my raid vendor's code less than my replication code. > Additionally, unless you have a ton of these boxes you won't get > some of the important benefits of Hadoop such as rack-aware > replication placement. Most of the benefit is achieved from machine-aware replication placement: the number of PDU and switch failures is much smaller than the number of node failures, which is much smaller than the number of disk device failures. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 00:20:14 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 00:20:14 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> Message-ID: > I'm curious, has anyone tried building one of these or know > of anyone who has? a guy here built one, and it seems to behave fine. > Seems like a cheap solution for raw backup. "raw"? I think the backblaze (v1) is used for rsync-based incremental/nightly-snapshots. but yeah, this is a lot of space at the end of pretty narrow pipes. at some level, though, backblaze is a far more sensible response to disk prices than conventional vendors. 3TB disks start at $140! how much expensive infrastructure do you want to add to the disks to make the space usable? it's absurd to think of using fiberchannel, for instance. bigname vendors still try to tough it out by pretending that their sticker on commodity disks makes them worth hundreds of dollars more - I always figure this is more to justify charging thousands for, say, a 12-disk enclosure. backblaze's approach is pretty gung-ho, though. if I were trying to do storage at that scale, I'd definitely consider using fewer parts. for instance, an all-in-one motherboard with 6 sata ports and disks in a 1U chassis. BB winds up being about $44/disk overhead, and I think the simpler approach could come close, maybe $50/disk. then again, if you only look at asymptotics, USB enclosures knock it down to maybe $25/disk ;) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 00:33:37 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 00:33:37 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: > Either way, I think if someone were to foolishly just toss together >> 100TB of data into a box they would have a hell of a time getting > anywhere near even 10% of the theoretical max performance-wise. storage isn't about performance any more. ok, hyperbole, a little. but even a cheap disk does > 100 MB/s, and in all honesty, there are not tons of people looking for bandwidth more than a small multiplier of that. sure, a QDR fileserver wants more than a couple disks, and if you're an iops-head, you're going flash anyway. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri Jul 22 00:46:11 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jul 2011 00:46:11 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <4E290093.1010504@scalableinformatics.com> On 07/22/2011 12:33 AM, Mark Hahn wrote: >> Either way, I think if someone were to foolishly just toss together >>> 100TB of data into a box they would have a hell of a time getting >> anywhere near even 10% of the theoretical max performance-wise. > > storage isn't about performance any more. ok, hyperbole, a little. > but even a cheap disk does> 100 MB/s, and in all honesty, there are > not tons of people looking for bandwidth more than a small multiplier > of that. sure, a QDR fileserver wants more than a couple disks, With all due respect, I beg to differ. The bigger you make your storage, the larger the pipes in you need, and the larger the pipes to the storage you need, lest you decide that tape is really cheaper after all. Tape does 100MB/s these days. And the media is relatively cheap (compared to some HD). If you don't care about access performance under load, you really can't beat its economics. More to the point, you need a really balanced architecture in terms of bandwidth. I think USB3 could be very interesting for small arrays, and pretty much expect to start seeing some as block targets pretty soon. I don't see enough aggregated USB3 ports together in a single machine to make this terribly interesting as a large scale storage medium, but it is a possible route. They are interesting boxen. We often ask customers if they'd consider non-enterprise drives. Failure rates similar to the enterprise as it turns out, modulo some ridiculous drive products. Most say no. Those who say yes don't see enhanced failure rates. > and if you're an iops-head, you're going flash anyway. This is more recent than you might have guessed ... at least outside of academia. We should have a fun machine to talk about next week, and show some benchies on. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri Jul 22 01:04:36 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 22:04:36 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110722050436.GB10994@bx9.net> On Fri, Jul 22, 2011 at 12:33:37AM -0400, Mark Hahn wrote: > storage isn't about performance any more. ok, hyperbole, a little. > but even a cheap disk does > 100 MB/s, and in all honesty, there are > not tons of people looking for bandwidth more than a small multiplier > of that. sure, a QDR fileserver wants more than a couple disks, > and if you're an iops-head, you're going flash anyway. Over in the big data world, we're all about disk bandwidth, because we take the computation to the data. When we're reading something for a Map/Reduce job, we can easily drive 800 MB/s off of 8 disks in a single node, and for many jobs the most expensive thing about the job is reading. Good thing we have 3 copies of every bit of data, that gives us 1/3 the runtime. Writing, not so happy. Network bandwidth is a lot more expensive than disk bandwidth. Some data manipulations in HPC are like Map/Reduce. For example, shooting a movie using saved state files is embarrassingly parallel. The first system I heard about which took computation to the data was from SDSC, long before GOOG was founded. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 01:44:56 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 01:44:56 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E290093.1010504@scalableinformatics.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: >>> Either way, I think if someone were to foolishly just toss together >>>> 100TB of data into a box they would have a hell of a time getting >>> anywhere near even 10% of the theoretical max performance-wise. >> >> storage isn't about performance any more. ok, hyperbole, a little. >> but even a cheap disk does> 100 MB/s, and in all honesty, there are >> not tons of people looking for bandwidth more than a small multiplier >> of that. sure, a QDR fileserver wants more than a couple disks, > > With all due respect, I beg to differ. with which part? > The bigger you make your storage, the larger the pipes in you need, and hence the QDR comment. > the larger the pipes to the storage you need, lest you decide that tape > is really cheaper after all. people who like tape seem to like it precisely because it's offline. BB storage, although fairly bottlenecked, is very much online and thus constantly integrity-verifiable... > Tape does 100MB/s these days. And the media is relatively cheap > (compared to some HD). yes, "some" is my favorite weasel word too ;) I don't follow tape prices much - but LTO looks a little more expensive than desktop drives. drives still not cheap. my guess is that tape could make sense at very large sizes, with enough tapes to amortize the drives, and some kind of very large robot. but really my point was that commodity capacity and speed covers the vast majority of the market - at least I'm guessing that most data is stored in systems of under, say 1 PB. if tape could deliver 135 TB in 4U with 10ms random access, yes, I guess there wouldn't be any point to backblaze... > If you don't care about access performance under > load, you really can't beat its economics. am I the only one who doesn't trust tape? who thinks of integrity being a consequence of constant verifiability? > More to the point, you need a really balanced architecture in terms of > bandwidth. I think USB3 could be very interesting for small arrays, and > pretty much expect to start seeing some as block targets pretty soon. I > don't see enough aggregated USB3 ports together in a single machine to > make this terribly interesting as a large scale storage medium, but it > is a possible route. hard to imagine a sane way to distribute power to lots of external usb enclosers, let alone how to mount it. > They are interesting boxen. We often ask customers if they'd consider > non-enterprise drives. Failure rates similar to the enterprise as it > turns out, modulo some ridiculous drive products. Most say no. Those > who say yes don't see enhanced failure rates. old-fashioned thinking, from the days when disks were expensive. now the most expensive commodity disk you can buy is maybe $200, so you really have to think of it as a consumable. (yes, people do still buy mercedes and SAS/FC/etc disks, but that doesn't make them mass-market/commodity products.) >> and if you're an iops-head, you're going flash anyway. > > This is more recent than you might have guessed ... at least outside of > academia. We should have a fun machine to talk about next week, and > show some benchies on. to be honest, I don't understand what applications lead to focus on IOPS (rationally, not just aesthetic/ideologically). it also seems like battery-backed ram and logging to disks would deliver the same goods... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri Jul 22 02:55:59 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 23:55:59 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: <20110722065559.GD16925@bx9.net> On Fri, Jul 22, 2011 at 01:44:56AM -0400, Mark Hahn wrote: > to be honest, I don't understand what applications lead to focus on IOPS > (rationally, not just aesthetic/ideologically). it also seems like > battery-backed ram and logging to disks would deliver the same goods... In HPC, the metadata for your big parallel filesystem is a good example. SSD is much cheaper capacity at high IOPs than battery-backed RAM. (The RAM has higher IOPs than you need.) For Big Data, there's often data that's hotter than the rest. An example from the blekko search engine is our index; when you type a query on our website, most often all of the 'disk' access is SSD. Big Data systems generally don't have a metadata problem like HPC does; instead of 200 million files, we have a couple of dozen tables in our petabyte database. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Fri Jul 22 03:05:11 2011 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 22 Jul 2011 09:05:11 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110722065559.GD16925@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> <20110722065559.GD16925@bx9.net> Message-ID: <20110722070511.GQ16178@leitl.org> On Thu, Jul 21, 2011 at 11:55:59PM -0700, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 01:44:56AM -0400, Mark Hahn wrote: > > > to be honest, I don't understand what applications lead to focus on IOPS > > (rationally, not just aesthetic/ideologically). it also seems like > > battery-backed ram and logging to disks would deliver the same goods... > > In HPC, the metadata for your big parallel filesystem is a good example. > SSD is much cheaper capacity at high IOPs than battery-backed RAM. (The > RAM has higher IOPs than you need.) Hybrid pools in zfs can make use both of SSD and real (battery-backed) RAM disks ( http://www.amazon.com/ACARD-ANS-9010-Dynamic-Module-including/dp/B001NDX6FE or http://www.ddrdrive.com/ ). http://bigip-blogs-adc.oracle.com/brendan/entry/test Additional advantage of zfs is that it can deal with the higher error rate of consumer or nearline SATA disks (though it can do nothing against enterprise disk's higher resistance to vibration), and also with silent bit rot with periodic scrubbing (you can make Linux RAID scrub, but you can't make it checksum). > For Big Data, there's often data that's hotter than the rest. An > example from the blekko search engine is our index; when you type a > query on our website, most often all of the 'disk' access is SSD. > > Big Data systems generally don't have a metadata problem like HPC > does; instead of 200 million files, we have a couple of dozen tables > in our petabyte database. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri Jul 22 08:13:30 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jul 2011 08:13:30 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: <4E29696A.2030607@scalableinformatics.com> On 07/22/2011 01:44 AM, Mark Hahn wrote: >>>> Either way, I think if someone were to foolishly just toss together >>>>> 100TB of data into a box they would have a hell of a time getting >>>> anywhere near even 10% of the theoretical max performance-wise. >>> >>> storage isn't about performance any more. ok, hyperbole, a little. >>> but even a cheap disk does> 100 MB/s, and in all honesty, there are >>> not tons of people looking for bandwidth more than a small multiplier >>> of that. sure, a QDR fileserver wants more than a couple disks, >> >> With all due respect, I beg to differ. > > with which part? Mainly "storage isn't about performance any more" and "there are not tons of people looking for bandwidth more than a small multiplier of that". To haul out my old joke ... generalizations tend to be incorrect ... >> The bigger you make your storage, the larger the pipes in you need, and > > hence the QDR comment. Yeah, well not everyone likes IB. As much as we've tried to convince others that it is a good idea in some cases for their workloads, many customers still prefer 10GbE and GbE. I personally have to admit that 10GbE and its very simple driver model (and its "just works" concept) is incredibly attractive, and often far easier to support than IB. This said, we've seen/experienced some very bad IB implementations (board level, driver issues, switch issues, overall fabric, ...) that I am somewhat more jaded as to real achievable bandwidth with it these days than I've been in the past. Sorta like people throwing disks together into 6G backplanes. We run into this all the time in certain circles. People tend to think the nice label will automatically grant them more performance than before. So we see some of the most poorly designed ... hideous really ... designed units from a bandwidth/latency perspective. I guess what I am saying is that QDR (nor 10GbE) is a silver bullet. There are no silver bullets. You *still* have to start with balanced and reasonable designs to get a chance at good performance. > >> the larger the pipes to the storage you need, lest you decide that tape >> is really cheaper after all. > > people who like tape seem to like it precisely because it's offline. > BB storage, although fairly bottlenecked, is very much online and thus > constantly integrity-verifiable... Extremely bottlenecked. 100TB / 100 MB/s -> 100,000,000 MB / 100 MB/s = 1,000,000 s to read or write ... once. This is what we've been calling the storage bandwidth wall. The higher the wall, the colder and more inaccessible your data is. This is on the order of 12 days to read or write the data once. My point about these units is that it may be possible to expand the capacity so much (without growing the various internal bandwidths) that it becomes effectively impossible to utilize all the space, even a majority of the space, in a reasonable time. Which renders the utility of such devices moot. > >> Tape does 100MB/s these days. And the media is relatively cheap >> (compared to some HD). > > yes, "some" is my favorite weasel word too ;) Well ... if you're backing up to SSD drives ... No, seriously not weasel wording on this. Tape is relatively cheap in bulk for larger capacities. > I don't follow tape prices much - but LTO looks a little more expensive > than desktop drives. drives still not cheap. my guess is that tape could > make sense at very large sizes, with enough tapes to amortize the drives, > and some kind of very large robot. but really my point was that commodity > capacity and speed covers the vast majority of the market - at least I'm > guessing that most data is stored in systems of under, say 1 PB. Understand also that I share your view that commodity drives are the better option. Just pointing out that you can follow your asymptote to an extreme (tape) if you wish to keep pushing pricing per byte down. My biggest argument against tape is, that, while the tapes themselves may last 20 years or so ... the drives don't. I've had numerous direct experiences with drive failures that wound up resulting in inaccessible data. I fail to see how the longevity of the media matters in this case, if you can't read it, or cannot get replacement drives to read it. Yeah, that happened. [...] > am I the only one who doesn't trust tape? who thinks of integrity > being a consequence of constant verifiability? See above. [...] >> They are interesting boxen. We often ask customers if they'd consider >> non-enterprise drives. Failure rates similar to the enterprise as it >> turns out, modulo some ridiculous drive products. Most say no. Those >> who say yes don't see enhanced failure rates. > > old-fashioned thinking, from the days when disks were expensive. > now the most expensive commodity disk you can buy is maybe $200, > so you really have to think of it as a consumable. (yes, people do still > buy mercedes and SAS/FC/etc disks, but that doesn't make them > mass-market/commodity products.) heh ... I can see it now: Me: "But gee Mr/Ms Customer, thats really old fashioned thinking (and Mark told me so!) so you gots ta let me sell you dis cheaper disk ..." (watches as door closes in face) It will take time for the business consumer side of market to adapt and adopt. Some do, most don't. Aside from that, the drive manufacturers just love them margins on the enterprise units ... And do you see how willingly people pay large multiples of $1/GB for SSDs? Ok, they are now getting closer to $1/GB, but thats still more than 1 OOM worse in cost than spinning rust ... > >>> and if you're an iops-head, you're going flash anyway. >> >> This is more recent than you might have guessed ... at least outside of >> academia. We should have a fun machine to talk about next week, and >> show some benchies on. > > to be honest, I don't understand what applications lead to focus on IOPS > (rationally, not just aesthetic/ideologically). it also seems like > battery-backed ram and logging to disks would deliver the same goods... oh... many. RAM is expensive. 10TB ram is power hungry and very expensive. Bloody fast, but very expensive. Many apps want fast and cheap. As to your thesis, in the world we live in today, bandwidth and latency are becoming ever more important, not less important. Maybe for specific users this isn't the case, and BB is perfect for that use case. For the general case, we aren't getting people asking us if we can raise that storage bandwidth wall. They are all asking us to lower that barrier. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Jul 22 09:05:52 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 22 Jul 2011 09:05:52 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E29696A.2030607@scalableinformatics.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> <4E29696A.2030607@scalableinformatics.com> Message-ID: <4E2975B0.604@runnersroll.com> On 07/22/11 08:13, Joe Landman wrote: > On 07/22/2011 01:44 AM, Mark Hahn wrote: >>>>> Either way, I think if someone were to foolishly just toss together >>>>>> 100TB of data into a box they would have a hell of a time getting >>>>> anywhere near even 10% of the theoretical max performance-wise. >>>> >>>> storage isn't about performance any more. ok, hyperbole, a little. >>>> but even a cheap disk does> 100 MB/s, and in all honesty, there are >>>> not tons of people looking for bandwidth more than a small multiplier >>>> of that. sure, a QDR fileserver wants more than a couple disks, >>> >>> With all due respect, I beg to differ. >> >> with which part? > > Mainly "storage isn't about performance any more" and "there are not > tons of people looking for bandwidth more than a small multiplier of > that". > > To haul out my old joke ... generalizations tend to be incorrect ... It's pretty nice to wake up in the morning and have somebody else have said everything nearly exactly as I would have. Nice write-up Joe! And at Greg - we can talk semantics until we're blue in the face but the reality is that Hadoop/HDFS/R3 is just not an appropriate solution for basic backups, which is the topic of this thread. Period. It's a fabulous tool for actually /working/ on big data and I /really/ do like Hadoop, but it's a very poor tool when all you want to do is really high-bw sequential writes or reads. If you disagree - fine - it's my opinion and I'm sticking to it. Regarding trusting your vendor's raid code less than replication code, I mean, that's pretty obvious. I think we all can agree cp x 3 is a much less complex solution. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Fri Jul 22 16:46:06 2011 From: mathog at caltech.edu (David Mathog) Date: Fri, 22 Jul 2011 13:46:06 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: Joe Landman wrote: > My biggest argument against tape is, that, while the tapes themselves > may last 20 years or so ... the drives don't. I've had numerous direct > experiences with drive failures that wound up resulting in inaccessible > data. I fail to see how the longevity of the media matters in this > case, if you can't read it, or cannot get replacement drives to read it. > Yeah, that happened. 20 years is a long, long time for digital storage media. I expect that if one did have a collection of 20 year old backup disk drives it would be reasonably challenging to find a working computer with a compatible interface to plug them into. And that assumes that those very old drives still work. For all I know there are disk drive models whose spindle permanently freezes in place if it isn't used for 15 years - it's not like the drive manufacturers actually test store drives that long. While it is unquestionably true that a 20 year old tape drive is prone to mechanical failure, it would still be easier to find another drive of the same type to read the tape then it would be to repair the equivalent mechanical failure in the storage medium itself (ie, in a failed disk drive.) That is, the tape itself is not prone to storage related mechanical failure. All of which is a bit of a straw man. The best way to maintain archival data over long periods of time is to periodically migrate it to newer storage technology, and in between migrations, to test read the archives periodically so as to detect unforeseen longevity issues early, while there is still a chance to recover the data. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From gus at ldeo.columbia.edu Fri Jul 22 17:20:54 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 22 Jul 2011 17:20:54 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: Message-ID: <4E29E9B6.3050505@ldeo.columbia.edu> Incidentally, we have a museum-type Data General 9-track tape reader, which was carefully preserved and is now dedicated to recover seismic reflection data of old research cruises from the 1970's and 80's. (IMHO the Fujitsu readers were even better, but nobody can find those anymore.) Short from writting on stone (and granite also has its weathering time scale), one must occasionally copy over data that is still of interest to new media, as David said. I guess the time scale of reassessing what is still of interest is what matters, and it needs to be compatible with the longevity of current media. Maybe forgetting is part of keeping sanity, and even of remembering. The problem here, maybe in other places too, is that funding for remembering what should be remembered, and forgetting what should be forgotten, is often forgotten. My two cents interjection. Gus Correa David Mathog wrote: > Joe Landman wrote: > >> My biggest argument against tape is, that, while the tapes themselves >> may last 20 years or so ... the drives don't. I've had numerous direct >> experiences with drive failures that wound up resulting in inaccessible >> data. I fail to see how the longevity of the media matters in this >> case, if you can't read it, or cannot get replacement drives to read it. >> Yeah, that happened. > > 20 years is a long, long time for digital storage media. > > I expect that if one did have a collection of 20 year old backup disk > drives it would be reasonably challenging to find a working computer > with a compatible interface to plug them into. And that assumes that > those very old drives still work. For all I know there are disk drive > models whose spindle permanently freezes in place if it isn't used for > 15 years - it's not like the drive manufacturers actually test store > drives that long. While it is unquestionably true that a 20 year old > tape drive is prone to mechanical failure, it would still be easier to > find another drive of the same type to read the tape then it would be to > repair the equivalent mechanical failure in the storage medium itself > (ie, in a failed disk drive.) That is, the tape itself is not prone to > storage related mechanical failure. > > All of which is a bit of a straw man. The best way to maintain archival > data over long periods of time is to periodically migrate it to newer > storage technology, and in between migrations, to test read the archives > periodically so as to detect unforeseen longevity issues early, while > there is still a chance to recover the data. > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Sat Jul 23 01:53:38 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 22 Jul 2011 22:53:38 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: <20110723055338.GC20531@bx9.net> On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > Additional advantage of zfs is that it can deal with the higher > error rate of consumer or nearline SATA disks (though it can do > nothing against enterprise disk's higher resistance to vibration), > and also with silent bit rot with periodic scrubbing (you can > make Linux RAID scrub, but you can't make it checksum). And you can have a single zfs filesystem over 100s of nodes with petabytes of data? This thread has had a lot of mixing of single-node filesystems with cluster filesystems, it leads to a lot of confusion. Hadoop has checksums and maybe scrubbing, and the NoSQL database that we wrote at blekko has both plus end-to-end checksums; it's hard to imagine anyone writing a modern storage system without those features. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cbergstrom at pathscale.com Sat Jul 23 03:28:34 2011 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 23 Jul 2011 14:28:34 +0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723055338.GC20531@bx9.net> References: <20110723055338.GC20531@bx9.net> Message-ID: <4E2A7822.80607@pathscale.com> On 07/23/11 12:53 PM, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > >> Additional advantage of zfs is that it can deal with the higher >> error rate of consumer or nearline SATA disks (though it can do >> nothing against enterprise disk's higher resistance to vibration), >> and also with silent bit rot with periodic scrubbing (you can >> make Linux RAID scrub, but you can't make it checksum). > And you can have a single zfs filesystem over 100s of nodes with > petabytes of data? This thread has had a lot of mixing of single-node > filesystems with cluster filesystems, it leads to a lot of confusion. zfs + lustre would do that, but the OP was comparing against Linux filesystem + RAID. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Sat Jul 23 07:21:44 2011 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 23 Jul 2011 13:21:44 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723055338.GC20531@bx9.net> References: <20110723055338.GC20531@bx9.net> Message-ID: <20110723112144.GL16178@leitl.org> On Fri, Jul 22, 2011 at 10:53:38PM -0700, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > > > Additional advantage of zfs is that it can deal with the higher > > error rate of consumer or nearline SATA disks (though it can do > > nothing against enterprise disk's higher resistance to vibration), > > and also with silent bit rot with periodic scrubbing (you can > > make Linux RAID scrub, but you can't make it checksum). > > And you can have a single zfs filesystem over 100s of nodes with > petabytes of data? This thread has had a lot of mixing of single-node I'm not sure how well pNFS (NFS 4.1) can do on top of zfs. Does anybode use this in production? > filesystems with cluster filesystems, it leads to a lot of confusion. > > Hadoop has checksums and maybe scrubbing, and the NoSQL database that > we wrote at blekko has both plus end-to-end checksums; it's hard to > imagine anyone writing a modern storage system without those features. Speaking of which, is there something easy and reliable open source for Linux that scales up to some 100 nodes, on GBit Ethernet? There's plenty mentioned on https://secure.wikimedia.org/wikipedia/en/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems but which of them fit above requirement? -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bob at drzyzgula.org Sat Jul 23 09:13:50 2011 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Sat, 23 Jul 2011 09:13:50 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110723131350.GB308@mx1.drzyzgula.org> Getting back to the original question, I will say that I, as I expect most of of did, of course considered these back when the first version came out. However, I rejected them based on a few specific criticisms: 1. The power supplies are not redundant. 2. The fans are nut redundant. 3. The drives are inaccessible without shutting down the system and pulling the whole chassis. For my application (I was building a NAS device, not a simple rsync target) I was also unhappy with the choice of motherboard and other I/O components, but that's a YMMV kind of thing and could easily be improved upon within the same chassis. FWIW, for chassis solutions that approach this level of density, but still offer redundant power & cooling as well as hot-swap drive access, Supermicro has a number of designs that are probably worth considering: http://www.supermicro.com/storage/ In the end we built a solution using the 24-drive-in-4U SC848A chassis; we didn't go to the 36-drive boxes because I didn't want to have to compete with the cabling on the back side of the rack to access the drives, and anyway our data center is cooling-constrained and thus we have rack units to spare. We put motherboards in half of them and use the other half in a JBOD configuration. We also used 2TB, 7200 rpm "Enterprise" SAS drives, which actually aren't all that much more expensive. Finally, we used Adaptec SSD-cacheing SAS controllers. All of this is of course more expensive than the parts in the Backblaze design, but that money all goes toward reliability, manageability and performance, and it still is tremendously cheaper than an enterprise SAN-based solution. Not to say that enterprise SANs don't have their place -- we use them for mission-critical production data -- but there are many applications for which their cost simply is not justified. On 21/07/11 12:28 -0400, Ellis H. Wilson III wrote: > > I have doubts about the manageability of such large data without complex > software sitting above the spinning rust to enable scalability of > performance and recovery of drive failures, which are inevitable at this > scale. Well, yes, from a software perspective this is true, and that's of course where most of the rest of this thread headed, which I did find interesting in useful. But if one assumes a appropriate software layers, I think that this remains an interesting hardware design question. > I mean, what is the actual value of this article? They really don't > tell you "how" to build reliable storage at that scale, just a > hand-waving description on how some of the items fit in the box and a > few file-system specifics. THe SATA wiring diagram is probably the most > detailed thing in the post and even that leaves a lot of questions to be > answered. Actually I'm not sure you read the whole blog post. They give extensive wiring diagrams for all of if, including detailed documentation of the custom harness for the power supplies. They also give a a complete parts list -- down to the last screw -- and links to suppliers for unusual or custom parts as well as full CAD drawings of the chassis, in SolidWorks (a free viewer is available). Not quite sure what else you'd be looking for -- at least from a hardware perspective. I do think that this is an interesting exercise in finding exactly how little hardware you can wrap around some hard drives and still have a functional storage system. And as Backblaze seems to have built a going concern on top of the design it does seem to have its applications. However, I think one has to recognize its limitations and be very careful to not try to push it into applications where the lack of redundancy and manageability are going to come up and bite you on the behind. --Bob _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Sat Jul 23 11:12:29 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sat, 23 Jul 2011 08:12:29 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723131350.GB308@mx1.drzyzgula.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com>, <20110723131350.GB308@mx1.drzyzgula.org> Message-ID: ________________________________________ From: beowulf-bounces at beowulf.org [beowulf-bounces at beowulf.org] On Behalf Of Bob Drzyzgula [bob at drzyzgula.org] I do think that this is an interesting exercise in finding exactly how little hardware you can wrap around some hard drives and still have a functional storage system. And as Backblaze seems to have built a going concern on top of the design it does seem to have its applications. However, I think one has to recognize its limitations and be very careful to not try to push it into applications where the lack of redundancy and manageability are going to come up and bite you on the behind. --Bob _ Yes... Designing and using a giant system with perfectly reliable hardware (or something that simulates perfectly reliable hardware at some abstraction level) is a straightforward process. Designing something where you explicitly acknowledge failures and it still works, while not resorting to the software equivalent of Triple Modular Redundancy and similar schemes) and which has good performance, is a much, much more interesting problems. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Mon Jul 25 09:20:45 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 25 Jul 2011 14:20:45 +0100 Subject: [Beowulf] PetaBytes on a budget, take 2 References: <20110721081836.GD16178@leitl.org><56629.192.168.93.213.1311263128.squirrel@mail.eadline.org><20110721160951.GN16178@leitl.org><4E285390.6050308@runnersroll.com>, <20110723131350.GB308@mx1.drzyzgula.org> Message-ID: <207BB2F60743C34496BE41039233A80906A53D5D@MRL-PWEXCHMB02.mil.tagmclarengroup.com> > _ > > Yes... Designing and using a giant system with perfectly reliable > hardware (or something that simulates perfectly reliable hardware at > some abstraction level) is a straightforward process. > I think that is well worth a debate. As Bob points out, these storage arrays do not have redundant hot-swap PSUs, or hot swap disks. They're made for the market for large volume online archiving - where you would imagine that there are two or more copies of data in geographically separate locations. Is it time to start looking at the cost/benefit analysis for 'prime' storage to have the same scheme? Don't get me wrong - I have experienced the benefits of N+1 hot swap PSUs and RAID hot swap disks on many an occasion, as has everyone here. There's nothing more satisfying then getting an email over the weekend about a popped disk on an array and being able to flop back onto the sofa. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Tue Jul 26 01:43:19 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 26 Jul 2011 15:43:19 +1000 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: Message-ID: <4E2E53F7.3040208@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 23/07/11 06:46, David Mathog wrote: > All of which is a bit of a straw man. The best way to maintain archival > data over long periods of time is to periodically migrate it to newer > storage technology, and in between migrations, to test read the archives > periodically so as to detect unforeseen longevity issues early, while > there is still a chance to recover the data. If you are interested in how the National Archives of Australia handles digital archiving then there is information (primarily software & format focused) here: http://www.naa.gov.au/records-management/preserve/e-preservation/at-NAA/index.aspx But they do say on one of the pages below that: # When selecting the hardware and systems for its digital # preservation prototype, the National Archives avoided # relying on any single vendor or technology. In so doing, # we have enhanced our ability to deal with hardware # obsolescence. Technology is our enabler, but we don't want # it to be our driver. Conceptually, the digital repository # is one system, but it comprises two independent systems # running simultaneously, with different operating systems # and hardware. # # By operating with redundancy, we are future-proofing our # system. In the event of an operational flaw in any one # operating system, disk technology or vendor, an alternative # is available, so the risk to data is lower. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4uU/cACgkQO2KABBYQAh+t0gCfWKLfIv++/x7iK1XLVGgpDIpM QkkAoIbCKLB2uL4U0QimO1l7WYPOQ5gK =nC+k -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eagles051387 at gmail.com Fri Jul 1 04:26:51 2011 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 01 Jul 2011 10:26:51 +0200 Subject: [Beowulf] Suggestions for used workhorse servers In-Reply-To: <4E0CE127.9060605@ias.edu> References: <4E0CDD91.2090405@cora.nwra.com> <4E0CE127.9060605@ias.edu> Message-ID: <4E0D84CB.9010006@gmail.com> I have a nice IBM 1u m3250 and It runs like a charm no issues what so ever. Specs are rather nice and its moderatly priced. Has a dual core E7400 and can take up to 8gb of ram. Cost was about 1,300 euros. On 30/06/2011 22:48, Prentice Bisbal wrote: > I've always been pleased with the HP ProLiant systems, like the DL385 > models. The seemed pretty reliable to me. I'd trust one of those over a > Dell PowerEdge, however I'm sure you'll get as many opinions as there > are subscribers on this list. > > -- > Prentice > > On 06/30/2011 04:33 PM, Orion Poplawski wrote: >> One can find some pretty inexpensive older servers on eBay that probably could >> yield a decent $/flop ratio. I was wondering if people here had suggestions >> for classic workhorse servers - basic 1U boxes that did/do pretty well be are >> a couple years old at this point. >> >> Thanks! >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eagles051387 at gmail.com Fri Jul 1 04:28:18 2011 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 01 Jul 2011 10:28:18 +0200 Subject: [Beowulf] Suggestions for used workhorse servers In-Reply-To: References: Message-ID: <4E0D8522.4090403@gmail.com> Jim also dont forget in the case of having 4-8 nodes and possibly setup the master with solid state, the other nodes could be setup diskless. that way all you would need to reconfigure would be the master node and not all the slaves. On 30/06/2011 23:15, Lux, Jim (337C) wrote: > One might make a nice small cluster for learning purposes. With 8 nodes, > you could do a lot of experimenting. Even 4 nodes works, but with 8, if > your parallelization works, you get a pretty dramatic speedup. > > And, when you screw up, and need to reinstall all the software everywhere, > 4-8 nodes is manageable by hand. > > You could also, if you have extra network cards, experiment with things > like different interconnect architectures. > > There is significant value in a stack of boxes which you "own" and don't > have to account for the use of (or lack), for that sort of "fooling > around" > > > For production purposes, you're probably better off buying newer > computers: Power consumption, hassles, etc. > > > On 6/30/11 1:33 PM, "Orion Poplawski" wrote: > >> One can find some pretty inexpensive older servers on eBay that probably >> could >> yield a decent $/flop ratio. I was wondering if people here had >> suggestions >> for classic workhorse servers - basic 1U boxes that did/do pretty well be >> are >> a couple years old at this point. >> >> Thanks! >> >> -- >> Orion Poplawski >> Technical Manager 303-415-9701 x222 >> NWRA/CoRA Division FAX: 303-415-9702 >> 3380 Mitchell Lane orion at cora.nwra.com >> Boulder, CO 80301 http://www.cora.nwra.com >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From orion at cora.nwra.com Fri Jul 1 12:21:06 2011 From: orion at cora.nwra.com (Orion Poplawski) Date: Fri, 01 Jul 2011 10:21:06 -0600 Subject: [Beowulf] AVX In-Reply-To: <4E0CE34E.10807@scalableinformatics.com> References: <4E0CDD91.2090405@cora.nwra.com> <4E0CE34E.10807@scalableinformatics.com> Message-ID: <4E0DF3F2.3050206@cora.nwra.com> On 06/30/2011 02:57 PM, Joe Landman wrote: > We are building a number of more ... specialty ... type cluster things these > days. Its very possible to put together a pretty good 4 core 16GB ram > modern/fast Xeon unit (with AVX bits ... Sandy Bridge based unit) This sparked my curiosity as well. What are people's experience with AVX? What software uses it? Performance improvements? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 7 10:13:05 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 7 Jul 2011 16:13:05 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee Message-ID: <20110707141305.GJ16178@leitl.org> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee One million ARM chips challenge Intel bumblebee Manchester works toward human brain 07 Jul 2011 13:25 | by Matthew Finnegan in London | posted in Chips One million ARM chips challenge Intel bumblebee - A project to replicate the workings of the human brain has received a boost with the delivery of one million ARM processors. While Intel has its sights set on reaching bumble bee brain level in the near future, it seems its rival is involved in one further. Scientists at the University of Manchester will link together the ARM chips as the system architecture of a massive computer, dubbed SpiNNaker, or Spiking Neural Network architecture. Despite the mass of chips it will only be possible to recreate models of up to one percent of the human brain. The chips have arrived and are past functionality testing. A similar experiment was once attempted with a load of old Centrino chips found at the back of our stationary cupboard, though so far we haven't even managed to replicate the cranial workings of a particularly slow slug. The work, headed up by Professor Steve Furber, has the potential to become a revolutionary tool for neuroscientists and psychologists in understanding how our brains work. SpiNNaker will attempt to replicate the workings of the 100 billion neurons and the 1,000 million connections that are used to create high connectivity in cells. SpiNNaker will model the electric signals that neurons emit, with each impulse modelled as a ?packet? of data, similar to the way that information is transferred over the internet. The packet is sent to other neurons, represented by small equations solved in real time by ARM processors. The chips, designed in Machester and built in Taiwan, each contain 18 ARM processors. The bespoke 18 core chips are able to provide the computing power of a personal computer in a fraction of the space, using just one watt of power. Now that the chips have arrived it will be possible to get cracking on building model. ?The project revolves around getting the chips made, which has taken the past five years to get right,? Professor Steve Furber told TechEye. ?We will know be increasing the scale of the project over the next 18 months before it reaches its final form, with one million processors used. We already have the system working on a smaller scale, and we are able to look at fifty to sixty thousand neurons currently.? As well as offering possibilities as a scientific research tool, Furber hopes that the system will help pave the way for computational advancements too. ?It will help to analyse the intermediate levels of the brain, which are very difficult to focus on otherwise,? he says. ?Another area which this help is in building more reliable computing systems. As chip manufacturers continue towards the end of Moore?s Law, transistors will become increasingly unreliable. And computer systems are very susceptible to malfunctioning transistors.? Furber says biology works differently. ?Biology, on the other hand, reacts to the malfunctioning of neurons very well, with it happening regularly with all brains, so this could help future chips become more reliable.? Of course, we also wanted to know how this all compares with Intel?s famous bumblebee claims. Unfortunately, professor Furber couldn't specifically help us with information about bumblebee brain processing. He was, however, able to reel off some details about the honeybee. ?The honeybee brain has around 850,000 neurons so we will be able to reach that level of processing in the next few months. Of course, we don?t have a honeybee brain model to run, but we will have the computing power.? Over to you, Intel. Read more: http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee#ixzz1RQe7GYoL _______________________________________________ tt mailing list tt at postbiota.org http://postbiota.org/mailman/listinfo/tt _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 11:36:02 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 11:36:02 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <20110707141305.GJ16178@leitl.org> References: <20110707141305.GJ16178@leitl.org> Message-ID: <4E15D262.90209@ias.edu> On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > One million ARM chips challenge Intel bumblebee > Now say it like Dr. Evil: one MILLION processors. How long is it going to take to wire them all up? And how fast are they going to fail? If there's a MTBF of one million hours, that's still one failure per hour. Should be interesting. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 12:31:24 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 09:31:24 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15D262.90209@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: > On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > > > One million ARM chips challenge Intel bumblebee > > > > Now say it like Dr. Evil: one MILLION processors. > > > How long is it going to take to wire them all up? And how fast are they > going to fail? If there's a MTBF of one million hours, that's still one > failure per hour. But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 7 12:56:29 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 7 Jul 2011 18:56:29 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <20110707165629.GL16178@leitl.org> On Thu, Jul 07, 2011 at 09:31:24AM -0700, Lux, Jim (337C) wrote: > > On 07/07/2011 10:13 AM, Eugen Leitl wrote: > > > > > > http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee > > > > > > One million ARM chips challenge Intel bumblebee > > > > > > > Now say it like Dr. Evil: one MILLION processors. > > > > > > How long is it going to take to wire them all up? And how fast are they > > going to fail? If there's a MTBF of one million hours, that's still one > > failure per hour. > > > But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. They do address som of that in ftp://ftp.cs.man.ac.uk/pub/amulet/papers/SBF_ACSD09.pdf It's also specific to neural emulation. These should tolerate pretty huge error rates without fouling up the qualitative system behaviour they're trying to model. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Daniel.Pfenniger at unige.ch Thu Jul 7 13:05:22 2011 From: Daniel.Pfenniger at unige.ch (Daniel Pfenniger) Date: Thu, 07 Jul 2011 19:05:22 +0200 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15D262.90209@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <4E15E752.8080907@unige.ch> Prentice Bisbal wrote: > > On 07/07/2011 10:13 AM, Eugen Leitl wrote: >> >> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee >> >> One million ARM chips challenge Intel bumblebee >> > > Now say it like Dr. Evil: one MILLION processors. > > > How long is it going to take to wire them all up? And how fast are they > going to fail? If there's a MTBF of one million hours, that's still one > failure per hour. > > Should be interesting. The real challenge is they want to simulate a human brain (~10^11 neurons, 10^14-10^15 synapses) with so few processors. I guess in any real brain many neurons and even more synapses are permanently out of order... Dan _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 13:05:55 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 10:05:55 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15E752.8080907@unige.ch> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15E752.8080907@unige.ch> Message-ID: > -----Original Message----- > Subject: Re: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee > > > The real challenge is they want to simulate a human brain > (~10^11 neurons, 10^14-10^15 synapses) with so few processors. > I guess in any real brain many neurons and even more synapses are > permanently out of order... Kind of depends on the time of day after the conference bird of a feather session gets started, eh? _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 13:17:30 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 13:17:30 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> Message-ID: <4E15EA2A.9010200@ias.edu> On 07/07/2011 12:31 PM, Lux, Jim (337C) wrote: >> On 07/07/2011 10:13 AM, Eugen Leitl wrote: >>> >>> http://www.techeye.net/chips/one-million-arm-chips-challenge-intel-bumblebee >>> >>> One million ARM chips challenge Intel bumblebee >>> >> >> Now say it like Dr. Evil: one MILLION processors. >> >> >> How long is it going to take to wire them all up? And how fast are they >> going to fail? If there's a MTBF of one million hours, that's still one >> failure per hour. > > > But this presents a very interesting design challenge.. when you get to this sort of scale, you have to assume that at any time, some of them are going to be dead or dying. Just like google's massively parallel database engines.. > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. > Just to be clear - I wasn't saying this was a bad idea. Scaling up to this size seems inevitable. I was just imagining the team of admins who would have to be working non-stop to replace dead processors! I wonder what the architecture for this system will be like. I imagine it will be built around small multi-socket blades that are hot-swappable to handle this. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Thu Jul 7 14:26:04 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 7 Jul 2011 11:26:04 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E15EA2A.9010200@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: > > It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this > list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work > in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in > unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. > > > > Just to be clear - I wasn't saying this was a bad idea. Scaling up to > this size seems inevitable. I was just imagining the team of admins who > would have to be working non-stop to replace dead processors! > > I wonder what the architecture for this system will be like. I imagine > it will be built around small multi-socket blades that are hot-swappable > to handle this. I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 7 15:25:35 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 07 Jul 2011 15:25:35 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <4E16082F.5060700@runnersroll.com> On 07/07/11 14:26, Lux, Jim (337C) wrote: >>> It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. >>> >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". Or rather than replace or add to the pool, perhaps just allow the ones that die to just, well, stay dead. The issue with things this scale is that unlike the individual or smallish business there are very few good reasons to not upgrade every, say, 3 to 5 years. The costs involved in having spare CPUs sitting around waiting to be swapped in, the maintenance of having administrators replacing stuff and any potential downtime replacements require seem at first glance to outweigh the elegance of "letting nature taking it's course" with the supercomputer. For instance, if Prentice's MTBF of 1 million hours is realistic (I personally have no idea if it is), then that's "only" 43,800 CPUs by the end of year 5. That's less than 5% of the total capacity - i.e. not a big deal if this system can truly tolerate and route around failures as our brains do. Perhaps they could study old and/or drug abusing bees at that stage, hehe. Just my 2 wampum, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Jul 7 15:38:34 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 07 Jul 2011 15:38:34 -0400 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <4E160B3A.1030009@ias.edu> On 07/07/2011 02:26 PM, Lux, Jim (337C) wrote: >>> It's all about ultimate scalability. Anybody with a moderate competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in 1.2*unit time. >>> >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > > > I think that you just anticipate the failures and deal with them. It's challenging to write code to do this, but it's certainly a worthy objective. I can easily see a situation where the cost to replace dead units is so high that you just don't bother doing it: it's cheaper to just add more live ones to the "pool". > Did you read the paper that someone else posted a link to? I just read the first half of it. A good part of this research is focused on fault-tolerance/resiliency of computer systems. They're not just interested in creating a computer to mimic the brain, they want to learn how to mimic the brain's fault-tolerance in computers. To paraphrase the paper, we lose a neuron a second in our brains for our entire lives, but we never notice any problems from that. This research hopes to learn how to duplicate with that this computer, so you could say hardware failures are desirable and necessary for this research. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ntmoore at gmail.com Thu Jul 7 18:22:47 2011 From: ntmoore at gmail.com (Nathan Moore) Date: Thu, 7 Jul 2011 17:22:47 -0500 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E160B3A.1030009@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> <4E160B3A.1030009@ias.edu> Message-ID: Some of these "decay over time" questions have been worked on in the context of detector design in high energy physics. Everything in the big detectors needs to be radiation hard... On Thu, Jul 7, 2011 at 2:38 PM, Prentice Bisbal wrote: > On 07/07/2011 02:26 PM, Lux, Jim (337C) wrote: > >>> It's all about ultimate scalability. Anybody with a moderate > competence (certainly anyone on this > >> list) could devise a scheme to use 1000 perfect processors that never > fail to do 1000 quanta of work > >> in unit time. It's substantially more challenging to devise a scheme to > do 1000 quanta of work in > >> unit time on, say, 1500 processors with a 20% failure rate. Or even in > 1.2*unit time. > >>> > >> > >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to > >> this size seems inevitable. I was just imagining the team of admins who > >> would have to be working non-stop to replace dead processors! > >> > >> I wonder what the architecture for this system will be like. I imagine > >> it will be built around small multi-socket blades that are hot-swappable > >> to handle this. > > > > > > > > I think that you just anticipate the failures and deal with them. It's > challenging to write code to do this, but it's certainly a worthy objective. > I can easily see a situation where the cost to replace dead units is so high > that you just don't bother doing it: it's cheaper to just add more live ones > to the "pool". > > > > Did you read the paper that someone else posted a link to? I just read > the first half of it. A good part of this research is focused on > fault-tolerance/resiliency of computer systems. They're not just > interested in creating a computer to mimic the brain, they want to learn > how to mimic the brain's fault-tolerance in computers. > > To paraphrase the paper, we lose a neuron a second in our brains for our > entire lives, but we never notice any problems from that. This research > hopes to learn how to duplicate with that this computer, so you could > say hardware failures are desirable and necessary for this research. > > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Associate Professor, Physics Winona State University - - - - - - - - - - - - - - - - - - - - - -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Fri Jul 8 17:33:27 2011 From: mathog at caltech.edu (David Mathog) Date: Fri, 08 Jul 2011 14:33:27 -0700 Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee Message-ID: "Ellis H. Wilson III" wrote > For instance, if Prentice's MTBF of 1 million hours is realistic (I > personally have no idea if it is), then that's "only" 43,800 CPUs by the > end of year 5. That's less than 5% of the total capacity - i.e. not a > big deal if this system can truly tolerate and route around failures as > our brains do. Perhaps they could study old and/or drug abusing bees at > that stage, hehe. Serendipitously, 5 years is roughly the maximum life span of a queen bee. However, it is a pretty safe bet that if they are emulating a bee brain, it is a worker bee brain, and a worker bee is lucky if it lives a couple of months. Did it say anywhere that the emulation was real time? Very common for emulations to run orders of magnitude slower than real time, so processor loss could still be an issue during runs. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri Jul 8 23:33:55 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 8 Jul 2011 23:33:55 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> Message-ID: <58977.68.83.96.23.1310182435.squirrel@mail.eadline.org> >> > It's all about ultimate scalability. Anybody with a moderate >> competence (certainly anyone on this >> list) could devise a scheme to use 1000 perfect processors that never >> fail to do 1000 quanta of work >> in unit time. It's substantially more challenging to devise a scheme to >> do 1000 quanta of work in >> unit time on, say, 1500 processors with a 20% failure rate. Or even in >> 1.2*unit time. >> > >> >> Just to be clear - I wasn't saying this was a bad idea. Scaling up to >> this size seems inevitable. I was just imagining the team of admins who >> would have to be working non-stop to replace dead processors! >> >> I wonder what the architecture for this system will be like. I imagine >> it will be built around small multi-socket blades that are hot-swappable >> to handle this. > > > > I think that you just anticipate the failures and deal with them. It's > challenging to write code to do this, but it's certainly a worthy > objective. I can easily see a situation where the cost to replace dead > units is so high that you just don't bother doing it: it's cheaper to just > add more live ones to the "pool". I wrote about the programming issue in a series of three articles (conjecture, never really tried it, if only I had the time ...). The first article links (at the end) to the other two. http://www.clustermonkey.net//content/view/158/28/ And yes, disposable "nodes" just like a failed cable in a large cluster, route a new one, don't worry about unbundling a huge cable tree. I assume there will be a high level of integration so there may be "nodes" are left for dead which are integrated into a much larger blade. -- Doug > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Fri Jul 8 23:42:28 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 8 Jul 2011 23:42:28 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <4E160B3A.1030009@ias.edu> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <4E15EA2A.9010200@ias.edu> <4E160B3A.1030009@ias.edu> Message-ID: <35078.68.83.96.23.1310182948.squirrel@mail.eadline.org> --snip-- > > Did you read the paper that someone else posted a link to? I just read > the first half of it. A good part of this research is focused on > fault-tolerance/resiliency of computer systems. They're not just > interested in creating a computer to mimic the brain, they want to learn > how to mimic the brain's fault-tolerance in computers. > > To paraphrase the paper, we lose a neuron a second in our brains for our > entire lives, but we never notice any problems from that. I know some people from the sixties that may beg to differ ;-) -- Doug > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Mon Jul 11 16:34:34 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Mon, 11 Jul 2011 16:34:34 -0400 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: We are (beta) releasing a drop-in package for SGE6.2u5, SGE6.2u5p1, and SGE6.2u5p2 for thread-binding: http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html Mainly tested on Intel boxes -- would be great if AMD Magny-Cours server owners offer help with testing! (Play it safe -- setup a 1 or 2-node test cluster by using the non-standard SGE TCP ports). Thanks! Rayson On Mon, Apr 18, 2011 at 2:26 PM, Rayson Ho wrote: > For those who had issues with earlier version, please try the latest > loadcheck v4: > > http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html > > I compiled the binary on Oracle Linux, which is compatible with RHEL > 5.x, Scientific Linux or Centos 5.x. I tested the binary on the > standard Red Hat kernel, and Oracle enhanced "Unbreakable Enterprise > Kernel", Fedora 13, Ubuntu 10.04 LTS. > > Optimizing for AMD's NUMA machine characteristics is on the ToDo list. > > Rayson > > > > On Wed, Apr 13, 2011 at 2:15 PM, Prakashan Korambath wrote: >> Hi Rayson, >> >> Do you have a statically linked version? Thanks. >> >> ./loadcheck: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by >> ./loadcheck) >> >> Prakashan >> >> >> >> On 04/13/2011 09:21 AM, Rayson Ho wrote: >>> >>> Carlos, >>> >>> I notice that you have "lx24-amd64" instead of "lx26-amd64" for the >>> arch string, so I believe you are running the loadcheck from standard >>> Oracle Grid Engine, Sun Grid Engine, or one of the forks instead of >>> the one from the Open Grid Scheduler page. >>> >>> The existing Grid Engine (including the latest Open Grid Scheduler >>> releases: SGE 6.2u5p1& ?SGE 6.2u5p2, or Univa's fork) uses PLPA, and >>> it is known to be wrong on magny-cours. >>> >>> (i.e. SGE 6.2u5p1& ?SGE 6.2u5p2 from: >>> http://sourceforge.net/projects/gridscheduler/files/ ) >>> >>> >>> Chansup on the Grid Engine mailing list (it's the general purpose Grid >>> Engine mailing list for now) tested the version I uploaded last night, >>> and seems to work on a dual-socket magny-cours AMD machine. It prints: >>> >>> m_topology ? ? ?SCCCCCCCCCCCCSCCCCCCCCCCCC >>> >>> However, I am still fixing the processor, core id mapping code: >>> >>> http://gridengine.org/pipermail/users/2011-April/000629.html >>> http://gridengine.org/pipermail/users/2011-April/000628.html >>> >>> I compiled the hwloc enabled loadcheck on kernel 2.6.34& ?glibc 2.12, >>> so it may not work on machines running lower kernel or glibc versions, >>> you can download it from: >>> >>> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html >>> >>> Rayson >>> >>> >>> >>> On Wed, Apr 13, 2011 at 3:03 AM, Carlos Fernandez Sanchez >>> ?wrote: >>>> >>>> This is the output of a 2 sockets, 12 cores/socket (magny-cours) AMD >>>> system >>>> (and seems to be wrong!): >>>> >>>> arch ? ? ? ? ? ?lx24-amd64 >>>> num_proc ? ? ? ?24 >>>> m_socket ? ? ? ?2 >>>> m_core ? ? ? ? ?12 >>>> m_topology ? ? ?SCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTT >>>> load_short ? ? ?0.29 >>>> load_medium ? ? 0.13 >>>> load_long ? ? ? 0.04 >>>> mem_free ? ? ? ?26257.382812M >>>> swap_free ? ? ? 8191.992188M >>>> virtual_free ? ?34449.375000M >>>> mem_total ? ? ? 32238.328125M >>>> swap_total ? ? ?8191.992188M >>>> virtual_total ? 40430.320312M >>>> mem_used ? ? ? ?5980.945312M >>>> swap_used ? ? ? 0.000000M >>>> virtual_used ? ?5980.945312M >>>> cpu ? ? ? ? ? ? 0.0% >>>> >>>> >>>> Carlos Fernandez Sanchez >>>> Systems Manager >>>> CESGA >>>> Avda. de Vigo s/n. Campus Vida >>>> Tel.: (+34) 981569810, ext. 232 >>>> 15705 - Santiago de Compostela >>>> SPAIN >>>> >>>> -------------------------------------------------- >>>> From: "Rayson Ho" >>>> Sent: Tuesday, April 12, 2011 10:31 PM >>>> To: "Beowulf List" >>>> Subject: [Beowulf] Grid Engine multi-core thread binding enhancement >>>> -pre-alpha release >>>> >>>>> If you are using the "Job to Core Binding" feature in SGE and running >>>>> SGE on newer hardware, then please give the new hwloc enabled >>>>> loadcheck a try. >>>>> >>>>> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html >>>>> >>>>> The current hardware topology discovery library (Portable Linux >>>>> Processor Affinity - PLPA) used by SGE was deprecated in 2009, and new >>>>> hardware topology may not be detected correctly by PLPA. >>>>> >>>>> If you are running SGE on AMD Magny-Cours servers, please post your >>>>> loadcheck output, as it is known to be wrong when handled by PLPA. >>>>> >>>>> The Open Grid Scheduler is migrating to hwloc -- we will ship hwloc >>>>> support in later releases of Grid Engine / Grid Scheduler. >>>>> >>>>> http://gridscheduler.sourceforge.net/ >>>>> >>>>> Thanks!! >>>>> Rayson >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Mon Jul 11 23:39:46 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 11 Jul 2011 23:39:46 -0400 (EDT) Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: > http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html since this isn't an SGE list, I don't want to pursue an off-topic too far, but out of curiosity, does this make the scheduler topology aware? that is, not just topo-aware binding, but topo-aware resource allocation? you know, avoid unnecessary resource contention among the threads belonging to multiple jobs that happen to be on the same node. large-memory processes not getting bound to a single memory node. packing both small and large-memory processes within a node. etc? thanks, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From kilian.cavalotti.work at gmail.com Tue Jul 12 05:16:26 2011 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Tue, 12 Jul 2011 11:16:26 +0200 Subject: [Beowulf] HPC storage purchasing process Message-ID: An interesting insider view about the procurement process that occured for the purchase of a HPC storage system at CHPC http://www.hpcwire.com/hpcwire/2011-07-08/hpc_center_traces_storage_selection_experience.html Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From raysonlogin at gmail.com Tue Jul 12 16:12:14 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 12 Jul 2011 16:12:14 -0400 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release In-Reply-To: References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2> <4DA5E85D.4010801@ats.ucla.edu> Message-ID: On Mon, Jul 11, 2011 at 11:39 PM, Mark Hahn wrote: >> http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html > > since this isn't an SGE list, I don't want to pursue an off-topic too far, Hi Mark, I think a lot of this will apply to non-SGE batch schedulers -- in fact Torque will support hwloc in a future release. And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of CPU set support for many years, but now this feature is more important as the interaction of different hardware layers impacts the performance more as more cores are added per socket. > but out of curiosity, does this make the scheduler topology aware? > that is, not just topo-aware binding, but topo-aware resource allocation? > you know, avoid unnecessary resource contention among the threads belonging > to multiple jobs that happen to be on the same node. You can tell SGE (now: Grid Scheduler) how you want to allocate hardware resource, but then different hardware architectures & program behaviors can introduce interactions that will cause different performance impact. For example, a few years ago while I was still working for a large UNIX system vendor, I found that a few SPEC OMP benchmarks run faster when the threads are closer to each other (even when sharing the same core by running in SMT mode), while most benchmarks benefit from more L2/L3 caches & memory bandwidth (I'm talking about the same thread count for both cases). But it is hard even as a compiler developer to find out how to choose the optimal thread allocation -- even with high-level array access pattern information & memory bandwidth models available at compilation time. For batch systems, we don't have as much info as the compiler. While we can profile systems on the fly by PAPI, I doubt we will go that route in the near future. So, that means we need the job submitter to tell us what he wants -- in SGE/OGS, we have "qsub -binding striding::", which means you will need to benchmark the code and see how the code interacts with the hardware, and see whether it runs better with more L2/L3/memory bandwidth (meaning step-size >= 2), or "qsub -binding linear", which means the job will get the core by itself. http://wikis.sun.com/display/gridengine62u5/Using+Job+to+Core+Binding >?large-memory processes > not getting bound to a single memory node. ?packing both small and > large-memory processes within a node. ?etc? For memory nodes, a call to numactl should be able to handle most use-cases. Rayson > > thanks, mark hahn. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Wed Jul 13 04:47:44 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 13 Jul 2011 09:47:44 +0100 Subject: [Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release References: <26FE8986BC6B4E56B7E9EABB63E10A38@pccarlosf2><4DA5E85D.4010801@ats.ucla.edu> Message-ID: <207BB2F60743C34496BE41039233A8090656ACFF@MRL-PWEXCHMB02.mil.tagmclarengroup.com> > > Hi Mark, > > I think a lot of this will apply to non-SGE batch schedulers -- in > fact Torque will support hwloc in a future release. > That sounds good to me! (Hint - if anyone from Altair is listening in it would be useful...) > And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of > CPU set support for many years, but now this feature is more important > as the interaction of different hardware layers impacts the > performance more as more cores are added per socket. > I agree very much with what you say here. The Gridengine topology-aware scheduling sounds great, and as you say with multi-core architectures will be more and more useful. I'd like to mention cpuset support though - cpusets are of course vitally useful, however you can have situations where a topology-aware scheduler would allow you to allocate high core count jobs on a machine, using cores which are physically close to each other, yet also run small core count yet high memory jobs which access the memory of those cores. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Sat Jul 16 16:19:43 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 16 Jul 2011 16:19:43 -0400 (EDT) Subject: [Beowulf] [tt] One million ARM chips challenge Intel bumblebee In-Reply-To: <20110707165629.GL16178@leitl.org> References: <20110707141305.GJ16178@leitl.org> <4E15D262.90209@ias.edu> <20110707165629.GL16178@leitl.org> Message-ID: >>> How long is it going to take to wire them all up? And how fast are they >>> going to fail? If there's a MTBF of one million hours, that's still one >>> failure per hour. ... > They do address som of that in ftp://ftp.cs.man.ac.uk/pub/amulet/papers/SBF_ACSD09.pdf the 1m proc seems to be referring to cores, of which their current SOC has 20/chip, and there are 4 chips on their current test board: http://www.eetimes.com/electronics-news/4217842/SpiNNaker-ARM-chip-slide-show?pageNumber=4 hmm, that article says 18 cores (maybe reduced for yield). stacked dram, not sure what the other companion chip is on the test board. anyway, compare it to the K computer: 516096 compute cores, 64512 packages, versus 50k packages for Spinnaker. Spinnaker will obviously put more chips onto a single board (board links are more reliable than connectors, as well as more power-efficient.) Spinnaker has 6 links for a 2d toroidal mesh (not 3d for some reason) - K also uses a 6-link mesh. obviously, off-board links need a connector, but if I were designing either box I'd have each board plug into a per-rack backplane, again, to avoid dealing with cables. if you have a per-rack sub-mesh anyway, it should be 3d, shouldn't it? in abstract, it seems like Spinnaker would want a 3d mesh to better model the failure effect in the brain (which is certainly not 2d nearest-neighbor!) in fact, if you wanted to embrace brain-like topologies, I'd think a flat-network-neighborhood would be most realistic (albeit cable-intensive. but we're not afraid of failed cables, since the brain isn't!) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Mon Jul 18 10:18:41 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 18 Jul 2011 10:18:41 -0400 (EDT) Subject: [Beowulf] Multi-core benchmarking Message-ID: <32768.192.168.93.213.1310998721.squirrel@mail.eadline.org> I had recently run some tests on a few multi-core processors that I use in my Limulus Project Benchmarking A Multi-Core Processor For HPC http://www.clustermonkey.net//content/view/306/1/ I have always been interested in how well multi-core can support parallel codes. I used the following CPUs: * Intel Core2 Quad-core Q6600 running at 2.4GHz (Kentsfield) * AMD Phenom II X4 quad-core 910e running at 2.6GHz (Deneb) * Intel Core i5 Quad-core i5-2400S running at 2.5 GHz (Sandybridge) -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Mon Jul 18 13:22:57 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 18 Jul 2011 13:22:57 -0400 (EDT) Subject: [Beowulf] Multi-core benchmarking In-Reply-To: References: <32768.192.168.93.213.1310998721.squirrel@mail.eadline.org> Message-ID: <60878.192.168.93.213.1311009777.squirrel@mail.eadline.org> forward from Mikhail Kuzminsky: > Dear Douglas, > I can't send my messages directly to beowulf maillist, so could you pls > forward my answer manually to beowulf maillist ? > > 18 ???????? 2011, 18:20 ???? "Douglas Eadline" : > > I had recently run some tests on a few multi-core > processors that I use in my Limulus Project > > ??Benchmarking A Multi-Core Processor For HPC > ?? > http://www.clustermonkey.net//content/view/306/1/ > > I have always been interested in how well > multi-core can support parallel codes. > I used the following CPUs: > > ????* Intel Core2 Quad-core Q6600 running at 2.4GHz (Kentsfield) > ????* AMD Phenom II X4 quad-core 910e running at 2.6GHz (Deneb) > ????* Intel Core i5 Quad-core i5-2400S running at 2.5 GHz (Sandybridge) > > The compiler used was gfortran. Does it have AVX support ? For full AVX > extension support for i5 > ??we need gcc 4.6. 0 or higher. gfortran version 4.4.4, I don't think it has AVX. I was mostly interested in memory bandwidth in these tests. I will be testing compilers at some other point. -- Doug > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From vanallsburg at hope.edu Tue Jul 19 09:32:39 2011 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 19 Jul 2011 09:32:39 -0400 Subject: [Beowulf] initialize starting number for torque 2.5.5 Message-ID: <4E258777.4030508@hope.edu> Hi All, I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this possible or problematic? Thanks! Paul -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hearnsj at googlemail.com Tue Jul 19 09:44:06 2011 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 19 Jul 2011 14:44:06 +0100 Subject: [Beowulf] initialize starting number for torque 2.5.5 In-Reply-To: <4E258777.4030508@hope.edu> References: <4E258777.4030508@hope.edu> Message-ID: On 19 July 2011 14:32, Paul Van Allsburg wrote: > Hi All, > > I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this > possible or problematic? In SGE you would just edit the jobseqno file. In PBS that sequence number is held in the file serverdb, and I have had call to alter it once when we upgraded a cluster and I wanted to keep a continuity of job sequence numbers. As serverdb is an XML file in Torque, if I Google serves me well, so I would try installing then running the first job, stopping then editing serverdb? _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From vanallsburg at hope.edu Tue Jul 19 09:50:30 2011 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 19 Jul 2011 09:50:30 -0400 Subject: [Beowulf] initialize starting number for torque 2.5.5 In-Reply-To: References: <4E258777.4030508@hope.edu> Message-ID: <4E258BA6.8050806@hope.edu> Terrific! Thanks Joey, Paul On 7/19/2011 9:36 AM, Joey Jones wrote: > > Shouldn't be a problem at all: > > In qmgr: > > set server next_job_number = 100000 > > ____________________________________________ > Joey B. Jones > HPC2 Systems And Network Administration > Mississippi State University > > On Tue, 19 Jul 2011, Paul Van Allsburg wrote: > >> Hi All, >> >> I'm in the process of replacing a cluster head node and I'd like to install torque with a starting job number of 100000. Is this >> possible or problematic? >> >> Thanks! >> >> Paul >> >> -- >> Paul Van Allsburg >> Scientific Computing Specialist >> Natural Sciences Division, Hope College >> 35 E. 12th St. Holland, Michigan 49423 >> 616-395-7292 vanallsburg at hope.edu >> http://www.hope.edu/academic/csm/ >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Thu Jul 21 04:18:36 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 21 Jul 2011 10:18:36 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: <20110721081836.GD16178@leitl.org> http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Thu Jul 21 11:45:28 2011 From: deadline at eadline.org (Douglas Eadline) Date: Thu, 21 Jul 2011 11:45:28 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721081836.GD16178@leitl.org> References: <20110721081836.GD16178@leitl.org> Message-ID: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> I'm curious, has anyone tried building one of these or know of anyone who has? Seems like a cheap solution for raw backup. -- Doug > > http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ > > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From eugen at leitl.org Thu Jul 21 12:09:51 2011 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 21 Jul 2011 18:09:51 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> Message-ID: <20110721160951.GN16178@leitl.org> On Thu, Jul 21, 2011 at 11:45:28AM -0400, Douglas Eadline wrote: > I'm curious, has anyone tried building one of these or know > of anyone who has? > > Seems like a cheap solution for raw backup. We use quite a few of wire-shelved HP N36L with 8 GByte ECC DDR3 RAM and 4x 3 TByte Hitachi drives with zfs, running NexentaCore and napp-it. They export via NFS and CIFS, but in principle you could use these for a cluster FS. > -- > Doug > > > > > http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 12:28:00 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 12:28:00 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721160951.GN16178@leitl.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> Message-ID: <4E285390.6050308@runnersroll.com> On 07/21/11 12:09, Eugen Leitl wrote: > On Thu, Jul 21, 2011 at 11:45:28AM -0400, Douglas Eadline wrote: >> I'm curious, has anyone tried building one of these or know >> of anyone who has? >> >> Seems like a cheap solution for raw backup. I have doubts about the manageability of such large data without complex software sitting above the spinning rust to enable scalability of performance and recovery of drive failures, which are inevitable at this scale. I mean, what is the actual value of this article? They really don't tell you "how" to build reliable storage at that scale, just a hand-waving description on how some of the items fit in the box and a few file-system specifics. THe SATA wiring diagram is probably the most detailed thing in the post and even that leaves a lot of questions to be answered. Either way, I think if someone were to foolishly just toss together >100TB of data into a box they would have a hell of a time getting anywhere near even 10% of the theoretical max performance-wise. Not to mention double-disk failures (not /that/ uncommon with same make,model,lot hdds) would just wreck all their data. Now for Backblaze (which is a pretty poor name choice IMHO), they manage all that data in-house so building cheap units makes sense since they can safely rely on the software stack they've built over a couple years. For traditional Beowulfers, spending a year or two developing custom software just to manage big data is likely not worth it. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Jul 21 14:29:56 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 11:29:56 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110721182956.GA19104@bx9.net> On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > For traditional Beowulfers, spending a year or two developing custom > software just to manage big data is likely not worth it. There are many open-souce packages for big data, HDFS being one file-oriented example in the Hadoop family. While they generally don't have the features you'd want for running with HPC programs, they do have sufficient features to do things like backups. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 14:55:30 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 14:55:30 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721182956.GA19104@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> Message-ID: <4E287622.6010400@runnersroll.com> On 07/21/11 14:29, Greg Lindahl wrote: > On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > >> For traditional Beowulfers, spending a year or two developing custom >> software just to manage big data is likely not worth it. > > There are many open-souce packages for big data, HDFS being one > file-oriented example in the Hadoop family. While they generally don't > have the features you'd want for running with HPC programs, they do > have sufficient features to do things like backups. I'm actually doing a bunch of work with Hadoop right now, so it's funny you mention it. My experience with and understanding of Hadoop/HDFS is that it is really more geared towards actually doing something with the data once you have it on storage, which is why it's based of off google fs (and undoubtedly why you mention it, being in the search arena yourself). As purely a backup solution it would be particularly clunky, especially in a setup like this one where there's a high HDD to CPU ratio. My personal experience with getting large amounts of data from local storage to HDFS has been suboptimal compared to something more raw, but perhaps I'm doing something wrong. Do you know of any distributed file-systems that are geared towards high-sequential-performance and resilient backup/restore? I think even for HPC (checkpoints), there's a pretty good desire to be able to push massive data down and get it back over wide pipes. Perhaps pNFS will fill this need? ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mm at yuhu.biz Thu Jul 21 17:58:16 2011 From: mm at yuhu.biz (Marian Marinov) Date: Fri, 22 Jul 2011 00:58:16 +0300 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E287622.6010400@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> Message-ID: <201107220058.20707.mm@yuhu.biz> On Thursday 21 July 2011 21:55:30 Ellis H. Wilson III wrote: > On 07/21/11 14:29, Greg Lindahl wrote: > > On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote: > >> For traditional Beowulfers, spending a year or two developing custom > >> > >> software just to manage big data is likely not worth it. > > > > There are many open-souce packages for big data, HDFS being one > > file-oriented example in the Hadoop family. While they generally don't > > have the features you'd want for running with HPC programs, they do > > have sufficient features to do things like backups. > > I'm actually doing a bunch of work with Hadoop right now, so it's funny > you mention it. My experience with and understanding of Hadoop/HDFS is > that it is really more geared towards actually doing something with the > data once you have it on storage, which is why it's based of off google > fs (and undoubtedly why you mention it, being in the search arena > yourself). As purely a backup solution it would be particularly clunky, > especially in a setup like this one where there's a high HDD to CPU ratio. > > My personal experience with getting large amounts of data from local > storage to HDFS has been suboptimal compared to something more raw, but > perhaps I'm doing something wrong. Do you know of any distributed > file-systems that are geared towards high-sequential-performance and > resilient backup/restore? I think even for HPC (checkpoints), there's a > pretty good desire to be able to push massive data down and get it back > over wide pipes. Perhaps pNFS will fill this need? > I think that GlusterFS would fit perfectly in that place. HDFS is actually a very poor choice for such storages because its performance is not good. The article explaines that they have compared JFS, XFS and Ext4. When I was desiging my backup solution I also compared those 3 and GlusterFS on top of them. I also concluded that Ext4 was the way to go. And with utilizing LVM or having a software to handle the HW failures it actually prooves to be quite suitable for backups. The performance of Ext4 is far better then JFS and XFS, we also tested Ext3 but abondand that. However I'm not sure that this kind of storage is very good for anything else then backups. I believe that more random I/O may kill the performance and hardware of such systems. If you are doing only backups on these drives and you are keeping hot spares on the controler having a tripple failure is quite hard to achieve. And even in those situations if you lose only a single RAID6 array, not the whole storage node. Currently my servers are with 34TB capacity, and what these guys show me, is how I can rearange my current hardware and double the capacity of the backups. So I'm extremely happy that they share this with the world. -- Best regards, Marian Marinov CEO of 1H Ltd. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Thu Jul 21 18:07:42 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 15:07:42 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E287622.6010400@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> Message-ID: <20110721220742.GD22174@bx9.net> On Thu, Jul 21, 2011 at 02:55:30PM -0400, Ellis H. Wilson III wrote: > My personal experience with getting large amounts of data from local > storage to HDFS has been suboptimal compared to something more raw, If you're writing 3 copies of everything on 3 different nodes, then sure, it's a lot slower than writing 1 copy. The benefit you get from this extra up-front expense is resilience. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Thu Jul 21 20:03:58 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Thu, 21 Jul 2011 20:03:58 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110721220742.GD22174@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> <20110721220742.GD22174@bx9.net> Message-ID: <4E28BE6E.5030800@runnersroll.com> On 07/21/11 18:07, Greg Lindahl wrote: > On Thu, Jul 21, 2011 at 02:55:30PM -0400, Ellis H. Wilson III wrote: >> My personal experience with getting large amounts of data from local >> storage to HDFS has been suboptimal compared to something more raw, > > If you're writing 3 copies of everything on 3 different nodes, then > sure, it's a lot slower than writing 1 copy. The benefit you get from > this extra up-front expense is resilience. Used in a backup solution, triplication won't get you much more resilience than RAID6 but will pay a much greater performance penalty to simply get your backup or checkpoint completed. Additionally, unless you have a ton of these boxes you won't get some of the important benefits of Hadoop such as rack-aware replication placement. Perhaps you could alter HDFS to handle triplication in the background once you get the local copy on-disk, but this isn't really what it was built for so again one is probably better off going with a more efficient, if less complex distributed file system. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Jul 21 20:22:03 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 17:22:03 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E28BE6E.5030800@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <20110721182956.GA19104@bx9.net> <4E287622.6010400@runnersroll.com> <20110721220742.GD22174@bx9.net> <4E28BE6E.5030800@runnersroll.com> Message-ID: <20110722002203.GD1350@bx9.net> On Thu, Jul 21, 2011 at 08:03:58PM -0400, Ellis H. Wilson III wrote: > Used in a backup solution, triplication won't get you much more > resilience than RAID6 but will pay a much greater performance penalty to > simply get your backup or checkpoint completed. Hey, if you don't see any benefit from R3, then it's no surprise that you find the cost too high. Me, I don't like being woken up in the dead of the night to run to the colo to replace a disk. And I trust my raid vendor's code less than my replication code. > Additionally, unless you have a ton of these boxes you won't get > some of the important benefits of Hadoop such as rack-aware > replication placement. Most of the benefit is achieved from machine-aware replication placement: the number of PDU and switch failures is much smaller than the number of node failures, which is much smaller than the number of disk device failures. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 00:20:14 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 00:20:14 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> Message-ID: > I'm curious, has anyone tried building one of these or know > of anyone who has? a guy here built one, and it seems to behave fine. > Seems like a cheap solution for raw backup. "raw"? I think the backblaze (v1) is used for rsync-based incremental/nightly-snapshots. but yeah, this is a lot of space at the end of pretty narrow pipes. at some level, though, backblaze is a far more sensible response to disk prices than conventional vendors. 3TB disks start at $140! how much expensive infrastructure do you want to add to the disks to make the space usable? it's absurd to think of using fiberchannel, for instance. bigname vendors still try to tough it out by pretending that their sticker on commodity disks makes them worth hundreds of dollars more - I always figure this is more to justify charging thousands for, say, a 12-disk enclosure. backblaze's approach is pretty gung-ho, though. if I were trying to do storage at that scale, I'd definitely consider using fewer parts. for instance, an all-in-one motherboard with 6 sata ports and disks in a 1U chassis. BB winds up being about $44/disk overhead, and I think the simpler approach could come close, maybe $50/disk. then again, if you only look at asymptotics, USB enclosures knock it down to maybe $25/disk ;) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 00:33:37 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 00:33:37 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: > Either way, I think if someone were to foolishly just toss together >> 100TB of data into a box they would have a hell of a time getting > anywhere near even 10% of the theoretical max performance-wise. storage isn't about performance any more. ok, hyperbole, a little. but even a cheap disk does > 100 MB/s, and in all honesty, there are not tons of people looking for bandwidth more than a small multiplier of that. sure, a QDR fileserver wants more than a couple disks, and if you're an iops-head, you're going flash anyway. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri Jul 22 00:46:11 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jul 2011 00:46:11 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <4E290093.1010504@scalableinformatics.com> On 07/22/2011 12:33 AM, Mark Hahn wrote: >> Either way, I think if someone were to foolishly just toss together >>> 100TB of data into a box they would have a hell of a time getting >> anywhere near even 10% of the theoretical max performance-wise. > > storage isn't about performance any more. ok, hyperbole, a little. > but even a cheap disk does> 100 MB/s, and in all honesty, there are > not tons of people looking for bandwidth more than a small multiplier > of that. sure, a QDR fileserver wants more than a couple disks, With all due respect, I beg to differ. The bigger you make your storage, the larger the pipes in you need, and the larger the pipes to the storage you need, lest you decide that tape is really cheaper after all. Tape does 100MB/s these days. And the media is relatively cheap (compared to some HD). If you don't care about access performance under load, you really can't beat its economics. More to the point, you need a really balanced architecture in terms of bandwidth. I think USB3 could be very interesting for small arrays, and pretty much expect to start seeing some as block targets pretty soon. I don't see enough aggregated USB3 ports together in a single machine to make this terribly interesting as a large scale storage medium, but it is a possible route. They are interesting boxen. We often ask customers if they'd consider non-enterprise drives. Failure rates similar to the enterprise as it turns out, modulo some ridiculous drive products. Most say no. Those who say yes don't see enhanced failure rates. > and if you're an iops-head, you're going flash anyway. This is more recent than you might have guessed ... at least outside of academia. We should have a fun machine to talk about next week, and show some benchies on. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri Jul 22 01:04:36 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 22:04:36 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110722050436.GB10994@bx9.net> On Fri, Jul 22, 2011 at 12:33:37AM -0400, Mark Hahn wrote: > storage isn't about performance any more. ok, hyperbole, a little. > but even a cheap disk does > 100 MB/s, and in all honesty, there are > not tons of people looking for bandwidth more than a small multiplier > of that. sure, a QDR fileserver wants more than a couple disks, > and if you're an iops-head, you're going flash anyway. Over in the big data world, we're all about disk bandwidth, because we take the computation to the data. When we're reading something for a Map/Reduce job, we can easily drive 800 MB/s off of 8 disks in a single node, and for many jobs the most expensive thing about the job is reading. Good thing we have 3 copies of every bit of data, that gives us 1/3 the runtime. Writing, not so happy. Network bandwidth is a lot more expensive than disk bandwidth. Some data manipulations in HPC are like Map/Reduce. For example, shooting a movie using saved state files is embarrassingly parallel. The first system I heard about which took computation to the data was from SDSC, long before GOOG was founded. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Fri Jul 22 01:44:56 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 22 Jul 2011 01:44:56 -0400 (EDT) Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E290093.1010504@scalableinformatics.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: >>> Either way, I think if someone were to foolishly just toss together >>>> 100TB of data into a box they would have a hell of a time getting >>> anywhere near even 10% of the theoretical max performance-wise. >> >> storage isn't about performance any more. ok, hyperbole, a little. >> but even a cheap disk does> 100 MB/s, and in all honesty, there are >> not tons of people looking for bandwidth more than a small multiplier >> of that. sure, a QDR fileserver wants more than a couple disks, > > With all due respect, I beg to differ. with which part? > The bigger you make your storage, the larger the pipes in you need, and hence the QDR comment. > the larger the pipes to the storage you need, lest you decide that tape > is really cheaper after all. people who like tape seem to like it precisely because it's offline. BB storage, although fairly bottlenecked, is very much online and thus constantly integrity-verifiable... > Tape does 100MB/s these days. And the media is relatively cheap > (compared to some HD). yes, "some" is my favorite weasel word too ;) I don't follow tape prices much - but LTO looks a little more expensive than desktop drives. drives still not cheap. my guess is that tape could make sense at very large sizes, with enough tapes to amortize the drives, and some kind of very large robot. but really my point was that commodity capacity and speed covers the vast majority of the market - at least I'm guessing that most data is stored in systems of under, say 1 PB. if tape could deliver 135 TB in 4U with 10ms random access, yes, I guess there wouldn't be any point to backblaze... > If you don't care about access performance under > load, you really can't beat its economics. am I the only one who doesn't trust tape? who thinks of integrity being a consequence of constant verifiability? > More to the point, you need a really balanced architecture in terms of > bandwidth. I think USB3 could be very interesting for small arrays, and > pretty much expect to start seeing some as block targets pretty soon. I > don't see enough aggregated USB3 ports together in a single machine to > make this terribly interesting as a large scale storage medium, but it > is a possible route. hard to imagine a sane way to distribute power to lots of external usb enclosers, let alone how to mount it. > They are interesting boxen. We often ask customers if they'd consider > non-enterprise drives. Failure rates similar to the enterprise as it > turns out, modulo some ridiculous drive products. Most say no. Those > who say yes don't see enhanced failure rates. old-fashioned thinking, from the days when disks were expensive. now the most expensive commodity disk you can buy is maybe $200, so you really have to think of it as a consumable. (yes, people do still buy mercedes and SAS/FC/etc disks, but that doesn't make them mass-market/commodity products.) >> and if you're an iops-head, you're going flash anyway. > > This is more recent than you might have guessed ... at least outside of > academia. We should have a fun machine to talk about next week, and > show some benchies on. to be honest, I don't understand what applications lead to focus on IOPS (rationally, not just aesthetic/ideologically). it also seems like battery-backed ram and logging to disks would deliver the same goods... _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Fri Jul 22 02:55:59 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 21 Jul 2011 23:55:59 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: <20110722065559.GD16925@bx9.net> On Fri, Jul 22, 2011 at 01:44:56AM -0400, Mark Hahn wrote: > to be honest, I don't understand what applications lead to focus on IOPS > (rationally, not just aesthetic/ideologically). it also seems like > battery-backed ram and logging to disks would deliver the same goods... In HPC, the metadata for your big parallel filesystem is a good example. SSD is much cheaper capacity at high IOPs than battery-backed RAM. (The RAM has higher IOPs than you need.) For Big Data, there's often data that's hotter than the rest. An example from the blekko search engine is our index; when you type a query on our website, most often all of the 'disk' access is SSD. Big Data systems generally don't have a metadata problem like HPC does; instead of 200 million files, we have a couple of dozen tables in our petabyte database. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Fri Jul 22 03:05:11 2011 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 22 Jul 2011 09:05:11 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110722065559.GD16925@bx9.net> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> <20110722065559.GD16925@bx9.net> Message-ID: <20110722070511.GQ16178@leitl.org> On Thu, Jul 21, 2011 at 11:55:59PM -0700, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 01:44:56AM -0400, Mark Hahn wrote: > > > to be honest, I don't understand what applications lead to focus on IOPS > > (rationally, not just aesthetic/ideologically). it also seems like > > battery-backed ram and logging to disks would deliver the same goods... > > In HPC, the metadata for your big parallel filesystem is a good example. > SSD is much cheaper capacity at high IOPs than battery-backed RAM. (The > RAM has higher IOPs than you need.) Hybrid pools in zfs can make use both of SSD and real (battery-backed) RAM disks ( http://www.amazon.com/ACARD-ANS-9010-Dynamic-Module-including/dp/B001NDX6FE or http://www.ddrdrive.com/ ). http://bigip-blogs-adc.oracle.com/brendan/entry/test Additional advantage of zfs is that it can deal with the higher error rate of consumer or nearline SATA disks (though it can do nothing against enterprise disk's higher resistance to vibration), and also with silent bit rot with periodic scrubbing (you can make Linux RAID scrub, but you can't make it checksum). > For Big Data, there's often data that's hotter than the rest. An > example from the blekko search engine is our index; when you type a > query on our website, most often all of the 'disk' access is SSD. > > Big Data systems generally don't have a metadata problem like HPC > does; instead of 200 million files, we have a couple of dozen tables > in our petabyte database. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Fri Jul 22 08:13:30 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jul 2011 08:13:30 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> Message-ID: <4E29696A.2030607@scalableinformatics.com> On 07/22/2011 01:44 AM, Mark Hahn wrote: >>>> Either way, I think if someone were to foolishly just toss together >>>>> 100TB of data into a box they would have a hell of a time getting >>>> anywhere near even 10% of the theoretical max performance-wise. >>> >>> storage isn't about performance any more. ok, hyperbole, a little. >>> but even a cheap disk does> 100 MB/s, and in all honesty, there are >>> not tons of people looking for bandwidth more than a small multiplier >>> of that. sure, a QDR fileserver wants more than a couple disks, >> >> With all due respect, I beg to differ. > > with which part? Mainly "storage isn't about performance any more" and "there are not tons of people looking for bandwidth more than a small multiplier of that". To haul out my old joke ... generalizations tend to be incorrect ... >> The bigger you make your storage, the larger the pipes in you need, and > > hence the QDR comment. Yeah, well not everyone likes IB. As much as we've tried to convince others that it is a good idea in some cases for their workloads, many customers still prefer 10GbE and GbE. I personally have to admit that 10GbE and its very simple driver model (and its "just works" concept) is incredibly attractive, and often far easier to support than IB. This said, we've seen/experienced some very bad IB implementations (board level, driver issues, switch issues, overall fabric, ...) that I am somewhat more jaded as to real achievable bandwidth with it these days than I've been in the past. Sorta like people throwing disks together into 6G backplanes. We run into this all the time in certain circles. People tend to think the nice label will automatically grant them more performance than before. So we see some of the most poorly designed ... hideous really ... designed units from a bandwidth/latency perspective. I guess what I am saying is that QDR (nor 10GbE) is a silver bullet. There are no silver bullets. You *still* have to start with balanced and reasonable designs to get a chance at good performance. > >> the larger the pipes to the storage you need, lest you decide that tape >> is really cheaper after all. > > people who like tape seem to like it precisely because it's offline. > BB storage, although fairly bottlenecked, is very much online and thus > constantly integrity-verifiable... Extremely bottlenecked. 100TB / 100 MB/s -> 100,000,000 MB / 100 MB/s = 1,000,000 s to read or write ... once. This is what we've been calling the storage bandwidth wall. The higher the wall, the colder and more inaccessible your data is. This is on the order of 12 days to read or write the data once. My point about these units is that it may be possible to expand the capacity so much (without growing the various internal bandwidths) that it becomes effectively impossible to utilize all the space, even a majority of the space, in a reasonable time. Which renders the utility of such devices moot. > >> Tape does 100MB/s these days. And the media is relatively cheap >> (compared to some HD). > > yes, "some" is my favorite weasel word too ;) Well ... if you're backing up to SSD drives ... No, seriously not weasel wording on this. Tape is relatively cheap in bulk for larger capacities. > I don't follow tape prices much - but LTO looks a little more expensive > than desktop drives. drives still not cheap. my guess is that tape could > make sense at very large sizes, with enough tapes to amortize the drives, > and some kind of very large robot. but really my point was that commodity > capacity and speed covers the vast majority of the market - at least I'm > guessing that most data is stored in systems of under, say 1 PB. Understand also that I share your view that commodity drives are the better option. Just pointing out that you can follow your asymptote to an extreme (tape) if you wish to keep pushing pricing per byte down. My biggest argument against tape is, that, while the tapes themselves may last 20 years or so ... the drives don't. I've had numerous direct experiences with drive failures that wound up resulting in inaccessible data. I fail to see how the longevity of the media matters in this case, if you can't read it, or cannot get replacement drives to read it. Yeah, that happened. [...] > am I the only one who doesn't trust tape? who thinks of integrity > being a consequence of constant verifiability? See above. [...] >> They are interesting boxen. We often ask customers if they'd consider >> non-enterprise drives. Failure rates similar to the enterprise as it >> turns out, modulo some ridiculous drive products. Most say no. Those >> who say yes don't see enhanced failure rates. > > old-fashioned thinking, from the days when disks were expensive. > now the most expensive commodity disk you can buy is maybe $200, > so you really have to think of it as a consumable. (yes, people do still > buy mercedes and SAS/FC/etc disks, but that doesn't make them > mass-market/commodity products.) heh ... I can see it now: Me: "But gee Mr/Ms Customer, thats really old fashioned thinking (and Mark told me so!) so you gots ta let me sell you dis cheaper disk ..." (watches as door closes in face) It will take time for the business consumer side of market to adapt and adopt. Some do, most don't. Aside from that, the drive manufacturers just love them margins on the enterprise units ... And do you see how willingly people pay large multiples of $1/GB for SSDs? Ok, they are now getting closer to $1/GB, but thats still more than 1 OOM worse in cost than spinning rust ... > >>> and if you're an iops-head, you're going flash anyway. >> >> This is more recent than you might have guessed ... at least outside of >> academia. We should have a fun machine to talk about next week, and >> show some benchies on. > > to be honest, I don't understand what applications lead to focus on IOPS > (rationally, not just aesthetic/ideologically). it also seems like > battery-backed ram and logging to disks would deliver the same goods... oh... many. RAM is expensive. 10TB ram is power hungry and very expensive. Bloody fast, but very expensive. Many apps want fast and cheap. As to your thesis, in the world we live in today, bandwidth and latency are becoming ever more important, not less important. Maybe for specific users this isn't the case, and BB is perfect for that use case. For the general case, we aren't getting people asking us if we can raise that storage bandwidth wall. They are all asking us to lower that barrier. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Jul 22 09:05:52 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 22 Jul 2011 09:05:52 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E29696A.2030607@scalableinformatics.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> <4E290093.1010504@scalableinformatics.com> <4E29696A.2030607@scalableinformatics.com> Message-ID: <4E2975B0.604@runnersroll.com> On 07/22/11 08:13, Joe Landman wrote: > On 07/22/2011 01:44 AM, Mark Hahn wrote: >>>>> Either way, I think if someone were to foolishly just toss together >>>>>> 100TB of data into a box they would have a hell of a time getting >>>>> anywhere near even 10% of the theoretical max performance-wise. >>>> >>>> storage isn't about performance any more. ok, hyperbole, a little. >>>> but even a cheap disk does> 100 MB/s, and in all honesty, there are >>>> not tons of people looking for bandwidth more than a small multiplier >>>> of that. sure, a QDR fileserver wants more than a couple disks, >>> >>> With all due respect, I beg to differ. >> >> with which part? > > Mainly "storage isn't about performance any more" and "there are not > tons of people looking for bandwidth more than a small multiplier of > that". > > To haul out my old joke ... generalizations tend to be incorrect ... It's pretty nice to wake up in the morning and have somebody else have said everything nearly exactly as I would have. Nice write-up Joe! And at Greg - we can talk semantics until we're blue in the face but the reality is that Hadoop/HDFS/R3 is just not an appropriate solution for basic backups, which is the topic of this thread. Period. It's a fabulous tool for actually /working/ on big data and I /really/ do like Hadoop, but it's a very poor tool when all you want to do is really high-bw sequential writes or reads. If you disagree - fine - it's my opinion and I'm sticking to it. Regarding trusting your vendor's raid code less than replication code, I mean, that's pretty obvious. I think we all can agree cp x 3 is a much less complex solution. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Fri Jul 22 16:46:06 2011 From: mathog at caltech.edu (David Mathog) Date: Fri, 22 Jul 2011 13:46:06 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: Joe Landman wrote: > My biggest argument against tape is, that, while the tapes themselves > may last 20 years or so ... the drives don't. I've had numerous direct > experiences with drive failures that wound up resulting in inaccessible > data. I fail to see how the longevity of the media matters in this > case, if you can't read it, or cannot get replacement drives to read it. > Yeah, that happened. 20 years is a long, long time for digital storage media. I expect that if one did have a collection of 20 year old backup disk drives it would be reasonably challenging to find a working computer with a compatible interface to plug them into. And that assumes that those very old drives still work. For all I know there are disk drive models whose spindle permanently freezes in place if it isn't used for 15 years - it's not like the drive manufacturers actually test store drives that long. While it is unquestionably true that a 20 year old tape drive is prone to mechanical failure, it would still be easier to find another drive of the same type to read the tape then it would be to repair the equivalent mechanical failure in the storage medium itself (ie, in a failed disk drive.) That is, the tape itself is not prone to storage related mechanical failure. All of which is a bit of a straw man. The best way to maintain archival data over long periods of time is to periodically migrate it to newer storage technology, and in between migrations, to test read the archives periodically so as to detect unforeseen longevity issues early, while there is still a chance to recover the data. David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From gus at ldeo.columbia.edu Fri Jul 22 17:20:54 2011 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 22 Jul 2011 17:20:54 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: Message-ID: <4E29E9B6.3050505@ldeo.columbia.edu> Incidentally, we have a museum-type Data General 9-track tape reader, which was carefully preserved and is now dedicated to recover seismic reflection data of old research cruises from the 1970's and 80's. (IMHO the Fujitsu readers were even better, but nobody can find those anymore.) Short from writting on stone (and granite also has its weathering time scale), one must occasionally copy over data that is still of interest to new media, as David said. I guess the time scale of reassessing what is still of interest is what matters, and it needs to be compatible with the longevity of current media. Maybe forgetting is part of keeping sanity, and even of remembering. The problem here, maybe in other places too, is that funding for remembering what should be remembered, and forgetting what should be forgotten, is often forgotten. My two cents interjection. Gus Correa David Mathog wrote: > Joe Landman wrote: > >> My biggest argument against tape is, that, while the tapes themselves >> may last 20 years or so ... the drives don't. I've had numerous direct >> experiences with drive failures that wound up resulting in inaccessible >> data. I fail to see how the longevity of the media matters in this >> case, if you can't read it, or cannot get replacement drives to read it. >> Yeah, that happened. > > 20 years is a long, long time for digital storage media. > > I expect that if one did have a collection of 20 year old backup disk > drives it would be reasonably challenging to find a working computer > with a compatible interface to plug them into. And that assumes that > those very old drives still work. For all I know there are disk drive > models whose spindle permanently freezes in place if it isn't used for > 15 years - it's not like the drive manufacturers actually test store > drives that long. While it is unquestionably true that a 20 year old > tape drive is prone to mechanical failure, it would still be easier to > find another drive of the same type to read the tape then it would be to > repair the equivalent mechanical failure in the storage medium itself > (ie, in a failed disk drive.) That is, the tape itself is not prone to > storage related mechanical failure. > > All of which is a bit of a straw man. The best way to maintain archival > data over long periods of time is to periodically migrate it to newer > storage technology, and in between migrations, to test read the archives > periodically so as to detect unforeseen longevity issues early, while > there is still a chance to recover the data. > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Sat Jul 23 01:53:38 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 22 Jul 2011 22:53:38 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 Message-ID: <20110723055338.GC20531@bx9.net> On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > Additional advantage of zfs is that it can deal with the higher > error rate of consumer or nearline SATA disks (though it can do > nothing against enterprise disk's higher resistance to vibration), > and also with silent bit rot with periodic scrubbing (you can > make Linux RAID scrub, but you can't make it checksum). And you can have a single zfs filesystem over 100s of nodes with petabytes of data? This thread has had a lot of mixing of single-node filesystems with cluster filesystems, it leads to a lot of confusion. Hadoop has checksums and maybe scrubbing, and the NoSQL database that we wrote at blekko has both plus end-to-end checksums; it's hard to imagine anyone writing a modern storage system without those features. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cbergstrom at pathscale.com Sat Jul 23 03:28:34 2011 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sat, 23 Jul 2011 14:28:34 +0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723055338.GC20531@bx9.net> References: <20110723055338.GC20531@bx9.net> Message-ID: <4E2A7822.80607@pathscale.com> On 07/23/11 12:53 PM, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > >> Additional advantage of zfs is that it can deal with the higher >> error rate of consumer or nearline SATA disks (though it can do >> nothing against enterprise disk's higher resistance to vibration), >> and also with silent bit rot with periodic scrubbing (you can >> make Linux RAID scrub, but you can't make it checksum). > And you can have a single zfs filesystem over 100s of nodes with > petabytes of data? This thread has had a lot of mixing of single-node > filesystems with cluster filesystems, it leads to a lot of confusion. zfs + lustre would do that, but the OP was comparing against Linux filesystem + RAID. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From eugen at leitl.org Sat Jul 23 07:21:44 2011 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 23 Jul 2011 13:21:44 +0200 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723055338.GC20531@bx9.net> References: <20110723055338.GC20531@bx9.net> Message-ID: <20110723112144.GL16178@leitl.org> On Fri, Jul 22, 2011 at 10:53:38PM -0700, Greg Lindahl wrote: > On Fri, Jul 22, 2011 at 09:05:11AM +0200, Eugen Leitl wrote: > > > Additional advantage of zfs is that it can deal with the higher > > error rate of consumer or nearline SATA disks (though it can do > > nothing against enterprise disk's higher resistance to vibration), > > and also with silent bit rot with periodic scrubbing (you can > > make Linux RAID scrub, but you can't make it checksum). > > And you can have a single zfs filesystem over 100s of nodes with > petabytes of data? This thread has had a lot of mixing of single-node I'm not sure how well pNFS (NFS 4.1) can do on top of zfs. Does anybode use this in production? > filesystems with cluster filesystems, it leads to a lot of confusion. > > Hadoop has checksums and maybe scrubbing, and the NoSQL database that > we wrote at blekko has both plus end-to-end checksums; it's hard to > imagine anyone writing a modern storage system without those features. Speaking of which, is there something easy and reliable open source for Linux that scales up to some 100 nodes, on GBit Ethernet? There's plenty mentioned on https://secure.wikimedia.org/wikipedia/en/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems but which of them fit above requirement? -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bob at drzyzgula.org Sat Jul 23 09:13:50 2011 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Sat, 23 Jul 2011 09:13:50 -0400 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <4E285390.6050308@runnersroll.com> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com> Message-ID: <20110723131350.GB308@mx1.drzyzgula.org> Getting back to the original question, I will say that I, as I expect most of of did, of course considered these back when the first version came out. However, I rejected them based on a few specific criticisms: 1. The power supplies are not redundant. 2. The fans are nut redundant. 3. The drives are inaccessible without shutting down the system and pulling the whole chassis. For my application (I was building a NAS device, not a simple rsync target) I was also unhappy with the choice of motherboard and other I/O components, but that's a YMMV kind of thing and could easily be improved upon within the same chassis. FWIW, for chassis solutions that approach this level of density, but still offer redundant power & cooling as well as hot-swap drive access, Supermicro has a number of designs that are probably worth considering: http://www.supermicro.com/storage/ In the end we built a solution using the 24-drive-in-4U SC848A chassis; we didn't go to the 36-drive boxes because I didn't want to have to compete with the cabling on the back side of the rack to access the drives, and anyway our data center is cooling-constrained and thus we have rack units to spare. We put motherboards in half of them and use the other half in a JBOD configuration. We also used 2TB, 7200 rpm "Enterprise" SAS drives, which actually aren't all that much more expensive. Finally, we used Adaptec SSD-cacheing SAS controllers. All of this is of course more expensive than the parts in the Backblaze design, but that money all goes toward reliability, manageability and performance, and it still is tremendously cheaper than an enterprise SAN-based solution. Not to say that enterprise SANs don't have their place -- we use them for mission-critical production data -- but there are many applications for which their cost simply is not justified. On 21/07/11 12:28 -0400, Ellis H. Wilson III wrote: > > I have doubts about the manageability of such large data without complex > software sitting above the spinning rust to enable scalability of > performance and recovery of drive failures, which are inevitable at this > scale. Well, yes, from a software perspective this is true, and that's of course where most of the rest of this thread headed, which I did find interesting in useful. But if one assumes a appropriate software layers, I think that this remains an interesting hardware design question. > I mean, what is the actual value of this article? They really don't > tell you "how" to build reliable storage at that scale, just a > hand-waving description on how some of the items fit in the box and a > few file-system specifics. THe SATA wiring diagram is probably the most > detailed thing in the post and even that leaves a lot of questions to be > answered. Actually I'm not sure you read the whole blog post. They give extensive wiring diagrams for all of if, including detailed documentation of the custom harness for the power supplies. They also give a a complete parts list -- down to the last screw -- and links to suppliers for unusual or custom parts as well as full CAD drawings of the chassis, in SolidWorks (a free viewer is available). Not quite sure what else you'd be looking for -- at least from a hardware perspective. I do think that this is an interesting exercise in finding exactly how little hardware you can wrap around some hard drives and still have a functional storage system. And as Backblaze seems to have built a going concern on top of the design it does seem to have its applications. However, I think one has to recognize its limitations and be very careful to not try to push it into applications where the lack of redundancy and manageability are going to come up and bite you on the behind. --Bob _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Sat Jul 23 11:12:29 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sat, 23 Jul 2011 08:12:29 -0700 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: <20110723131350.GB308@mx1.drzyzgula.org> References: <20110721081836.GD16178@leitl.org> <56629.192.168.93.213.1311263128.squirrel@mail.eadline.org> <20110721160951.GN16178@leitl.org> <4E285390.6050308@runnersroll.com>, <20110723131350.GB308@mx1.drzyzgula.org> Message-ID: ________________________________________ From: beowulf-bounces at beowulf.org [beowulf-bounces at beowulf.org] On Behalf Of Bob Drzyzgula [bob at drzyzgula.org] I do think that this is an interesting exercise in finding exactly how little hardware you can wrap around some hard drives and still have a functional storage system. And as Backblaze seems to have built a going concern on top of the design it does seem to have its applications. However, I think one has to recognize its limitations and be very careful to not try to push it into applications where the lack of redundancy and manageability are going to come up and bite you on the behind. --Bob _ Yes... Designing and using a giant system with perfectly reliable hardware (or something that simulates perfectly reliable hardware at some abstraction level) is a straightforward process. Designing something where you explicitly acknowledge failures and it still works, while not resorting to the software equivalent of Triple Modular Redundancy and similar schemes) and which has good performance, is a much, much more interesting problems. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Mon Jul 25 09:20:45 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 25 Jul 2011 14:20:45 +0100 Subject: [Beowulf] PetaBytes on a budget, take 2 References: <20110721081836.GD16178@leitl.org><56629.192.168.93.213.1311263128.squirrel@mail.eadline.org><20110721160951.GN16178@leitl.org><4E285390.6050308@runnersroll.com>, <20110723131350.GB308@mx1.drzyzgula.org> Message-ID: <207BB2F60743C34496BE41039233A80906A53D5D@MRL-PWEXCHMB02.mil.tagmclarengroup.com> > _ > > Yes... Designing and using a giant system with perfectly reliable > hardware (or something that simulates perfectly reliable hardware at > some abstraction level) is a straightforward process. > I think that is well worth a debate. As Bob points out, these storage arrays do not have redundant hot-swap PSUs, or hot swap disks. They're made for the market for large volume online archiving - where you would imagine that there are two or more copies of data in geographically separate locations. Is it time to start looking at the cost/benefit analysis for 'prime' storage to have the same scheme? Don't get me wrong - I have experienced the benefits of N+1 hot swap PSUs and RAID hot swap disks on many an occasion, as has everyone here. There's nothing more satisfying then getting an email over the weekend about a popped disk on an array and being able to flop back onto the sofa. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Tue Jul 26 01:43:19 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 26 Jul 2011 15:43:19 +1000 Subject: [Beowulf] PetaBytes on a budget, take 2 In-Reply-To: References: Message-ID: <4E2E53F7.3040208@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 23/07/11 06:46, David Mathog wrote: > All of which is a bit of a straw man. The best way to maintain archival > data over long periods of time is to periodically migrate it to newer > storage technology, and in between migrations, to test read the archives > periodically so as to detect unforeseen longevity issues early, while > there is still a chance to recover the data. If you are interested in how the National Archives of Australia handles digital archiving then there is information (primarily software & format focused) here: http://www.naa.gov.au/records-management/preserve/e-preservation/at-NAA/index.aspx But they do say on one of the pages below that: # When selecting the hardware and systems for its digital # preservation prototype, the National Archives avoided # relying on any single vendor or technology. In so doing, # we have enhanced our ability to deal with hardware # obsolescence. Technology is our enabler, but we don't want # it to be our driver. Conceptually, the digital repository # is one system, but it comprises two independent systems # running simultaneously, with different operating systems # and hardware. # # By operating with redundancy, we are future-proofing our # system. In the event of an operational flaw in any one # operating system, disk technology or vendor, an alternative # is available, so the risk to data is lower. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4uU/cACgkQO2KABBYQAh+t0gCfWKLfIv++/x7iK1XLVGgpDIpM QkkAoIbCKLB2uL4U0QimO1l7WYPOQ5gK =nC+k -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.