From deadline at eadline.org Mon Oct 3 08:25:06 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 3 Oct 2011 08:25:06 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20110921110239.GR25711@leitl.org> References: <20110921110239.GR25711@leitl.org> Message-ID: <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> Interesting and pragmatic HPC cloud presentation, worth watching (25 minutes) http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ -- Doug > > http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars > > $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud > > By Jon Brodkin | Published September 20, 2011 10:49 AM > > Amazon EC2 and other cloud services are expanding the market for > high-performance computing. Without access to a national lab or a > supercomputer in your own data center, cloud computing lets businesses > spin > up temporary clusters at will and stop paying for them as soon as the > computing needs are met. > > A vendor called Cycle Computing is on a mission to demonstrate the > potential > of Amazon???s cloud by building increasingly large clusters on the Elastic > Compute Cloud. Even with Amazon, building a cluster takes some work, but > Cycle combines several technologies to ease the process and recently used > them to create a 30,000-core cluster running CentOS Linux. > > The cluster, announced publicly this week, was created for an unnamed > ???Top 5 > Pharma??? customer, and ran for about seven hours at the end of July at a > peak > cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. > The details are impressive: 3,809 compute instances, each with eight cores > and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB > (petabytes) of disk space. Security was ensured with HTTPS, SSH and > 256-bit > AES encryption, and the cluster ran across data centers in three Amazon > regions in the United States and Europe. The cluster was dubbed > ???Nekomata.??? > > Spreading the cluster across multiple continents was done partly for > disaster > recovery purposes, and also to guarantee that 30,000 cores could be > provisioned. ???We thought it would improve our probability of success if > we > spread it out,??? Cycle Computing???s Dave Powers, manager of product > engineering, told Ars. ???Nobody really knows how many instances you can > get at > any one time from any one [Amazon] region.??? > > Amazon offers its own special cluster compute instances, at a higher cost > than regular-sized virtual machines. These cluster instances provide 10 > Gigabit Ethernet networking along with greater CPU and memory, but they > weren???t necessary to build the Cycle Computing cluster. > > The pharmaceutical company???s job, related to molecular modeling, was > ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. To > further > reduce costs, Cycle took advantage of Amazon???s low-price ???spot > instances.??? To > manage the cluster, Cycle Computing used its own management software as > well > as the Condor High-Throughput Computing software and Chef, an open source > systems integration framework. > > Cycle demonstrated the power of the Amazon cloud earlier this year with a > 10,000-core cluster built for a smaller pharma firm called Genentech. Now, > 10,000 cores is a relatively easy task, says Powers. ???We think we???ve > mastered > the small-scale environments,??? he said. 30,000 cores isn???t the end > game, > either. Going forward, Cycle plans bigger, more complicated clusters, > perhaps > ones that will require Amazon???s special cluster compute instances. > > The 30,000-core cluster may or may not be the biggest one run on EC2. > Amazon > isn???t saying. > > ???I can???t share specific customer details, but can tell you that we do > have > businesses of all sizes running large-scale, high-performance computing > workloads on AWS [Amazon Web Services], including distributed clusters > like > the Cycle Computing 30,000 core cluster to tightly-coupled clusters often > used for science and engineering applications such as computational fluid > dynamics and molecular dynamics simulation,??? an Amazon spokesperson told > Ars. > > Amazon itself actually built a supercomputer on its own cloud that made it > onto the list of the world???s Top 500 supercomputers. With 7,000 cores, > the > Amazon cluster ranked number 232 in the world last November with speeds of > 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle > Computing hasn???t run the Linpack benchmark to determine the speed of its > clusters relative to Top 500 sites. > > But Cycle???s work is impressive no matter how you measure it. The job > performed for the unnamed pharma company ???would take well over a week > for > them to run internally,??? Powers says. In the end, the cluster performed > the > equivalent of 10.9 ???compute years of work.??? > > The task of managing such large cloud-based clusters forced Cycle to step > up > its own game, with a new plug-in for Chef the company calls Grill. > > ???There is no way that any mere human could keep track of all of the > moving > parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At > Cycle, > we???ve always been fans of extreme IT automation, but we needed to take > this > to the next level in order to monitor and manage every instance, volume, > daemon, job, and so on in order for Nekomata to be an efficient 30,000 > core > tool instead of a big shiny on-demand paperweight.??? > > But problems did arise during the 30,000-core run. > > ???You can be sure that when you run at massive scale, you are bound to > run > into some unexpected gotchas,??? Cycle notes. ???In our case, one of the > gotchas > included such things as running out of file descriptors on the license > server. In hindsight, we should have anticipated this would be an issue, > but > we didn???t find that in our prelaunch testing, because we didn???t test > at full > scale. We were able to quickly recover from this bump and keep moving > along > with the workload with minimal impact. The license server was able to keep > up > very nicely with this workload once we increased the number of file > descriptors.??? > > Cycle also hit a speed bump related to volume and byte limits on > Amazon???s > Elastic Block Store volumes. But the company is already planning bigger > and > better things. > > ???We already have our next use-case identified and will be turning up the > scale a bit more with the next run,??? the company says. But ultimately, > ???it???s > not about core counts or terabytes of RAM or petabytes of data. Rather, > it???s > about how we are helping to transform how science is done.??? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Mon Oct 3 13:51:06 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 03 Oct 2011 13:51:06 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> References: <20110921110239.GR25711@leitl.org> <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> Message-ID: <4E89F60A.4070801@ias.edu> Doug, Thanks for posting that video. It confirmed what I always suspected about clouds for HPC. Prentice On 10/03/2011 08:25 AM, Douglas Eadline wrote: > Interesting and pragmatic HPC cloud presentation, worth watching > (25 minutes) > > http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ > > -- > Doug > >> >> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >> >> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >> >> By Jon Brodkin | Published September 20, 2011 10:49 AM >> >> Amazon EC2 and other cloud services are expanding the market for >> high-performance computing. Without access to a national lab or a >> supercomputer in your own data center, cloud computing lets businesses >> spin >> up temporary clusters at will and stop paying for them as soon as the >> computing needs are met. >> >> A vendor called Cycle Computing is on a mission to demonstrate the >> potential >> of Amazon???s cloud by building increasingly large clusters on the Elastic >> Compute Cloud. Even with Amazon, building a cluster takes some work, but >> Cycle combines several technologies to ease the process and recently used >> them to create a 30,000-core cluster running CentOS Linux. >> >> The cluster, announced publicly this week, was created for an unnamed >> ???Top 5 >> Pharma??? customer, and ran for about seven hours at the end of July at a >> peak >> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. >> The details are impressive: 3,809 compute instances, each with eight cores >> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >> 256-bit >> AES encryption, and the cluster ran across data centers in three Amazon >> regions in the United States and Europe. The cluster was dubbed >> ???Nekomata.??? >> >> Spreading the cluster across multiple continents was done partly for >> disaster >> recovery purposes, and also to guarantee that 30,000 cores could be >> provisioned. ???We thought it would improve our probability of success if >> we >> spread it out,??? Cycle Computing???s Dave Powers, manager of product >> engineering, told Ars. ???Nobody really knows how many instances you can >> get at >> any one time from any one [Amazon] region.??? >> >> Amazon offers its own special cluster compute instances, at a higher cost >> than regular-sized virtual machines. These cluster instances provide 10 >> Gigabit Ethernet networking along with greater CPU and memory, but they >> weren???t necessary to build the Cycle Computing cluster. >> >> The pharmaceutical company???s job, related to molecular modeling, was >> ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. To >> further >> reduce costs, Cycle took advantage of Amazon???s low-price ???spot >> instances.??? To >> manage the cluster, Cycle Computing used its own management software as >> well >> as the Condor High-Throughput Computing software and Chef, an open source >> systems integration framework. >> >> Cycle demonstrated the power of the Amazon cloud earlier this year with a >> 10,000-core cluster built for a smaller pharma firm called Genentech. Now, >> 10,000 cores is a relatively easy task, says Powers. ???We think we???ve >> mastered >> the small-scale environments,??? he said. 30,000 cores isn???t the end >> game, >> either. Going forward, Cycle plans bigger, more complicated clusters, >> perhaps >> ones that will require Amazon???s special cluster compute instances. >> >> The 30,000-core cluster may or may not be the biggest one run on EC2. >> Amazon >> isn???t saying. >> >> ???I can???t share specific customer details, but can tell you that we do >> have >> businesses of all sizes running large-scale, high-performance computing >> workloads on AWS [Amazon Web Services], including distributed clusters >> like >> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >> used for science and engineering applications such as computational fluid >> dynamics and molecular dynamics simulation,??? an Amazon spokesperson told >> Ars. >> >> Amazon itself actually built a supercomputer on its own cloud that made it >> onto the list of the world???s Top 500 supercomputers. With 7,000 cores, >> the >> Amazon cluster ranked number 232 in the world last November with speeds of >> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle >> Computing hasn???t run the Linpack benchmark to determine the speed of its >> clusters relative to Top 500 sites. >> >> But Cycle???s work is impressive no matter how you measure it. The job >> performed for the unnamed pharma company ???would take well over a week >> for >> them to run internally,??? Powers says. In the end, the cluster performed >> the >> equivalent of 10.9 ???compute years of work.??? >> >> The task of managing such large cloud-based clusters forced Cycle to step >> up >> its own game, with a new plug-in for Chef the company calls Grill. >> >> ???There is no way that any mere human could keep track of all of the >> moving >> parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At >> Cycle, >> we???ve always been fans of extreme IT automation, but we needed to take >> this >> to the next level in order to monitor and manage every instance, volume, >> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >> core >> tool instead of a big shiny on-demand paperweight.??? >> >> But problems did arise during the 30,000-core run. >> >> ???You can be sure that when you run at massive scale, you are bound to >> run >> into some unexpected gotchas,??? Cycle notes. ???In our case, one of the >> gotchas >> included such things as running out of file descriptors on the license >> server. In hindsight, we should have anticipated this would be an issue, >> but >> we didn???t find that in our prelaunch testing, because we didn???t test >> at full >> scale. We were able to quickly recover from this bump and keep moving >> along >> with the workload with minimal impact. The license server was able to keep >> up >> very nicely with this workload once we increased the number of file >> descriptors.??? >> >> Cycle also hit a speed bump related to volume and byte limits on >> Amazon???s >> Elastic Block Store volumes. But the company is already planning bigger >> and >> better things. >> >> ???We already have our next use-case identified and will be turning up the >> scale a bit more with the next run,??? the company says. But ultimately, >> ???it???s >> not about core counts or terabytes of RAM or petabytes of data. Rather, >> it???s >> about how we are helping to transform how science is done.??? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Mon Oct 3 14:17:33 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 3 Oct 2011 14:17:33 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <4E89F60A.4070801@ias.edu> References: <20110921110239.GR25711@leitl.org> <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> <4E89F60A.4070801@ias.edu> Message-ID: <58756.192.168.93.213.1317665853.squirrel@mail.eadline.org> I think everyone has a similar thoughts, but the presentation provides some real data and experiences. BTW, for those interested, I have new poll on ClusterMonkey asking about clouds and HPC. (http://www.clustermonkey.net/) The last poll was on GP-GPU use. -- Doug > Doug, > > Thanks for posting that video. It confirmed what I always suspected > about clouds for HPC. > > > Prentice > > On 10/03/2011 08:25 AM, Douglas Eadline wrote: >> Interesting and pragmatic HPC cloud presentation, worth watching >> (25 minutes) >> >> http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ >> >> -- >> Doug >> >>> >>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >>> >>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >>> >>> By Jon Brodkin | Published September 20, 2011 10:49 AM >>> >>> Amazon EC2 and other cloud services are expanding the market for >>> high-performance computing. Without access to a national lab or a >>> supercomputer in your own data center, cloud computing lets businesses >>> spin >>> up temporary clusters at will and stop paying for them as soon as the >>> computing needs are met. >>> >>> A vendor called Cycle Computing is on a mission to demonstrate the >>> potential >>> of Amazon???s cloud by building increasingly large clusters on the >>> Elastic >>> Compute Cloud. Even with Amazon, building a cluster takes some work, >>> but >>> Cycle combines several technologies to ease the process and recently >>> used >>> them to create a 30,000-core cluster running CentOS Linux. >>> >>> The cluster, announced publicly this week, was created for an unnamed >>> ???Top 5 >>> Pharma??? customer, and ran for about seven hours at the end of July at >>> a >>> peak >>> cost of $1,279 per hour, including the fees to Amazon and Cycle >>> Computing. >>> The details are impressive: 3,809 compute instances, each with eight >>> cores >>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >>> 256-bit >>> AES encryption, and the cluster ran across data centers in three Amazon >>> regions in the United States and Europe. The cluster was dubbed >>> ???Nekomata.??? >>> >>> Spreading the cluster across multiple continents was done partly for >>> disaster >>> recovery purposes, and also to guarantee that 30,000 cores could be >>> provisioned. ???We thought it would improve our probability of success >>> if >>> we >>> spread it out,??? Cycle Computing???s Dave Powers, manager of product >>> engineering, told Ars. ???Nobody really knows how many instances you >>> can >>> get at >>> any one time from any one [Amazon] region.??? >>> >>> Amazon offers its own special cluster compute instances, at a higher >>> cost >>> than regular-sized virtual machines. These cluster instances provide 10 >>> Gigabit Ethernet networking along with greater CPU and memory, but they >>> weren???t necessary to build the Cycle Computing cluster. >>> >>> The pharmaceutical company???s job, related to molecular modeling, was >>> ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. >>> To >>> further >>> reduce costs, Cycle took advantage of Amazon???s low-price ???spot >>> instances.??? To >>> manage the cluster, Cycle Computing used its own management software as >>> well >>> as the Condor High-Throughput Computing software and Chef, an open >>> source >>> systems integration framework. >>> >>> Cycle demonstrated the power of the Amazon cloud earlier this year with >>> a >>> 10,000-core cluster built for a smaller pharma firm called Genentech. >>> Now, >>> 10,000 cores is a relatively easy task, says Powers. ???We think >>> we???ve >>> mastered >>> the small-scale environments,??? he said. 30,000 cores isn???t the end >>> game, >>> either. Going forward, Cycle plans bigger, more complicated clusters, >>> perhaps >>> ones that will require Amazon???s special cluster compute instances. >>> >>> The 30,000-core cluster may or may not be the biggest one run on EC2. >>> Amazon >>> isn???t saying. >>> >>> ???I can???t share specific customer details, but can tell you that we >>> do >>> have >>> businesses of all sizes running large-scale, high-performance computing >>> workloads on AWS [Amazon Web Services], including distributed clusters >>> like >>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters >>> often >>> used for science and engineering applications such as computational >>> fluid >>> dynamics and molecular dynamics simulation,??? an Amazon spokesperson >>> told >>> Ars. >>> >>> Amazon itself actually built a supercomputer on its own cloud that made >>> it >>> onto the list of the world???s Top 500 supercomputers. With 7,000 >>> cores, >>> the >>> Amazon cluster ranked number 232 in the world last November with speeds >>> of >>> 41.82 teraflops, falling to number 451 in June of this year. So far, >>> Cycle >>> Computing hasn???t run the Linpack benchmark to determine the speed of >>> its >>> clusters relative to Top 500 sites. >>> >>> But Cycle???s work is impressive no matter how you measure it. The job >>> performed for the unnamed pharma company ???would take well over a week >>> for >>> them to run internally,??? Powers says. In the end, the cluster >>> performed >>> the >>> equivalent of 10.9 ???compute years of work.??? >>> >>> The task of managing such large cloud-based clusters forced Cycle to >>> step >>> up >>> its own game, with a new plug-in for Chef the company calls Grill. >>> >>> ???There is no way that any mere human could keep track of all of the >>> moving >>> parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At >>> Cycle, >>> we???ve always been fans of extreme IT automation, but we needed to >>> take >>> this >>> to the next level in order to monitor and manage every instance, >>> volume, >>> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >>> core >>> tool instead of a big shiny on-demand paperweight.??? >>> >>> But problems did arise during the 30,000-core run. >>> >>> ???You can be sure that when you run at massive scale, you are bound to >>> run >>> into some unexpected gotchas,??? Cycle notes. ???In our case, one of >>> the >>> gotchas >>> included such things as running out of file descriptors on the license >>> server. In hindsight, we should have anticipated this would be an >>> issue, >>> but >>> we didn???t find that in our prelaunch testing, because we didn???t >>> test >>> at full >>> scale. We were able to quickly recover from this bump and keep moving >>> along >>> with the workload with minimal impact. The license server was able to >>> keep >>> up >>> very nicely with this workload once we increased the number of file >>> descriptors.??? >>> >>> Cycle also hit a speed bump related to volume and byte limits on >>> Amazon???s >>> Elastic Block Store volumes. But the company is already planning bigger >>> and >>> better things. >>> >>> ???We already have our next use-case identified and will be turning up >>> the >>> scale a bit more with the next run,??? the company says. But >>> ultimately, >>> ???it???s >>> not about core counts or terabytes of RAM or petabytes of data. Rather, >>> it???s >>> about how we are helping to transform how science is done.??? >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >>> -- >>> This message has been scanned for viruses and >>> dangerous content by MailScanner, and is >>> believed to be clean. >>> >>> >> >> > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Mon Oct 3 14:50:22 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Mon, 3 Oct 2011 14:50:22 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20110921110239.GR25711@leitl.org> References: <20110921110239.GR25711@leitl.org> Message-ID: There's a free & opensource application called StarCluster that can do most (if not all?) of the EC2 provisioning & cluster setup for a High Throughput Computing cluster: http://web.mit.edu/stardev/cluster/ StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc automatically for the user in around 10-15 mins. StarCluster is licensed under LGPL, written in Python+Boto, and supports a lot of the new EC2 features (Cluster Compute Instances, Spot Instances, Cluster GPU Instances, etc). Support for launching higher node count (100+ instances) clusters is even better with the new scalability enhancements in the latest version (0.92). And there are some tutorials on YouTube: - "StarCluster 0.91 Demo": http://www.youtube.com/watch?v=vC3lJcPq1FY - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": http://www.youtube.com/watch?v=2Ym7epCYnSk Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: > > http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars > > $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud > > By Jon Brodkin | Published September 20, 2011 10:49 AM > > Amazon EC2 and other cloud services are expanding the market for > high-performance computing. Without access to a national lab or a > supercomputer in your own data center, cloud computing lets businesses spin > up temporary clusters at will and stop paying for them as soon as the > computing needs are met. > > A vendor called Cycle Computing is on a mission to demonstrate the potential > of Amazon?s cloud by building increasingly large clusters on the Elastic > Compute Cloud. Even with Amazon, building a cluster takes some work, but > Cycle combines several technologies to ease the process and recently used > them to create a 30,000-core cluster running CentOS Linux. > > The cluster, announced publicly this week, was created for an unnamed ?Top 5 > Pharma? customer, and ran for about seven hours at the end of July at a peak > cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. > The details are impressive: 3,809 compute instances, each with eight cores > and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB > (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit > AES encryption, and the cluster ran across data centers in three Amazon > regions in the United States and Europe. The cluster was dubbed ?Nekomata.? > > Spreading the cluster across multiple continents was done partly for disaster > recovery purposes, and also to guarantee that 30,000 cores could be > provisioned. ?We thought it would improve our probability of success if we > spread it out,? Cycle Computing?s Dave Powers, manager of product > engineering, told Ars. ?Nobody really knows how many instances you can get at > any one time from any one [Amazon] region.? > > Amazon offers its own special cluster compute instances, at a higher cost > than regular-sized virtual machines. These cluster instances provide 10 > Gigabit Ethernet networking along with greater CPU and memory, but they > weren?t necessary to build the Cycle Computing cluster. > > The pharmaceutical company?s job, related to molecular modeling, was > ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To further > reduce costs, Cycle took advantage of Amazon?s low-price ?spot instances.? To > manage the cluster, Cycle Computing used its own management software as well > as the Condor High-Throughput Computing software and Chef, an open source > systems integration framework. > > Cycle demonstrated the power of the Amazon cloud earlier this year with a > 10,000-core cluster built for a smaller pharma firm called Genentech. Now, > 10,000 cores is a relatively easy task, says Powers. ?We think we?ve mastered > the small-scale environments,? he said. 30,000 cores isn?t the end game, > either. Going forward, Cycle plans bigger, more complicated clusters, perhaps > ones that will require Amazon?s special cluster compute instances. > > The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon > isn?t saying. > > ?I can?t share specific customer details, but can tell you that we do have > businesses of all sizes running large-scale, high-performance computing > workloads on AWS [Amazon Web Services], including distributed clusters like > the Cycle Computing 30,000 core cluster to tightly-coupled clusters often > used for science and engineering applications such as computational fluid > dynamics and molecular dynamics simulation,? an Amazon spokesperson told Ars. > > Amazon itself actually built a supercomputer on its own cloud that made it > onto the list of the world?s Top 500 supercomputers. With 7,000 cores, the > Amazon cluster ranked number 232 in the world last November with speeds of > 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle > Computing hasn?t run the Linpack benchmark to determine the speed of its > clusters relative to Top 500 sites. > > But Cycle?s work is impressive no matter how you measure it. The job > performed for the unnamed pharma company ?would take well over a week for > them to run internally,? Powers says. In the end, the cluster performed the > equivalent of 10.9 ?compute years of work.? > > The task of managing such large cloud-based clusters forced Cycle to step up > its own game, with a new plug-in for Chef the company calls Grill. > > ?There is no way that any mere human could keep track of all of the moving > parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, > we?ve always been fans of extreme IT automation, but we needed to take this > to the next level in order to monitor and manage every instance, volume, > daemon, job, and so on in order for Nekomata to be an efficient 30,000 core > tool instead of a big shiny on-demand paperweight.? > > But problems did arise during the 30,000-core run. > > ?You can be sure that when you run at massive scale, you are bound to run > into some unexpected gotchas,? Cycle notes. ?In our case, one of the gotchas > included such things as running out of file descriptors on the license > server. In hindsight, we should have anticipated this would be an issue, but > we didn?t find that in our prelaunch testing, because we didn?t test at full > scale. We were able to quickly recover from this bump and keep moving along > with the workload with minimal impact. The license server was able to keep up > very nicely with this workload once we increased the number of file > descriptors.? > > Cycle also hit a speed bump related to volume and byte limits on Amazon?s > Elastic Block Store volumes. But the company is already planning bigger and > better things. > > ?We already have our next use-case identified and will be turning up the > scale a bit more with the next run,? the company says. But ultimately, ?it?s > not about core counts or terabytes of RAM or petabytes of data. Rather, it?s > about how we are helping to transform how science is done.? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ Wikimedia Commons http://commons.wikimedia.org/wiki/User:Raysonho _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Mon Oct 3 15:21:44 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 3 Oct 2011 15:21:44 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: <20110921110239.GR25711@leitl.org> Message-ID: On Mon, 3 Oct 2011, Rayson Ho wrote: > There's a free & opensource application called StarCluster that can do > most (if not all?) of the EC2 provisioning & cluster setup for a High > Throughput Computing cluster: I will say that if anyone is going to make this work, it is going to be Amazon and/or Google -- they have the very very big pile of computers needed to make it work. I would be very interested in seeing the detailed scaling of "fine grained parallel" applications on cloud resources -- one point that the talk made that I agree with is that embarrassingly parallel applications that require minimal I/O or IPCs will do well in a cloud where all that matters is how many instances you can run of jobs that don't talk to each other or need much access to data. But what of jobs that require synchronous high speed communications? What of jobs that require access to huge datasets? Ultimately the problem comes down to this. Your choice is to rent time on somebody else's hardware or buy your own hardware. For many people, one can scale to infinity and beyond, so using "all" of the time/resource you have available either way is a given. In which case no matter how you slice it, Amazon or Google have to make a profit above and beyond the cost of delivering the service. You don't (or rather, your "profit" is just the ability to run your jobs and get paid as usual to do your research either way). This means that it will always be cheaper to directly provision a lot of computing rather than run it in the cloud, or for that matter at an HPC center. Not all -- lots of nonlinearities and thresholds associated with infrastructure and admin and so on -- but a lot. Enough that I don't see Amazon's Pinky OR the Brain ever taking over the (HPC) world... rgb > > http://web.mit.edu/stardev/cluster/ > > StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc > automatically for the user in around 10-15 mins. StarCluster is > licensed under LGPL, written in Python+Boto, and supports a lot of the > new EC2 features (Cluster Compute Instances, Spot Instances, Cluster > GPU Instances, etc). Support for launching higher node count (100+ > instances) clusters is even better with the new scalability > enhancements in the latest version (0.92). > > And there are some tutorials on YouTube: > > - "StarCluster 0.91 Demo": > http://www.youtube.com/watch?v=vC3lJcPq1FY > > - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": > http://www.youtube.com/watch?v=2Ym7epCYnSk > > Rayson > > ================================= > Grid Engine / Open Grid Scheduler > http://gridscheduler.sourceforge.net > > > > On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: >> >> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >> >> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >> >> By Jon Brodkin | Published September 20, 2011 10:49 AM >> >> Amazon EC2 and other cloud services are expanding the market for >> high-performance computing. Without access to a national lab or a >> supercomputer in your own data center, cloud computing lets businesses spin >> up temporary clusters at will and stop paying for them as soon as the >> computing needs are met. >> >> A vendor called Cycle Computing is on a mission to demonstrate the potential >> of Amazon?s cloud by building increasingly large clusters on the Elastic >> Compute Cloud. Even with Amazon, building a cluster takes some work, but >> Cycle combines several technologies to ease the process and recently used >> them to create a 30,000-core cluster running CentOS Linux. >> >> The cluster, announced publicly this week, was created for an unnamed ?Top 5 >> Pharma? customer, and ran for about seven hours at the end of July at a peak >> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. >> The details are impressive: 3,809 compute instances, each with eight cores >> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >> (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit >> AES encryption, and the cluster ran across data centers in three Amazon >> regions in the United States and Europe. The cluster was dubbed ?Nekomata.? >> >> Spreading the cluster across multiple continents was done partly for disaster >> recovery purposes, and also to guarantee that 30,000 cores could be >> provisioned. ?We thought it would improve our probability of success if we >> spread it out,? Cycle Computing?s Dave Powers, manager of product >> engineering, told Ars. ?Nobody really knows how many instances you can get at >> any one time from any one [Amazon] region.? >> >> Amazon offers its own special cluster compute instances, at a higher cost >> than regular-sized virtual machines. These cluster instances provide 10 >> Gigabit Ethernet networking along with greater CPU and memory, but they >> weren?t necessary to build the Cycle Computing cluster. >> >> The pharmaceutical company?s job, related to molecular modeling, was >> ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To further >> reduce costs, Cycle took advantage of Amazon?s low-price ?spot instances.? To >> manage the cluster, Cycle Computing used its own management software as well >> as the Condor High-Throughput Computing software and Chef, an open source >> systems integration framework. >> >> Cycle demonstrated the power of the Amazon cloud earlier this year with a >> 10,000-core cluster built for a smaller pharma firm called Genentech. Now, >> 10,000 cores is a relatively easy task, says Powers. ?We think we?ve mastered >> the small-scale environments,? he said. 30,000 cores isn?t the end game, >> either. Going forward, Cycle plans bigger, more complicated clusters, perhaps >> ones that will require Amazon?s special cluster compute instances. >> >> The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon >> isn?t saying. >> >> ?I can?t share specific customer details, but can tell you that we do have >> businesses of all sizes running large-scale, high-performance computing >> workloads on AWS [Amazon Web Services], including distributed clusters like >> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >> used for science and engineering applications such as computational fluid >> dynamics and molecular dynamics simulation,? an Amazon spokesperson told Ars. >> >> Amazon itself actually built a supercomputer on its own cloud that made it >> onto the list of the world?s Top 500 supercomputers. With 7,000 cores, the >> Amazon cluster ranked number 232 in the world last November with speeds of >> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle >> Computing hasn?t run the Linpack benchmark to determine the speed of its >> clusters relative to Top 500 sites. >> >> But Cycle?s work is impressive no matter how you measure it. The job >> performed for the unnamed pharma company ?would take well over a week for >> them to run internally,? Powers says. In the end, the cluster performed the >> equivalent of 10.9 ?compute years of work.? >> >> The task of managing such large cloud-based clusters forced Cycle to step up >> its own game, with a new plug-in for Chef the company calls Grill. >> >> ?There is no way that any mere human could keep track of all of the moving >> parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, >> we?ve always been fans of extreme IT automation, but we needed to take this >> to the next level in order to monitor and manage every instance, volume, >> daemon, job, and so on in order for Nekomata to be an efficient 30,000 core >> tool instead of a big shiny on-demand paperweight.? >> >> But problems did arise during the 30,000-core run. >> >> ?You can be sure that when you run at massive scale, you are bound to run >> into some unexpected gotchas,? Cycle notes. ?In our case, one of the gotchas >> included such things as running out of file descriptors on the license >> server. In hindsight, we should have anticipated this would be an issue, but >> we didn?t find that in our prelaunch testing, because we didn?t test at full >> scale. We were able to quickly recover from this bump and keep moving along >> with the workload with minimal impact. The license server was able to keep up >> very nicely with this workload once we increased the number of file >> descriptors.? >> >> Cycle also hit a speed bump related to volume and byte limits on Amazon?s >> Elastic Block Store volumes. But the company is already planning bigger and >> better things. >> >> ?We already have our next use-case identified and will be turning up the >> scale a bit more with the next run,? the company says. But ultimately, ?it?s >> not about core counts or terabytes of RAM or petabytes of data. Rather, it?s >> about how we are helping to transform how science is done.? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > Rayson > > ================================================== > Open Grid Scheduler - The Official Open Source Grid Engine > http://gridscheduler.sourceforge.net/ > > Wikimedia Commons > http://commons.wikimedia.org/wiki/User:Raysonho > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Tue Oct 4 10:55:39 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 4 Oct 2011 10:55:39 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: <20110921110239.GR25711@leitl.org> Message-ID: On Mon, Oct 3, 2011 at 3:21 PM, Robert G. Brown wrote: >?I would be very interested in seeing the > detailed scaling of "fine grained parallel" applications on cloud > resources -- one point that the talk made that I agree with is that > embarrassingly parallel applications that require minimal I/O or IPCs > will do well in a cloud where all that matters is how many instances you > can run of jobs that don't talk to each other or need much access to > data. ?But what of jobs that require synchronous high speed > communications? Amazon (and I believe other cloud providers have something similar?) introduced Cluster Compute Instances with 10 Gb Ethernet. For traditional MPI workloads, the real advantage is actually from HVM (Hardware VM), as it cuts the communication latency by quite a lot. > What of jobs that require access to huge datasets? Getting data in & out of the cloud is still a big problem, and the highest bandwidth way of sending data to AWS is by FedEx. In fact, it is quite often that the fastest way to send data from one data center to another when the data size is big. And processing data on the cloud is easier (in terms of setup) with Amazon Elastic MapReduce (and recently works with spot instances). http://aws.amazon.com/elasticmapreduce/ > Ultimately the problem comes down to this. ?Your choice is to rent time > on somebody else's hardware or buy your own hardware. ?For many people, > one can scale to infinity and beyond, so using "all" of the > time/resource you have available either way is a given. ?In which case > no matter how you slice it, Amazon or Google have to make a profit above > and beyond the cost of delivering the service. ?You don't (or rather, > your "profit" is just the ability to run your jobs and get paid as usual > to do your research either way). ?This means that it will always be > cheaper to directly provision a lot of computing rather than run it in > the cloud, or for that matter at an HPC center. Provided that the machines are used 24x7. A lot of enterprise users do not have enough work to load up the machines. Eg, I worked with a client that has lots of data & numbers to crunch at night, and during day time most of the machines are idle. For traditional HPC centers, the batch queue length is almost never 0, then agreed, cloud wouldn't help or even makes the problem worse. Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net > ?Not all -- lots of > nonlinearities and thresholds associated with infrastructure and admin > and so on -- but a lot. ?Enough that I don't see Amazon's Pinky OR the > Brain ever taking over the (HPC) world... > > ? rgb > >> >> http://web.mit.edu/stardev/cluster/ >> >> StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc >> automatically for the user in around 10-15 mins. StarCluster is >> licensed under LGPL, written in Python+Boto, and supports a lot of the >> new EC2 features (Cluster Compute Instances, Spot Instances, Cluster >> GPU Instances, etc). Support for launching higher node count (100+ >> instances) clusters is even better with the new scalability >> enhancements in the latest version (0.92). >> >> And there are some tutorials on YouTube: >> >> - "StarCluster 0.91 Demo": >> http://www.youtube.com/watch?v=vC3lJcPq1FY >> >> - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": >> http://www.youtube.com/watch?v=2Ym7epCYnSk >> >> Rayson >> >> ================================= >> Grid Engine / Open Grid Scheduler >> http://gridscheduler.sourceforge.net >> >> >> >> On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: >>> >>> >>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >>> >>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >>> >>> By Jon Brodkin | Published September 20, 2011 10:49 AM >>> >>> Amazon EC2 and other cloud services are expanding the market for >>> high-performance computing. Without access to a national lab or a >>> supercomputer in your own data center, cloud computing lets businesses >>> spin >>> up temporary clusters at will and stop paying for them as soon as the >>> computing needs are met. >>> >>> A vendor called Cycle Computing is on a mission to demonstrate the >>> potential >>> of Amazon?s cloud by building increasingly large clusters on the Elastic >>> Compute Cloud. Even with Amazon, building a cluster takes some work, but >>> Cycle combines several technologies to ease the process and recently used >>> them to create a 30,000-core cluster running CentOS Linux. >>> >>> The cluster, announced publicly this week, was created for an unnamed >>> ?Top 5 >>> Pharma? customer, and ran for about seven hours at the end of July at a >>> peak >>> cost of $1,279 per hour, including the fees to Amazon and Cycle >>> Computing. >>> The details are impressive: 3,809 compute instances, each with eight >>> cores >>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >>> 256-bit >>> AES encryption, and the cluster ran across data centers in three Amazon >>> regions in the United States and Europe. The cluster was dubbed >>> ?Nekomata.? >>> >>> Spreading the cluster across multiple continents was done partly for >>> disaster >>> recovery purposes, and also to guarantee that 30,000 cores could be >>> provisioned. ?We thought it would improve our probability of success if >>> we >>> spread it out,? Cycle Computing?s Dave Powers, manager of product >>> engineering, told Ars. ?Nobody really knows how many instances you can >>> get at >>> any one time from any one [Amazon] region.? >>> >>> Amazon offers its own special cluster compute instances, at a higher cost >>> than regular-sized virtual machines. These cluster instances provide 10 >>> Gigabit Ethernet networking along with greater CPU and memory, but they >>> weren?t necessary to build the Cycle Computing cluster. >>> >>> The pharmaceutical company?s job, related to molecular modeling, was >>> ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To >>> further >>> reduce costs, Cycle took advantage of Amazon?s low-price ?spot >>> instances.? To >>> manage the cluster, Cycle Computing used its own management software as >>> well >>> as the Condor High-Throughput Computing software and Chef, an open source >>> systems integration framework. >>> >>> Cycle demonstrated the power of the Amazon cloud earlier this year with a >>> 10,000-core cluster built for a smaller pharma firm called Genentech. >>> Now, >>> 10,000 cores is a relatively easy task, says Powers. ?We think we?ve >>> mastered >>> the small-scale environments,? he said. 30,000 cores isn?t the end game, >>> either. Going forward, Cycle plans bigger, more complicated clusters, >>> perhaps >>> ones that will require Amazon?s special cluster compute instances. >>> >>> The 30,000-core cluster may or may not be the biggest one run on EC2. >>> Amazon >>> isn?t saying. >>> >>> ?I can?t share specific customer details, but can tell you that we do >>> have >>> businesses of all sizes running large-scale, high-performance computing >>> workloads on AWS [Amazon Web Services], including distributed clusters >>> like >>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >>> used for science and engineering applications such as computational fluid >>> dynamics and molecular dynamics simulation,? an Amazon spokesperson told >>> Ars. >>> >>> Amazon itself actually built a supercomputer on its own cloud that made >>> it >>> onto the list of the world?s Top 500 supercomputers. With 7,000 cores, >>> the >>> Amazon cluster ranked number 232 in the world last November with speeds >>> of >>> 41.82 teraflops, falling to number 451 in June of this year. So far, >>> Cycle >>> Computing hasn?t run the Linpack benchmark to determine the speed of its >>> clusters relative to Top 500 sites. >>> >>> But Cycle?s work is impressive no matter how you measure it. The job >>> performed for the unnamed pharma company ?would take well over a week for >>> them to run internally,? Powers says. In the end, the cluster performed >>> the >>> equivalent of 10.9 ?compute years of work.? >>> >>> The task of managing such large cloud-based clusters forced Cycle to step >>> up >>> its own game, with a new plug-in for Chef the company calls Grill. >>> >>> ?There is no way that any mere human could keep track of all of the >>> moving >>> parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, >>> we?ve always been fans of extreme IT automation, but we needed to take >>> this >>> to the next level in order to monitor and manage every instance, volume, >>> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >>> core >>> tool instead of a big shiny on-demand paperweight.? >>> >>> But problems did arise during the 30,000-core run. >>> >>> ?You can be sure that when you run at massive scale, you are bound to run >>> into some unexpected gotchas,? Cycle notes. ?In our case, one of the >>> gotchas >>> included such things as running out of file descriptors on the license >>> server. In hindsight, we should have anticipated this would be an issue, >>> but >>> we didn?t find that in our prelaunch testing, because we didn?t test at >>> full >>> scale. We were able to quickly recover from this bump and keep moving >>> along >>> with the workload with minimal impact. The license server was able to >>> keep up >>> very nicely with this workload once we increased the number of file >>> descriptors.? >>> >>> Cycle also hit a speed bump related to volume and byte limits on Amazon?s >>> Elastic Block Store volumes. But the company is already planning bigger >>> and >>> better things. >>> >>> ?We already have our next use-case identified and will be turning up the >>> scale a bit more with the next run,? the company says. But ultimately, >>> ?it?s >>> not about core counts or terabytes of RAM or petabytes of data. Rather, >>> it?s >>> about how we are helping to transform how science is done.? >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> >> -- >> Rayson >> >> ================================================== >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> Wikimedia Commons >> http://commons.wikimedia.org/wiki/User:Raysonho >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown ? ? ? ? ? ? ? ? ? ? ? ?http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 ?Fax: 919-660-2525 ? ? email:rgb at phy.duke.edu > > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 11:26:55 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 08:26:55 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: Message-ID: On 10/4/11 7:55 AM, "Rayson Ho" wrote: >On Mon, Oct 3, 2011 at 3:21 PM, Robert G. Brown wrote: >> I would be very interested in seeing the >> detailed scaling of "fine grained parallel" applications on cloud >> resources -- one point that the talk made that I agree with is that >> embarrassingly parallel applications that require minimal I/O or IPCs >> will do well in a cloud where all that matters is how many instances you >> can run of jobs that don't talk to each other or need much access to >> data. But what of jobs that require synchronous high speed >> communications? > >Amazon (and I believe other cloud providers have something similar?) >introduced Cluster Compute Instances with 10 Gb Ethernet. For >traditional MPI workloads, the real advantage is actually from HVM >(Hardware VM), as it cuts the communication latency by quite a lot. > > >> What of jobs that require access to huge datasets? > >Getting data in & out of the cloud is still a big problem, and the >highest bandwidth way of sending data to AWS is by FedEx. In fact, it >is quite often that the fastest way to send data from one data center >to another when the data size is big. The classic: nothing beats a station wagon full of tapes for bandwidth. (today, it's minivan with terabyte hard drives, but that's the idea) > > > >> Ultimately the problem comes down to this. Your choice is to rent time >> on somebody else's hardware or buy your own hardware. For many people, >> one can scale to infinity and beyond, so using "all" of the >> time/resource you have available either way is a given. In which case >> no matter how you slice it, Amazon or Google have to make a profit above >> and beyond the cost of delivering the service. You don't (or rather, >> your "profit" is just the ability to run your jobs and get paid as usual >> to do your research either way). This means that it will always be >> cheaper to directly provision a lot of computing rather than run it in >> the cloud, or for that matter at an HPC center. > >Provided that the machines are used 24x7. A lot of enterprise users do >not have enough work to load up the machines. Eg, I worked with a >client that has lots of data & numbers to crunch at night, and during >day time most of the machines are idle. In a situation where you've got an existing application and data, and you just want to crunch numbers, and you pay either cloud or in-house, then you make the choice based on the incremental cost. However, even at the smallest increment on a cloud/hosted scheme, you have to pay from CPU second #1 (plus the fixed overhead of getting the job ready to go). If you have a cluster in house, there is likely a way to get a test job run essentially for free (perhaps on an older non-production cluster). That test job provides the performance data and preliminary results that you use in preparing the proposal to get real money to pay for real computation. This has been my argument for personal clusters... There's no accounting staff or administrative person watching over you to make sure you are effectively using the capital investment, in the same sense that most places don't care how much idle time there is on your desktop PC. If you've got an idea, and you're willing to put your own time (free?) into it, using the box that happens to be in your office or lab, nobody cares one way or another, as long as your primary job gets done. Notwithstanding that there ARE places that do cycle harvesting from desktop machines, but the management and sysadmin hassles are so extreme (I've written software to DO such harvesting, in pre-Beowulf days).. Those kinds of places go to thin clients and hosted VM instances eventually, I think. Where an Amazon could do themselves a favor (maybe they do this already) is to provide a free downloadable version of their environment for your own computer, or some "low priority cycles" for free, to get people hooked. Sort of like IBM providing computers for cheap to universities in the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized cellphones, 10 cent text messages. Give us your child 'til 7, and he's ours for life. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From raysonlogin at gmail.com Tue Oct 4 11:58:12 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 4 Oct 2011 11:58:12 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 11:26 AM, Lux, Jim (337C) wrote: > The classic: nothing beats a station wagon full of tapes for bandwidth. > (today, it's minivan with terabyte hard drives, but that's the idea) BTW, I've heard horror stories related to routing errors with this method - truck drivers delivering wrong tapes or losing tapes (hopefully the data is properly encrypted). > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days). The technology part of cycle harvesting is solvable, the accounting part is (IMO) much harder. A few years ago I talked to a University HPC lab about deploying cycle harvesting in the libraries (it's a big University, so we are talking about 1000+ library desktops). The technology was there (BOINC client), but getting the software installed & maintained means extra work, which means an extra IT guy... and means no one wants to pay for this. I wonder how many University labs or Biotech companies are doing organization wide cycle harvesting these days, for example, with technologies like BOINC: http://boinc.berkeley.edu/ > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, AMI is not private (in the end, it is IaaS, so the VM images are open). In fact, StarCluster has AMIs for download & install (mainly for developers who want to code for StarCluster locally): http://web.mit.edu/stardev/cluster/download_amis.html And one can roll a custom StarCluster AMI and upload it to AWS, such that the image settings are optimized to the needs: http://web.mit.edu/stardev/cluster/docs/0.91/create_new_ami.html > or some "low priority cycles" for free, to get people hooked. AWS Free Usage Tier -- (most people just use the free tier as free hosting): http://aws.amazon.com/free/ Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net > ?Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. > > >> > > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Tue Oct 4 13:08:11 2011 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 4 Oct 2011 13:08:11 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <53556.192.168.93.213.1317748091.squirrel@mail.eadline.org> --snip-- > > This has been my argument for personal clusters... There's no accounting > staff or administrative person watching over you to make sure you are > effectively using the capital investment, in the same sense that most > places don't care how much idle time there is on your desktop PC. If > you've got an idea, and you're willing to put your own time (free?) into > it, using the box that happens to be in your office or lab, nobody cares > one way or another, as long as your primary job gets done. > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days).. Those > kinds of places go to thin clients and hosted VM instances eventually, I > think. BTW, very soon prebuilt Limulus systems will be available (http://limulus.basement-supercomputing.com) with 16 cores (four i5-2500S processors), one power plug, cool, quiet, with cool blue lights to impress your co-workers. -- Doug > > > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, or some "low priority cycles" for free, to get people > hooked. Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. > > >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Tue Oct 4 14:39:20 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 14:39:20 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days).. Those > kinds of places go to thin clients and hosted VM instances eventually, I > think. Condor (much improved from the old days, I think) actually makes this fairly easy nowadays. The physics department runs condor across lots of the low-rent desktop systems, creating a readily available compute farm for EP jobs. I don't do much of that sort of thing any more, alas. Mostly teaching, working on dieharder when I can, and writing textbooks at a furious pace. I will have a complete first year physics textbook -- the world's best, naturally;-) -- finished by the end of this semester (I'm within about four and a half chapters of finished already, and writing at least a chapter a week at this point). After that is done, and two other books that are partly finished (three if I get really inspired and try to finish the beowulf book) THEN I may have time to do more actual computing. > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, or some "low priority cycles" for free, to get people > hooked. Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. As I said, ultimately Amazon makes a profit. That is, they provide the cluster and some reasonable subset of cluster management in infrastructure provisioning, where they have to a) recoup the cost of the hardware, the infrastructure, and the management; b) make at LEAST 5-10% or better on the costs of all of this as profit, if not more like 40-50% or even 100% markup. Usually retail is 100% markup, but Amazon has scale efficiencies such that they can get by with less, whether or not they "like" to. So it ultimately comes down to whether or not you can provide similar efficiencies in your own local environment. Suppose it is a University. You have $100,000 for a compute resource that you expect to use over three years. There is typically no indirect cost charged to capital equipment. Often, but not always, housing, cooling, powering, and even managing the hardware is "free" to the researcher, absorbed into the ongoing costs of the server room and management staff already needed to run the department LAN and servers. Thus for your $100,000 you can buy (say) 100 dedicated function systems for $1000 each and everything else is paid out of opportunity cost labor or University provisioning that doesn't cost your grant anything -- out of that $100,000 (although of course your indirect costs elsewhere partly subsidize it). Even network ports may be free, or may not be if you need a higher end "cluster" network. If you rent from ANYBODY, you pay: * Slightly over 1/3 of the $100,000 up front for indirect costs. Duke, for example, would be perfectly happy to charge your grant $1 for every $2 that it pays out to a third party for cloud computing rental. For that fee they do all of the bookkeeping, basically -- most is pure profit, but prenegotiated with all of the granting agencies and that's just the way it is. * Your remaining (say) $63,000 has to pay for (a fraction of) the power, the housing, the cooling, the network. Unless Amazon subsidizes the cluster with different money altogether (e.g. using money from book sales to provide all of this at a loss) it will almost certainly not be as cheap as a University center for modest size clusters. When clusters grow to where people have to build new data centers just to house them, of course, this may not be true (but Amazon still doesn't gain much of a relative advantage even in this extreme case, not in the long run). Infrastructure costs are likely ballpark 10% of the cost of the hardware you are running on. * It has to pay for Amazon's sysadmins and management and security. These are humans that your money DIRECTLY supports, not humans that are directly supported to do something else and do admin for you on an opportunity cost basis "for free". Real salaries, (fractionally) paid from this income stream only. Even amortized in the friendliest most favorable way possible, admin cost are probably at least 10% of the hardware costs. * Profit. At least (say) $6300 is profit. Nobody makes a similar profit in the case of the DIY cluster. * The amortized cost of the hardware. The way I see it, you end up with roughly 50% of every dollar lost >>off the top<< of your $100,000. You ultimately buy (an amortized fraction of) the hardware the $100,000 as up-front capital equipment would cost you, and instead of being able to leverage pre-existing University infrastructure, avoid indirect costs, all as on a non-profit basis, you have to pay for infrastructure, indirect costs on the grant, management, AND A PROFIT on top of the hardware. The only real advantage is that -- maybe -- Amazon has market leverage and economy of scale on the hardware. But 50%? That's hard to make back. rgb > > >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dag at sonsorol.org Tue Oct 4 15:29:28 2011 From: dag at sonsorol.org (Chris Dagdigian) Date: Tue, 04 Oct 2011 15:29:28 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <4E8B5E98.3090002@sonsorol.org> I'm largely with RGB on this one with the minor caveat that I think he might be undervaluing the insane economies of scale that IaaS providers like Amazon & Google can provide. At the scale that Amazon operates at, they can obtain and run infrastructure far, far more efficiently than most (if not all) of us can ourselves. These folks have exabytes of spinning disk, redundant data-centers (with insane PUE values) all over the world and they know how to manage hundreds of thousands of servers with high efficiency in a very hostile networking environment. Not only can they run bigger and more efficient than we can, they can charge a price that makes them a profit while still being (in many cases) far cheaper than my own costs should I be truly honest about the fully-loaded costs of maintaining HPC or IT services. AWS has a history of lowering prices as their own costs go down. You can see this via the EC2 pricing history as well as the now-down-to-zero cost of inbound data transit. AWS Spot market makes this even more interesting. I can currently run an m1.4xlarge 64bit server instance with 15GB RAM for about $.24 per hour - close to 50% cheaper than the published hourly price and that spot price can hold steady for weeks at a time in many cases. The biggest hangup is the economics. Even harder in an academic environment where researchers are used to seeing their funds vanish to "overhead" on their grant or they just assume that datacenters, bandwidth, power and hosting are all "free" to use. It's hard to do true cost comparisons but time and time again I've seen IaaS come out ahead when the fully-loaded costs are actually put down on paper. Here is a cliche example: Amazon S3 Before the S3 object storage service will even *acknowledge* a successful PUT request, your file is already at rest in at least three amazon facilities. So to "really" compare S3 against what you can do locally you at least have to factor in the cost of your organization being able to provide 3x multi-facility replication for whatever object store you choose to deploy... I don't want to be seen as a shill so I'll stop with that example. The results really are surprising once you start down the "true cost of IT services..." road. As for industry trends with HPC and IaaS ... I can assure you that in the super practical & cynical world of biotech and pharma there is already an HPC migration to IaaS platforms that is years old already. It's a lot easier to see where and how your money is being spent inside a biotech startup or pharma and that is (and has) shunted a decent amount of spending towards cloud platforms. The easy stuff is moving to IaaS platforms. The hard stuff, the custom stuff, the tightly bound stuff and the data/IO-bound stuff is staying local of course - but that still means lots of stuff is moving externally. The article that prompted this thread is a great example of this. The client company had a boatload of one-off molecular dynamics simulations to run. So much, in fact, that the problem was computationally infeasable to even consider doing inhouse. So they did it on AWS. 30,000 CPU cores. For ~$9,000 dollars. Amazing. It's a fun time to be in HPC actually. And getting my head around "IaaS" platforms turned me onto things (like opscode chef) that we are now bringing inhouse and integrating into our legacy clusters and grids. Sorry for rambling but I think there are 2 main drivers behind what I see moving HPC users and applications into IaaS cloud platforms ... (1) The economies of scale are real. IaaS providers can run better, bigger and cheaper than we can and they can still make a profit. This is real, not hype or sales BS. (as long as you are honest about your actual costs...) (2) The benefits of "scriptable everything" or "everything has an API". I'm so freaking sick of companies installing VMWare and excreting a press release calling themselves a "cloud provider". Virtual servers and virtual block storage on demand are boring, basic and pedestrian. That was clever in 2004. I need far more "glue" to build useful stuff in a virtual world and IaaS platforms deliver more products/services and "glue" options than anyone else out there. The "scriptable everything" nature of IaaS is enabling a lot of cool system and workflow building, much of which would be hard or almost impossible to do in-house with local resources. My $.02 -Chris (corporate hat: chris at bioteam.net) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue Oct 4 16:07:21 2011 From: mathog at caltech.edu (mathog) Date: Tue, 04 Oct 2011 13:07:21 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > "Robert G. Brown" wrote: > Often, but not always, housing, cooling, powering, and even managing > the > hardware is "free" to the researcher, absorbed into the ongoing costs > of > the server room and management staff already needed to run the > department LAN and servers. Not always indeed. My little machine room houses a half dozen machines from other biology division people, and they are not charged to keep them there. However, putting a computer in the central campus machine rooms is not free. And new computer rooms, at least those of any size, do not get free power. After geology put in this monster: http://www.gps.caltech.edu/uploads/Image/Facilities/Beowulf.jpg the administration decided that when a computer room pretty much needs its own substation, it is well beyond the incidental overhead costs they are willing to pick up for average research labs. Along similar lines, I would guess that SLAC has to pay for its own power, rather than Stanford covering it out of overhead. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Tue Oct 4 16:39:16 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 16:39:16 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Chi Chan wrote: > On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: >> BTW, I've heard horror stories related to routing errors with this >> method - truck drivers delivering wrong tapes or losing tapes >> (hopefully the data is properly encrypted). > > I just read this on Slashdot today, it is "very hard to encrypt a > backup tape" (really?): > > http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients Not if it is encrypted with a stream cipher -- a stream cipher basically xors the data with a bitstream generated from a suitable key in a cryptographic-strength pseudorandom number generator (although there are variations on this theme). As a result, it can be quite fast -- as fast as generating pseudorandom numbers from the generator -- and it produces a file that is exactly the size of the original message in length. There are encryption schemes that expend extraordinary amounts of computational energy in generating the stream, and there are also block ciphers (which are indeed hard to implement for a streaming tape full of data, as they usually don't work so well for long messages). But in the end no, it isn't that hard to encrypt a backup tape, provided that you are willing to accept the limitation that the speed of encrypting/decrypting the stream being written to the tape is basically limited by the speed of your RNG (which may well be slower than the speed of most fast networks). rgb > > --Chi > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 16:43:15 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 13:43:15 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of mathog > Sent: Tuesday, October 04, 2011 1:07 PM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud > > > "Robert G. Brown" wrote: > > > Often, but not always, housing, cooling, powering, and even managing > > the > > hardware is "free" to the researcher, absorbed into the ongoing costs > > of > > the server room and management staff already needed to run the > > department LAN and servers. > > Not always indeed. My little machine room houses a half dozen machines > from other biology division > people, and they are not charged to keep them there. However, putting > a computer in the central > campus machine rooms is not free. And new computer rooms, at least > those of any size, do not > get free power. After geology put in this monster: > > http://www.gps.caltech.edu/uploads/Image/Facilities/Beowulf.jpg > http://citerra.gps.caltech.edu/wiki/Public/Technology A mere 512 nodes, each with 8 cores. 670W power supply is standard, so let's say about 500 nodes at 700 watts each or 350kW... HVAC will add on top of that, but I doubt they're loaded to the max. Call it 400kW.. That's big, but not enormous. (e.g you can rent a trailer mounted generator for that kind of power for about $1000/day.. the bigger generators one sees on a movie set might be 200-300kW)) CalTrans will only pay $123/hr for a 500kW generator (and fuel cost comes out of that) But, if you were paying SoCalEdison for the juice..You'd be on (minimum) the TOU-GS-3 tariff.. On peak you'd be paying 0.02/kWh for delivery and 0.104/kWh for the power. (off peak would be 0.045/kWh) So call it 12c/kWh on peak. At 400kW, that's $48/hr, which isn't bad, operating expenses wise. Let's compare to the EC2.. $1300/hr for 30k cores. 23 core hours/$ The CITerra is $50/hr for 4000 cores. 80 core hours/$ Yes, one had to go out and BUY all those cores for CITerra. $5000/node, all in, including cabling racks, etc.? What's that, about $1.25M. Spread that out over 3 years at 2000 hrs/year (we only consider working in the daytime, etc. and you get about $210/hr for the capital cost (for all 500+ nodes..) So, the EC2 seems like a good solution when you need rapid scalability to huge sizes and you have a big expense budget and a small capital budget. You could call up Amazon this afternoon and run that 30,000 core job tonight. And you'd pay substantially for that flexibility (which is how Amazon makes money, eh?) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlb17 at duke.edu Tue Oct 4 16:47:30 2011 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 4 Oct 2011 16:47:30 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011 at 4:39pm, Robert G. Brown wrote > On Tue, 4 Oct 2011, Chi Chan wrote: > >> On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: >>> BTW, I've heard horror stories related to routing errors with this >>> method - truck drivers delivering wrong tapes or losing tapes >>> (hopefully the data is properly encrypted). >> >> I just read this on Slashdot today, it is "very hard to encrypt a >> backup tape" (really?): >> >> http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients > > Not if it is encrypted with a stream cipher -- a stream cipher basically > xors the data with a bitstream generated from a suitable key in a > cryptographic-strength pseudorandom number generator (although there are > variations on this theme). As a result, it can be quite fast -- as fast > as generating pseudorandom numbers from the generator -- and it produces > a file that is exactly the size of the original message in length. For added "no, it's not hard, they're apparently just not very bright" value, LTO4+ includes hardware AES encryption. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 16:48:00 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 13:48:00 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > -----Original Message----- > From: Robert G. Brown [mailto:rgb at phy.duke.edu] > Sent: Tuesday, October 04, 2011 1:39 PM > To: Chi Chan > Cc: Rayson Ho; Lux, Jim (337C); tt at postbiota.org; jtriley at mit.edu; Beowulf List > Subject: Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud > > On Tue, 4 Oct 2011, Chi Chan wrote: > > > On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: > >> BTW, I've heard horror stories related to routing errors with this > >> method - truck drivers delivering wrong tapes or losing tapes > >> (hopefully the data is properly encrypted). > > > > I just read this on Slashdot today, it is "very hard to encrypt a > > backup tape" (really?): > > > > http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients > > Not if it is encrypted with a stream cipher -- a stream cipher basically > xors the data with a bitstream generated from a suitable key in a > cryptographic-strength pseudorandom number generator (although there are > variations on this theme). As a result, it can be quite fast -- as fast > as generating pseudorandom numbers from the generator -- and it produces > a file that is exactly the size of the original message in length. > > There are encryption schemes that expend extraordinary amounts of > computational energy in generating the stream, and there are also block > ciphers (which are indeed hard to implement for a streaming tape full of > data, as they usually don't work so well for long messages). But in the > end no, it isn't that hard to encrypt a backup tape, provided that you > are willing to accept the limitation that the speed of > encrypting/decrypting the stream being written to the tape is basically > limited by the speed of your RNG (which may well be slower than the > speed of most fast networks). > The reason it wasn't encrypted is almost certainly not because it was difficult to do so for technology reasons. When you see a story about "data being lost or stolen from a car" it's because it was an ad hoc situation. Someone got a copy of the data to do some sort of analysis or to take it somewhere on a onetime basis, and "things went wrong". Any sort of regular process would normally deal with encryption or security as a matter of course: it's too easy to do it right. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue Oct 4 16:52:13 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 4 Oct 2011 13:52:13 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <4E8B5E98.3090002@sonsorol.org> References: <4E8B5E98.3090002@sonsorol.org> Message-ID: <20111004205213.GD14057@bx9.net> On Tue, Oct 04, 2011 at 03:29:28PM -0400, Chris Dagdigian wrote: > I'm largely with RGB on this one with the minor caveat that I think he > might be undervaluing the insane economies of scale that IaaS providers > like Amazon & Google can provide. You can rent that economy of scale if you're in the right part of the country. We weren't surprised to recently learn that our Silicon Valley datacenter rent is much lower than Moscow, but I was surprised to learn that we pay 1/3 less here than in Vegas, which allegedly has cheap land and power hence cheap datacenter rents. And with only 750 servers, we are already big enough to reap enough outright economy of scale to make leasing our own servers in a rented datacenter cheaper than renting everything from Amazon. The unique thing Amazon is providing is the ability to grow and shrink your cluster. Your example of a company which wanted to run a bunch of molecular dynamics computations in a short period of time is an illustration of that. BTW, Amazon has lowered prices since AWS was released, but not by as much as their costs have fallen. That's no surprise, given their dominant role in that market. -- greg (corporate hat: infrastructure at a search engine) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Tue Oct 4 17:03:46 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 17:03:46 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > > The reason it wasn't encrypted is almost certainly not because it > was difficult to do so for technology reasons. When you see a story > about "data being lost or stolen from a car" it's because it was an ad > hoc situation. Someone got a copy of the data to do some sort of > analysis or to take it somewhere on a onetime basis, and "things went > wrong". > > Any sort of regular process would normally deal with encryption or > security as a matter of course: it's too easy to do it right. The problem being that HIPAA is not amused by incompetence. The standard is pretty much show due diligence or be prepared to pay massive bucks out in lawsuits should the data you protect be compromised. It is really a most annoying standard -- I mean it is good that it is so flexible and makes the responsibility clear, but for most of HIPAA's existence it has provided no F***ing guidelines on how to make protected data secure. Consequently (and I say this as a modest consultant-level expert) your data and mine in the Electronic Medical Record of your choice is typically: a) Stored in flat, unencrypted plaintext or binary image in the base DB. b) Transmitted in flat, unencrypted plaintext between the server and any LAN-connected clients. In other words, it assumes that your local LAN is secure. c) Relies on third party e.g. VPN solutions to provide encryption for use across a WAN. Needless to say, the passwords and authentication schemes used in EMRs are typically a joke -- after all, the users are borderline incompetent users and cannot be expected to remember or quickly type in a user id or password much more complicated than their own initials. Many sites have one completely trivial password in use by all the physicians and nurses who use the system -- just enough to MAYBE keep patients out of the system while waiting in an examining room. I have had to convince the staff of at least one major EMR company that I will refrain from naming that no, I wasn't going to ship them a copy of an entire dataset exported from an old practice management system -- think of it as the names, addresses, SSNs and a few dozen other "protected" pieces of personal information -- to them as an unencrypted zip file over the internet, and had to finally grit my teeth and accept the use of zip's (not terribly good) built in encryption and cross my fingers and pray. Do not underestimate the sheer power of incompetence, in other words, especially incompetence in an environment almost completely lacking meaningful IT-level standards or oversight. It's really shameful, actually -- it would be so very easy to build in nearly bulletproof security schema that would make the need for third party VPNs passe. I don't know that ALL of the EMRs out there are STILL this bad, but I'd bet that 90% of them are. They certainly were 3-4 years ago, last time I looked in detail. So this is just par for the course. Doctors don't understand IT security. EMR creators should, but security is "expensive" and they don't bother because it isn't mandated. The end result is that everything from the DB to the physician's working screen is so horribly insecure that if any greed-driven cracker out there ever decided to exclusively target the weaknesses, they could compromise HIPAA and SSNs by the millions. Sigh. rgb > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Tue Oct 4 17:21:31 2011 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 4 Oct 2011 17:21:31 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> Several years ago I flippantly proposed what seems to be a simple way to ensure important consumer private data (medical, finance, etc.) was safe. Pass a law that says organization who collects or holds personal data must include the same data for organization's Board of Directors and officers (CEO, COO etc) in the database. At least the CEO might start taking security serious when someone in Bulgaria is buying jet skies with his AMX card. -- Doug > On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > >> >> The reason it wasn't encrypted is almost certainly not because it >> was difficult to do so for technology reasons. When you see a story >> about "data being lost or stolen from a car" it's because it was an ad >> hoc situation. Someone got a copy of the data to do some sort of >> analysis or to take it somewhere on a onetime basis, and "things went >> wrong". >> >> Any sort of regular process would normally deal with encryption or >> security as a matter of course: it's too easy to do it right. > > The problem being that HIPAA is not amused by incompetence. The > standard is pretty much show due diligence or be prepared to pay massive > bucks out in lawsuits should the data you protect be compromised. It is > really a most annoying standard -- I mean it is good that it is so > flexible and makes the responsibility clear, but for most of HIPAA's > existence it has provided no F***ing guidelines on how to make protected > data secure. > > Consequently (and I say this as a modest consultant-level expert) your > data and mine in the Electronic Medical Record of your choice is > typically: > > a) Stored in flat, unencrypted plaintext or binary image in the base > DB. > > b) Transmitted in flat, unencrypted plaintext between the server and > any LAN-connected clients. In other words, it assumes that your local > LAN is secure. > > c) Relies on third party e.g. VPN solutions to provide encryption for > use across a WAN. > > Needless to say, the passwords and authentication schemes used in EMRs > are typically a joke -- after all, the users are borderline incompetent > users and cannot be expected to remember or quickly type in a user id or > password much more complicated than their own initials. Many sites have > one completely trivial password in use by all the physicians and nurses > who use the system -- just enough to MAYBE keep patients out of the > system while waiting in an examining room. > > I have had to convince the staff of at least one major EMR company that > I will refrain from naming that no, I wasn't going to ship them a copy > of an entire dataset exported from an old practice management system -- > think of it as the names, addresses, SSNs and a few dozen other > "protected" pieces of personal information -- to them as an unencrypted > zip file over the internet, and had to finally grit my teeth and accept > the use of zip's (not terribly good) built in encryption and cross my > fingers and pray. > > Do not underestimate the sheer power of incompetence, in other words, > especially incompetence in an environment almost completely lacking > meaningful IT-level standards or oversight. It's really shameful, > actually -- it would be so very easy to build in nearly bulletproof > security schema that would make the need for third party VPNs passe. > > I don't know that ALL of the EMRs out there are STILL this bad, but I'd > bet that 90% of them are. They certainly were 3-4 years ago, last time > I looked in detail. > > So this is just par for the course. Doctors don't understand IT > security. EMR creators should, but security is "expensive" and they > don't bother because it isn't mandated. The end result is that > everything from the DB to the physician's working screen is so horribly > insecure that if any greed-driven cracker out there ever decided to > exclusively target the weaknesses, they could compromise HIPAA and SSNs > by the millions. > > Sigh. > > rgb > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Tue Oct 4 17:39:40 2011 From: mathog at caltech.edu (mathog) Date: Tue, 04 Oct 2011 14:39:40 -0700 Subject: [Beowulf] =?utf-8?q?=241=2C_279-per-hour=2C_30=2C=09000-core_clus?= =?utf-8?q?ter_built_on_Amazon_EC2_cloud?= In-Reply-To: References: Message-ID: <1a4e05cecd44d8777737e6994d09b289@saf.bio.caltech.edu> On Tue, 4 Oct 2011 13:43:15 -0700, Lux, Jim (337C) wrote: > So call it 12c/kWh on peak. At 400kW, that's $48/hr, which isn't > bad, operating expenses wise. Well, yes and no. If they only turned it on once and a while it wouldn't be too bad, but I'm pretty sure it runs 100% of the time. At least I have never walked by when the racks were not lit up, so... $48 * 24 * 365 = $420480/year Versus the average lab at (waves hands) $150 in electricity a month = $1800/year? It will of course depend on what kind of work the lab does. The difference is two orders of magnitude. Anyway, last I looked we had around 300 professors, so that one facility used up, order of magnitude, as much juice as all the "normal" labs combined. (Certainly there are some other labs around which also use a lot of electricity.) Cooling water usage was probably also a sore point from the administration's perspective. Pretty much everything here runs AC off chilled water coming from a central plant. Either that cluster used up a whole lot of chilled water capacity at the central plant or they built a a separate chiller somewhere. Dave Kewley who sometimes posts here used to run that system, so he would know. Regards David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlb17 at duke.edu Tue Oct 4 17:41:02 2011 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 4 Oct 2011 17:41:02 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011 at 5:03pm, Robert G. Brown wrote > Needless to say, the passwords and authentication schemes used in EMRs > are typically a joke -- after all, the users are borderline incompetent > users and cannot be expected to remember or quickly type in a user id or > password much more complicated than their own initials. Many sites have > one completely trivial password in use by all the physicians and nurses > who use the system -- just enough to MAYBE keep patients out of the > system while waiting in an examining room. My wife's experience here was somewhat the opposite of that. Within 2 days of starting her fellowship at UCSF she had acquired over 10 usernames and passwords (and one RSA hardware token) for all the various systems she needed to interact with. Each system, of course, had its own password aging and renewal rules. Determining how physicians manage their passwords in such an environment is left as an exercise for the reader... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 5 08:40:53 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 5 Oct 2011 08:40:53 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> References: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> Message-ID: On Tue, 4 Oct 2011, Douglas Eadline wrote: > > Several years ago I flippantly proposed what seems to be > a simple way to ensure important consumer private data > (medical, finance, etc.) was safe. Pass a law that says > organization who collects or holds personal data must > include the same data for organization's Board of Directors and > officers (CEO, COO etc) in the database. At least > the CEO might start taking security serious when > someone in Bulgaria is buying jet skies with his AMX card. It wouldn't help. Physicians are too clueless to understand or care (mostly, not universally) and besides, what can they do? They don't write software. The companies that provide the software won't have their board's information in the DB under any circumstances, and they are the problem. Or rather, the unregulated nature of the business is the problem. The government is spending all sorts of energy specifying the detailed structure of the DB and ICD codes for every possible illness at a staggering degree of granularity so that they can eventually micro-specify compensation rates for fingering your left gonad during an exam but are leaving HIPAA -- a disaster from day one in so very many ways -- in place as the sole guardian of our medical privacy. HIPAA fails to specify IT security, and obscures precisely who will be held financially responsible for failures of security or what other sanctions might be applied. HIPAA has had the easily predictable side effect of placing enormous physical and financial obstacles in the path of medical research, to the point where I think it is safe to say that HIPAA alone has de fact killed thousands to tens of thousands of people simply by delaying discovery for years to decades (while costing us a modest fortune to perform such research as is now performed, with whole departments in any research setting devoted to managing the permissioning of the data). Finally, HIPAA's fundamental original purpose was to keep e.g. health insurance companies or employers from getting your health care records and using them to deny coverage or employment, and it didn't really succeed even in that because of the appalling state of deregulation in the insurance industry itself. It's really pretty amazing. It's hard to imagine how anyone could have come up with a piece of governance so diabolically well designed to be enormously expensive in money and lives while failing even to accomplish its own primary goals or the related goals that it SHOULD have tried to accomplish (such as mandating a certain -- high -- level of security and complete open-standard interoperability and data portability in emergent EMR/PM systems, at least at the DB level), even if they tried. However, we should never be hasty to ascribe to human evil that which can adequately be explained by mere incompetence and stupidity. But this is OT, and I'll return to my muttons now. Soap box out. rgb > > -- > Doug > > > > > >> On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: >> >>> >>> The reason it wasn't encrypted is almost certainly not because it >>> was difficult to do so for technology reasons. When you see a story >>> about "data being lost or stolen from a car" it's because it was an ad >>> hoc situation. Someone got a copy of the data to do some sort of >>> analysis or to take it somewhere on a onetime basis, and "things went >>> wrong". >>> >>> Any sort of regular process would normally deal with encryption or >>> security as a matter of course: it's too easy to do it right. >> >> The problem being that HIPAA is not amused by incompetence. The >> standard is pretty much show due diligence or be prepared to pay massive >> bucks out in lawsuits should the data you protect be compromised. It is >> really a most annoying standard -- I mean it is good that it is so >> flexible and makes the responsibility clear, but for most of HIPAA's >> existence it has provided no F***ing guidelines on how to make protected >> data secure. >> >> Consequently (and I say this as a modest consultant-level expert) your >> data and mine in the Electronic Medical Record of your choice is >> typically: >> >> a) Stored in flat, unencrypted plaintext or binary image in the base >> DB. >> >> b) Transmitted in flat, unencrypted plaintext between the server and >> any LAN-connected clients. In other words, it assumes that your local >> LAN is secure. >> >> c) Relies on third party e.g. VPN solutions to provide encryption for >> use across a WAN. >> >> Needless to say, the passwords and authentication schemes used in EMRs >> are typically a joke -- after all, the users are borderline incompetent >> users and cannot be expected to remember or quickly type in a user id or >> password much more complicated than their own initials. Many sites have >> one completely trivial password in use by all the physicians and nurses >> who use the system -- just enough to MAYBE keep patients out of the >> system while waiting in an examining room. >> >> I have had to convince the staff of at least one major EMR company that >> I will refrain from naming that no, I wasn't going to ship them a copy >> of an entire dataset exported from an old practice management system -- >> think of it as the names, addresses, SSNs and a few dozen other >> "protected" pieces of personal information -- to them as an unencrypted >> zip file over the internet, and had to finally grit my teeth and accept >> the use of zip's (not terribly good) built in encryption and cross my >> fingers and pray. >> >> Do not underestimate the sheer power of incompetence, in other words, >> especially incompetence in an environment almost completely lacking >> meaningful IT-level standards or oversight. It's really shameful, >> actually -- it would be so very easy to build in nearly bulletproof >> security schema that would make the need for third party VPNs passe. >> >> I don't know that ALL of the EMRs out there are STILL this bad, but I'd >> bet that 90% of them are. They certainly were 3-4 years ago, last time >> I looked in detail. >> >> So this is just par for the course. Doctors don't understand IT >> security. EMR creators should, but security is "expensive" and they >> don't bother because it isn't mandated. The end result is that >> everything from the DB to the physician's working screen is so horribly >> insecure that if any greed-driven cracker out there ever decided to >> exclusively target the weaknesses, they could compromise HIPAA and SSNs >> by the millions. >> >> Sigh. >> >> rgb >> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> Robert G. Brown http://www.phy.duke.edu/~rgb/ >> Duke University Dept. of Physics, Box 90305 >> Durham, N.C. 27708-0305 >> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> > > > -- > Doug > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 5 08:45:02 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 5 Oct 2011 08:45:02 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Joshua Baker-LePain wrote: > On Tue, 4 Oct 2011 at 5:03pm, Robert G. Brown wrote > >> Needless to say, the passwords and authentication schemes used in EMRs >> are typically a joke -- after all, the users are borderline incompetent >> users and cannot be expected to remember or quickly type in a user id or >> password much more complicated than their own initials. Many sites have >> one completely trivial password in use by all the physicians and nurses >> who use the system -- just enough to MAYBE keep patients out of the >> system while waiting in an examining room. > > My wife's experience here was somewhat the opposite of that. Within 2 > days of starting her fellowship at UCSF she had acquired over 10 usernames > and passwords (and one RSA hardware token) for all the various systems she > needed to interact with. Each system, of course, had its own password > aging and renewal rules. Determining how physicians manage their > passwords in such an environment is left as an exercise for the reader... Ah, yes, excellent. Ten of them AND an RSA e.g. SecureID -- wow, that takes some real brilliance. I know how MY physician wife would manage it... rgb > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Wed Oct 5 09:42:28 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Wed, 05 Oct 2011 09:42:28 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20111004205213.GD14057@bx9.net> References: <4E8B5E98.3090002@sonsorol.org> <20111004205213.GD14057@bx9.net> Message-ID: <4E8C5EC4.9020101@runnersroll.com> On 10/04/11 16:52, Greg Lindahl wrote: > On Tue, Oct 04, 2011 at 03:29:28PM -0400, Chris Dagdigian wrote: >> I'm largely with RGB on this one with the minor caveat that I think he >> might be undervaluing the insane economies of scale that IaaS providers >> like Amazon & Google can provide. > > cheap land and power hence cheap datacenter rents. And with only 750 > servers, we are already big enough to reap enough outright economy of > scale to make leasing our own servers in a rented datacenter cheaper > than renting everything from Amazon. > > The unique thing Amazon is providing is the ability to grow and shrink > your cluster. Your example of a company which wanted to run a bunch of > molecular dynamics computations in a short period of time is an > illustration of that. On this note, does anyone know if there are prior works (either academic or publicly disclosed documentations of a company pursuing such a route) of people splitting their workload up into the "static" and "dynamic" portions and running them respectively on in-house and rented hardware? While I see this discussion time and time again go either one way or the other (google or amazon, if you will), I suspect for many companies if it were possible to "invisibly" extend their infrastructure into the cloud on an as-needed basis, it might be a pretty attractive solution. Put another way, there doesn't seem to be much sense in buying a couple more racks for just a short-term project that will result in those racks going silent afterwards. On the flipside, you probably have some fraction of the compute and data resources you need as it is, you just want it to run a little faster or need a little more scratch space/bandwidth. So renting an entire set of resources wouldn't be optimal either, since that will result in underutilization of the infrastructure at home. So just buy whatever fraction your missing from Amazon from a month and use some hacks to make it look like that hardware is right there next to your other stuff. Obviously this requires an embarrassingly parallel workload due to the locality dichotomy (or completely disjoint workloads). Another idea I had was just like solar energy, what if there was a way for you to build up credits for Amazon in the "day" and use them at "night"? I.E. put some Amazon software on your infrastructure that allows you them to use your servers as part of their "cloud" when you're not using your equipment at max, and when you do go peak it will automatically provision more and more Amazon leased resources on an as-needed basis and burn up those earned credits instead of "real money." Just some ideas I figured I'd put through the beo-blender to see if they hold any weight before actually pursuing them as research objectives. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jcownie at cantab.net Thu Oct 6 13:33:51 2011 From: jcownie at cantab.net (James Cownie) Date: Thu, 6 Oct 2011 18:33:51 +0100 Subject: [Beowulf] Beowulf Bash at SC11? Message-ID: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> SC approaches fast, but I've seen no mention of a Beowulf Bash. Has it died? Did I just miss an announcement? -- -- Jim -- James Cownie -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 7 09:45:29 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 07 Oct 2011 09:45:29 -0400 Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> Message-ID: <4E8F0279.5070809@ias.edu> There's an announcement on beowulf.org for a Beowulf Bash... from 2009! Beowulf Bash: The 11th Annual Beowulf.org Meeting November 16, 2009 Portland OR Location: The Game, One Center Court, The Rose Quarter Sponsors: AMD Cluster Monkey InsideHPC Penguin Computing SiCorp TeraScala XAND Marketing On 10/06/2011 01:33 PM, James Cownie wrote: > SC approaches fast, but I've seen no mention of a Beowulf Bash. > > Has it died? > > Did I just miss an announcement? > > -- > > -- Jim > > -- > > James Cownie > > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Glen.Beane at jax.org Fri Oct 7 10:21:41 2011 From: Glen.Beane at jax.org (Glen Beane) Date: Fri, 7 Oct 2011 14:21:41 +0000 Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <4E8F0279.5070809@ias.edu> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> <4E8F0279.5070809@ias.edu> Message-ID: <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> I remember not hearing much about it last year in New Orleans until someone I knew from Penguin handed me a card Monday night at the opening gala On Oct 7, 2011, at 9:45 AM, Prentice Bisbal wrote: > There's an announcement on beowulf.org for a Beowulf Bash... from 2009! > > Beowulf Bash: The 11th Annual Beowulf.org Meeting > November 16, 2009 > Portland OR > Location: The Game, One Center Court, The Rose Quarter Sponsors: > AMD Cluster Monkey > InsideHPC > Penguin Computing > SiCorp TeraScala > XAND Marketing > > > On 10/06/2011 01:33 PM, James Cownie wrote: >> SC approaches fast, but I've seen no mention of a Beowulf Bash. >> >> Has it died? >> >> Did I just miss an announcement? >> >> -- >> >> -- Jim >> >> -- >> >> James Cownie > >> >> >> >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Glen L. Beane Senior Software Engineer The Jackson Laboratory (207) 288-6153 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri Oct 7 17:19:52 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 7 Oct 2011 17:19:52 -0400 (EDT) Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> <4E8F0279.5070809@ias.edu> <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> Message-ID: <47582.192.168.93.213.1318022392.squirrel@mail.eadline.org> I always announce it on this list and on ClusterMonkey, it also will be announced on InsideHPC and some of the sponsor sites. -- Doug > I remember not hearing much about it last year in New Orleans until > someone I knew from Penguin handed me a card Monday night at the opening > gala > > > On Oct 7, 2011, at 9:45 AM, Prentice Bisbal wrote: > >> There's an announcement on beowulf.org for a Beowulf Bash... from 2009! >> >> Beowulf Bash: The 11th Annual Beowulf.org Meeting >> November 16, 2009 >> Portland OR >> Location: The Game, One Center Court, The Rose Quarter Sponsors: >> AMD Cluster Monkey >> InsideHPC >> Penguin Computing >> SiCorp TeraScala >> XAND Marketing >> >> >> On 10/06/2011 01:33 PM, James Cownie wrote: >>> SC approaches fast, but I've seen no mention of a Beowulf Bash. >>> >>> Has it died? >>> >>> Did I just miss an announcement? >>> >>> -- >>> >>> -- Jim >>> >>> -- >>> >>> James Cownie > >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Glen L. Beane > Senior Software Engineer > The Jackson Laboratory > (207) 288-6153 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From kilian.cavalotti.work at gmail.com Tue Oct 11 11:21:32 2011 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Tue, 11 Oct 2011 17:21:32 +0200 Subject: [Beowulf] IBM to acquire Platform Computing Message-ID: http://www.platform.com/press-releases/2011/IBMtoAcquireSystemSoftwareCompanyPlatformComputingtoExtendReachofTechnicalComputing and http://www-03.ibm.com/systems/deepcomputing/platform.html Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dag at sonsorol.org Wed Oct 12 10:52:13 2011 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed, 12 Oct 2011 10:52:13 -0400 Subject: [Beowulf] 10GbE topologies for small-ish clusters? Message-ID: <4E95A99D.9040703@sonsorol.org> First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? Regards, Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Wed Oct 12 10:58:58 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 12 Oct 2011 10:58:58 -0400 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: <4E95AB32.3030804@scalableinformatics.com> On 10/12/2011 10:52 AM, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? What's the use case? Low latency, or simplified high bandwidth connection? 10GbE with 40GbE uplinks won't be cheap. But it would be doable. Gnodal, Mellanox, and others would be able to do this. > > Regards, > Chris > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From i.n.kozin at googlemail.com Wed Oct 12 11:22:52 2011 From: i.n.kozin at googlemail.com (Igor Kozin) Date: Wed, 12 Oct 2011 16:22:52 +0100 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: Gnodal was probably the first to announce a 1U 72 port switch http://www.gnodal.com/docs/Gnodal%20GS7200%20datasheet.pdf Other vendors either have announced or will be probably announcing dense packaging too. On 12 October 2011 15:52, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Wed Oct 12 11:28:28 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 12 Oct 2011 16:28:28 +0100 Subject: [Beowulf] 10GbE topologies for small-ish clusters? References: <4E95A99D.9040703@sonsorol.org> Message-ID: <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? I was going to suggest two Gnodal rack top switches, linked by a 40Gbps link http://www.gnodal.com/ I see though that their GS7200 switch has 72 x 10Gbps ports - should do you just fine! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From akshar.bhosale at gmail.com Wed Oct 12 12:28:57 2011 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Wed, 12 Oct 2011 21:58:57 +0530 Subject: [Beowulf] refunding reserved amount in gold Message-ID: Hi, We are using PBS (torque 2.4.8) and gold version 2.1.7.1. One of the jobs went for execution and reserved the equivalent amount. The same job came out of execution and went in queue from execution. This happened 30 times for the same job. Every time job has reserved amount. Now finally there is very huge amount(30*charges for that single job) which is shown in reserved state.Job now does not exist. User can not submit the new job now because of neglegible amount balance in his account. We want to clear reserved amount. How to do that? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From Shainer at Mellanox.com Wed Oct 12 12:30:02 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Wed, 12 Oct 2011 16:30:02 +0000 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> References: <4E95A99D.9040703@sonsorol.org> <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> Message-ID: You can also check the Mellanox products - both for 40GigE and 10GigE switch fabric. Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Hearns, John Sent: Wednesday, October 12, 2011 8:31 AM To: dag at sonsorol.org; beowulf at beowulf.org Subject: Re: [Beowulf] 10GbE topologies for small-ish clusters? First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? I was going to suggest two Gnodal rack top switches, linked by a 40Gbps link http://www.gnodal.com/ I see though that their GS7200 switch has 72 x 10Gbps ports - should do you just fine! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From scrusan at ur.rochester.edu Wed Oct 12 12:33:39 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Wed, 12 Oct 2011 12:33:39 -0400 Subject: [Beowulf] refunding reserved amount in gold In-Reply-To: References: Message-ID: <85631CC6-BFE0-44A2-B69E-42BB660AC632@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I would suggest you post this to the Gold mailing list with a few more pieces of information: http://www.supercluster.org/mailman/listinfo/gold-users Regardless, you could probably use the grefund command... On Oct 12, 2011, at 12:28 PM, akshar bhosale wrote: > Hi, > > We are using PBS (torque 2.4.8) and gold version 2.1.7.1. One of the > jobs went for execution and reserved the equivalent amount. The same job > came out of execution and went in queue from execution. This happened 30 > times for the same job. Every time job has reserved amount. Now finally > there is very huge amount(30*charges for that single job) which is shown in > reserved state.Job now does not exist. User can not submit the new job now > because of neglegible amount balance in his account. We want to clear > reserved amount. How to do that? > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOlcFoAAoJENS19LGOpgqK1UIIAIFZj6fIZebQt9xQwmVBVxB9 MPwJMlw4C0F8bR/crGBWx7NUHElep1frROYohD15jN/8bFA2/bJ3xFdiH1bMNqHu MdB4EmRbs4nuNeN/ZayV4JXBVD3oPuwESYA65jVj0MfbVbzeRod6ZnNvpZOb/Juc 7dHCNPa2coLGLakGEQperOvOOCqsTbxSUdagXulW/1xH3iG+8UPNPJe7ATvO0tE3 FYOot3a3WgN8dsWUnsOKBnA17FA2zN0ac/QdEd2COSbpOjbpQp7BIlg0f0QIIkU6 pVq1C706jn5Cl4gKXsfC277Rrx3eLl3YPVA6XaL95PSXBH51L7Y3ViqMmVe9Coo= =cSUy -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Wed Oct 12 14:04:27 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 12 Oct 2011 11:04:27 -0700 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> <20111012180002.GC5039@bx9.net> Message-ID: <20111012180427.GD5039@bx9.net> We just bought a couple of 64-port 10g switches from Blade, for the middle of our networking infrastructure. They were the winner over all the others, lowest price and appropriate features. We also bought Blade top-of-rack switches. Now that they've been bought up by IBM you have to negotiate harder to get that low price, but you can still get it by threatening them with competing quotes. Gnodal looks very interesting for larger, multi-switch clusters, they were just a bit late to market for us. Arista really believes that their high prices are justified; we didn't. And if anyone would like to buy some used Mellanox 48-port 10ge switches, we have 2 extras we'd like to sell. -- greg On Wed, Oct 12, 2011 at 10:52:13AM -0400, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Shainer at Mellanox.com Wed Oct 12 14:11:04 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Wed, 12 Oct 2011 18:11:04 +0000 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <20111012180427.GD5039@bx9.net> References: <4E95A99D.9040703@sonsorol.org> <20111012180002.GC5039@bx9.net> <20111012180427.GD5039@bx9.net> Message-ID: The 48-ports are not Mellanox but previous company that Mellanox acquired, as the Mellanox ones are 36 x 40G or 64 x 10G in 1U (or bigger). But please don't let these small details hold you from re-living your history. Good luck selling. -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Greg Lindahl Sent: Wednesday, October 12, 2011 11:05 AM To: Chris Dagdigian Cc: Beowulf Mailing List Subject: Re: [Beowulf] 10GbE topologies for small-ish clusters? We just bought a couple of 64-port 10g switches from Blade, for the middle of our networking infrastructure. They were the winner over all the others, lowest price and appropriate features. We also bought Blade top-of-rack switches. Now that they've been bought up by IBM you have to negotiate harder to get that low price, but you can still get it by threatening them with competing quotes. Gnodal looks very interesting for larger, multi-switch clusters, they were just a bit late to market for us. Arista really believes that their high prices are justified; we didn't. And if anyone would like to buy some used Mellanox 48-port 10ge switches, we have 2 extras we'd like to sell. -- greg On Wed, Oct 12, 2011 at 10:52:13AM -0400, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two > racks what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cap at nsc.liu.se Thu Oct 13 07:51:56 2011 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Thu, 13 Oct 2011 13:51:56 +0200 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: <201110131351.59977.cap@nsc.liu.se> On Wednesday, October 12, 2011 04:52:13 PM Chris Dagdigian wrote: > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? Both Arista and Blade (now IBM) has 64 port 1U single ASIC switches (a few ports will require qsfp to sfp+ break out cables afaict). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 21 09:10:18 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 09:10:18 -0400 Subject: [Beowulf] Users abusing screen Message-ID: <4EA16F3A.8080209@ias.edu> Beowulfers, I have a question that isn't directly related to clusters, but I suspect it's an issue many of you are dealing with are dealt with: users using the screen command to stay logged in on systems and running long jobs that they forget about. Have any of you experienced this, and how did you deal with it? Here's my scenario: In addition to my cluster, we have a bunch of "computer servers" where users can run the programs. These are "large" boxes with more cores (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a desktop top. Periodically, when I have to shutdown/reboot a system for maintenance, I find a LOT of shells being run through the screen command for users who aren't logged in. The majority are idle shells, but many are running jobs, that seem to be forgotten about. For example, I recently found some jobs running since July or August that were running under the account of someone who hasn't even been here for months! My opinion is these these are shared resources, and if you aren't interactively using them, you should log out to free up resources for others. If you have a job that can be run non-interactively, you should submit it to the cluster. Has anyone else here dealt with the problem? I would like to remove screen from my environment entirely to prevent this. My fellow sysadmins here agree. I'm expecting massive backlash from the users. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:07:27 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:07:27 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <4EA198BF.3030002@ias.edu> On 10/21/2011 11:06 AM, Kilian Cavalotti wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) Actually, I can't for reasons I can't get into here. But something like that was part of my original "master plan". -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:10:36 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:10:36 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> References: <4EA16F3A.8080209@ias.edu> <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> Message-ID: <4EA1997C.70103@ias.edu> On 10/21/2011 11:24 AM, Reuti wrote: > Hi, > > Am 21.10.2011 um 15:10 schrieb Prentice Bisbal: > >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? >> >> Here's my scenario: >> >> In addition to my cluster, we have a bunch of "computer servers" where >> users can run the programs. These are "large" boxes with more cores >> (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a >> desktop top. >> >> Periodically, when I have to shutdown/reboot a system for maintenance, >> I find a LOT of shells being run through the screen command for users >> who aren't logged in. The majority are idle shells, but many are running >> jobs, that seem to be forgotten about. For example, I recently found >> some jobs running since July or August that were running under the >> account of someone who hasn't even been here for months! >> >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. If you have a job that can be run non-interactively, you should >> submit it to the cluster. >> >> Has anyone else here dealt with the problem? >> >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > I disallow rsh to the machines and limit ssh to admin staff. Users who want to run something on a machine have to go through the queuing system to get access to a node granted by GridEngine (for the startup method you can use either the -builtin- or [in case you need X11 forwarding] by a different sshd_config and ssh [GridEngine will start one daemon per task], one additional step is necessary for a tight integration of ssh). > > For users just checking their jobs on a node I have a dedicated queue (where they can login always, but h_cpu limited to 60 seconds, i.e. they can't abuse it). > > -- Reuti > Reuti, That was EXACTLY my original plan, but for reasons I don't want to get into, I can't implement that. In fact, just yesterday I ripped out all the SGE queues I had configured to that. Why? because I was tired of seeing them and being reminded of what a good idea it was. :( -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:12:53 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:12:53 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19365.4030109@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> Message-ID: <4EA19A05.4000400@ias.edu> On 10/21/2011 11:44 AM, Ellis H. Wilson III wrote: > On 10/21/11 09:10, Prentice Bisbal wrote: >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? > > I think this is strongly tied to what kind of work the users are doing > (i.e. how interactive it is, how long jobs take, how likely failure is > to occur that they must react to). In my personal experience the jobs I > spawn aren't interactive, tend to take a long time, and because of point > 2 require me to react pretty quickly to their failure or I lose out on > valuable compute-time. However, they are cumbersome to execute via a > queuing manager (my work is in systems, so perhaps that area is an > exception). Therefore what I always do is just nohup myself a job, and > tail -f it if I need to watch it. I've adapted my ssh config such that > I don't get booted off after 5 or 10 minutes without any input from me > (I think the limit I set is like 2hours or something), so I can watch > output fly by to my hearts content. > > If I were you, I think the best way to avoid a user-uprising, but to > achieve your goal is to give instructions on how a user can nohup (yes, > just assume they don't know how) and how to configure ssh to not die > after a short time. This way they don't have to worry about getting > disconnected if they aren't constantly interacting (so they can watch > output), but they also aren't staying logged on indefinitely (since > presumably their laptops/desktops aren't on indefinitely). > > If you give them an alternative that is well defined with an example > (not just, "Oh you can use such-and-such instead.") I can hardly believe > they'll be all that upset. > Ellis, Using nohup was exactly the advice I gave to one of my users yesterday. Not sure if he'll use it. 'man' is a very difficult program to learn, from what I understand. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Fri Oct 21 11:24:32 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 21 Oct 2011 17:24:32 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> Hi, Am 21.10.2011 um 15:10 schrieb Prentice Bisbal: > Beowulfers, > > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? > > Here's my scenario: > > In addition to my cluster, we have a bunch of "computer servers" where > users can run the programs. These are "large" boxes with more cores > (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a > desktop top. > > Periodically, when I have to shutdown/reboot a system for maintenance, > I find a LOT of shells being run through the screen command for users > who aren't logged in. The majority are idle shells, but many are running > jobs, that seem to be forgotten about. For example, I recently found > some jobs running since July or August that were running under the > account of someone who hasn't even been here for months! > > My opinion is these these are shared resources, and if you aren't > interactively using them, you should log out to free up resources for > others. If you have a job that can be run non-interactively, you should > submit it to the cluster. > > Has anyone else here dealt with the problem? > > I would like to remove screen from my environment entirely to prevent > this. My fellow sysadmins here agree. I'm expecting massive backlash > from the users. I disallow rsh to the machines and limit ssh to admin staff. Users who want to run something on a machine have to go through the queuing system to get access to a node granted by GridEngine (for the startup method you can use either the -builtin- or [in case you need X11 forwarding] by a different sshd_config and ssh [GridEngine will start one daemon per task], one additional step is necessary for a tight integration of ssh). For users just checking their jobs on a node I have a dedicated queue (where they can login always, but h_cpu limited to 60 seconds, i.e. they can't abuse it). -- Reuti _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bug at sas.upenn.edu Fri Oct 21 11:17:55 2011 From: bug at sas.upenn.edu (Gavin W. Burris) Date: Fri, 21 Oct 2011 11:17:55 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <4EA18D23.4050501@sas.upenn.edu> On 10/21/2011 11:06 AM, Kilian Cavalotti wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) > > Cheers, I think we have a winner. :) -- Gavin W. Burris Senior Systems Programmer Information Security and Unix Systems School of Arts and Sciences University of Pennsylvania _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Oct 21 11:44:37 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 21 Oct 2011 11:44:37 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA19365.4030109@runnersroll.com> On 10/21/11 09:10, Prentice Bisbal wrote: > Beowulfers, > > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? I think this is strongly tied to what kind of work the users are doing (i.e. how interactive it is, how long jobs take, how likely failure is to occur that they must react to). In my personal experience the jobs I spawn aren't interactive, tend to take a long time, and because of point 2 require me to react pretty quickly to their failure or I lose out on valuable compute-time. However, they are cumbersome to execute via a queuing manager (my work is in systems, so perhaps that area is an exception). Therefore what I always do is just nohup myself a job, and tail -f it if I need to watch it. I've adapted my ssh config such that I don't get booted off after 5 or 10 minutes without any input from me (I think the limit I set is like 2hours or something), so I can watch output fly by to my hearts content. If I were you, I think the best way to avoid a user-uprising, but to achieve your goal is to give instructions on how a user can nohup (yes, just assume they don't know how) and how to configure ssh to not die after a short time. This way they don't have to worry about getting disconnected if they aren't constantly interacting (so they can watch output), but they also aren't staying logged on indefinitely (since presumably their laptops/desktops aren't on indefinitely). If you give them an alternative that is well defined with an example (not just, "Oh you can use such-and-such instead.") I can hardly believe they'll be all that upset. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Oct 21 12:26:09 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 21 Oct 2011 12:26:09 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> Message-ID: <4EA19D21.3090902@runnersroll.com> On 10/21/11 12:12, Prentice Bisbal wrote: >> If you give them an alternative that is well defined with an example >> (not just, "Oh you can use such-and-such instead.") I can hardly believe >> they'll be all that upset. >> > > Ellis, > > Using nohup was exactly the advice I gave to one of my users yesterday. > Not sure if he'll use it. 'man' is a very difficult program to learn, > from what I understand. Hahaha, I love your cynicism. Right up my alley, however, I think in all seriousness 'man' does fall short for many applications in terms of examples (there are exceptions to this, but most man docs don't have examples from my experience). Many users just want examples of it's use, and can derive their case faster from such than custom-creation of a set of parameters from man. So just take a few moments, cook up an example of 'nohup ./someapp &> out.txt &' usage and associated ways to kill and watch it's output and put it all into an email. Save that email away, and when you're ready just shoot it out to everyone. Or if you have an internal wiki setup, that's much, much better. Just forward a link to some new page on it. If you make even a half-assed effort to show you are providing a viable alternative and a low bar to entry, you'll cut the number of people complaining at least in half. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Fri Oct 21 11:26:57 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 21 Oct 2011 17:26:57 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <46778F4F-95ED-4FC7-B936-F8221A759916@staff.uni-marburg.de> Am 21.10.2011 um 17:06 schrieb Kilian Cavalotti: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) Isn't it to late at that point if I get it right? They login by ssh to an exechost and issue thereon screen to reconnect later. But they should already use qlogin to go to the exechost. -- Reuti > Cheers, > -- > Kilian > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri Oct 21 12:45:38 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 21 Oct 2011 09:45:38 -0700 Subject: [Beowulf] about 'man' Re: Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> Message-ID: On 10/21/11 9:12 AM, "Prentice Bisbal" wrote: > >Ellis, > >Using nohup was exactly the advice I gave to one of my users yesterday. >Not sure if he'll use it. 'man' is a very difficult program to learn, >from what I understand. Well... 'man' is easy, but sometimes, you need decent examples and tutorials. Just knowing what all the switches are and the format is like giving someone a dictionary and saying: now write me a sonnet. This is especially so for the "swiss army knife" type utilities (grep, I'm looking at you!) > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 10:44:27 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 10:44:27 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111021134457.GA22748@grml> References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> Message-ID: <4EA1854B.5090506@ias.edu> On 10/21/2011 09:44 AM, Henning Fehrmann wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? >> >> Here's my scenario: >> >> In addition to my cluster, we have a bunch of "computer servers" where >> users can run the programs. These are "large" boxes with more cores >> (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a >> desktop top. >> >> Periodically, when I have to shutdown/reboot a system for maintenance, >> I find a LOT of shells being run through the screen command for users >> who aren't logged in. The majority are idle shells, but many are running >> jobs, that seem to be forgotten about. For example, I recently found >> some jobs running since July or August that were running under the >> account of someone who hasn't even been here for months! >> >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. If you have a job that can be run non-interactively, you should >> submit it to the cluster. >> >> Has anyone else here dealt with the problem? >> >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > I wouldn't deinstall screen. It is a useful tool for many things and > there are alternatives doing the same. Instead one could enforce a > maximum CPU time a job can take by setting ulimits. > > Have you thought about queueing systems like condor or SGE? Yes, I have cluster that uses SGE, and we allow users to run serial jobs (non-MPI, etc.) there, so there is no need for them to use screen to execute long-running jobs. Hence my frustration. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From kilian.cavalotti.work at gmail.com Fri Oct 21 11:06:11 2011 From: kilian.cavalotti.work at gmail.com (Kilian Cavalotti) Date: Fri, 21 Oct 2011 17:06:11 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA1854B.5090506@ias.edu> References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: Hi Prentice, On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >> Have you thought about queueing systems like condor or SGE? > > Yes, I have cluster that uses SGE, and we allow users to run serial jobs > (non-MPI, etc.) there, so there is no need for them to use screen to > execute long-running jobs. Hence my frustration. You could alias "screen" to "qlogin". :) Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From atp at piskorski.com Fri Oct 21 15:14:01 2011 From: atp at piskorski.com (Andrew Piskorski) Date: Fri, 21 Oct 2011 15:14:01 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <20111021191401.GA87390@piskorski.com> On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: > My opinion is these these are shared resources, and if you aren't > interactively using them, you should log out to free up resources for > others. "running under screen" != "non-interactive". > I would like to remove screen from my environment entirely to prevent > this. My fellow sysadmins here agree. I'm expecting massive backlash > from the users. No shit. If you allow users to login at all, then (IMNSHO) removing screen is insane. That's not a solution to your problem, that's creating a totally new problem and pretending it's a solution. I essentially always use screen whenever I ssh to any Linux box for any reason. If my sysadmin arbitrarily disabled screen because some other user was doing something dumb, I'd be pretty upset too. (Annoyed enough to maybe just build screen myself on that box.) -- Andrew Piskorski http://www.piskorski.com/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 21 22:18:19 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 21 Oct 2011 22:18:19 -0400 Subject: [Beowulf] about 'man' Re: Users abusing screen In-Reply-To: References: <4EA19A05.4000400@ias.edu> Message-ID: I'm not a sysadmin, but I thought these days we were supposed to point [end]users at "help" or "doc" instead of man? Man is like sdb, it's great but not for everyone, you need context to appreciate it. I think in System V type derivatives it's usually "help"? peter On Fri, Oct 21, 2011 at 12:45 PM, Lux, Jim (337C) wrote: > > > On 10/21/11 9:12 AM, "Prentice Bisbal" wrote: > > > >Ellis, > > > >Using nohup was exactly the advice I gave to one of my users yesterday. > >Not sure if he'll use it. 'man' is a very difficult program to learn, > >from what I understand. > > Well... 'man' is easy, but sometimes, you need decent examples and > tutorials. Just knowing what all the switches are and the format is like > giving someone a dictionary and saying: now write me a sonnet. This is > especially so for the "swiss army knife" type utilities (grep, I'm looking > at you!) > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From ellis at runnersroll.com Sat Oct 22 08:02:35 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Sat, 22 Oct 2011 08:02:35 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111021191401.GA87390@piskorski.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> Message-ID: <4EA2B0DB.3040702@runnersroll.com> On 10/21/11 15:14, Andrew Piskorski wrote: > On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: > >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. > > "running under screen" != "non-interactive". What I think Prentice was pointing out here was more along the lines of: "non-interactive" >= "running under screen" <= interactive Where interactivity is more of a spectrum than a != or =. More pointedly, he stated his users are acting in a non-interactive manner, in some cases even after they leave, which is irresponsible at all levels. Obviously he has to balance a rule-set between the good users and the bad users, such that abuse isn't quite as easy. >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > No shit. If you allow users to login at all, then (IMNSHO) removing > screen is insane. That's not a solution to your problem, that's > creating a totally new problem and pretending it's a solution. Insane? I mean, I do a lot of work on a bunch of different distros and hardware types, and have found little use for screen /unless/ I was on a really, really poor internet connection that cut out on the minutes level. Can you give some examples regarding something you can do with screen you cannot do with nohup and tail? > I essentially always use screen whenever I ssh to any Linux box for > any reason. But why? Just leave a terminal open if you want interactivity, otherwise nohup something. Perhaps I've understated screen's usefulness, but I'm glad to be corrected/educated on it's efficacy in this area. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From skylar at cs.earlham.edu Sat Oct 22 13:24:02 2011 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 22 Oct 2011 10:24:02 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA2B0DB.3040702@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> <4EA2B0DB.3040702@runnersroll.com> Message-ID: <4EA2FC32.9000605@cs.earlham.edu> On 10/22/11 05:02, Ellis H. Wilson III wrote: > > Insane? I mean, I do a lot of work on a bunch of different distros and > hardware types, and have found little use for screen /unless/ I was on a > really, really poor internet connection that cut out on the minutes > level. Can you give some examples regarding something you can do with > screen you cannot do with nohup and tail? > > Here's a few I can think of: * Multiple shells off one login * Scroll buffer * Copy&paste w/o needing a mouse * Start session logging at any time, w/o needing to remember to use script or nohup I guess I'm with Andrew, where the first thing I do upon logging in is either connecting to an existing screen session or starting a fresh one. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From j.wender at science-computing.de Mon Oct 24 02:30:12 2011 From: j.wender at science-computing.de (Jan Wender) Date: Mon, 24 Oct 2011 08:30:12 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA505F4.7080007@science-computing.de> On 10/21/2011 03:10 PM, Prentice Bisbal wrote: > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? How about killing long-running (either elapsed or used time) processes not started through the batch system? You should be able to identify them by looking at the process tree. At least one cluster I know kills all user processes which have not been started from the queueing system. Cheerio, Jan -- ---- Company Information ---- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- A non-text attachment was scrubbed... Name: j_wender.vcf Type: text/x-vcard Size: 338 bytes Desc: not available URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From greg.matthews at diamond.ac.uk Mon Oct 24 07:00:19 2011 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Mon, 24 Oct 2011 12:00:19 +0100 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> Message-ID: <4EA54543.5090908@diamond.ac.uk> Prentice Bisbal wrote: > Using nohup was exactly the advice I gave to one of my users yesterday. > Not sure if he'll use it. 'man' is a very difficult program to learn, > from what I understand. our experience of ppl using nohup without really thinking it through is eventually filling the partition with an enormous nohup.out file. GREG > > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Mon Oct 24 07:20:02 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 24 Oct 2011 13:20:02 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA54543.5090908@diamond.ac.uk> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> <4EA54543.5090908@diamond.ac.uk> Message-ID: <9DA6F2A5-6736-457F-AE89-C5EC56735C09@staff.uni-marburg.de> Am 24.10.2011 um 13:00 schrieb Gregory Matthews: > Prentice Bisbal wrote: >> Using nohup was exactly the advice I gave to one of my users yesterday. >> Not sure if he'll use it. 'man' is a very difficult program to learn, >> from what I understand. > > our experience of ppl using nohup without really thinking it through is > eventually filling the partition with an enormous nohup.out file. It's possible to make an alias, so that "nohup" reads "nohup > /dev/null" The redirection doesn't need to be at the end of the command. Depends whether they need the output, and/or any output file is created by the application on its own anyway. -- Reuti > GREG > >> >> Prentice >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- > Greg Matthews 01235 778658 > Senior Computer Systems Administrator > Diamond Light Source, Oxfordshire, UK > > -- > This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. > Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. > Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. > Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 09:42:23 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 09:42:23 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA2B0DB.3040702@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> <4EA2B0DB.3040702@runnersroll.com> Message-ID: <4EA56B3F.3060404@ias.edu> On 10/22/2011 08:02 AM, Ellis H. Wilson III wrote: > On 10/21/11 15:14, Andrew Piskorski wrote: >> On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: >> >>> My opinion is these these are shared resources, and if you aren't >>> interactively using them, you should log out to free up resources for >>> others. >> "running under screen" != "non-interactive". > What I think Prentice was pointing out here was more along the lines of: > "non-interactive" >= "running under screen" <= interactive > Where interactivity is more of a spectrum than a != or =. More > pointedly, he stated his users are acting in a non-interactive manner, > in some cases even after they leave, which is irresponsible at all > levels. Obviously he has to balance a rule-set between the good users > and the bad users, such that abuse isn't quite as easy. Thanks for coming to my defense, Ellis. I don't think I could have explained it better myself. >>> I would like to remove screen from my environment entirely to prevent >>> this. My fellow sysadmins here agree. I'm expecting massive backlash >>> from the users. >> No shit. If you allow users to login at all, then (IMNSHO) removing >> screen is insane. That's not a solution to your problem, that's >> creating a totally new problem and pretending it's a solution. > Insane? I mean, I do a lot of work on a bunch of different distros and > hardware types, and have found little use for screen /unless/ I was on a > really, really poor internet connection that cut out on the minutes > level. Can you give some examples regarding something you can do with > screen you cannot do with nohup and tail? I agree. I've been a professional sys admin using Unix/Linux day in and day out for well over 10 years, and not one days has gone by where I saw a need for screen. >> I essentially always use screen whenever I ssh to any Linux box for >> any reason. > But why? Just leave a terminal open if you want interactivity, > otherwise nohup something. Perhaps I've understated screen's > usefulness, but I'm glad to be corrected/educated on it's efficacy in > this area. > > Best, > > ellis > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 09:46:49 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 09:46:49 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA505F4.7080007@science-computing.de> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> Message-ID: <4EA56C49.9060204@ias.edu> On 10/24/2011 02:30 AM, Jan Wender wrote: > On 10/21/2011 03:10 PM, Prentice Bisbal wrote: >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? > How about killing long-running (either elapsed or used time) processes not > started through the batch system? You should be able to identify them by looking > at the process tree. > At least one cluster I know kills all user processes which have not been started > from the queueing system. The systems where screen is being abused are not part of the batch system, and they will not /can not be for reasons I don't want to get into here. The problem with killing long-running programs is that there are often long running programs that are legitimate in my evironment. I can quickly scan 'ps' output and determine which is which, but I doubt that kind of intelligence could ever be built into a shell script. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 10:22:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 10:22:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA574BA.2050304@ias.edu> Anything is possible if you're a good enough programmer. Like I said earlier, there are some users legitimately running long jobs on the systems in question. Instead of developing a clever program to automatically kill long running screen jobs, I think it would be better to be up front with my users and remove screen, rather than let them use it, only to surprise them later by killing their jobs. On 10/24/2011 09:55 AM, geert geurts wrote: > > Hello Prentice, > > Screen is a essential app, for sure. > But as an answer to the initial question... > I'm not much of a programmer, but can't you replace the binary with a > custom compiled version which runs two threads? One with the initial > program, and one which sleeps for the maximum amount of time you're > willing to allow screen sessions to last, and kills the session when > the time runs out... > > Or maybe build some script around the actual binary to do the same.. > > > Regards, > Geert > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Mon Oct 24 18:48:44 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 25 Oct 2011 09:48:44 +1100 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA5EB4C.3000809@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22/10/11 00:10, Prentice Bisbal wrote: > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? Hmm, any way of making a local version of screen which puts all the processes into a cpuset or control group so you can easily distinguish between ones in screen and outside of it ? Perhaps even doing it with a wrapper if you didn't want to build a modified version ? That way you get to restrict the number of cores they can monopolise.. Of course a user could get around it by building their own copy, but at least then you'd be able to see that.. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6l60wACgkQO2KABBYQAh/YtwCfegBzvEpH/s4PtHnFlEwSqQLK UO8An3DK20lEVrT9WM8qln0wM7alKoU6 =oInQ -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue Oct 25 19:13:05 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 25 Oct 2011 16:13:05 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA56C49.9060204@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> Message-ID: <20111025231305.GC9493@bx9.net> On Mon, Oct 24, 2011 at 09:46:49AM -0400, Prentice Bisbal wrote: > The systems where screen is being abused are not part of the batch > system, and they will not /can not be for reasons I don't want to get > into here. The problem with killing long-running programs is that there > are often long running programs that are legitimate in my evironment. I > can quickly scan 'ps' output and determine which is which, but I doubt > that kind of intelligence could ever be built into a shell script. I see that you didn't bother to check out the software proposed soon after you asked your question. If you don't check out potential answers because you doubt they will work, why should anyone bother to reply to you? The problem you have is a common issue in university environments, and the common solution is a script that accurately figures out long-running cpu-intensive programs and nices/kills them. I first ran into such a thing in, oh, 1992? It's not rocket science. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Wed Oct 26 10:31:56 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 26 Oct 2011 10:31:56 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111025231305.GC9493@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> Message-ID: <4EA819DC.9090106@ias.edu> On 10/25/2011 07:13 PM, Greg Lindahl wrote: > On Mon, Oct 24, 2011 at 09:46:49AM -0400, Prentice Bisbal wrote: > >> The systems where screen is being abused are not part of the batch >> system, and they will not /can not be for reasons I don't want to get >> into here. The problem with killing long-running programs is that there >> are often long running programs that are legitimate in my evironment. I >> can quickly scan 'ps' output and determine which is which, but I doubt >> that kind of intelligence could ever be built into a shell script. > I see that you didn't bother to check out the software proposed soon > after you asked your question. If you don't check out potential > answers because you doubt they will work, why should anyone bother to > reply to you? Greg, I didn't realize I needed to log a detailed response to every suggestion made to me on this list. I've been a member of this list for quite sometime, and I've never seen a comment like yours before. You're out of line. People should bother to reply to me because I've been a participating member of this list for 4 years now, and often assist others when I can. I don't expect a response to every suggestion I provide to others. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bcostescu at gmail.com Wed Oct 26 11:41:50 2011 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed, 26 Oct 2011 17:41:50 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: On Fri, Oct 21, 2011 at 15:10, Prentice Bisbal wrote: > Periodically, when I have to shutdown/reboot a system for maintenance, > I find a LOT of shells being run through the screen command for users > who aren't logged in. The majority are idle shells, but many are running > jobs, that seem to be forgotten about. > ... > I would like to remove screen from my environment entirely to prevent > this. >From what I understand from your message, it's not screen per-se which upsets you, it's the way it is (ab)used by some users to start long running memory hogging jobs; you seem to be OK with idle shells found at maintenance time which are still started through screen. So why the backlash against screen ? Starting jobs in the background can be done directly through the shell, with no screen; if the job can be split in smaller pieces time-wise, they can be started by at/cron; screen can be installed by a user, possible under a different name... so many and surely other possibilities to still upset you even if you uninstall screen, because you focus on the wrong subject. To deal with forgotten long running jobs, you have various administrative (f.e. bill users/groups, even if in some kind of symbolic way) or technical (f.e. only allow 24h CPU time through system-wide limits or install a daemon which watches and warns and/or takes measures) means - some of these have been discussed on this very list in the past or have been mentioned earlier in this thread. Each situation is different (f.e. some legitimate jobs could run for more than 24h), so you should check all suggestions and apply the one(s) which fit(s) best. I know from my own experience that it's not easy to be on this side of the fence :-) Good luck! Bogdan _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 26 12:22:31 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 26 Oct 2011 12:22:31 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: OK, OK, I haven't participated in this discussion so far -- way too busy. But since it keeps on going, and going, and going, and since nobody has mentioned the obvious and permanent solution, I'm going to have to bring it up: >From "man 8 syslogd", which alas seems to no longer exist save in our hearts and memories, when confronted with any sort of persistent system abuse: 5. Use step 4 and if the problem persists and is not secondary to a rogue program/daemon get a 3.5 ft (approx. 1 meter) length of sucker rod* and have a chat with the user in question. * Sucker rod def. ? 3/4, 7/8 or 1in. hardened steel rod, male threaded on each end. Primary use in the oil industry in West- ern North Dakota and other locations to pump 'suck' oil from oil wells. Secondary uses are for the construction of cattle feed lots and for dealing with the occasional recalcitrant or bel- ligerent individual. I've found that the "sucker rod solution" is really the only one that ultimately works. Even if it is merely present when discussing the problem with the worst offenders, it marvelously focusses the mind on the severity of the issue. Otherwise (as has been pointed out repeatedly) it is rather trivial to write an e.g. cron script that reaps/kills ANYTHING undesireable on a public server. Invariably they will sooner or later kill something that shouldn't be killed in the sense that it is doing some sort of useful work, but screen isn't likely to be something in that category. Myself, I like the sucker rod approach. BANG down on the desk with it and say something ominous like "So, you've been cluttering up my server with unattended and abandoned sessions. Would you be so kind as to CEASE (bam) and DESIST (bam) from this antisocial activity?" Then mutter something about too much Jolt Cola and back away slowly. Don't worry too much about the divots you leave in the desk or the coffee mug that somehow got shattered. They'll be useful reminders the next time he or she considers walking way from a multiplexed screen session. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Wed Oct 26 12:42:50 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 26 Oct 2011 12:42:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA8388A.6060704@scalableinformatics.com> On 10/26/2011 12:22 PM, Robert G. Brown wrote: > Myself, I like the sucker rod approach. BANG down on the desk with it > and say something ominous like "So, you've been cluttering up my server > with unattended and abandoned sessions. Would you be so kind as to > CEASE (bam) and DESIST (bam) from this antisocial activity?" Then > mutter something about too much Jolt Cola and back away slowly. [donning his old New Yawk accent ... "Hey, we don't gots no accent ... you'se got an accent..."] "Thats a nice computer model you have there perfesser ... be a shame to have to run it over ... TCP over SLIP (serial line IP) ..." "So you like that 64 bit math, eh? Lets see how well you compute with a few less bits ..." [back to your regularly scheduled supercomputer cluster] -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Wed Oct 26 16:55:13 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 26 Oct 2011 16:55:13 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA819DC.9090106@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> Message-ID: > sometime, and I've never seen a comment like yours before. You're out of > line. hah. Greg doesn't post all that much, but he's no stranger to the flame ;) seriously, your question seemed to be about a general problem, but your motive, ulterior or not, seemed to be to get rid of screen. IMO, getting rid of screen is BOFHishness of the first order. it's a tool that has valuable uses. it's not the cause of your problem. on our login nodes, we have some basic limits (/etc/security/limit.conf) that prevent large or long processes or numerous processes. * hard as 3000000 * hard cpu 60 * hard nproc 100 * hard maxlogins 20 these are very arguable, and actually pretty loose. our login nodes are intended for editing/compiling/submitting, maybe the occasional gnuplot/etc. there doesn't seem to be much resistance to the 3G as (vsz) limit, and it does definitely cut down on OOM problems. 60 cpu-minutes covers any possible compile/etc (though it has caused problems with people trying to do very large scp operations.) nproc could probably be much lower (20?) and maxlogins ought to be more like 5. we don't currently have an idle-process killer, though have thought of it. we only recently put a default TMOUT in place to cause a bit of gc on forgotten login sessions. we do have screen installed (I never use it myself.) regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From scrusan at ur.rochester.edu Wed Oct 26 17:14:13 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Wed, 26 Oct 2011 17:14:13 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 26, 2011, at 4:55 PM, Mark Hahn wrote: >> sometime, and I've never seen a comment like yours before. You're out of >> line. > > hah. Greg doesn't post all that much, but he's no stranger to the flame ;) > > seriously, your question seemed to be about a general problem, > but your motive, ulterior or not, seemed to be to get rid of screen. > > IMO, getting rid of screen is BOFHishness of the first order. > it's a tool that has valuable uses. it's not the cause of your problem. I agree. - From reading this thread, the original machine(s) in question seem to be some sort of interactive or login node(s). If these nodes were large memory or SMP machines, we'd have our resource manager take care of long running processes or other abuses. > > on our login nodes, we have some basic limits (/etc/security/limit.conf) > that prevent large or long processes or numerous processes. > > * hard as 3000000 > * hard cpu 60 > * hard nproc 100 > * hard maxlogins 20 > > these are very arguable, and actually pretty loose. our login nodes are > intended for editing/compiling/submitting, maybe the occasional gnuplot/etc. > there doesn't seem to be much resistance to the 3G as (vsz) limit, and > it does definitely cut down on OOM problems. 60 cpu-minutes covers any > possible compile/etc (though it has caused problems with people trying to > do very large scp operations.) nproc could probably be much lower (20?) > and maxlogins ought to be more like 5. We actually just spinned up a graphical login node for our less saavy users whom are more apt to run matlab, comsol, gnuplot, and other 'EZ button' graphically based scientific software. This graphical login software (http://code.google.com/p/neatx/) has helped us a lot with novice users. It has session resumption, client software for any platforms, it's faster than xforwarding, and it's wrapped around SSH. The node itself is 'fairly' heavy (8 procs, 72GB of RAM), but we've implemented cgroups to stop abuses. Upon login (through SSH or NX) each user is added to his own control group, which has processor and memory limits. Since the user's processes are kept inside of control group process spaces, it's easy to work directly with their processes/process trees, whether it be dynamic throttling, or just killing processes. On our login nodes that don't use control groups, we just kill any heavy computational processes after a certain period of time, depending on whether or not it's a compilation step, gzip, etc. We state this in our documentation, and usually give the user a warning+grace period. We don't see this type of abuse anymore because the few users whom have done this quickly learned (and apologized, imagine that!), or they were using our cgroup setup login node, so their abuse didn't affect the system enough. If the issue is processes that run for far too long, and are abusing the system, cgroups or 'pushing' the users to use a batch system seems to work better than writing scripts to make decisions on killing processes. Most ISVs have methods to run computation in batch mode, so it's not necessary for matlab type users to have their applications running for 3 weeks in a screen session when they could be using the cluster. Either that, or using some sort of cpu/memory limits that were listed above, or cgroups. So a process can run forever, but it won't have enough CPU/memory shares to make a difference. Just my .02 > > we don't currently have an idle-process killer, though have thought of it. > we only recently put a default TMOUT in place to cause a bit of gc on > forgotten login sessions. > > we do have screen installed (I never use it myself.) > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOqHgzAAoJENS19LGOpgqKDHQH/AqfAefrt3nusElS/OBnxgBK Pf8tFuyjoJvLgt+3KX19ZL18r1b/BhdW3/1GZgSVVjQZcYkV6dtUq6VI545jqDag lRY9kvyIhudKfVhFwGa87DbXSzYv5oDImf3UejsIiJvo20Bzxf7mdpToT+AGJ4gA J2HzrZwjdZk/DYEJ7CpG9lfthDDq5mrTQTbzVCnFHvEiWpeoBvfd3gJOP94age0F 0ZQGLCgheRSJXLsOlq0y0vqr+7nzupSrLUk5A1YcUysSpk4Dc4mvUVJFE+QbStN6 dSiYHhKMxF5qJTXYOSAF4QDmIObyzlbFFmHCeTTWrCG7KeWtOZU4zUfN7TL3sO4= =M5Pw -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Oct 27 01:41:47 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 26 Oct 2011 22:41:47 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> Message-ID: <20111027054147.GB29939@bx9.net> On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > If the issue is processes that run for far too long, and are abusing > the system, cgroups or 'pushing' the users to use a batch system seems > to work better than writing scripts to make decisions on killing > processes. What I saw work well was nicing the process after a certain time, including an email, and then killing and emailing after a longer time. The emails can push the batch alternative. Users generally don't become angry if the limits are enforced by a script; they can only be surprised once, and that first time is just nicing the process. If they have a hard time predicting runtime (a common issue, especially for non-hardcore supercomputing types), it's not like they _intentionally_ are exceeding the limits... -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Oct 27 10:49:51 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 27 Oct 2011 10:49:51 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111027054147.GB29939@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: <4EA96F8F.1010207@ias.edu> On 10/27/2011 01:41 AM, Greg Lindahl wrote: > On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > >> If the issue is processes that run for far too long, and are abusing >> the system, cgroups or 'pushing' the users to use a batch system seems >> to work better than writing scripts to make decisions on killing >> processes. > What I saw work well was nicing the process after a certain time, > including an email, and then killing and emailing after a longer > time. The emails can push the batch alternative. Users generally don't > become angry if the limits are enforced by a script; they can only be > surprised once, and that first time is just nicing the process. If > they have a hard time predicting runtime (a common issue, especially > for non-hardcore supercomputing types), it's not like they > _intentionally_ are exceeding the limits... Exactly. That's why I don't want to automate killing jobs longer than X days. Honestly, I can't believe how much controversy this discussion has created. I thought my OP would go unnoticed. Next time, I'll just ask which text editor I should use. ;) -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dnlombar at ichips.intel.com Thu Oct 27 12:04:21 2011 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 27 Oct 2011 09:04:21 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> Message-ID: <20111027160421.GA28306@nlxcldnl2.cl.intel.com> On Wed, Oct 26, 2011 at 02:55:13PM -0600, Mark Hahn wrote: > > sometime, and I've never seen a comment like yours before. You're out of > > line. > > hah. Greg doesn't post all that much, but he's no stranger to the flame ;) > > seriously, your question seemed to be about a general problem, > but your motive, ulterior or not, seemed to be to get rid of screen. > > IMO, getting rid of screen is BOFHishness of the first order. > it's a tool that has valuable uses. it's not the cause of your problem. Completely agree with this. If you get rid of screen, another tool will be used, perhaps even as simple as a private copy, or nohup and tail as others suggested. My primary use of screen is to do work across home and the office. Nohup only solves one of the potential scenarios. If screen were removed, my productivity would go down. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From glykos at mbg.duth.gr Thu Oct 27 15:19:37 2011 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Thu, 27 Oct 2011 22:19:37 +0300 (EEST) Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA96F8F.1010207@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > Exactly. That's why I don't want to automate killing jobs longer than X > days. Probably irrelevant after so many suggestions, but Caos NSA had this very nice 'pam_slurm' module which allows a user to login only to those nodes on which the said user has active jobs (allocated through slurm). The principal idea ["you are welcome to be bring your allocated node (and, thus, your job) to a halt if that's what you want"], sounds pedagogically attractive ... ;-) Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Oct 27 15:33:18 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 27 Oct 2011 15:33:18 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <4EA9B1FE.8090903@ias.edu> On 10/27/2011 03:19 PM, Nicholas M Glykos wrote: > >> Exactly. That's why I don't want to automate killing jobs longer than X >> days. > Probably irrelevant after so many suggestions, but Caos NSA had this very > nice 'pam_slurm' module which allows a user to login only to those nodes > on which the said user has active jobs (allocated through slurm). The > principal idea ["you are welcome to be bring your allocated node (and, > thus, your job) to a halt if that's what you want"], sounds pedagogically > attractive ... ;-) > > This doesn't apply to my case, since access to the systems in question isn't controlled by a queuing system. That alone would fix the problem. I think there's a similar pam module for SGE, too. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Thu Oct 27 15:43:59 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 27 Oct 2011 21:43:59 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA9B1FE.8090903@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <4EA9B1FE.8090903@ias.edu> Message-ID: <94F21C03-C8BB-4DB4-AA3A-D1271524E43E@staff.uni-marburg.de> Am 27.10.2011 um 21:33 schrieb Prentice Bisbal: > On 10/27/2011 03:19 PM, Nicholas M Glykos wrote: >> >>> Exactly. That's why I don't want to automate killing jobs longer than X >>> days. >> Probably irrelevant after so many suggestions, but Caos NSA had this very >> nice 'pam_slurm' module which allows a user to login only to those nodes >> on which the said user has active jobs (allocated through slurm). The >> principal idea ["you are welcome to be bring your allocated node (and, >> thus, your job) to a halt if that's what you want"], sounds pedagogically >> attractive ... ;-) They use it in one cluster with Slurm I have access to. But it looks like you are never thrown out again once you are in. -- Reuti > This doesn't apply to my case, since access to the systems in question > isn't controlled by a queuing system. That alone would fix the problem. > > I think there's a similar pam module for SGE, too. > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu Oct 27 19:37:29 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 27 Oct 2011 19:37:29 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > nice 'pam_slurm' module which allows a user to login only to those nodes > on which the said user has active jobs (allocated through slurm). The I think this is slightly BOFHish, too. do people actually have problems with users stealing cycles this way? the issue is actually stealing, and we simply tell our users not to steal. (actually, I don't think we even point it out, since it's so obvious!) that means we don't attempt to control (we had pam_slurm installed and actually removed it.) after all, just because a user's job is done, it doesn't mean the user has no reason to go onto that node (maybe there's a status file in /tmp, or a core dump or something.) if someone persisted in stealing cycles, we'd lock their account. regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From skylar at cs.earlham.edu Thu Oct 27 19:43:24 2011 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 27 Oct 2011 16:43:24 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <4EA9EC9C.9090307@cs.earlham.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/27/2011 04:37 PM, Mark Hahn wrote: >> nice 'pam_slurm' module which allows a user to login only to those nodes >> on which the said user has active jobs (allocated through slurm). The > > I think this is slightly BOFHish, too. do people actually have problems > with users stealing cycles this way? the issue is actually stealing, > and we simply tell our users not to steal. (actually, I don't think we > even point it out, since it's so obvious!) > > that means we don't attempt to control (we had pam_slurm installed and > actually removed it.) after all, just because a user's job is done, it > doesn't mean the user has no reason to go onto that node (maybe there's a > status file in /tmp, or a core dump or something.) > > if someone persisted in stealing cycles, we'd lock their account. > We do the equivalent with GE it if the end user requests it. We have some clusters that need to support a mix of critical jobs supporting data pipelines, and less-critical academic work. Our default stance, though, is to trust our users to do the right thing. Mostly it works, but sometimes we do need to bring out the LART stick. - -- - -- - -- Skylar Thompson (skylar at cs.earlham.edu) - -- http://www.cs.earlham.edu/~skylar/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6p7JwACgkQsc4yyULgN4aRdgCbB3er3VI9OZEVSWO0GjL15rgU Z0sAoIZBKFsCeaYwA44uQT13JcdMN3dz =ervm -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Fri Oct 28 14:04:02 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 28 Oct 2011 14:04:02 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: On Thu, 27 Oct 2011, Mark Hahn wrote: > if someone persisted in stealing cycles, we'd lock their account. Exactly. Or visit them with a sucker rod. Or have a department chair have a "talk" with them. Human to human interactions and controls work better than installing complex tools or automated constraints. Sure, sucker rods are a joke and no we don't actually bop users on the head or the desk or whomp them upside the head with a manual, but in most cases a stern talking to followed by locking their account unless/until they formally agree to change their ways is more than sufficient. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From sabujp at gmail.com Fri Oct 28 14:22:03 2011 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 28 Oct 2011 13:22:03 -0500 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > Human to human interactions and controls work better than installing > complex tools or automated constraints. ?Sure, sucker rods are a joke > and no we don't actually bop users on the head or the desk or whomp them > upside the head with a manual, but in most cases a stern talking to > followed by locking their account unless/until they formally agree to > change their ways is more than sufficient. Funny you should mentioned that, we've got such a device handy, passed down through the years from previous sysadmins: http://i.imgur.com/G0pjk.jpg It's also got a nice foam layer on the bopping side. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From beckerjes at mail.nih.gov Fri Oct 28 14:27:48 2011 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Fri, 28 Oct 2011 14:27:48 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <20111028182748.GC41282@mail.nih.gov> On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >http://i.imgur.com/G0pjk.jpg > >It's also got a nice foam layer on the bopping side. Then it's just a prop. What's the *real* one look like? -- Jesse Becker NHGRI Linux support (Digicon Contractor) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From sabujp at gmail.com Fri Oct 28 14:33:52 2011 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 28 Oct 2011 13:33:52 -0500 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111028182748.GC41282@mail.nih.gov> References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: I don't know, maybe we drop this on their head: http://i.imgur.com/VWxyF.jpg or worse, switch out their linux workstation with it. On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: > On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >> >> http://i.imgur.com/G0pjk.jpg >> >> It's also got a nice foam layer on the bopping side. > > Then it's just a prop. ?What's the *real* one look like? > > -- > Jesse Becker > NHGRI Linux support (Digicon Contractor) > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri Oct 28 14:58:33 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 28 Oct 2011 11:58:33 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: Google "Microsoft we share your pain" and look for the WSYP videos on youtube.. The three minute version is probably the one you want. Jim Lux +1(818)354-2075 > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Sabuj Pattanayek > Sent: Friday, October 28, 2011 11:34 AM > To: Beowulf Mailing List > Subject: Re: [Beowulf] Users abusing screen > > I don't know, maybe we drop this on their head: > > http://i.imgur.com/VWxyF.jpg > > or worse, switch out their linux workstation with it. > > On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: > > On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: > >> > >> http://i.imgur.com/G0pjk.jpg > >> > >> It's also got a nice foam layer on the bopping side. > > > > Then it's just a prop. ?What's the *real* one look like? > > > > -- > > Jesse Becker > > NHGRI Linux support (Digicon Contractor) > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From glykos at mbg.duth.gr Fri Oct 28 15:10:18 2011 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Fri, 28 Oct 2011 22:10:18 +0300 (EEST) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > > if someone persisted in stealing cycles, we'd lock their account. > > Exactly. Or visit them with a sucker rod. Or have a department chair > have a "talk" with them. > > Human to human interactions and controls work better than installing > complex tools or automated constraints. I can't, of course, even contemplate the possibility of disagreeing with RGB. Having said that, we (humans) do install complex tools and automated constraints on each and every technologically advanced piece of equipment, from cars and aircrafts, to computing machines (and we do not assume that proper training and human interaction suffices to guarantee proper operation of the said equipment). In this respect, methods like allocating (in a controlled manner) exclusive rights to compute nodes do appear sensible. I agree that installing restraints is a balancing act between crippling creativity (and making power users mad) and avoiding equipment misuse, but clearly, there are limits in the freedom of use (for example, you wouldn't add all cluster users to your sudo list). My twocents, Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 28 16:20:41 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 28 Oct 2011 16:20:41 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: <4EAB0E99.10407@ias.edu> I was still supporting those only 4 years ago. Much heavier than a Dell or HP workstation. Will fix 'layer 8' problems in a jiffy. -- Prentice On 10/28/2011 02:33 PM, Sabuj Pattanayek wrote: > I don't know, maybe we drop this on their head: > > http://i.imgur.com/VWxyF.jpg > > or worse, switch out their linux workstation with it. > > On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: >> On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >>> http://i.imgur.com/G0pjk.jpg >>> >>> It's also got a nice foam layer on the bopping side. >> Then it's just a prop. What's the *real* one look like? >> >> -- >> Jesse Becker >> NHGRI Linux support (Digicon Contractor) >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 28 16:56:49 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 28 Oct 2011 16:56:49 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111027054147.GB29939@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: I think Greg is right on the money. Particularly at a place like IAS, where resources are good and users may be errant but are doing great things, I'd have a sequence of limits; first, a mail warning ("Your job PID 666 has consumed one million core hours, and its priority will be decremented in 500,000 CH unless you call the sysadmin at 555-1212") and later nice (iwith another email warning) and only then kill (with an email notificiation). If they have opportunities to upscale the allocations to really important jobs, and they are notified about automatic limitations ahead of time, they have no reason to complain. Peter On Thu, Oct 27, 2011 at 1:41 AM, Greg Lindahl wrote: > On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > > > If the issue is processes that run for far too long, and are abusing > > the system, cgroups or 'pushing' the users to use a batch system seems > > to work better than writing scripts to make decisions on killing > > processes. > > What I saw work well was nicing the process after a certain time, > including an email, and then killing and emailing after a longer > time. The emails can push the batch alternative. Users generally don't > become angry if the limits are enforced by a script; they can only be > surprised once, and that first time is just nicing the process. If > they have a hard time predicting runtime (a common issue, especially > for non-hardcore supercomputing types), it's not like they > _intentionally_ are exceeding the limits... > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 28 18:21:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 28 Oct 2011 18:21:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: <4EAB2AFE.7000901@ias.edu> On 10/28/2011 04:56 PM, Peter St. John wrote: > I think Greg is right on the money. Particularly at a place like IAS, > where resources are good and users may be errant but are doing great > things, Have you been a visitor, member or staff member at IAS? -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 28 19:16:44 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 28 Oct 2011 19:16:44 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EAB2AFE.7000901@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EAB2AFE.7000901@ias.edu> Message-ID: Prentice, No, I didin't mean to imply anything specific about e.g. your budget, but IAS has a fantastic reputation. Say hi to Dima for me, he plays Go and is an algebraic geometer visiting this year. Peter On Fri, Oct 28, 2011 at 6:21 PM, Prentice Bisbal wrote: > > On 10/28/2011 04:56 PM, Peter St. John wrote: > > I think Greg is right on the money. Particularly at a place like IAS, > > where resources are good and users may be errant but are doing great > > things, > > Have you been a visitor, member or staff member at IAS? > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Mon Oct 3 08:25:06 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 3 Oct 2011 08:25:06 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20110921110239.GR25711@leitl.org> References: <20110921110239.GR25711@leitl.org> Message-ID: <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> Interesting and pragmatic HPC cloud presentation, worth watching (25 minutes) http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ -- Doug > > http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars > > $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud > > By Jon Brodkin | Published September 20, 2011 10:49 AM > > Amazon EC2 and other cloud services are expanding the market for > high-performance computing. Without access to a national lab or a > supercomputer in your own data center, cloud computing lets businesses > spin > up temporary clusters at will and stop paying for them as soon as the > computing needs are met. > > A vendor called Cycle Computing is on a mission to demonstrate the > potential > of Amazon???s cloud by building increasingly large clusters on the Elastic > Compute Cloud. Even with Amazon, building a cluster takes some work, but > Cycle combines several technologies to ease the process and recently used > them to create a 30,000-core cluster running CentOS Linux. > > The cluster, announced publicly this week, was created for an unnamed > ???Top 5 > Pharma??? customer, and ran for about seven hours at the end of July at a > peak > cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. > The details are impressive: 3,809 compute instances, each with eight cores > and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB > (petabytes) of disk space. Security was ensured with HTTPS, SSH and > 256-bit > AES encryption, and the cluster ran across data centers in three Amazon > regions in the United States and Europe. The cluster was dubbed > ???Nekomata.??? > > Spreading the cluster across multiple continents was done partly for > disaster > recovery purposes, and also to guarantee that 30,000 cores could be > provisioned. ???We thought it would improve our probability of success if > we > spread it out,??? Cycle Computing???s Dave Powers, manager of product > engineering, told Ars. ???Nobody really knows how many instances you can > get at > any one time from any one [Amazon] region.??? > > Amazon offers its own special cluster compute instances, at a higher cost > than regular-sized virtual machines. These cluster instances provide 10 > Gigabit Ethernet networking along with greater CPU and memory, but they > weren???t necessary to build the Cycle Computing cluster. > > The pharmaceutical company???s job, related to molecular modeling, was > ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. To > further > reduce costs, Cycle took advantage of Amazon???s low-price ???spot > instances.??? To > manage the cluster, Cycle Computing used its own management software as > well > as the Condor High-Throughput Computing software and Chef, an open source > systems integration framework. > > Cycle demonstrated the power of the Amazon cloud earlier this year with a > 10,000-core cluster built for a smaller pharma firm called Genentech. Now, > 10,000 cores is a relatively easy task, says Powers. ???We think we???ve > mastered > the small-scale environments,??? he said. 30,000 cores isn???t the end > game, > either. Going forward, Cycle plans bigger, more complicated clusters, > perhaps > ones that will require Amazon???s special cluster compute instances. > > The 30,000-core cluster may or may not be the biggest one run on EC2. > Amazon > isn???t saying. > > ???I can???t share specific customer details, but can tell you that we do > have > businesses of all sizes running large-scale, high-performance computing > workloads on AWS [Amazon Web Services], including distributed clusters > like > the Cycle Computing 30,000 core cluster to tightly-coupled clusters often > used for science and engineering applications such as computational fluid > dynamics and molecular dynamics simulation,??? an Amazon spokesperson told > Ars. > > Amazon itself actually built a supercomputer on its own cloud that made it > onto the list of the world???s Top 500 supercomputers. With 7,000 cores, > the > Amazon cluster ranked number 232 in the world last November with speeds of > 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle > Computing hasn???t run the Linpack benchmark to determine the speed of its > clusters relative to Top 500 sites. > > But Cycle???s work is impressive no matter how you measure it. The job > performed for the unnamed pharma company ???would take well over a week > for > them to run internally,??? Powers says. In the end, the cluster performed > the > equivalent of 10.9 ???compute years of work.??? > > The task of managing such large cloud-based clusters forced Cycle to step > up > its own game, with a new plug-in for Chef the company calls Grill. > > ???There is no way that any mere human could keep track of all of the > moving > parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At > Cycle, > we???ve always been fans of extreme IT automation, but we needed to take > this > to the next level in order to monitor and manage every instance, volume, > daemon, job, and so on in order for Nekomata to be an efficient 30,000 > core > tool instead of a big shiny on-demand paperweight.??? > > But problems did arise during the 30,000-core run. > > ???You can be sure that when you run at massive scale, you are bound to > run > into some unexpected gotchas,??? Cycle notes. ???In our case, one of the > gotchas > included such things as running out of file descriptors on the license > server. In hindsight, we should have anticipated this would be an issue, > but > we didn???t find that in our prelaunch testing, because we didn???t test > at full > scale. We were able to quickly recover from this bump and keep moving > along > with the workload with minimal impact. The license server was able to keep > up > very nicely with this workload once we increased the number of file > descriptors.??? > > Cycle also hit a speed bump related to volume and byte limits on > Amazon???s > Elastic Block Store volumes. But the company is already planning bigger > and > better things. > > ???We already have our next use-case identified and will be turning up the > scale a bit more with the next run,??? the company says. But ultimately, > ???it???s > not about core counts or terabytes of RAM or petabytes of data. Rather, > it???s > about how we are helping to transform how science is done.??? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Mon Oct 3 13:51:06 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 03 Oct 2011 13:51:06 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> References: <20110921110239.GR25711@leitl.org> <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> Message-ID: <4E89F60A.4070801@ias.edu> Doug, Thanks for posting that video. It confirmed what I always suspected about clouds for HPC. Prentice On 10/03/2011 08:25 AM, Douglas Eadline wrote: > Interesting and pragmatic HPC cloud presentation, worth watching > (25 minutes) > > http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ > > -- > Doug > >> >> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >> >> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >> >> By Jon Brodkin | Published September 20, 2011 10:49 AM >> >> Amazon EC2 and other cloud services are expanding the market for >> high-performance computing. Without access to a national lab or a >> supercomputer in your own data center, cloud computing lets businesses >> spin >> up temporary clusters at will and stop paying for them as soon as the >> computing needs are met. >> >> A vendor called Cycle Computing is on a mission to demonstrate the >> potential >> of Amazon???s cloud by building increasingly large clusters on the Elastic >> Compute Cloud. Even with Amazon, building a cluster takes some work, but >> Cycle combines several technologies to ease the process and recently used >> them to create a 30,000-core cluster running CentOS Linux. >> >> The cluster, announced publicly this week, was created for an unnamed >> ???Top 5 >> Pharma??? customer, and ran for about seven hours at the end of July at a >> peak >> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. >> The details are impressive: 3,809 compute instances, each with eight cores >> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >> 256-bit >> AES encryption, and the cluster ran across data centers in three Amazon >> regions in the United States and Europe. The cluster was dubbed >> ???Nekomata.??? >> >> Spreading the cluster across multiple continents was done partly for >> disaster >> recovery purposes, and also to guarantee that 30,000 cores could be >> provisioned. ???We thought it would improve our probability of success if >> we >> spread it out,??? Cycle Computing???s Dave Powers, manager of product >> engineering, told Ars. ???Nobody really knows how many instances you can >> get at >> any one time from any one [Amazon] region.??? >> >> Amazon offers its own special cluster compute instances, at a higher cost >> than regular-sized virtual machines. These cluster instances provide 10 >> Gigabit Ethernet networking along with greater CPU and memory, but they >> weren???t necessary to build the Cycle Computing cluster. >> >> The pharmaceutical company???s job, related to molecular modeling, was >> ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. To >> further >> reduce costs, Cycle took advantage of Amazon???s low-price ???spot >> instances.??? To >> manage the cluster, Cycle Computing used its own management software as >> well >> as the Condor High-Throughput Computing software and Chef, an open source >> systems integration framework. >> >> Cycle demonstrated the power of the Amazon cloud earlier this year with a >> 10,000-core cluster built for a smaller pharma firm called Genentech. Now, >> 10,000 cores is a relatively easy task, says Powers. ???We think we???ve >> mastered >> the small-scale environments,??? he said. 30,000 cores isn???t the end >> game, >> either. Going forward, Cycle plans bigger, more complicated clusters, >> perhaps >> ones that will require Amazon???s special cluster compute instances. >> >> The 30,000-core cluster may or may not be the biggest one run on EC2. >> Amazon >> isn???t saying. >> >> ???I can???t share specific customer details, but can tell you that we do >> have >> businesses of all sizes running large-scale, high-performance computing >> workloads on AWS [Amazon Web Services], including distributed clusters >> like >> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >> used for science and engineering applications such as computational fluid >> dynamics and molecular dynamics simulation,??? an Amazon spokesperson told >> Ars. >> >> Amazon itself actually built a supercomputer on its own cloud that made it >> onto the list of the world???s Top 500 supercomputers. With 7,000 cores, >> the >> Amazon cluster ranked number 232 in the world last November with speeds of >> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle >> Computing hasn???t run the Linpack benchmark to determine the speed of its >> clusters relative to Top 500 sites. >> >> But Cycle???s work is impressive no matter how you measure it. The job >> performed for the unnamed pharma company ???would take well over a week >> for >> them to run internally,??? Powers says. In the end, the cluster performed >> the >> equivalent of 10.9 ???compute years of work.??? >> >> The task of managing such large cloud-based clusters forced Cycle to step >> up >> its own game, with a new plug-in for Chef the company calls Grill. >> >> ???There is no way that any mere human could keep track of all of the >> moving >> parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At >> Cycle, >> we???ve always been fans of extreme IT automation, but we needed to take >> this >> to the next level in order to monitor and manage every instance, volume, >> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >> core >> tool instead of a big shiny on-demand paperweight.??? >> >> But problems did arise during the 30,000-core run. >> >> ???You can be sure that when you run at massive scale, you are bound to >> run >> into some unexpected gotchas,??? Cycle notes. ???In our case, one of the >> gotchas >> included such things as running out of file descriptors on the license >> server. In hindsight, we should have anticipated this would be an issue, >> but >> we didn???t find that in our prelaunch testing, because we didn???t test >> at full >> scale. We were able to quickly recover from this bump and keep moving >> along >> with the workload with minimal impact. The license server was able to keep >> up >> very nicely with this workload once we increased the number of file >> descriptors.??? >> >> Cycle also hit a speed bump related to volume and byte limits on >> Amazon???s >> Elastic Block Store volumes. But the company is already planning bigger >> and >> better things. >> >> ???We already have our next use-case identified and will be turning up the >> scale a bit more with the next run,??? the company says. But ultimately, >> ???it???s >> not about core counts or terabytes of RAM or petabytes of data. Rather, >> it???s >> about how we are helping to transform how science is done.??? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From deadline at eadline.org Mon Oct 3 14:17:33 2011 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 3 Oct 2011 14:17:33 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <4E89F60A.4070801@ias.edu> References: <20110921110239.GR25711@leitl.org> <59677.192.168.93.213.1317644706.squirrel@mail.eadline.org> <4E89F60A.4070801@ias.edu> Message-ID: <58756.192.168.93.213.1317665853.squirrel@mail.eadline.org> I think everyone has a similar thoughts, but the presentation provides some real data and experiences. BTW, for those interested, I have new poll on ClusterMonkey asking about clouds and HPC. (http://www.clustermonkey.net/) The last poll was on GP-GPU use. -- Doug > Doug, > > Thanks for posting that video. It confirmed what I always suspected > about clouds for HPC. > > > Prentice > > On 10/03/2011 08:25 AM, Douglas Eadline wrote: >> Interesting and pragmatic HPC cloud presentation, worth watching >> (25 minutes) >> >> http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/ >> >> -- >> Doug >> >>> >>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >>> >>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >>> >>> By Jon Brodkin | Published September 20, 2011 10:49 AM >>> >>> Amazon EC2 and other cloud services are expanding the market for >>> high-performance computing. Without access to a national lab or a >>> supercomputer in your own data center, cloud computing lets businesses >>> spin >>> up temporary clusters at will and stop paying for them as soon as the >>> computing needs are met. >>> >>> A vendor called Cycle Computing is on a mission to demonstrate the >>> potential >>> of Amazon???s cloud by building increasingly large clusters on the >>> Elastic >>> Compute Cloud. Even with Amazon, building a cluster takes some work, >>> but >>> Cycle combines several technologies to ease the process and recently >>> used >>> them to create a 30,000-core cluster running CentOS Linux. >>> >>> The cluster, announced publicly this week, was created for an unnamed >>> ???Top 5 >>> Pharma??? customer, and ran for about seven hours at the end of July at >>> a >>> peak >>> cost of $1,279 per hour, including the fees to Amazon and Cycle >>> Computing. >>> The details are impressive: 3,809 compute instances, each with eight >>> cores >>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >>> 256-bit >>> AES encryption, and the cluster ran across data centers in three Amazon >>> regions in the United States and Europe. The cluster was dubbed >>> ???Nekomata.??? >>> >>> Spreading the cluster across multiple continents was done partly for >>> disaster >>> recovery purposes, and also to guarantee that 30,000 cores could be >>> provisioned. ???We thought it would improve our probability of success >>> if >>> we >>> spread it out,??? Cycle Computing???s Dave Powers, manager of product >>> engineering, told Ars. ???Nobody really knows how many instances you >>> can >>> get at >>> any one time from any one [Amazon] region.??? >>> >>> Amazon offers its own special cluster compute instances, at a higher >>> cost >>> than regular-sized virtual machines. These cluster instances provide 10 >>> Gigabit Ethernet networking along with greater CPU and memory, but they >>> weren???t necessary to build the Cycle Computing cluster. >>> >>> The pharmaceutical company???s job, related to molecular modeling, was >>> ???embarrassingly parallel??? so a fast interconnect wasn???t crucial. >>> To >>> further >>> reduce costs, Cycle took advantage of Amazon???s low-price ???spot >>> instances.??? To >>> manage the cluster, Cycle Computing used its own management software as >>> well >>> as the Condor High-Throughput Computing software and Chef, an open >>> source >>> systems integration framework. >>> >>> Cycle demonstrated the power of the Amazon cloud earlier this year with >>> a >>> 10,000-core cluster built for a smaller pharma firm called Genentech. >>> Now, >>> 10,000 cores is a relatively easy task, says Powers. ???We think >>> we???ve >>> mastered >>> the small-scale environments,??? he said. 30,000 cores isn???t the end >>> game, >>> either. Going forward, Cycle plans bigger, more complicated clusters, >>> perhaps >>> ones that will require Amazon???s special cluster compute instances. >>> >>> The 30,000-core cluster may or may not be the biggest one run on EC2. >>> Amazon >>> isn???t saying. >>> >>> ???I can???t share specific customer details, but can tell you that we >>> do >>> have >>> businesses of all sizes running large-scale, high-performance computing >>> workloads on AWS [Amazon Web Services], including distributed clusters >>> like >>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters >>> often >>> used for science and engineering applications such as computational >>> fluid >>> dynamics and molecular dynamics simulation,??? an Amazon spokesperson >>> told >>> Ars. >>> >>> Amazon itself actually built a supercomputer on its own cloud that made >>> it >>> onto the list of the world???s Top 500 supercomputers. With 7,000 >>> cores, >>> the >>> Amazon cluster ranked number 232 in the world last November with speeds >>> of >>> 41.82 teraflops, falling to number 451 in June of this year. So far, >>> Cycle >>> Computing hasn???t run the Linpack benchmark to determine the speed of >>> its >>> clusters relative to Top 500 sites. >>> >>> But Cycle???s work is impressive no matter how you measure it. The job >>> performed for the unnamed pharma company ???would take well over a week >>> for >>> them to run internally,??? Powers says. In the end, the cluster >>> performed >>> the >>> equivalent of 10.9 ???compute years of work.??? >>> >>> The task of managing such large cloud-based clusters forced Cycle to >>> step >>> up >>> its own game, with a new plug-in for Chef the company calls Grill. >>> >>> ???There is no way that any mere human could keep track of all of the >>> moving >>> parts on a cluster of this scale,??? Cycle wrote in a blog post. ???At >>> Cycle, >>> we???ve always been fans of extreme IT automation, but we needed to >>> take >>> this >>> to the next level in order to monitor and manage every instance, >>> volume, >>> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >>> core >>> tool instead of a big shiny on-demand paperweight.??? >>> >>> But problems did arise during the 30,000-core run. >>> >>> ???You can be sure that when you run at massive scale, you are bound to >>> run >>> into some unexpected gotchas,??? Cycle notes. ???In our case, one of >>> the >>> gotchas >>> included such things as running out of file descriptors on the license >>> server. In hindsight, we should have anticipated this would be an >>> issue, >>> but >>> we didn???t find that in our prelaunch testing, because we didn???t >>> test >>> at full >>> scale. We were able to quickly recover from this bump and keep moving >>> along >>> with the workload with minimal impact. The license server was able to >>> keep >>> up >>> very nicely with this workload once we increased the number of file >>> descriptors.??? >>> >>> Cycle also hit a speed bump related to volume and byte limits on >>> Amazon???s >>> Elastic Block Store volumes. But the company is already planning bigger >>> and >>> better things. >>> >>> ???We already have our next use-case identified and will be turning up >>> the >>> scale a bit more with the next run,??? the company says. But >>> ultimately, >>> ???it???s >>> not about core counts or terabytes of RAM or petabytes of data. Rather, >>> it???s >>> about how we are helping to transform how science is done.??? >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >>> -- >>> This message has been scanned for viruses and >>> dangerous content by MailScanner, and is >>> believed to be clean. >>> >>> >> >> > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Mon Oct 3 14:50:22 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Mon, 3 Oct 2011 14:50:22 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20110921110239.GR25711@leitl.org> References: <20110921110239.GR25711@leitl.org> Message-ID: There's a free & opensource application called StarCluster that can do most (if not all?) of the EC2 provisioning & cluster setup for a High Throughput Computing cluster: http://web.mit.edu/stardev/cluster/ StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc automatically for the user in around 10-15 mins. StarCluster is licensed under LGPL, written in Python+Boto, and supports a lot of the new EC2 features (Cluster Compute Instances, Spot Instances, Cluster GPU Instances, etc). Support for launching higher node count (100+ instances) clusters is even better with the new scalability enhancements in the latest version (0.92). And there are some tutorials on YouTube: - "StarCluster 0.91 Demo": http://www.youtube.com/watch?v=vC3lJcPq1FY - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": http://www.youtube.com/watch?v=2Ym7epCYnSk Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: > > http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars > > $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud > > By Jon Brodkin | Published September 20, 2011 10:49 AM > > Amazon EC2 and other cloud services are expanding the market for > high-performance computing. Without access to a national lab or a > supercomputer in your own data center, cloud computing lets businesses spin > up temporary clusters at will and stop paying for them as soon as the > computing needs are met. > > A vendor called Cycle Computing is on a mission to demonstrate the potential > of Amazon?s cloud by building increasingly large clusters on the Elastic > Compute Cloud. Even with Amazon, building a cluster takes some work, but > Cycle combines several technologies to ease the process and recently used > them to create a 30,000-core cluster running CentOS Linux. > > The cluster, announced publicly this week, was created for an unnamed ?Top 5 > Pharma? customer, and ran for about seven hours at the end of July at a peak > cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. > The details are impressive: 3,809 compute instances, each with eight cores > and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB > (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit > AES encryption, and the cluster ran across data centers in three Amazon > regions in the United States and Europe. The cluster was dubbed ?Nekomata.? > > Spreading the cluster across multiple continents was done partly for disaster > recovery purposes, and also to guarantee that 30,000 cores could be > provisioned. ?We thought it would improve our probability of success if we > spread it out,? Cycle Computing?s Dave Powers, manager of product > engineering, told Ars. ?Nobody really knows how many instances you can get at > any one time from any one [Amazon] region.? > > Amazon offers its own special cluster compute instances, at a higher cost > than regular-sized virtual machines. These cluster instances provide 10 > Gigabit Ethernet networking along with greater CPU and memory, but they > weren?t necessary to build the Cycle Computing cluster. > > The pharmaceutical company?s job, related to molecular modeling, was > ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To further > reduce costs, Cycle took advantage of Amazon?s low-price ?spot instances.? To > manage the cluster, Cycle Computing used its own management software as well > as the Condor High-Throughput Computing software and Chef, an open source > systems integration framework. > > Cycle demonstrated the power of the Amazon cloud earlier this year with a > 10,000-core cluster built for a smaller pharma firm called Genentech. Now, > 10,000 cores is a relatively easy task, says Powers. ?We think we?ve mastered > the small-scale environments,? he said. 30,000 cores isn?t the end game, > either. Going forward, Cycle plans bigger, more complicated clusters, perhaps > ones that will require Amazon?s special cluster compute instances. > > The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon > isn?t saying. > > ?I can?t share specific customer details, but can tell you that we do have > businesses of all sizes running large-scale, high-performance computing > workloads on AWS [Amazon Web Services], including distributed clusters like > the Cycle Computing 30,000 core cluster to tightly-coupled clusters often > used for science and engineering applications such as computational fluid > dynamics and molecular dynamics simulation,? an Amazon spokesperson told Ars. > > Amazon itself actually built a supercomputer on its own cloud that made it > onto the list of the world?s Top 500 supercomputers. With 7,000 cores, the > Amazon cluster ranked number 232 in the world last November with speeds of > 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle > Computing hasn?t run the Linpack benchmark to determine the speed of its > clusters relative to Top 500 sites. > > But Cycle?s work is impressive no matter how you measure it. The job > performed for the unnamed pharma company ?would take well over a week for > them to run internally,? Powers says. In the end, the cluster performed the > equivalent of 10.9 ?compute years of work.? > > The task of managing such large cloud-based clusters forced Cycle to step up > its own game, with a new plug-in for Chef the company calls Grill. > > ?There is no way that any mere human could keep track of all of the moving > parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, > we?ve always been fans of extreme IT automation, but we needed to take this > to the next level in order to monitor and manage every instance, volume, > daemon, job, and so on in order for Nekomata to be an efficient 30,000 core > tool instead of a big shiny on-demand paperweight.? > > But problems did arise during the 30,000-core run. > > ?You can be sure that when you run at massive scale, you are bound to run > into some unexpected gotchas,? Cycle notes. ?In our case, one of the gotchas > included such things as running out of file descriptors on the license > server. In hindsight, we should have anticipated this would be an issue, but > we didn?t find that in our prelaunch testing, because we didn?t test at full > scale. We were able to quickly recover from this bump and keep moving along > with the workload with minimal impact. The license server was able to keep up > very nicely with this workload once we increased the number of file > descriptors.? > > Cycle also hit a speed bump related to volume and byte limits on Amazon?s > Elastic Block Store volumes. But the company is already planning bigger and > better things. > > ?We already have our next use-case identified and will be turning up the > scale a bit more with the next run,? the company says. But ultimately, ?it?s > not about core counts or terabytes of RAM or petabytes of data. Rather, it?s > about how we are helping to transform how science is done.? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ Wikimedia Commons http://commons.wikimedia.org/wiki/User:Raysonho _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Mon Oct 3 15:21:44 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 3 Oct 2011 15:21:44 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: <20110921110239.GR25711@leitl.org> Message-ID: On Mon, 3 Oct 2011, Rayson Ho wrote: > There's a free & opensource application called StarCluster that can do > most (if not all?) of the EC2 provisioning & cluster setup for a High > Throughput Computing cluster: I will say that if anyone is going to make this work, it is going to be Amazon and/or Google -- they have the very very big pile of computers needed to make it work. I would be very interested in seeing the detailed scaling of "fine grained parallel" applications on cloud resources -- one point that the talk made that I agree with is that embarrassingly parallel applications that require minimal I/O or IPCs will do well in a cloud where all that matters is how many instances you can run of jobs that don't talk to each other or need much access to data. But what of jobs that require synchronous high speed communications? What of jobs that require access to huge datasets? Ultimately the problem comes down to this. Your choice is to rent time on somebody else's hardware or buy your own hardware. For many people, one can scale to infinity and beyond, so using "all" of the time/resource you have available either way is a given. In which case no matter how you slice it, Amazon or Google have to make a profit above and beyond the cost of delivering the service. You don't (or rather, your "profit" is just the ability to run your jobs and get paid as usual to do your research either way). This means that it will always be cheaper to directly provision a lot of computing rather than run it in the cloud, or for that matter at an HPC center. Not all -- lots of nonlinearities and thresholds associated with infrastructure and admin and so on -- but a lot. Enough that I don't see Amazon's Pinky OR the Brain ever taking over the (HPC) world... rgb > > http://web.mit.edu/stardev/cluster/ > > StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc > automatically for the user in around 10-15 mins. StarCluster is > licensed under LGPL, written in Python+Boto, and supports a lot of the > new EC2 features (Cluster Compute Instances, Spot Instances, Cluster > GPU Instances, etc). Support for launching higher node count (100+ > instances) clusters is even better with the new scalability > enhancements in the latest version (0.92). > > And there are some tutorials on YouTube: > > - "StarCluster 0.91 Demo": > http://www.youtube.com/watch?v=vC3lJcPq1FY > > - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": > http://www.youtube.com/watch?v=2Ym7epCYnSk > > Rayson > > ================================= > Grid Engine / Open Grid Scheduler > http://gridscheduler.sourceforge.net > > > > On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: >> >> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >> >> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >> >> By Jon Brodkin | Published September 20, 2011 10:49 AM >> >> Amazon EC2 and other cloud services are expanding the market for >> high-performance computing. Without access to a national lab or a >> supercomputer in your own data center, cloud computing lets businesses spin >> up temporary clusters at will and stop paying for them as soon as the >> computing needs are met. >> >> A vendor called Cycle Computing is on a mission to demonstrate the potential >> of Amazon?s cloud by building increasingly large clusters on the Elastic >> Compute Cloud. Even with Amazon, building a cluster takes some work, but >> Cycle combines several technologies to ease the process and recently used >> them to create a 30,000-core cluster running CentOS Linux. >> >> The cluster, announced publicly this week, was created for an unnamed ?Top 5 >> Pharma? customer, and ran for about seven hours at the end of July at a peak >> cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. >> The details are impressive: 3,809 compute instances, each with eight cores >> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >> (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit >> AES encryption, and the cluster ran across data centers in three Amazon >> regions in the United States and Europe. The cluster was dubbed ?Nekomata.? >> >> Spreading the cluster across multiple continents was done partly for disaster >> recovery purposes, and also to guarantee that 30,000 cores could be >> provisioned. ?We thought it would improve our probability of success if we >> spread it out,? Cycle Computing?s Dave Powers, manager of product >> engineering, told Ars. ?Nobody really knows how many instances you can get at >> any one time from any one [Amazon] region.? >> >> Amazon offers its own special cluster compute instances, at a higher cost >> than regular-sized virtual machines. These cluster instances provide 10 >> Gigabit Ethernet networking along with greater CPU and memory, but they >> weren?t necessary to build the Cycle Computing cluster. >> >> The pharmaceutical company?s job, related to molecular modeling, was >> ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To further >> reduce costs, Cycle took advantage of Amazon?s low-price ?spot instances.? To >> manage the cluster, Cycle Computing used its own management software as well >> as the Condor High-Throughput Computing software and Chef, an open source >> systems integration framework. >> >> Cycle demonstrated the power of the Amazon cloud earlier this year with a >> 10,000-core cluster built for a smaller pharma firm called Genentech. Now, >> 10,000 cores is a relatively easy task, says Powers. ?We think we?ve mastered >> the small-scale environments,? he said. 30,000 cores isn?t the end game, >> either. Going forward, Cycle plans bigger, more complicated clusters, perhaps >> ones that will require Amazon?s special cluster compute instances. >> >> The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon >> isn?t saying. >> >> ?I can?t share specific customer details, but can tell you that we do have >> businesses of all sizes running large-scale, high-performance computing >> workloads on AWS [Amazon Web Services], including distributed clusters like >> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >> used for science and engineering applications such as computational fluid >> dynamics and molecular dynamics simulation,? an Amazon spokesperson told Ars. >> >> Amazon itself actually built a supercomputer on its own cloud that made it >> onto the list of the world?s Top 500 supercomputers. With 7,000 cores, the >> Amazon cluster ranked number 232 in the world last November with speeds of >> 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle >> Computing hasn?t run the Linpack benchmark to determine the speed of its >> clusters relative to Top 500 sites. >> >> But Cycle?s work is impressive no matter how you measure it. The job >> performed for the unnamed pharma company ?would take well over a week for >> them to run internally,? Powers says. In the end, the cluster performed the >> equivalent of 10.9 ?compute years of work.? >> >> The task of managing such large cloud-based clusters forced Cycle to step up >> its own game, with a new plug-in for Chef the company calls Grill. >> >> ?There is no way that any mere human could keep track of all of the moving >> parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, >> we?ve always been fans of extreme IT automation, but we needed to take this >> to the next level in order to monitor and manage every instance, volume, >> daemon, job, and so on in order for Nekomata to be an efficient 30,000 core >> tool instead of a big shiny on-demand paperweight.? >> >> But problems did arise during the 30,000-core run. >> >> ?You can be sure that when you run at massive scale, you are bound to run >> into some unexpected gotchas,? Cycle notes. ?In our case, one of the gotchas >> included such things as running out of file descriptors on the license >> server. In hindsight, we should have anticipated this would be an issue, but >> we didn?t find that in our prelaunch testing, because we didn?t test at full >> scale. We were able to quickly recover from this bump and keep moving along >> with the workload with minimal impact. The license server was able to keep up >> very nicely with this workload once we increased the number of file >> descriptors.? >> >> Cycle also hit a speed bump related to volume and byte limits on Amazon?s >> Elastic Block Store volumes. But the company is already planning bigger and >> better things. >> >> ?We already have our next use-case identified and will be turning up the >> scale a bit more with the next run,? the company says. But ultimately, ?it?s >> not about core counts or terabytes of RAM or petabytes of data. Rather, it?s >> about how we are helping to transform how science is done.? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > Rayson > > ================================================== > Open Grid Scheduler - The Official Open Source Grid Engine > http://gridscheduler.sourceforge.net/ > > Wikimedia Commons > http://commons.wikimedia.org/wiki/User:Raysonho > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From raysonlogin at gmail.com Tue Oct 4 10:55:39 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 4 Oct 2011 10:55:39 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: <20110921110239.GR25711@leitl.org> Message-ID: On Mon, Oct 3, 2011 at 3:21 PM, Robert G. Brown wrote: >?I would be very interested in seeing the > detailed scaling of "fine grained parallel" applications on cloud > resources -- one point that the talk made that I agree with is that > embarrassingly parallel applications that require minimal I/O or IPCs > will do well in a cloud where all that matters is how many instances you > can run of jobs that don't talk to each other or need much access to > data. ?But what of jobs that require synchronous high speed > communications? Amazon (and I believe other cloud providers have something similar?) introduced Cluster Compute Instances with 10 Gb Ethernet. For traditional MPI workloads, the real advantage is actually from HVM (Hardware VM), as it cuts the communication latency by quite a lot. > What of jobs that require access to huge datasets? Getting data in & out of the cloud is still a big problem, and the highest bandwidth way of sending data to AWS is by FedEx. In fact, it is quite often that the fastest way to send data from one data center to another when the data size is big. And processing data on the cloud is easier (in terms of setup) with Amazon Elastic MapReduce (and recently works with spot instances). http://aws.amazon.com/elasticmapreduce/ > Ultimately the problem comes down to this. ?Your choice is to rent time > on somebody else's hardware or buy your own hardware. ?For many people, > one can scale to infinity and beyond, so using "all" of the > time/resource you have available either way is a given. ?In which case > no matter how you slice it, Amazon or Google have to make a profit above > and beyond the cost of delivering the service. ?You don't (or rather, > your "profit" is just the ability to run your jobs and get paid as usual > to do your research either way). ?This means that it will always be > cheaper to directly provision a lot of computing rather than run it in > the cloud, or for that matter at an HPC center. Provided that the machines are used 24x7. A lot of enterprise users do not have enough work to load up the machines. Eg, I worked with a client that has lots of data & numbers to crunch at night, and during day time most of the machines are idle. For traditional HPC centers, the batch queue length is almost never 0, then agreed, cloud wouldn't help or even makes the problem worse. Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net > ?Not all -- lots of > nonlinearities and thresholds associated with infrastructure and admin > and so on -- but a lot. ?Enough that I don't see Amazon's Pinky OR the > Brain ever taking over the (HPC) world... > > ? rgb > >> >> http://web.mit.edu/stardev/cluster/ >> >> StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc >> automatically for the user in around 10-15 mins. StarCluster is >> licensed under LGPL, written in Python+Boto, and supports a lot of the >> new EC2 features (Cluster Compute Instances, Spot Instances, Cluster >> GPU Instances, etc). Support for launching higher node count (100+ >> instances) clusters is even better with the new scalability >> enhancements in the latest version (0.92). >> >> And there are some tutorials on YouTube: >> >> - "StarCluster 0.91 Demo": >> http://www.youtube.com/watch?v=vC3lJcPq1FY >> >> - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": >> http://www.youtube.com/watch?v=2Ym7epCYnSk >> >> Rayson >> >> ================================= >> Grid Engine / Open Grid Scheduler >> http://gridscheduler.sourceforge.net >> >> >> >> On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl wrote: >>> >>> >>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars >>> >>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud >>> >>> By Jon Brodkin | Published September 20, 2011 10:49 AM >>> >>> Amazon EC2 and other cloud services are expanding the market for >>> high-performance computing. Without access to a national lab or a >>> supercomputer in your own data center, cloud computing lets businesses >>> spin >>> up temporary clusters at will and stop paying for them as soon as the >>> computing needs are met. >>> >>> A vendor called Cycle Computing is on a mission to demonstrate the >>> potential >>> of Amazon?s cloud by building increasingly large clusters on the Elastic >>> Compute Cloud. Even with Amazon, building a cluster takes some work, but >>> Cycle combines several technologies to ease the process and recently used >>> them to create a 30,000-core cluster running CentOS Linux. >>> >>> The cluster, announced publicly this week, was created for an unnamed >>> ?Top 5 >>> Pharma? customer, and ran for about seven hours at the end of July at a >>> peak >>> cost of $1,279 per hour, including the fees to Amazon and Cycle >>> Computing. >>> The details are impressive: 3,809 compute instances, each with eight >>> cores >>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB >>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and >>> 256-bit >>> AES encryption, and the cluster ran across data centers in three Amazon >>> regions in the United States and Europe. The cluster was dubbed >>> ?Nekomata.? >>> >>> Spreading the cluster across multiple continents was done partly for >>> disaster >>> recovery purposes, and also to guarantee that 30,000 cores could be >>> provisioned. ?We thought it would improve our probability of success if >>> we >>> spread it out,? Cycle Computing?s Dave Powers, manager of product >>> engineering, told Ars. ?Nobody really knows how many instances you can >>> get at >>> any one time from any one [Amazon] region.? >>> >>> Amazon offers its own special cluster compute instances, at a higher cost >>> than regular-sized virtual machines. These cluster instances provide 10 >>> Gigabit Ethernet networking along with greater CPU and memory, but they >>> weren?t necessary to build the Cycle Computing cluster. >>> >>> The pharmaceutical company?s job, related to molecular modeling, was >>> ?embarrassingly parallel? so a fast interconnect wasn?t crucial. To >>> further >>> reduce costs, Cycle took advantage of Amazon?s low-price ?spot >>> instances.? To >>> manage the cluster, Cycle Computing used its own management software as >>> well >>> as the Condor High-Throughput Computing software and Chef, an open source >>> systems integration framework. >>> >>> Cycle demonstrated the power of the Amazon cloud earlier this year with a >>> 10,000-core cluster built for a smaller pharma firm called Genentech. >>> Now, >>> 10,000 cores is a relatively easy task, says Powers. ?We think we?ve >>> mastered >>> the small-scale environments,? he said. 30,000 cores isn?t the end game, >>> either. Going forward, Cycle plans bigger, more complicated clusters, >>> perhaps >>> ones that will require Amazon?s special cluster compute instances. >>> >>> The 30,000-core cluster may or may not be the biggest one run on EC2. >>> Amazon >>> isn?t saying. >>> >>> ?I can?t share specific customer details, but can tell you that we do >>> have >>> businesses of all sizes running large-scale, high-performance computing >>> workloads on AWS [Amazon Web Services], including distributed clusters >>> like >>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters often >>> used for science and engineering applications such as computational fluid >>> dynamics and molecular dynamics simulation,? an Amazon spokesperson told >>> Ars. >>> >>> Amazon itself actually built a supercomputer on its own cloud that made >>> it >>> onto the list of the world?s Top 500 supercomputers. With 7,000 cores, >>> the >>> Amazon cluster ranked number 232 in the world last November with speeds >>> of >>> 41.82 teraflops, falling to number 451 in June of this year. So far, >>> Cycle >>> Computing hasn?t run the Linpack benchmark to determine the speed of its >>> clusters relative to Top 500 sites. >>> >>> But Cycle?s work is impressive no matter how you measure it. The job >>> performed for the unnamed pharma company ?would take well over a week for >>> them to run internally,? Powers says. In the end, the cluster performed >>> the >>> equivalent of 10.9 ?compute years of work.? >>> >>> The task of managing such large cloud-based clusters forced Cycle to step >>> up >>> its own game, with a new plug-in for Chef the company calls Grill. >>> >>> ?There is no way that any mere human could keep track of all of the >>> moving >>> parts on a cluster of this scale,? Cycle wrote in a blog post. ?At Cycle, >>> we?ve always been fans of extreme IT automation, but we needed to take >>> this >>> to the next level in order to monitor and manage every instance, volume, >>> daemon, job, and so on in order for Nekomata to be an efficient 30,000 >>> core >>> tool instead of a big shiny on-demand paperweight.? >>> >>> But problems did arise during the 30,000-core run. >>> >>> ?You can be sure that when you run at massive scale, you are bound to run >>> into some unexpected gotchas,? Cycle notes. ?In our case, one of the >>> gotchas >>> included such things as running out of file descriptors on the license >>> server. In hindsight, we should have anticipated this would be an issue, >>> but >>> we didn?t find that in our prelaunch testing, because we didn?t test at >>> full >>> scale. We were able to quickly recover from this bump and keep moving >>> along >>> with the workload with minimal impact. The license server was able to >>> keep up >>> very nicely with this workload once we increased the number of file >>> descriptors.? >>> >>> Cycle also hit a speed bump related to volume and byte limits on Amazon?s >>> Elastic Block Store volumes. But the company is already planning bigger >>> and >>> better things. >>> >>> ?We already have our next use-case identified and will be turning up the >>> scale a bit more with the next run,? the company says. But ultimately, >>> ?it?s >>> not about core counts or terabytes of RAM or petabytes of data. Rather, >>> it?s >>> about how we are helping to transform how science is done.? >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> >> -- >> Rayson >> >> ================================================== >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> Wikimedia Commons >> http://commons.wikimedia.org/wiki/User:Raysonho >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown ? ? ? ? ? ? ? ? ? ? ? ?http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 ?Fax: 919-660-2525 ? ? email:rgb at phy.duke.edu > > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 11:26:55 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 08:26:55 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: Message-ID: On 10/4/11 7:55 AM, "Rayson Ho" wrote: >On Mon, Oct 3, 2011 at 3:21 PM, Robert G. Brown wrote: >> I would be very interested in seeing the >> detailed scaling of "fine grained parallel" applications on cloud >> resources -- one point that the talk made that I agree with is that >> embarrassingly parallel applications that require minimal I/O or IPCs >> will do well in a cloud where all that matters is how many instances you >> can run of jobs that don't talk to each other or need much access to >> data. But what of jobs that require synchronous high speed >> communications? > >Amazon (and I believe other cloud providers have something similar?) >introduced Cluster Compute Instances with 10 Gb Ethernet. For >traditional MPI workloads, the real advantage is actually from HVM >(Hardware VM), as it cuts the communication latency by quite a lot. > > >> What of jobs that require access to huge datasets? > >Getting data in & out of the cloud is still a big problem, and the >highest bandwidth way of sending data to AWS is by FedEx. In fact, it >is quite often that the fastest way to send data from one data center >to another when the data size is big. The classic: nothing beats a station wagon full of tapes for bandwidth. (today, it's minivan with terabyte hard drives, but that's the idea) > > > >> Ultimately the problem comes down to this. Your choice is to rent time >> on somebody else's hardware or buy your own hardware. For many people, >> one can scale to infinity and beyond, so using "all" of the >> time/resource you have available either way is a given. In which case >> no matter how you slice it, Amazon or Google have to make a profit above >> and beyond the cost of delivering the service. You don't (or rather, >> your "profit" is just the ability to run your jobs and get paid as usual >> to do your research either way). This means that it will always be >> cheaper to directly provision a lot of computing rather than run it in >> the cloud, or for that matter at an HPC center. > >Provided that the machines are used 24x7. A lot of enterprise users do >not have enough work to load up the machines. Eg, I worked with a >client that has lots of data & numbers to crunch at night, and during >day time most of the machines are idle. In a situation where you've got an existing application and data, and you just want to crunch numbers, and you pay either cloud or in-house, then you make the choice based on the incremental cost. However, even at the smallest increment on a cloud/hosted scheme, you have to pay from CPU second #1 (plus the fixed overhead of getting the job ready to go). If you have a cluster in house, there is likely a way to get a test job run essentially for free (perhaps on an older non-production cluster). That test job provides the performance data and preliminary results that you use in preparing the proposal to get real money to pay for real computation. This has been my argument for personal clusters... There's no accounting staff or administrative person watching over you to make sure you are effectively using the capital investment, in the same sense that most places don't care how much idle time there is on your desktop PC. If you've got an idea, and you're willing to put your own time (free?) into it, using the box that happens to be in your office or lab, nobody cares one way or another, as long as your primary job gets done. Notwithstanding that there ARE places that do cycle harvesting from desktop machines, but the management and sysadmin hassles are so extreme (I've written software to DO such harvesting, in pre-Beowulf days).. Those kinds of places go to thin clients and hosted VM instances eventually, I think. Where an Amazon could do themselves a favor (maybe they do this already) is to provide a free downloadable version of their environment for your own computer, or some "low priority cycles" for free, to get people hooked. Sort of like IBM providing computers for cheap to universities in the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized cellphones, 10 cent text messages. Give us your child 'til 7, and he's ours for life. > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From raysonlogin at gmail.com Tue Oct 4 11:58:12 2011 From: raysonlogin at gmail.com (Rayson Ho) Date: Tue, 4 Oct 2011 11:58:12 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, Oct 4, 2011 at 11:26 AM, Lux, Jim (337C) wrote: > The classic: nothing beats a station wagon full of tapes for bandwidth. > (today, it's minivan with terabyte hard drives, but that's the idea) BTW, I've heard horror stories related to routing errors with this method - truck drivers delivering wrong tapes or losing tapes (hopefully the data is properly encrypted). > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days). The technology part of cycle harvesting is solvable, the accounting part is (IMO) much harder. A few years ago I talked to a University HPC lab about deploying cycle harvesting in the libraries (it's a big University, so we are talking about 1000+ library desktops). The technology was there (BOINC client), but getting the software installed & maintained means extra work, which means an extra IT guy... and means no one wants to pay for this. I wonder how many University labs or Biotech companies are doing organization wide cycle harvesting these days, for example, with technologies like BOINC: http://boinc.berkeley.edu/ > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, AMI is not private (in the end, it is IaaS, so the VM images are open). In fact, StarCluster has AMIs for download & install (mainly for developers who want to code for StarCluster locally): http://web.mit.edu/stardev/cluster/download_amis.html And one can roll a custom StarCluster AMI and upload it to AWS, such that the image settings are optimized to the needs: http://web.mit.edu/stardev/cluster/docs/0.91/create_new_ami.html > or some "low priority cycles" for free, to get people hooked. AWS Free Usage Tier -- (most people just use the free tier as free hosting): http://aws.amazon.com/free/ Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net > ?Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. > > >> > > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Tue Oct 4 13:08:11 2011 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 4 Oct 2011 13:08:11 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <53556.192.168.93.213.1317748091.squirrel@mail.eadline.org> --snip-- > > This has been my argument for personal clusters... There's no accounting > staff or administrative person watching over you to make sure you are > effectively using the capital investment, in the same sense that most > places don't care how much idle time there is on your desktop PC. If > you've got an idea, and you're willing to put your own time (free?) into > it, using the box that happens to be in your office or lab, nobody cares > one way or another, as long as your primary job gets done. > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days).. Those > kinds of places go to thin clients and hosted VM instances eventually, I > think. BTW, very soon prebuilt Limulus systems will be available (http://limulus.basement-supercomputing.com) with 16 cores (four i5-2500S processors), one power plug, cool, quiet, with cool blue lights to impress your co-workers. -- Doug > > > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, or some "low priority cycles" for free, to get people > hooked. Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. > > >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Tue Oct 4 14:39:20 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 14:39:20 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > Notwithstanding that there ARE places that do cycle harvesting from > desktop machines, but the management and sysadmin hassles are so extreme > (I've written software to DO such harvesting, in pre-Beowulf days).. Those > kinds of places go to thin clients and hosted VM instances eventually, I > think. Condor (much improved from the old days, I think) actually makes this fairly easy nowadays. The physics department runs condor across lots of the low-rent desktop systems, creating a readily available compute farm for EP jobs. I don't do much of that sort of thing any more, alas. Mostly teaching, working on dieharder when I can, and writing textbooks at a furious pace. I will have a complete first year physics textbook -- the world's best, naturally;-) -- finished by the end of this semester (I'm within about four and a half chapters of finished already, and writing at least a chapter a week at this point). After that is done, and two other books that are partly finished (three if I get really inspired and try to finish the beowulf book) THEN I may have time to do more actual computing. > Where an Amazon could do themselves a favor (maybe they do this already) > is to provide a free downloadable version of their environment for your > own computer, or some "low priority cycles" for free, to get people > hooked. Sort of like IBM providing computers for cheap to universities in > the 60s and 70s. Razors, razor blades. Kindles, e-books. Subsidized > cellphones, 10 cent text messages. Give us your child 'til 7, and he's > ours for life. As I said, ultimately Amazon makes a profit. That is, they provide the cluster and some reasonable subset of cluster management in infrastructure provisioning, where they have to a) recoup the cost of the hardware, the infrastructure, and the management; b) make at LEAST 5-10% or better on the costs of all of this as profit, if not more like 40-50% or even 100% markup. Usually retail is 100% markup, but Amazon has scale efficiencies such that they can get by with less, whether or not they "like" to. So it ultimately comes down to whether or not you can provide similar efficiencies in your own local environment. Suppose it is a University. You have $100,000 for a compute resource that you expect to use over three years. There is typically no indirect cost charged to capital equipment. Often, but not always, housing, cooling, powering, and even managing the hardware is "free" to the researcher, absorbed into the ongoing costs of the server room and management staff already needed to run the department LAN and servers. Thus for your $100,000 you can buy (say) 100 dedicated function systems for $1000 each and everything else is paid out of opportunity cost labor or University provisioning that doesn't cost your grant anything -- out of that $100,000 (although of course your indirect costs elsewhere partly subsidize it). Even network ports may be free, or may not be if you need a higher end "cluster" network. If you rent from ANYBODY, you pay: * Slightly over 1/3 of the $100,000 up front for indirect costs. Duke, for example, would be perfectly happy to charge your grant $1 for every $2 that it pays out to a third party for cloud computing rental. For that fee they do all of the bookkeeping, basically -- most is pure profit, but prenegotiated with all of the granting agencies and that's just the way it is. * Your remaining (say) $63,000 has to pay for (a fraction of) the power, the housing, the cooling, the network. Unless Amazon subsidizes the cluster with different money altogether (e.g. using money from book sales to provide all of this at a loss) it will almost certainly not be as cheap as a University center for modest size clusters. When clusters grow to where people have to build new data centers just to house them, of course, this may not be true (but Amazon still doesn't gain much of a relative advantage even in this extreme case, not in the long run). Infrastructure costs are likely ballpark 10% of the cost of the hardware you are running on. * It has to pay for Amazon's sysadmins and management and security. These are humans that your money DIRECTLY supports, not humans that are directly supported to do something else and do admin for you on an opportunity cost basis "for free". Real salaries, (fractionally) paid from this income stream only. Even amortized in the friendliest most favorable way possible, admin cost are probably at least 10% of the hardware costs. * Profit. At least (say) $6300 is profit. Nobody makes a similar profit in the case of the DIY cluster. * The amortized cost of the hardware. The way I see it, you end up with roughly 50% of every dollar lost >>off the top<< of your $100,000. You ultimately buy (an amortized fraction of) the hardware the $100,000 as up-front capital equipment would cost you, and instead of being able to leverage pre-existing University infrastructure, avoid indirect costs, all as on a non-profit basis, you have to pay for infrastructure, indirect costs on the grant, management, AND A PROFIT on top of the hardware. The only real advantage is that -- maybe -- Amazon has market leverage and economy of scale on the hardware. But 50%? That's hard to make back. rgb > > >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dag at sonsorol.org Tue Oct 4 15:29:28 2011 From: dag at sonsorol.org (Chris Dagdigian) Date: Tue, 04 Oct 2011 15:29:28 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <4E8B5E98.3090002@sonsorol.org> I'm largely with RGB on this one with the minor caveat that I think he might be undervaluing the insane economies of scale that IaaS providers like Amazon & Google can provide. At the scale that Amazon operates at, they can obtain and run infrastructure far, far more efficiently than most (if not all) of us can ourselves. These folks have exabytes of spinning disk, redundant data-centers (with insane PUE values) all over the world and they know how to manage hundreds of thousands of servers with high efficiency in a very hostile networking environment. Not only can they run bigger and more efficient than we can, they can charge a price that makes them a profit while still being (in many cases) far cheaper than my own costs should I be truly honest about the fully-loaded costs of maintaining HPC or IT services. AWS has a history of lowering prices as their own costs go down. You can see this via the EC2 pricing history as well as the now-down-to-zero cost of inbound data transit. AWS Spot market makes this even more interesting. I can currently run an m1.4xlarge 64bit server instance with 15GB RAM for about $.24 per hour - close to 50% cheaper than the published hourly price and that spot price can hold steady for weeks at a time in many cases. The biggest hangup is the economics. Even harder in an academic environment where researchers are used to seeing their funds vanish to "overhead" on their grant or they just assume that datacenters, bandwidth, power and hosting are all "free" to use. It's hard to do true cost comparisons but time and time again I've seen IaaS come out ahead when the fully-loaded costs are actually put down on paper. Here is a cliche example: Amazon S3 Before the S3 object storage service will even *acknowledge* a successful PUT request, your file is already at rest in at least three amazon facilities. So to "really" compare S3 against what you can do locally you at least have to factor in the cost of your organization being able to provide 3x multi-facility replication for whatever object store you choose to deploy... I don't want to be seen as a shill so I'll stop with that example. The results really are surprising once you start down the "true cost of IT services..." road. As for industry trends with HPC and IaaS ... I can assure you that in the super practical & cynical world of biotech and pharma there is already an HPC migration to IaaS platforms that is years old already. It's a lot easier to see where and how your money is being spent inside a biotech startup or pharma and that is (and has) shunted a decent amount of spending towards cloud platforms. The easy stuff is moving to IaaS platforms. The hard stuff, the custom stuff, the tightly bound stuff and the data/IO-bound stuff is staying local of course - but that still means lots of stuff is moving externally. The article that prompted this thread is a great example of this. The client company had a boatload of one-off molecular dynamics simulations to run. So much, in fact, that the problem was computationally infeasable to even consider doing inhouse. So they did it on AWS. 30,000 CPU cores. For ~$9,000 dollars. Amazing. It's a fun time to be in HPC actually. And getting my head around "IaaS" platforms turned me onto things (like opscode chef) that we are now bringing inhouse and integrating into our legacy clusters and grids. Sorry for rambling but I think there are 2 main drivers behind what I see moving HPC users and applications into IaaS cloud platforms ... (1) The economies of scale are real. IaaS providers can run better, bigger and cheaper than we can and they can still make a profit. This is real, not hype or sales BS. (as long as you are honest about your actual costs...) (2) The benefits of "scriptable everything" or "everything has an API". I'm so freaking sick of companies installing VMWare and excreting a press release calling themselves a "cloud provider". Virtual servers and virtual block storage on demand are boring, basic and pedestrian. That was clever in 2004. I need far more "glue" to build useful stuff in a virtual world and IaaS platforms deliver more products/services and "glue" options than anyone else out there. The "scriptable everything" nature of IaaS is enabling a lot of cool system and workflow building, much of which would be hard or almost impossible to do in-house with local resources. My $.02 -Chris (corporate hat: chris at bioteam.net) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From mathog at caltech.edu Tue Oct 4 16:07:21 2011 From: mathog at caltech.edu (mathog) Date: Tue, 04 Oct 2011 13:07:21 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > "Robert G. Brown" wrote: > Often, but not always, housing, cooling, powering, and even managing > the > hardware is "free" to the researcher, absorbed into the ongoing costs > of > the server room and management staff already needed to run the > department LAN and servers. Not always indeed. My little machine room houses a half dozen machines from other biology division people, and they are not charged to keep them there. However, putting a computer in the central campus machine rooms is not free. And new computer rooms, at least those of any size, do not get free power. After geology put in this monster: http://www.gps.caltech.edu/uploads/Image/Facilities/Beowulf.jpg the administration decided that when a computer room pretty much needs its own substation, it is well beyond the incidental overhead costs they are willing to pick up for average research labs. Along similar lines, I would guess that SLAC has to pay for its own power, rather than Stanford covering it out of overhead. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Tue Oct 4 16:39:16 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 16:39:16 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Chi Chan wrote: > On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: >> BTW, I've heard horror stories related to routing errors with this >> method - truck drivers delivering wrong tapes or losing tapes >> (hopefully the data is properly encrypted). > > I just read this on Slashdot today, it is "very hard to encrypt a > backup tape" (really?): > > http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients Not if it is encrypted with a stream cipher -- a stream cipher basically xors the data with a bitstream generated from a suitable key in a cryptographic-strength pseudorandom number generator (although there are variations on this theme). As a result, it can be quite fast -- as fast as generating pseudorandom numbers from the generator -- and it produces a file that is exactly the size of the original message in length. There are encryption schemes that expend extraordinary amounts of computational energy in generating the stream, and there are also block ciphers (which are indeed hard to implement for a streaming tape full of data, as they usually don't work so well for long messages). But in the end no, it isn't that hard to encrypt a backup tape, provided that you are willing to accept the limitation that the speed of encrypting/decrypting the stream being written to the tape is basically limited by the speed of your RNG (which may well be slower than the speed of most fast networks). rgb > > --Chi > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 16:43:15 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 13:43:15 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of mathog > Sent: Tuesday, October 04, 2011 1:07 PM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud > > > "Robert G. Brown" wrote: > > > Often, but not always, housing, cooling, powering, and even managing > > the > > hardware is "free" to the researcher, absorbed into the ongoing costs > > of > > the server room and management staff already needed to run the > > department LAN and servers. > > Not always indeed. My little machine room houses a half dozen machines > from other biology division > people, and they are not charged to keep them there. However, putting > a computer in the central > campus machine rooms is not free. And new computer rooms, at least > those of any size, do not > get free power. After geology put in this monster: > > http://www.gps.caltech.edu/uploads/Image/Facilities/Beowulf.jpg > http://citerra.gps.caltech.edu/wiki/Public/Technology A mere 512 nodes, each with 8 cores. 670W power supply is standard, so let's say about 500 nodes at 700 watts each or 350kW... HVAC will add on top of that, but I doubt they're loaded to the max. Call it 400kW.. That's big, but not enormous. (e.g you can rent a trailer mounted generator for that kind of power for about $1000/day.. the bigger generators one sees on a movie set might be 200-300kW)) CalTrans will only pay $123/hr for a 500kW generator (and fuel cost comes out of that) But, if you were paying SoCalEdison for the juice..You'd be on (minimum) the TOU-GS-3 tariff.. On peak you'd be paying 0.02/kWh for delivery and 0.104/kWh for the power. (off peak would be 0.045/kWh) So call it 12c/kWh on peak. At 400kW, that's $48/hr, which isn't bad, operating expenses wise. Let's compare to the EC2.. $1300/hr for 30k cores. 23 core hours/$ The CITerra is $50/hr for 4000 cores. 80 core hours/$ Yes, one had to go out and BUY all those cores for CITerra. $5000/node, all in, including cabling racks, etc.? What's that, about $1.25M. Spread that out over 3 years at 2000 hrs/year (we only consider working in the daytime, etc. and you get about $210/hr for the capital cost (for all 500+ nodes..) So, the EC2 seems like a good solution when you need rapid scalability to huge sizes and you have a big expense budget and a small capital budget. You could call up Amazon this afternoon and run that 30,000 core job tonight. And you'd pay substantially for that flexibility (which is how Amazon makes money, eh?) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlb17 at duke.edu Tue Oct 4 16:47:30 2011 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 4 Oct 2011 16:47:30 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011 at 4:39pm, Robert G. Brown wrote > On Tue, 4 Oct 2011, Chi Chan wrote: > >> On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: >>> BTW, I've heard horror stories related to routing errors with this >>> method - truck drivers delivering wrong tapes or losing tapes >>> (hopefully the data is properly encrypted). >> >> I just read this on Slashdot today, it is "very hard to encrypt a >> backup tape" (really?): >> >> http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients > > Not if it is encrypted with a stream cipher -- a stream cipher basically > xors the data with a bitstream generated from a suitable key in a > cryptographic-strength pseudorandom number generator (although there are > variations on this theme). As a result, it can be quite fast -- as fast > as generating pseudorandom numbers from the generator -- and it produces > a file that is exactly the size of the original message in length. For added "no, it's not hard, they're apparently just not very bright" value, LTO4+ includes hardware AES encryption. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Tue Oct 4 16:48:00 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 4 Oct 2011 13:48:00 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: > -----Original Message----- > From: Robert G. Brown [mailto:rgb at phy.duke.edu] > Sent: Tuesday, October 04, 2011 1:39 PM > To: Chi Chan > Cc: Rayson Ho; Lux, Jim (337C); tt at postbiota.org; jtriley at mit.edu; Beowulf List > Subject: Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud > > On Tue, 4 Oct 2011, Chi Chan wrote: > > > On Tue, Oct 4, 2011 at 11:58 AM, Rayson Ho wrote: > >> BTW, I've heard horror stories related to routing errors with this > >> method - truck drivers delivering wrong tapes or losing tapes > >> (hopefully the data is properly encrypted). > > > > I just read this on Slashdot today, it is "very hard to encrypt a > > backup tape" (really?): > > > > http://yro.slashdot.org/story/11/10/04/1815256/saic-loses-data-of-49-million-patients > > Not if it is encrypted with a stream cipher -- a stream cipher basically > xors the data with a bitstream generated from a suitable key in a > cryptographic-strength pseudorandom number generator (although there are > variations on this theme). As a result, it can be quite fast -- as fast > as generating pseudorandom numbers from the generator -- and it produces > a file that is exactly the size of the original message in length. > > There are encryption schemes that expend extraordinary amounts of > computational energy in generating the stream, and there are also block > ciphers (which are indeed hard to implement for a streaming tape full of > data, as they usually don't work so well for long messages). But in the > end no, it isn't that hard to encrypt a backup tape, provided that you > are willing to accept the limitation that the speed of > encrypting/decrypting the stream being written to the tape is basically > limited by the speed of your RNG (which may well be slower than the > speed of most fast networks). > The reason it wasn't encrypted is almost certainly not because it was difficult to do so for technology reasons. When you see a story about "data being lost or stolen from a car" it's because it was an ad hoc situation. Someone got a copy of the data to do some sort of analysis or to take it somewhere on a onetime basis, and "things went wrong". Any sort of regular process would normally deal with encryption or security as a matter of course: it's too easy to do it right. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue Oct 4 16:52:13 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 4 Oct 2011 13:52:13 -0700 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <4E8B5E98.3090002@sonsorol.org> References: <4E8B5E98.3090002@sonsorol.org> Message-ID: <20111004205213.GD14057@bx9.net> On Tue, Oct 04, 2011 at 03:29:28PM -0400, Chris Dagdigian wrote: > I'm largely with RGB on this one with the minor caveat that I think he > might be undervaluing the insane economies of scale that IaaS providers > like Amazon & Google can provide. You can rent that economy of scale if you're in the right part of the country. We weren't surprised to recently learn that our Silicon Valley datacenter rent is much lower than Moscow, but I was surprised to learn that we pay 1/3 less here than in Vegas, which allegedly has cheap land and power hence cheap datacenter rents. And with only 750 servers, we are already big enough to reap enough outright economy of scale to make leasing our own servers in a rented datacenter cheaper than renting everything from Amazon. The unique thing Amazon is providing is the ability to grow and shrink your cluster. Your example of a company which wanted to run a bunch of molecular dynamics computations in a short period of time is an illustration of that. BTW, Amazon has lowered prices since AWS was released, but not by as much as their costs have fallen. That's no surprise, given their dominant role in that market. -- greg (corporate hat: infrastructure at a search engine) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Tue Oct 4 17:03:46 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Oct 2011 17:03:46 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > > The reason it wasn't encrypted is almost certainly not because it > was difficult to do so for technology reasons. When you see a story > about "data being lost or stolen from a car" it's because it was an ad > hoc situation. Someone got a copy of the data to do some sort of > analysis or to take it somewhere on a onetime basis, and "things went > wrong". > > Any sort of regular process would normally deal with encryption or > security as a matter of course: it's too easy to do it right. The problem being that HIPAA is not amused by incompetence. The standard is pretty much show due diligence or be prepared to pay massive bucks out in lawsuits should the data you protect be compromised. It is really a most annoying standard -- I mean it is good that it is so flexible and makes the responsibility clear, but for most of HIPAA's existence it has provided no F***ing guidelines on how to make protected data secure. Consequently (and I say this as a modest consultant-level expert) your data and mine in the Electronic Medical Record of your choice is typically: a) Stored in flat, unencrypted plaintext or binary image in the base DB. b) Transmitted in flat, unencrypted plaintext between the server and any LAN-connected clients. In other words, it assumes that your local LAN is secure. c) Relies on third party e.g. VPN solutions to provide encryption for use across a WAN. Needless to say, the passwords and authentication schemes used in EMRs are typically a joke -- after all, the users are borderline incompetent users and cannot be expected to remember or quickly type in a user id or password much more complicated than their own initials. Many sites have one completely trivial password in use by all the physicians and nurses who use the system -- just enough to MAYBE keep patients out of the system while waiting in an examining room. I have had to convince the staff of at least one major EMR company that I will refrain from naming that no, I wasn't going to ship them a copy of an entire dataset exported from an old practice management system -- think of it as the names, addresses, SSNs and a few dozen other "protected" pieces of personal information -- to them as an unencrypted zip file over the internet, and had to finally grit my teeth and accept the use of zip's (not terribly good) built in encryption and cross my fingers and pray. Do not underestimate the sheer power of incompetence, in other words, especially incompetence in an environment almost completely lacking meaningful IT-level standards or oversight. It's really shameful, actually -- it would be so very easy to build in nearly bulletproof security schema that would make the need for third party VPNs passe. I don't know that ALL of the EMRs out there are STILL this bad, but I'd bet that 90% of them are. They certainly were 3-4 years ago, last time I looked in detail. So this is just par for the course. Doctors don't understand IT security. EMR creators should, but security is "expensive" and they don't bother because it isn't mandated. The end result is that everything from the DB to the physician's working screen is so horribly insecure that if any greed-driven cracker out there ever decided to exclusively target the weaknesses, they could compromise HIPAA and SSNs by the millions. Sigh. rgb > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Tue Oct 4 17:21:31 2011 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 4 Oct 2011 17:21:31 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> Several years ago I flippantly proposed what seems to be a simple way to ensure important consumer private data (medical, finance, etc.) was safe. Pass a law that says organization who collects or holds personal data must include the same data for organization's Board of Directors and officers (CEO, COO etc) in the database. At least the CEO might start taking security serious when someone in Bulgaria is buying jet skies with his AMX card. -- Doug > On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: > >> >> The reason it wasn't encrypted is almost certainly not because it >> was difficult to do so for technology reasons. When you see a story >> about "data being lost or stolen from a car" it's because it was an ad >> hoc situation. Someone got a copy of the data to do some sort of >> analysis or to take it somewhere on a onetime basis, and "things went >> wrong". >> >> Any sort of regular process would normally deal with encryption or >> security as a matter of course: it's too easy to do it right. > > The problem being that HIPAA is not amused by incompetence. The > standard is pretty much show due diligence or be prepared to pay massive > bucks out in lawsuits should the data you protect be compromised. It is > really a most annoying standard -- I mean it is good that it is so > flexible and makes the responsibility clear, but for most of HIPAA's > existence it has provided no F***ing guidelines on how to make protected > data secure. > > Consequently (and I say this as a modest consultant-level expert) your > data and mine in the Electronic Medical Record of your choice is > typically: > > a) Stored in flat, unencrypted plaintext or binary image in the base > DB. > > b) Transmitted in flat, unencrypted plaintext between the server and > any LAN-connected clients. In other words, it assumes that your local > LAN is secure. > > c) Relies on third party e.g. VPN solutions to provide encryption for > use across a WAN. > > Needless to say, the passwords and authentication schemes used in EMRs > are typically a joke -- after all, the users are borderline incompetent > users and cannot be expected to remember or quickly type in a user id or > password much more complicated than their own initials. Many sites have > one completely trivial password in use by all the physicians and nurses > who use the system -- just enough to MAYBE keep patients out of the > system while waiting in an examining room. > > I have had to convince the staff of at least one major EMR company that > I will refrain from naming that no, I wasn't going to ship them a copy > of an entire dataset exported from an old practice management system -- > think of it as the names, addresses, SSNs and a few dozen other > "protected" pieces of personal information -- to them as an unencrypted > zip file over the internet, and had to finally grit my teeth and accept > the use of zip's (not terribly good) built in encryption and cross my > fingers and pray. > > Do not underestimate the sheer power of incompetence, in other words, > especially incompetence in an environment almost completely lacking > meaningful IT-level standards or oversight. It's really shameful, > actually -- it would be so very easy to build in nearly bulletproof > security schema that would make the need for third party VPNs passe. > > I don't know that ALL of the EMRs out there are STILL this bad, but I'd > bet that 90% of them are. They certainly were 3-4 years ago, last time > I looked in detail. > > So this is just par for the course. Doctors don't understand IT > security. EMR creators should, but security is "expensive" and they > don't bother because it isn't mandated. The end result is that > everything from the DB to the physician's working screen is so horribly > insecure that if any greed-driven cracker out there ever decided to > exclusively target the weaknesses, they could compromise HIPAA and SSNs > by the millions. > > Sigh. > > rgb > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Tue Oct 4 17:39:40 2011 From: mathog at caltech.edu (mathog) Date: Tue, 04 Oct 2011 14:39:40 -0700 Subject: [Beowulf] =?utf-8?q?=241=2C_279-per-hour=2C_30=2C=09000-core_clus?= =?utf-8?q?ter_built_on_Amazon_EC2_cloud?= In-Reply-To: References: Message-ID: <1a4e05cecd44d8777737e6994d09b289@saf.bio.caltech.edu> On Tue, 4 Oct 2011 13:43:15 -0700, Lux, Jim (337C) wrote: > So call it 12c/kWh on peak. At 400kW, that's $48/hr, which isn't > bad, operating expenses wise. Well, yes and no. If they only turned it on once and a while it wouldn't be too bad, but I'm pretty sure it runs 100% of the time. At least I have never walked by when the racks were not lit up, so... $48 * 24 * 365 = $420480/year Versus the average lab at (waves hands) $150 in electricity a month = $1800/year? It will of course depend on what kind of work the lab does. The difference is two orders of magnitude. Anyway, last I looked we had around 300 professors, so that one facility used up, order of magnitude, as much juice as all the "normal" labs combined. (Certainly there are some other labs around which also use a lot of electricity.) Cooling water usage was probably also a sore point from the administration's perspective. Pretty much everything here runs AC off chilled water coming from a central plant. Either that cluster used up a whole lot of chilled water capacity at the central plant or they built a a separate chiller somewhere. Dave Kewley who sometimes posts here used to run that system, so he would know. Regards David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jlb17 at duke.edu Tue Oct 4 17:41:02 2011 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Tue, 4 Oct 2011 17:41:02 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011 at 5:03pm, Robert G. Brown wrote > Needless to say, the passwords and authentication schemes used in EMRs > are typically a joke -- after all, the users are borderline incompetent > users and cannot be expected to remember or quickly type in a user id or > password much more complicated than their own initials. Many sites have > one completely trivial password in use by all the physicians and nurses > who use the system -- just enough to MAYBE keep patients out of the > system while waiting in an examining room. My wife's experience here was somewhat the opposite of that. Within 2 days of starting her fellowship at UCSF she had acquired over 10 usernames and passwords (and one RSA hardware token) for all the various systems she needed to interact with. Each system, of course, had its own password aging and renewal rules. Determining how physicians manage their passwords in such an environment is left as an exercise for the reader... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 5 08:40:53 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 5 Oct 2011 08:40:53 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> References: <44854.192.168.93.213.1317763291.squirrel@mail.eadline.org> Message-ID: On Tue, 4 Oct 2011, Douglas Eadline wrote: > > Several years ago I flippantly proposed what seems to be > a simple way to ensure important consumer private data > (medical, finance, etc.) was safe. Pass a law that says > organization who collects or holds personal data must > include the same data for organization's Board of Directors and > officers (CEO, COO etc) in the database. At least > the CEO might start taking security serious when > someone in Bulgaria is buying jet skies with his AMX card. It wouldn't help. Physicians are too clueless to understand or care (mostly, not universally) and besides, what can they do? They don't write software. The companies that provide the software won't have their board's information in the DB under any circumstances, and they are the problem. Or rather, the unregulated nature of the business is the problem. The government is spending all sorts of energy specifying the detailed structure of the DB and ICD codes for every possible illness at a staggering degree of granularity so that they can eventually micro-specify compensation rates for fingering your left gonad during an exam but are leaving HIPAA -- a disaster from day one in so very many ways -- in place as the sole guardian of our medical privacy. HIPAA fails to specify IT security, and obscures precisely who will be held financially responsible for failures of security or what other sanctions might be applied. HIPAA has had the easily predictable side effect of placing enormous physical and financial obstacles in the path of medical research, to the point where I think it is safe to say that HIPAA alone has de fact killed thousands to tens of thousands of people simply by delaying discovery for years to decades (while costing us a modest fortune to perform such research as is now performed, with whole departments in any research setting devoted to managing the permissioning of the data). Finally, HIPAA's fundamental original purpose was to keep e.g. health insurance companies or employers from getting your health care records and using them to deny coverage or employment, and it didn't really succeed even in that because of the appalling state of deregulation in the insurance industry itself. It's really pretty amazing. It's hard to imagine how anyone could have come up with a piece of governance so diabolically well designed to be enormously expensive in money and lives while failing even to accomplish its own primary goals or the related goals that it SHOULD have tried to accomplish (such as mandating a certain -- high -- level of security and complete open-standard interoperability and data portability in emergent EMR/PM systems, at least at the DB level), even if they tried. However, we should never be hasty to ascribe to human evil that which can adequately be explained by mere incompetence and stupidity. But this is OT, and I'll return to my muttons now. Soap box out. rgb > > -- > Doug > > > > > >> On Tue, 4 Oct 2011, Lux, Jim (337C) wrote: >> >>> >>> The reason it wasn't encrypted is almost certainly not because it >>> was difficult to do so for technology reasons. When you see a story >>> about "data being lost or stolen from a car" it's because it was an ad >>> hoc situation. Someone got a copy of the data to do some sort of >>> analysis or to take it somewhere on a onetime basis, and "things went >>> wrong". >>> >>> Any sort of regular process would normally deal with encryption or >>> security as a matter of course: it's too easy to do it right. >> >> The problem being that HIPAA is not amused by incompetence. The >> standard is pretty much show due diligence or be prepared to pay massive >> bucks out in lawsuits should the data you protect be compromised. It is >> really a most annoying standard -- I mean it is good that it is so >> flexible and makes the responsibility clear, but for most of HIPAA's >> existence it has provided no F***ing guidelines on how to make protected >> data secure. >> >> Consequently (and I say this as a modest consultant-level expert) your >> data and mine in the Electronic Medical Record of your choice is >> typically: >> >> a) Stored in flat, unencrypted plaintext or binary image in the base >> DB. >> >> b) Transmitted in flat, unencrypted plaintext between the server and >> any LAN-connected clients. In other words, it assumes that your local >> LAN is secure. >> >> c) Relies on third party e.g. VPN solutions to provide encryption for >> use across a WAN. >> >> Needless to say, the passwords and authentication schemes used in EMRs >> are typically a joke -- after all, the users are borderline incompetent >> users and cannot be expected to remember or quickly type in a user id or >> password much more complicated than their own initials. Many sites have >> one completely trivial password in use by all the physicians and nurses >> who use the system -- just enough to MAYBE keep patients out of the >> system while waiting in an examining room. >> >> I have had to convince the staff of at least one major EMR company that >> I will refrain from naming that no, I wasn't going to ship them a copy >> of an entire dataset exported from an old practice management system -- >> think of it as the names, addresses, SSNs and a few dozen other >> "protected" pieces of personal information -- to them as an unencrypted >> zip file over the internet, and had to finally grit my teeth and accept >> the use of zip's (not terribly good) built in encryption and cross my >> fingers and pray. >> >> Do not underestimate the sheer power of incompetence, in other words, >> especially incompetence in an environment almost completely lacking >> meaningful IT-level standards or oversight. It's really shameful, >> actually -- it would be so very easy to build in nearly bulletproof >> security schema that would make the need for third party VPNs passe. >> >> I don't know that ALL of the EMRs out there are STILL this bad, but I'd >> bet that 90% of them are. They certainly were 3-4 years ago, last time >> I looked in detail. >> >> So this is just par for the course. Doctors don't understand IT >> security. EMR creators should, but security is "expensive" and they >> don't bother because it isn't mandated. The end result is that >> everything from the DB to the physician's working screen is so horribly >> insecure that if any greed-driven cracker out there ever decided to >> exclusively target the weaknesses, they could compromise HIPAA and SSNs >> by the millions. >> >> Sigh. >> >> rgb >> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> Robert G. Brown http://www.phy.duke.edu/~rgb/ >> Duke University Dept. of Physics, Box 90305 >> Durham, N.C. 27708-0305 >> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> > > > -- > Doug > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 5 08:45:02 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 5 Oct 2011 08:45:02 -0400 (EDT) Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: References: Message-ID: On Tue, 4 Oct 2011, Joshua Baker-LePain wrote: > On Tue, 4 Oct 2011 at 5:03pm, Robert G. Brown wrote > >> Needless to say, the passwords and authentication schemes used in EMRs >> are typically a joke -- after all, the users are borderline incompetent >> users and cannot be expected to remember or quickly type in a user id or >> password much more complicated than their own initials. Many sites have >> one completely trivial password in use by all the physicians and nurses >> who use the system -- just enough to MAYBE keep patients out of the >> system while waiting in an examining room. > > My wife's experience here was somewhat the opposite of that. Within 2 > days of starting her fellowship at UCSF she had acquired over 10 usernames > and passwords (and one RSA hardware token) for all the various systems she > needed to interact with. Each system, of course, had its own password > aging and renewal rules. Determining how physicians manage their > passwords in such an environment is left as an exercise for the reader... Ah, yes, excellent. Ten of them AND an RSA e.g. SecureID -- wow, that takes some real brilliance. I know how MY physician wife would manage it... rgb > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Wed Oct 5 09:42:28 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Wed, 05 Oct 2011 09:42:28 -0400 Subject: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud In-Reply-To: <20111004205213.GD14057@bx9.net> References: <4E8B5E98.3090002@sonsorol.org> <20111004205213.GD14057@bx9.net> Message-ID: <4E8C5EC4.9020101@runnersroll.com> On 10/04/11 16:52, Greg Lindahl wrote: > On Tue, Oct 04, 2011 at 03:29:28PM -0400, Chris Dagdigian wrote: >> I'm largely with RGB on this one with the minor caveat that I think he >> might be undervaluing the insane economies of scale that IaaS providers >> like Amazon & Google can provide. > > cheap land and power hence cheap datacenter rents. And with only 750 > servers, we are already big enough to reap enough outright economy of > scale to make leasing our own servers in a rented datacenter cheaper > than renting everything from Amazon. > > The unique thing Amazon is providing is the ability to grow and shrink > your cluster. Your example of a company which wanted to run a bunch of > molecular dynamics computations in a short period of time is an > illustration of that. On this note, does anyone know if there are prior works (either academic or publicly disclosed documentations of a company pursuing such a route) of people splitting their workload up into the "static" and "dynamic" portions and running them respectively on in-house and rented hardware? While I see this discussion time and time again go either one way or the other (google or amazon, if you will), I suspect for many companies if it were possible to "invisibly" extend their infrastructure into the cloud on an as-needed basis, it might be a pretty attractive solution. Put another way, there doesn't seem to be much sense in buying a couple more racks for just a short-term project that will result in those racks going silent afterwards. On the flipside, you probably have some fraction of the compute and data resources you need as it is, you just want it to run a little faster or need a little more scratch space/bandwidth. So renting an entire set of resources wouldn't be optimal either, since that will result in underutilization of the infrastructure at home. So just buy whatever fraction your missing from Amazon from a month and use some hacks to make it look like that hardware is right there next to your other stuff. Obviously this requires an embarrassingly parallel workload due to the locality dichotomy (or completely disjoint workloads). Another idea I had was just like solar energy, what if there was a way for you to build up credits for Amazon in the "day" and use them at "night"? I.E. put some Amazon software on your infrastructure that allows you them to use your servers as part of their "cloud" when you're not using your equipment at max, and when you do go peak it will automatically provision more and more Amazon leased resources on an as-needed basis and burn up those earned credits instead of "real money." Just some ideas I figured I'd put through the beo-blender to see if they hold any weight before actually pursuing them as research objectives. ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From jcownie at cantab.net Thu Oct 6 13:33:51 2011 From: jcownie at cantab.net (James Cownie) Date: Thu, 6 Oct 2011 18:33:51 +0100 Subject: [Beowulf] Beowulf Bash at SC11? Message-ID: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> SC approaches fast, but I've seen no mention of a Beowulf Bash. Has it died? Did I just miss an announcement? -- -- Jim -- James Cownie -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 7 09:45:29 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 07 Oct 2011 09:45:29 -0400 Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> Message-ID: <4E8F0279.5070809@ias.edu> There's an announcement on beowulf.org for a Beowulf Bash... from 2009! Beowulf Bash: The 11th Annual Beowulf.org Meeting November 16, 2009 Portland OR Location: The Game, One Center Court, The Rose Quarter Sponsors: AMD Cluster Monkey InsideHPC Penguin Computing SiCorp TeraScala XAND Marketing On 10/06/2011 01:33 PM, James Cownie wrote: > SC approaches fast, but I've seen no mention of a Beowulf Bash. > > Has it died? > > Did I just miss an announcement? > > -- > > -- Jim > > -- > > James Cownie > > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Glen.Beane at jax.org Fri Oct 7 10:21:41 2011 From: Glen.Beane at jax.org (Glen Beane) Date: Fri, 7 Oct 2011 14:21:41 +0000 Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <4E8F0279.5070809@ias.edu> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> <4E8F0279.5070809@ias.edu> Message-ID: <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> I remember not hearing much about it last year in New Orleans until someone I knew from Penguin handed me a card Monday night at the opening gala On Oct 7, 2011, at 9:45 AM, Prentice Bisbal wrote: > There's an announcement on beowulf.org for a Beowulf Bash... from 2009! > > Beowulf Bash: The 11th Annual Beowulf.org Meeting > November 16, 2009 > Portland OR > Location: The Game, One Center Court, The Rose Quarter Sponsors: > AMD Cluster Monkey > InsideHPC > Penguin Computing > SiCorp TeraScala > XAND Marketing > > > On 10/06/2011 01:33 PM, James Cownie wrote: >> SC approaches fast, but I've seen no mention of a Beowulf Bash. >> >> Has it died? >> >> Did I just miss an announcement? >> >> -- >> >> -- Jim >> >> -- >> >> James Cownie > >> >> >> >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Glen L. Beane Senior Software Engineer The Jackson Laboratory (207) 288-6153 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From deadline at eadline.org Fri Oct 7 17:19:52 2011 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 7 Oct 2011 17:19:52 -0400 (EDT) Subject: [Beowulf] Beowulf Bash at SC11? In-Reply-To: <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> References: <9DAC8FB2-067E-4F1B-ABBA-1AF995E62A33@cantab.net> <4E8F0279.5070809@ias.edu> <7514EA83-EDED-453C-8901-1C861D36C1B2@jax.org> Message-ID: <47582.192.168.93.213.1318022392.squirrel@mail.eadline.org> I always announce it on this list and on ClusterMonkey, it also will be announced on InsideHPC and some of the sponsor sites. -- Doug > I remember not hearing much about it last year in New Orleans until > someone I knew from Penguin handed me a card Monday night at the opening > gala > > > On Oct 7, 2011, at 9:45 AM, Prentice Bisbal wrote: > >> There's an announcement on beowulf.org for a Beowulf Bash... from 2009! >> >> Beowulf Bash: The 11th Annual Beowulf.org Meeting >> November 16, 2009 >> Portland OR >> Location: The Game, One Center Court, The Rose Quarter Sponsors: >> AMD Cluster Monkey >> InsideHPC >> Penguin Computing >> SiCorp TeraScala >> XAND Marketing >> >> >> On 10/06/2011 01:33 PM, James Cownie wrote: >>> SC approaches fast, but I've seen no mention of a Beowulf Bash. >>> >>> Has it died? >>> >>> Did I just miss an announcement? >>> >>> -- >>> >>> -- Jim >>> >>> -- >>> >>> James Cownie > >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Glen L. Beane > Senior Software Engineer > The Jackson Laboratory > (207) 288-6153 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From kilian.cavalotti.work at gmail.com Tue Oct 11 11:21:32 2011 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Tue, 11 Oct 2011 17:21:32 +0200 Subject: [Beowulf] IBM to acquire Platform Computing Message-ID: http://www.platform.com/press-releases/2011/IBMtoAcquireSystemSoftwareCompanyPlatformComputingtoExtendReachofTechnicalComputing and http://www-03.ibm.com/systems/deepcomputing/platform.html Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dag at sonsorol.org Wed Oct 12 10:52:13 2011 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed, 12 Oct 2011 10:52:13 -0400 Subject: [Beowulf] 10GbE topologies for small-ish clusters? Message-ID: <4E95A99D.9040703@sonsorol.org> First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? Regards, Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Wed Oct 12 10:58:58 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 12 Oct 2011 10:58:58 -0400 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: <4E95AB32.3030804@scalableinformatics.com> On 10/12/2011 10:52 AM, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? What's the use case? Low latency, or simplified high bandwidth connection? 10GbE with 40GbE uplinks won't be cheap. But it would be doable. Gnodal, Mellanox, and others would be able to do this. > > Regards, > Chris > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From i.n.kozin at googlemail.com Wed Oct 12 11:22:52 2011 From: i.n.kozin at googlemail.com (Igor Kozin) Date: Wed, 12 Oct 2011 16:22:52 +0100 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: Gnodal was probably the first to announce a 1U 72 port switch http://www.gnodal.com/docs/Gnodal%20GS7200%20datasheet.pdf Other vendors either have announced or will be probably announcing dense packaging too. On 12 October 2011 15:52, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From john.hearns at mclaren.com Wed Oct 12 11:28:28 2011 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 12 Oct 2011 16:28:28 +0100 Subject: [Beowulf] 10GbE topologies for small-ish clusters? References: <4E95A99D.9040703@sonsorol.org> Message-ID: <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? I was going to suggest two Gnodal rack top switches, linked by a 40Gbps link http://www.gnodal.com/ I see though that their GS7200 switch has 72 x 10Gbps ports - should do you just fine! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From akshar.bhosale at gmail.com Wed Oct 12 12:28:57 2011 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Wed, 12 Oct 2011 21:58:57 +0530 Subject: [Beowulf] refunding reserved amount in gold Message-ID: Hi, We are using PBS (torque 2.4.8) and gold version 2.1.7.1. One of the jobs went for execution and reserved the equivalent amount. The same job came out of execution and went in queue from execution. This happened 30 times for the same job. Every time job has reserved amount. Now finally there is very huge amount(30*charges for that single job) which is shown in reserved state.Job now does not exist. User can not submit the new job now because of neglegible amount balance in his account. We want to clear reserved amount. How to do that? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From Shainer at Mellanox.com Wed Oct 12 12:30:02 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Wed, 12 Oct 2011 16:30:02 +0000 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> References: <4E95A99D.9040703@sonsorol.org> <207BB2F60743C34496BE41039233A80903FB49D5@MRL-PWEXCHMB02.mil.tagmclarengroup.com> Message-ID: You can also check the Mellanox products - both for 40GigE and 10GigE switch fabric. Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Hearns, John Sent: Wednesday, October 12, 2011 8:31 AM To: dag at sonsorol.org; beowulf at beowulf.org Subject: Re: [Beowulf] 10GbE topologies for small-ish clusters? First time I'm seriously pondering bringing 10GbE straight to compute nodes ... For 64 servers (32 to a cabinet) and an HPC system that spans two racks what would be the common 10 Gig networking topology be today? - One large core switch? - 48 port top-of-rack switches with trunking? - Something else? I was going to suggest two Gnodal rack top switches, linked by a 40Gbps link http://www.gnodal.com/ I see though that their GS7200 switch has 72 x 10Gbps ports - should do you just fine! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From scrusan at ur.rochester.edu Wed Oct 12 12:33:39 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Wed, 12 Oct 2011 12:33:39 -0400 Subject: [Beowulf] refunding reserved amount in gold In-Reply-To: References: Message-ID: <85631CC6-BFE0-44A2-B69E-42BB660AC632@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I would suggest you post this to the Gold mailing list with a few more pieces of information: http://www.supercluster.org/mailman/listinfo/gold-users Regardless, you could probably use the grefund command... On Oct 12, 2011, at 12:28 PM, akshar bhosale wrote: > Hi, > > We are using PBS (torque 2.4.8) and gold version 2.1.7.1. One of the > jobs went for execution and reserved the equivalent amount. The same job > came out of execution and went in queue from execution. This happened 30 > times for the same job. Every time job has reserved amount. Now finally > there is very huge amount(30*charges for that single job) which is shown in > reserved state.Job now does not exist. User can not submit the new job now > because of neglegible amount balance in his account. We want to clear > reserved amount. How to do that? > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOlcFoAAoJENS19LGOpgqK1UIIAIFZj6fIZebQt9xQwmVBVxB9 MPwJMlw4C0F8bR/crGBWx7NUHElep1frROYohD15jN/8bFA2/bJ3xFdiH1bMNqHu MdB4EmRbs4nuNeN/ZayV4JXBVD3oPuwESYA65jVj0MfbVbzeRod6ZnNvpZOb/Juc 7dHCNPa2coLGLakGEQperOvOOCqsTbxSUdagXulW/1xH3iG+8UPNPJe7ATvO0tE3 FYOot3a3WgN8dsWUnsOKBnA17FA2zN0ac/QdEd2COSbpOjbpQp7BIlg0f0QIIkU6 pVq1C706jn5Cl4gKXsfC277Rrx3eLl3YPVA6XaL95PSXBH51L7Y3ViqMmVe9Coo= =cSUy -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Wed Oct 12 14:04:27 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 12 Oct 2011 11:04:27 -0700 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> <20111012180002.GC5039@bx9.net> Message-ID: <20111012180427.GD5039@bx9.net> We just bought a couple of 64-port 10g switches from Blade, for the middle of our networking infrastructure. They were the winner over all the others, lowest price and appropriate features. We also bought Blade top-of-rack switches. Now that they've been bought up by IBM you have to negotiate harder to get that low price, but you can still get it by threatening them with competing quotes. Gnodal looks very interesting for larger, multi-switch clusters, they were just a bit late to market for us. Arista really believes that their high prices are justified; we didn't. And if anyone would like to buy some used Mellanox 48-port 10ge switches, we have 2 extras we'd like to sell. -- greg On Wed, Oct 12, 2011 at 10:52:13AM -0400, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From Shainer at Mellanox.com Wed Oct 12 14:11:04 2011 From: Shainer at Mellanox.com (Gilad Shainer) Date: Wed, 12 Oct 2011 18:11:04 +0000 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <20111012180427.GD5039@bx9.net> References: <4E95A99D.9040703@sonsorol.org> <20111012180002.GC5039@bx9.net> <20111012180427.GD5039@bx9.net> Message-ID: The 48-ports are not Mellanox but previous company that Mellanox acquired, as the Mellanox ones are 36 x 40G or 64 x 10G in 1U (or bigger). But please don't let these small details hold you from re-living your history. Good luck selling. -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Greg Lindahl Sent: Wednesday, October 12, 2011 11:05 AM To: Chris Dagdigian Cc: Beowulf Mailing List Subject: Re: [Beowulf] 10GbE topologies for small-ish clusters? We just bought a couple of 64-port 10g switches from Blade, for the middle of our networking infrastructure. They were the winner over all the others, lowest price and appropriate features. We also bought Blade top-of-rack switches. Now that they've been bought up by IBM you have to negotiate harder to get that low price, but you can still get it by threatening them with competing quotes. Gnodal looks very interesting for larger, multi-switch clusters, they were just a bit late to market for us. Arista really believes that their high prices are justified; we didn't. And if anyone would like to buy some used Mellanox 48-port 10ge switches, we have 2 extras we'd like to sell. -- greg On Wed, Oct 12, 2011 at 10:52:13AM -0400, Chris Dagdigian wrote: > > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two > racks what would be the common 10 Gig networking topology be today? > > - One large core switch? > - 48 port top-of-rack switches with trunking? > - Something else? > > Regards, > Chris _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From cap at nsc.liu.se Thu Oct 13 07:51:56 2011 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Thu, 13 Oct 2011 13:51:56 +0200 Subject: [Beowulf] 10GbE topologies for small-ish clusters? In-Reply-To: <4E95A99D.9040703@sonsorol.org> References: <4E95A99D.9040703@sonsorol.org> Message-ID: <201110131351.59977.cap@nsc.liu.se> On Wednesday, October 12, 2011 04:52:13 PM Chris Dagdigian wrote: > First time I'm seriously pondering bringing 10GbE straight to compute > nodes ... > > For 64 servers (32 to a cabinet) and an HPC system that spans two racks > what would be the common 10 Gig networking topology be today? Both Arista and Blade (now IBM) has 64 port 1U single ASIC switches (a few ports will require qsfp to sfp+ break out cables afaict). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 21 09:10:18 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 09:10:18 -0400 Subject: [Beowulf] Users abusing screen Message-ID: <4EA16F3A.8080209@ias.edu> Beowulfers, I have a question that isn't directly related to clusters, but I suspect it's an issue many of you are dealing with are dealt with: users using the screen command to stay logged in on systems and running long jobs that they forget about. Have any of you experienced this, and how did you deal with it? Here's my scenario: In addition to my cluster, we have a bunch of "computer servers" where users can run the programs. These are "large" boxes with more cores (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a desktop top. Periodically, when I have to shutdown/reboot a system for maintenance, I find a LOT of shells being run through the screen command for users who aren't logged in. The majority are idle shells, but many are running jobs, that seem to be forgotten about. For example, I recently found some jobs running since July or August that were running under the account of someone who hasn't even been here for months! My opinion is these these are shared resources, and if you aren't interactively using them, you should log out to free up resources for others. If you have a job that can be run non-interactively, you should submit it to the cluster. Has anyone else here dealt with the problem? I would like to remove screen from my environment entirely to prevent this. My fellow sysadmins here agree. I'm expecting massive backlash from the users. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:07:27 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:07:27 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <4EA198BF.3030002@ias.edu> On 10/21/2011 11:06 AM, Kilian Cavalotti wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) Actually, I can't for reasons I can't get into here. But something like that was part of my original "master plan". -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:10:36 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:10:36 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> References: <4EA16F3A.8080209@ias.edu> <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> Message-ID: <4EA1997C.70103@ias.edu> On 10/21/2011 11:24 AM, Reuti wrote: > Hi, > > Am 21.10.2011 um 15:10 schrieb Prentice Bisbal: > >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? >> >> Here's my scenario: >> >> In addition to my cluster, we have a bunch of "computer servers" where >> users can run the programs. These are "large" boxes with more cores >> (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a >> desktop top. >> >> Periodically, when I have to shutdown/reboot a system for maintenance, >> I find a LOT of shells being run through the screen command for users >> who aren't logged in. The majority are idle shells, but many are running >> jobs, that seem to be forgotten about. For example, I recently found >> some jobs running since July or August that were running under the >> account of someone who hasn't even been here for months! >> >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. If you have a job that can be run non-interactively, you should >> submit it to the cluster. >> >> Has anyone else here dealt with the problem? >> >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > I disallow rsh to the machines and limit ssh to admin staff. Users who want to run something on a machine have to go through the queuing system to get access to a node granted by GridEngine (for the startup method you can use either the -builtin- or [in case you need X11 forwarding] by a different sshd_config and ssh [GridEngine will start one daemon per task], one additional step is necessary for a tight integration of ssh). > > For users just checking their jobs on a node I have a dedicated queue (where they can login always, but h_cpu limited to 60 seconds, i.e. they can't abuse it). > > -- Reuti > Reuti, That was EXACTLY my original plan, but for reasons I don't want to get into, I can't implement that. In fact, just yesterday I ripped out all the SGE queues I had configured to that. Why? because I was tired of seeing them and being reminded of what a good idea it was. :( -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 12:12:53 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 12:12:53 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19365.4030109@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> Message-ID: <4EA19A05.4000400@ias.edu> On 10/21/2011 11:44 AM, Ellis H. Wilson III wrote: > On 10/21/11 09:10, Prentice Bisbal wrote: >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? > > I think this is strongly tied to what kind of work the users are doing > (i.e. how interactive it is, how long jobs take, how likely failure is > to occur that they must react to). In my personal experience the jobs I > spawn aren't interactive, tend to take a long time, and because of point > 2 require me to react pretty quickly to their failure or I lose out on > valuable compute-time. However, they are cumbersome to execute via a > queuing manager (my work is in systems, so perhaps that area is an > exception). Therefore what I always do is just nohup myself a job, and > tail -f it if I need to watch it. I've adapted my ssh config such that > I don't get booted off after 5 or 10 minutes without any input from me > (I think the limit I set is like 2hours or something), so I can watch > output fly by to my hearts content. > > If I were you, I think the best way to avoid a user-uprising, but to > achieve your goal is to give instructions on how a user can nohup (yes, > just assume they don't know how) and how to configure ssh to not die > after a short time. This way they don't have to worry about getting > disconnected if they aren't constantly interacting (so they can watch > output), but they also aren't staying logged on indefinitely (since > presumably their laptops/desktops aren't on indefinitely). > > If you give them an alternative that is well defined with an example > (not just, "Oh you can use such-and-such instead.") I can hardly believe > they'll be all that upset. > Ellis, Using nohup was exactly the advice I gave to one of my users yesterday. Not sure if he'll use it. 'man' is a very difficult program to learn, from what I understand. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Fri Oct 21 11:24:32 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 21 Oct 2011 17:24:32 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <7B82E572-588E-41A4-9B46-8A1A07360A30@staff.uni-marburg.de> Hi, Am 21.10.2011 um 15:10 schrieb Prentice Bisbal: > Beowulfers, > > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? > > Here's my scenario: > > In addition to my cluster, we have a bunch of "computer servers" where > users can run the programs. These are "large" boxes with more cores > (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a > desktop top. > > Periodically, when I have to shutdown/reboot a system for maintenance, > I find a LOT of shells being run through the screen command for users > who aren't logged in. The majority are idle shells, but many are running > jobs, that seem to be forgotten about. For example, I recently found > some jobs running since July or August that were running under the > account of someone who hasn't even been here for months! > > My opinion is these these are shared resources, and if you aren't > interactively using them, you should log out to free up resources for > others. If you have a job that can be run non-interactively, you should > submit it to the cluster. > > Has anyone else here dealt with the problem? > > I would like to remove screen from my environment entirely to prevent > this. My fellow sysadmins here agree. I'm expecting massive backlash > from the users. I disallow rsh to the machines and limit ssh to admin staff. Users who want to run something on a machine have to go through the queuing system to get access to a node granted by GridEngine (for the startup method you can use either the -builtin- or [in case you need X11 forwarding] by a different sshd_config and ssh [GridEngine will start one daemon per task], one additional step is necessary for a tight integration of ssh). For users just checking their jobs on a node I have a dedicated queue (where they can login always, but h_cpu limited to 60 seconds, i.e. they can't abuse it). -- Reuti _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bug at sas.upenn.edu Fri Oct 21 11:17:55 2011 From: bug at sas.upenn.edu (Gavin W. Burris) Date: Fri, 21 Oct 2011 11:17:55 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <4EA18D23.4050501@sas.upenn.edu> On 10/21/2011 11:06 AM, Kilian Cavalotti wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) > > Cheers, I think we have a winner. :) -- Gavin W. Burris Senior Systems Programmer Information Security and Unix Systems School of Arts and Sciences University of Pennsylvania _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Oct 21 11:44:37 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 21 Oct 2011 11:44:37 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA19365.4030109@runnersroll.com> On 10/21/11 09:10, Prentice Bisbal wrote: > Beowulfers, > > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? I think this is strongly tied to what kind of work the users are doing (i.e. how interactive it is, how long jobs take, how likely failure is to occur that they must react to). In my personal experience the jobs I spawn aren't interactive, tend to take a long time, and because of point 2 require me to react pretty quickly to their failure or I lose out on valuable compute-time. However, they are cumbersome to execute via a queuing manager (my work is in systems, so perhaps that area is an exception). Therefore what I always do is just nohup myself a job, and tail -f it if I need to watch it. I've adapted my ssh config such that I don't get booted off after 5 or 10 minutes without any input from me (I think the limit I set is like 2hours or something), so I can watch output fly by to my hearts content. If I were you, I think the best way to avoid a user-uprising, but to achieve your goal is to give instructions on how a user can nohup (yes, just assume they don't know how) and how to configure ssh to not die after a short time. This way they don't have to worry about getting disconnected if they aren't constantly interacting (so they can watch output), but they also aren't staying logged on indefinitely (since presumably their laptops/desktops aren't on indefinitely). If you give them an alternative that is well defined with an example (not just, "Oh you can use such-and-such instead.") I can hardly believe they'll be all that upset. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ellis at runnersroll.com Fri Oct 21 12:26:09 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Fri, 21 Oct 2011 12:26:09 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> Message-ID: <4EA19D21.3090902@runnersroll.com> On 10/21/11 12:12, Prentice Bisbal wrote: >> If you give them an alternative that is well defined with an example >> (not just, "Oh you can use such-and-such instead.") I can hardly believe >> they'll be all that upset. >> > > Ellis, > > Using nohup was exactly the advice I gave to one of my users yesterday. > Not sure if he'll use it. 'man' is a very difficult program to learn, > from what I understand. Hahaha, I love your cynicism. Right up my alley, however, I think in all seriousness 'man' does fall short for many applications in terms of examples (there are exceptions to this, but most man docs don't have examples from my experience). Many users just want examples of it's use, and can derive their case faster from such than custom-creation of a set of parameters from man. So just take a few moments, cook up an example of 'nohup ./someapp &> out.txt &' usage and associated ways to kill and watch it's output and put it all into an email. Save that email away, and when you're ready just shoot it out to everyone. Or if you have an internal wiki setup, that's much, much better. Just forward a link to some new page on it. If you make even a half-assed effort to show you are providing a viable alternative and a low bar to entry, you'll cut the number of people complaining at least in half. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Fri Oct 21 11:26:57 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 21 Oct 2011 17:26:57 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: <46778F4F-95ED-4FC7-B936-F8221A759916@staff.uni-marburg.de> Am 21.10.2011 um 17:06 schrieb Kilian Cavalotti: > Hi Prentice, > > On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >>> Have you thought about queueing systems like condor or SGE? >> >> Yes, I have cluster that uses SGE, and we allow users to run serial jobs >> (non-MPI, etc.) there, so there is no need for them to use screen to >> execute long-running jobs. Hence my frustration. > > You could alias "screen" to "qlogin". :) Isn't it to late at that point if I get it right? They login by ssh to an exechost and issue thereon screen to reconnect later. But they should already use qlogin to go to the exechost. -- Reuti > Cheers, > -- > Kilian > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri Oct 21 12:45:38 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 21 Oct 2011 09:45:38 -0700 Subject: [Beowulf] about 'man' Re: Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> Message-ID: On 10/21/11 9:12 AM, "Prentice Bisbal" wrote: > >Ellis, > >Using nohup was exactly the advice I gave to one of my users yesterday. >Not sure if he'll use it. 'man' is a very difficult program to learn, >from what I understand. Well... 'man' is easy, but sometimes, you need decent examples and tutorials. Just knowing what all the switches are and the format is like giving someone a dictionary and saying: now write me a sonnet. This is especially so for the "swiss army knife" type utilities (grep, I'm looking at you!) > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 21 10:44:27 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 21 Oct 2011 10:44:27 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111021134457.GA22748@grml> References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> Message-ID: <4EA1854B.5090506@ias.edu> On 10/21/2011 09:44 AM, Henning Fehrmann wrote: > Hi Prentice, > > On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: >> Beowulfers, >> >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? >> >> Here's my scenario: >> >> In addition to my cluster, we have a bunch of "computer servers" where >> users can run the programs. These are "large" boxes with more cores >> (24-32 cores) and more RAM (128 - 256 GB, ECC) than they'd have on a >> desktop top. >> >> Periodically, when I have to shutdown/reboot a system for maintenance, >> I find a LOT of shells being run through the screen command for users >> who aren't logged in. The majority are idle shells, but many are running >> jobs, that seem to be forgotten about. For example, I recently found >> some jobs running since July or August that were running under the >> account of someone who hasn't even been here for months! >> >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. If you have a job that can be run non-interactively, you should >> submit it to the cluster. >> >> Has anyone else here dealt with the problem? >> >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > I wouldn't deinstall screen. It is a useful tool for many things and > there are alternatives doing the same. Instead one could enforce a > maximum CPU time a job can take by setting ulimits. > > Have you thought about queueing systems like condor or SGE? Yes, I have cluster that uses SGE, and we allow users to run serial jobs (non-MPI, etc.) there, so there is no need for them to use screen to execute long-running jobs. Hence my frustration. Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From kilian.cavalotti.work at gmail.com Fri Oct 21 11:06:11 2011 From: kilian.cavalotti.work at gmail.com (Kilian Cavalotti) Date: Fri, 21 Oct 2011 17:06:11 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA1854B.5090506@ias.edu> References: <4EA16F3A.8080209@ias.edu> <20111021134457.GA22748@grml> <4EA1854B.5090506@ias.edu> Message-ID: Hi Prentice, On Fri, Oct 21, 2011 at 4:44 PM, Prentice Bisbal wrote: >> Have you thought about queueing systems like condor or SGE? > > Yes, I have cluster that uses SGE, and we allow users to run serial jobs > (non-MPI, etc.) there, so there is no need for them to use screen to > execute long-running jobs. Hence my frustration. You could alias "screen" to "qlogin". :) Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From atp at piskorski.com Fri Oct 21 15:14:01 2011 From: atp at piskorski.com (Andrew Piskorski) Date: Fri, 21 Oct 2011 15:14:01 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <20111021191401.GA87390@piskorski.com> On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: > My opinion is these these are shared resources, and if you aren't > interactively using them, you should log out to free up resources for > others. "running under screen" != "non-interactive". > I would like to remove screen from my environment entirely to prevent > this. My fellow sysadmins here agree. I'm expecting massive backlash > from the users. No shit. If you allow users to login at all, then (IMNSHO) removing screen is insane. That's not a solution to your problem, that's creating a totally new problem and pretending it's a solution. I essentially always use screen whenever I ssh to any Linux box for any reason. If my sysadmin arbitrarily disabled screen because some other user was doing something dumb, I'd be pretty upset too. (Annoyed enough to maybe just build screen myself on that box.) -- Andrew Piskorski http://www.piskorski.com/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 21 22:18:19 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 21 Oct 2011 22:18:19 -0400 Subject: [Beowulf] about 'man' Re: Users abusing screen In-Reply-To: References: <4EA19A05.4000400@ias.edu> Message-ID: I'm not a sysadmin, but I thought these days we were supposed to point [end]users at "help" or "doc" instead of man? Man is like sdb, it's great but not for everyone, you need context to appreciate it. I think in System V type derivatives it's usually "help"? peter On Fri, Oct 21, 2011 at 12:45 PM, Lux, Jim (337C) wrote: > > > On 10/21/11 9:12 AM, "Prentice Bisbal" wrote: > > > >Ellis, > > > >Using nohup was exactly the advice I gave to one of my users yesterday. > >Not sure if he'll use it. 'man' is a very difficult program to learn, > >from what I understand. > > Well... 'man' is easy, but sometimes, you need decent examples and > tutorials. Just knowing what all the switches are and the format is like > giving someone a dictionary and saying: now write me a sonnet. This is > especially so for the "swiss army knife" type utilities (grep, I'm looking > at you!) > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From ellis at runnersroll.com Sat Oct 22 08:02:35 2011 From: ellis at runnersroll.com (Ellis H. Wilson III) Date: Sat, 22 Oct 2011 08:02:35 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111021191401.GA87390@piskorski.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> Message-ID: <4EA2B0DB.3040702@runnersroll.com> On 10/21/11 15:14, Andrew Piskorski wrote: > On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: > >> My opinion is these these are shared resources, and if you aren't >> interactively using them, you should log out to free up resources for >> others. > > "running under screen" != "non-interactive". What I think Prentice was pointing out here was more along the lines of: "non-interactive" >= "running under screen" <= interactive Where interactivity is more of a spectrum than a != or =. More pointedly, he stated his users are acting in a non-interactive manner, in some cases even after they leave, which is irresponsible at all levels. Obviously he has to balance a rule-set between the good users and the bad users, such that abuse isn't quite as easy. >> I would like to remove screen from my environment entirely to prevent >> this. My fellow sysadmins here agree. I'm expecting massive backlash >> from the users. > > No shit. If you allow users to login at all, then (IMNSHO) removing > screen is insane. That's not a solution to your problem, that's > creating a totally new problem and pretending it's a solution. Insane? I mean, I do a lot of work on a bunch of different distros and hardware types, and have found little use for screen /unless/ I was on a really, really poor internet connection that cut out on the minutes level. Can you give some examples regarding something you can do with screen you cannot do with nohup and tail? > I essentially always use screen whenever I ssh to any Linux box for > any reason. But why? Just leave a terminal open if you want interactivity, otherwise nohup something. Perhaps I've understated screen's usefulness, but I'm glad to be corrected/educated on it's efficacy in this area. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From skylar at cs.earlham.edu Sat Oct 22 13:24:02 2011 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 22 Oct 2011 10:24:02 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA2B0DB.3040702@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> <4EA2B0DB.3040702@runnersroll.com> Message-ID: <4EA2FC32.9000605@cs.earlham.edu> On 10/22/11 05:02, Ellis H. Wilson III wrote: > > Insane? I mean, I do a lot of work on a bunch of different distros and > hardware types, and have found little use for screen /unless/ I was on a > really, really poor internet connection that cut out on the minutes > level. Can you give some examples regarding something you can do with > screen you cannot do with nohup and tail? > > Here's a few I can think of: * Multiple shells off one login * Scroll buffer * Copy&paste w/o needing a mouse * Start session logging at any time, w/o needing to remember to use script or nohup I guess I'm with Andrew, where the first thing I do upon logging in is either connecting to an existing screen session or starting a fresh one. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From j.wender at science-computing.de Mon Oct 24 02:30:12 2011 From: j.wender at science-computing.de (Jan Wender) Date: Mon, 24 Oct 2011 08:30:12 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA505F4.7080007@science-computing.de> On 10/21/2011 03:10 PM, Prentice Bisbal wrote: > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? How about killing long-running (either elapsed or used time) processes not started through the batch system? You should be able to identify them by looking at the process tree. At least one cluster I know kills all user processes which have not been started from the queueing system. Cheerio, Jan -- ---- Company Information ---- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- A non-text attachment was scrubbed... Name: j_wender.vcf Type: text/x-vcard Size: 338 bytes Desc: not available URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From greg.matthews at diamond.ac.uk Mon Oct 24 07:00:19 2011 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Mon, 24 Oct 2011 12:00:19 +0100 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA19A05.4000400@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> Message-ID: <4EA54543.5090908@diamond.ac.uk> Prentice Bisbal wrote: > Using nohup was exactly the advice I gave to one of my users yesterday. > Not sure if he'll use it. 'man' is a very difficult program to learn, > from what I understand. our experience of ppl using nohup without really thinking it through is eventually filling the partition with an enormous nohup.out file. GREG > > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Mon Oct 24 07:20:02 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 24 Oct 2011 13:20:02 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA54543.5090908@diamond.ac.uk> References: <4EA16F3A.8080209@ias.edu> <4EA19365.4030109@runnersroll.com> <4EA19A05.4000400@ias.edu> <4EA54543.5090908@diamond.ac.uk> Message-ID: <9DA6F2A5-6736-457F-AE89-C5EC56735C09@staff.uni-marburg.de> Am 24.10.2011 um 13:00 schrieb Gregory Matthews: > Prentice Bisbal wrote: >> Using nohup was exactly the advice I gave to one of my users yesterday. >> Not sure if he'll use it. 'man' is a very difficult program to learn, >> from what I understand. > > our experience of ppl using nohup without really thinking it through is > eventually filling the partition with an enormous nohup.out file. It's possible to make an alias, so that "nohup" reads "nohup > /dev/null" The redirection doesn't need to be at the end of the command. Depends whether they need the output, and/or any output file is created by the application on its own anyway. -- Reuti > GREG > >> >> Prentice >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- > Greg Matthews 01235 778658 > Senior Computer Systems Administrator > Diamond Light Source, Oxfordshire, UK > > -- > This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. > Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. > Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. > Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 09:42:23 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 09:42:23 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA2B0DB.3040702@runnersroll.com> References: <4EA16F3A.8080209@ias.edu> <20111021191401.GA87390@piskorski.com> <4EA2B0DB.3040702@runnersroll.com> Message-ID: <4EA56B3F.3060404@ias.edu> On 10/22/2011 08:02 AM, Ellis H. Wilson III wrote: > On 10/21/11 15:14, Andrew Piskorski wrote: >> On Fri, Oct 21, 2011 at 09:10:18AM -0400, Prentice Bisbal wrote: >> >>> My opinion is these these are shared resources, and if you aren't >>> interactively using them, you should log out to free up resources for >>> others. >> "running under screen" != "non-interactive". > What I think Prentice was pointing out here was more along the lines of: > "non-interactive" >= "running under screen" <= interactive > Where interactivity is more of a spectrum than a != or =. More > pointedly, he stated his users are acting in a non-interactive manner, > in some cases even after they leave, which is irresponsible at all > levels. Obviously he has to balance a rule-set between the good users > and the bad users, such that abuse isn't quite as easy. Thanks for coming to my defense, Ellis. I don't think I could have explained it better myself. >>> I would like to remove screen from my environment entirely to prevent >>> this. My fellow sysadmins here agree. I'm expecting massive backlash >>> from the users. >> No shit. If you allow users to login at all, then (IMNSHO) removing >> screen is insane. That's not a solution to your problem, that's >> creating a totally new problem and pretending it's a solution. > Insane? I mean, I do a lot of work on a bunch of different distros and > hardware types, and have found little use for screen /unless/ I was on a > really, really poor internet connection that cut out on the minutes > level. Can you give some examples regarding something you can do with > screen you cannot do with nohup and tail? I agree. I've been a professional sys admin using Unix/Linux day in and day out for well over 10 years, and not one days has gone by where I saw a need for screen. >> I essentially always use screen whenever I ssh to any Linux box for >> any reason. > But why? Just leave a terminal open if you want interactivity, > otherwise nohup something. Perhaps I've understated screen's > usefulness, but I'm glad to be corrected/educated on it's efficacy in > this area. > > Best, > > ellis > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 09:46:49 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 09:46:49 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA505F4.7080007@science-computing.de> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> Message-ID: <4EA56C49.9060204@ias.edu> On 10/24/2011 02:30 AM, Jan Wender wrote: > On 10/21/2011 03:10 PM, Prentice Bisbal wrote: >> I have a question that isn't directly related to clusters, but I suspect >> it's an issue many of you are dealing with are dealt with: users using >> the screen command to stay logged in on systems and running long jobs >> that they forget about. Have any of you experienced this, and how did >> you deal with it? > How about killing long-running (either elapsed or used time) processes not > started through the batch system? You should be able to identify them by looking > at the process tree. > At least one cluster I know kills all user processes which have not been started > from the queueing system. The systems where screen is being abused are not part of the batch system, and they will not /can not be for reasons I don't want to get into here. The problem with killing long-running programs is that there are often long running programs that are legitimate in my evironment. I can quickly scan 'ps' output and determine which is which, but I doubt that kind of intelligence could ever be built into a shell script. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Mon Oct 24 10:22:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 24 Oct 2011 10:22:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA574BA.2050304@ias.edu> Anything is possible if you're a good enough programmer. Like I said earlier, there are some users legitimately running long jobs on the systems in question. Instead of developing a clever program to automatically kill long running screen jobs, I think it would be better to be up front with my users and remove screen, rather than let them use it, only to surprise them later by killing their jobs. On 10/24/2011 09:55 AM, geert geurts wrote: > > Hello Prentice, > > Screen is a essential app, for sure. > But as an answer to the initial question... > I'm not much of a programmer, but can't you replace the binary with a > custom compiled version which runs two threads? One with the initial > program, and one which sleeps for the maximum amount of time you're > willing to allow screen sessions to last, and kills the session when > the time runs out... > > Or maybe build some script around the actual binary to do the same.. > > > Regards, > Geert > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From samuel at unimelb.edu.au Mon Oct 24 18:48:44 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 25 Oct 2011 09:48:44 +1100 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA5EB4C.3000809@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22/10/11 00:10, Prentice Bisbal wrote: > I have a question that isn't directly related to clusters, but I suspect > it's an issue many of you are dealing with are dealt with: users using > the screen command to stay logged in on systems and running long jobs > that they forget about. Have any of you experienced this, and how did > you deal with it? Hmm, any way of making a local version of screen which puts all the processes into a cpuset or control group so you can easily distinguish between ones in screen and outside of it ? Perhaps even doing it with a wrapper if you didn't want to build a modified version ? That way you get to restrict the number of cores they can monopolise.. Of course a user could get around it by building their own copy, but at least then you'd be able to see that.. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6l60wACgkQO2KABBYQAh/YtwCfegBzvEpH/s4PtHnFlEwSqQLK UO8An3DK20lEVrT9WM8qln0wM7alKoU6 =oInQ -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue Oct 25 19:13:05 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 25 Oct 2011 16:13:05 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA56C49.9060204@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> Message-ID: <20111025231305.GC9493@bx9.net> On Mon, Oct 24, 2011 at 09:46:49AM -0400, Prentice Bisbal wrote: > The systems where screen is being abused are not part of the batch > system, and they will not /can not be for reasons I don't want to get > into here. The problem with killing long-running programs is that there > are often long running programs that are legitimate in my evironment. I > can quickly scan 'ps' output and determine which is which, but I doubt > that kind of intelligence could ever be built into a shell script. I see that you didn't bother to check out the software proposed soon after you asked your question. If you don't check out potential answers because you doubt they will work, why should anyone bother to reply to you? The problem you have is a common issue in university environments, and the common solution is a script that accurately figures out long-running cpu-intensive programs and nices/kills them. I first ran into such a thing in, oh, 1992? It's not rocket science. -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Wed Oct 26 10:31:56 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 26 Oct 2011 10:31:56 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111025231305.GC9493@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> Message-ID: <4EA819DC.9090106@ias.edu> On 10/25/2011 07:13 PM, Greg Lindahl wrote: > On Mon, Oct 24, 2011 at 09:46:49AM -0400, Prentice Bisbal wrote: > >> The systems where screen is being abused are not part of the batch >> system, and they will not /can not be for reasons I don't want to get >> into here. The problem with killing long-running programs is that there >> are often long running programs that are legitimate in my evironment. I >> can quickly scan 'ps' output and determine which is which, but I doubt >> that kind of intelligence could ever be built into a shell script. > I see that you didn't bother to check out the software proposed soon > after you asked your question. If you don't check out potential > answers because you doubt they will work, why should anyone bother to > reply to you? Greg, I didn't realize I needed to log a detailed response to every suggestion made to me on this list. I've been a member of this list for quite sometime, and I've never seen a comment like yours before. You're out of line. People should bother to reply to me because I've been a participating member of this list for 4 years now, and often assist others when I can. I don't expect a response to every suggestion I provide to others. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From bcostescu at gmail.com Wed Oct 26 11:41:50 2011 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed, 26 Oct 2011 17:41:50 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA16F3A.8080209@ias.edu> References: <4EA16F3A.8080209@ias.edu> Message-ID: On Fri, Oct 21, 2011 at 15:10, Prentice Bisbal wrote: > Periodically, when I have to shutdown/reboot a system for maintenance, > I find a LOT of shells being run through the screen command for users > who aren't logged in. The majority are idle shells, but many are running > jobs, that seem to be forgotten about. > ... > I would like to remove screen from my environment entirely to prevent > this. >From what I understand from your message, it's not screen per-se which upsets you, it's the way it is (ab)used by some users to start long running memory hogging jobs; you seem to be OK with idle shells found at maintenance time which are still started through screen. So why the backlash against screen ? Starting jobs in the background can be done directly through the shell, with no screen; if the job can be split in smaller pieces time-wise, they can be started by at/cron; screen can be installed by a user, possible under a different name... so many and surely other possibilities to still upset you even if you uninstall screen, because you focus on the wrong subject. To deal with forgotten long running jobs, you have various administrative (f.e. bill users/groups, even if in some kind of symbolic way) or technical (f.e. only allow 24h CPU time through system-wide limits or install a daemon which watches and warns and/or takes measures) means - some of these have been discussed on this very list in the past or have been mentioned earlier in this thread. Each situation is different (f.e. some legitimate jobs could run for more than 24h), so you should check all suggestions and apply the one(s) which fit(s) best. I know from my own experience that it's not easy to be on this side of the fence :-) Good luck! Bogdan _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Wed Oct 26 12:22:31 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 26 Oct 2011 12:22:31 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: OK, OK, I haven't participated in this discussion so far -- way too busy. But since it keeps on going, and going, and going, and since nobody has mentioned the obvious and permanent solution, I'm going to have to bring it up: >From "man 8 syslogd", which alas seems to no longer exist save in our hearts and memories, when confronted with any sort of persistent system abuse: 5. Use step 4 and if the problem persists and is not secondary to a rogue program/daemon get a 3.5 ft (approx. 1 meter) length of sucker rod* and have a chat with the user in question. * Sucker rod def. ? 3/4, 7/8 or 1in. hardened steel rod, male threaded on each end. Primary use in the oil industry in West- ern North Dakota and other locations to pump 'suck' oil from oil wells. Secondary uses are for the construction of cattle feed lots and for dealing with the occasional recalcitrant or bel- ligerent individual. I've found that the "sucker rod solution" is really the only one that ultimately works. Even if it is merely present when discussing the problem with the worst offenders, it marvelously focusses the mind on the severity of the issue. Otherwise (as has been pointed out repeatedly) it is rather trivial to write an e.g. cron script that reaps/kills ANYTHING undesireable on a public server. Invariably they will sooner or later kill something that shouldn't be killed in the sense that it is doing some sort of useful work, but screen isn't likely to be something in that category. Myself, I like the sucker rod approach. BANG down on the desk with it and say something ominous like "So, you've been cluttering up my server with unattended and abandoned sessions. Would you be so kind as to CEASE (bam) and DESIST (bam) from this antisocial activity?" Then mutter something about too much Jolt Cola and back away slowly. Don't worry too much about the divots you leave in the desk or the coffee mug that somehow got shattered. They'll be useful reminders the next time he or she considers walking way from a multiplexed screen session. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From landman at scalableinformatics.com Wed Oct 26 12:42:50 2011 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 26 Oct 2011 12:42:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> Message-ID: <4EA8388A.6060704@scalableinformatics.com> On 10/26/2011 12:22 PM, Robert G. Brown wrote: > Myself, I like the sucker rod approach. BANG down on the desk with it > and say something ominous like "So, you've been cluttering up my server > with unattended and abandoned sessions. Would you be so kind as to > CEASE (bam) and DESIST (bam) from this antisocial activity?" Then > mutter something about too much Jolt Cola and back away slowly. [donning his old New Yawk accent ... "Hey, we don't gots no accent ... you'se got an accent..."] "Thats a nice computer model you have there perfesser ... be a shame to have to run it over ... TCP over SLIP (serial line IP) ..." "So you like that 64 bit math, eh? Lets see how well you compute with a few less bits ..." [back to your regularly scheduled supercomputer cluster] -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Wed Oct 26 16:55:13 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 26 Oct 2011 16:55:13 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA819DC.9090106@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> Message-ID: > sometime, and I've never seen a comment like yours before. You're out of > line. hah. Greg doesn't post all that much, but he's no stranger to the flame ;) seriously, your question seemed to be about a general problem, but your motive, ulterior or not, seemed to be to get rid of screen. IMO, getting rid of screen is BOFHishness of the first order. it's a tool that has valuable uses. it's not the cause of your problem. on our login nodes, we have some basic limits (/etc/security/limit.conf) that prevent large or long processes or numerous processes. * hard as 3000000 * hard cpu 60 * hard nproc 100 * hard maxlogins 20 these are very arguable, and actually pretty loose. our login nodes are intended for editing/compiling/submitting, maybe the occasional gnuplot/etc. there doesn't seem to be much resistance to the 3G as (vsz) limit, and it does definitely cut down on OOM problems. 60 cpu-minutes covers any possible compile/etc (though it has caused problems with people trying to do very large scp operations.) nproc could probably be much lower (20?) and maxlogins ought to be more like 5. we don't currently have an idle-process killer, though have thought of it. we only recently put a default TMOUT in place to cause a bit of gc on forgotten login sessions. we do have screen installed (I never use it myself.) regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From scrusan at ur.rochester.edu Wed Oct 26 17:14:13 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Wed, 26 Oct 2011 17:14:13 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Oct 26, 2011, at 4:55 PM, Mark Hahn wrote: >> sometime, and I've never seen a comment like yours before. You're out of >> line. > > hah. Greg doesn't post all that much, but he's no stranger to the flame ;) > > seriously, your question seemed to be about a general problem, > but your motive, ulterior or not, seemed to be to get rid of screen. > > IMO, getting rid of screen is BOFHishness of the first order. > it's a tool that has valuable uses. it's not the cause of your problem. I agree. - From reading this thread, the original machine(s) in question seem to be some sort of interactive or login node(s). If these nodes were large memory or SMP machines, we'd have our resource manager take care of long running processes or other abuses. > > on our login nodes, we have some basic limits (/etc/security/limit.conf) > that prevent large or long processes or numerous processes. > > * hard as 3000000 > * hard cpu 60 > * hard nproc 100 > * hard maxlogins 20 > > these are very arguable, and actually pretty loose. our login nodes are > intended for editing/compiling/submitting, maybe the occasional gnuplot/etc. > there doesn't seem to be much resistance to the 3G as (vsz) limit, and > it does definitely cut down on OOM problems. 60 cpu-minutes covers any > possible compile/etc (though it has caused problems with people trying to > do very large scp operations.) nproc could probably be much lower (20?) > and maxlogins ought to be more like 5. We actually just spinned up a graphical login node for our less saavy users whom are more apt to run matlab, comsol, gnuplot, and other 'EZ button' graphically based scientific software. This graphical login software (http://code.google.com/p/neatx/) has helped us a lot with novice users. It has session resumption, client software for any platforms, it's faster than xforwarding, and it's wrapped around SSH. The node itself is 'fairly' heavy (8 procs, 72GB of RAM), but we've implemented cgroups to stop abuses. Upon login (through SSH or NX) each user is added to his own control group, which has processor and memory limits. Since the user's processes are kept inside of control group process spaces, it's easy to work directly with their processes/process trees, whether it be dynamic throttling, or just killing processes. On our login nodes that don't use control groups, we just kill any heavy computational processes after a certain period of time, depending on whether or not it's a compilation step, gzip, etc. We state this in our documentation, and usually give the user a warning+grace period. We don't see this type of abuse anymore because the few users whom have done this quickly learned (and apologized, imagine that!), or they were using our cgroup setup login node, so their abuse didn't affect the system enough. If the issue is processes that run for far too long, and are abusing the system, cgroups or 'pushing' the users to use a batch system seems to work better than writing scripts to make decisions on killing processes. Most ISVs have methods to run computation in batch mode, so it's not necessary for matlab type users to have their applications running for 3 weeks in a screen session when they could be using the cluster. Either that, or using some sort of cpu/memory limits that were listed above, or cgroups. So a process can run forever, but it won't have enough CPU/memory shares to make a difference. Just my .02 > > we don't currently have an idle-process killer, though have thought of it. > we only recently put a default TMOUT in place to cause a bit of gc on > forgotten login sessions. > > we do have screen installed (I never use it myself.) > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOqHgzAAoJENS19LGOpgqKDHQH/AqfAefrt3nusElS/OBnxgBK Pf8tFuyjoJvLgt+3KX19ZL18r1b/BhdW3/1GZgSVVjQZcYkV6dtUq6VI545jqDag lRY9kvyIhudKfVhFwGa87DbXSzYv5oDImf3UejsIiJvo20Bzxf7mdpToT+AGJ4gA J2HzrZwjdZk/DYEJ7CpG9lfthDDq5mrTQTbzVCnFHvEiWpeoBvfd3gJOP94age0F 0ZQGLCgheRSJXLsOlq0y0vqr+7nzupSrLUk5A1YcUysSpk4Dc4mvUVJFE+QbStN6 dSiYHhKMxF5qJTXYOSAF4QDmIObyzlbFFmHCeTTWrCG7KeWtOZU4zUfN7TL3sO4= =M5Pw -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Thu Oct 27 01:41:47 2011 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 26 Oct 2011 22:41:47 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> Message-ID: <20111027054147.GB29939@bx9.net> On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > If the issue is processes that run for far too long, and are abusing > the system, cgroups or 'pushing' the users to use a batch system seems > to work better than writing scripts to make decisions on killing > processes. What I saw work well was nicing the process after a certain time, including an email, and then killing and emailing after a longer time. The emails can push the batch alternative. Users generally don't become angry if the limits are enforced by a script; they can only be surprised once, and that first time is just nicing the process. If they have a hard time predicting runtime (a common issue, especially for non-hardcore supercomputing types), it's not like they _intentionally_ are exceeding the limits... -- greg _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Oct 27 10:49:51 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 27 Oct 2011 10:49:51 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111027054147.GB29939@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: <4EA96F8F.1010207@ias.edu> On 10/27/2011 01:41 AM, Greg Lindahl wrote: > On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > >> If the issue is processes that run for far too long, and are abusing >> the system, cgroups or 'pushing' the users to use a batch system seems >> to work better than writing scripts to make decisions on killing >> processes. > What I saw work well was nicing the process after a certain time, > including an email, and then killing and emailing after a longer > time. The emails can push the batch alternative. Users generally don't > become angry if the limits are enforced by a script; they can only be > surprised once, and that first time is just nicing the process. If > they have a hard time predicting runtime (a common issue, especially > for non-hardcore supercomputing types), it's not like they > _intentionally_ are exceeding the limits... Exactly. That's why I don't want to automate killing jobs longer than X days. Honestly, I can't believe how much controversy this discussion has created. I thought my OP would go unnoticed. Next time, I'll just ask which text editor I should use. ;) -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dnlombar at ichips.intel.com Thu Oct 27 12:04:21 2011 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 27 Oct 2011 09:04:21 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> Message-ID: <20111027160421.GA28306@nlxcldnl2.cl.intel.com> On Wed, Oct 26, 2011 at 02:55:13PM -0600, Mark Hahn wrote: > > sometime, and I've never seen a comment like yours before. You're out of > > line. > > hah. Greg doesn't post all that much, but he's no stranger to the flame ;) > > seriously, your question seemed to be about a general problem, > but your motive, ulterior or not, seemed to be to get rid of screen. > > IMO, getting rid of screen is BOFHishness of the first order. > it's a tool that has valuable uses. it's not the cause of your problem. Completely agree with this. If you get rid of screen, another tool will be used, perhaps even as simple as a private copy, or nohup and tail as others suggested. My primary use of screen is to do work across home and the office. Nohup only solves one of the potential scenarios. If screen were removed, my productivity would go down. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From glykos at mbg.duth.gr Thu Oct 27 15:19:37 2011 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Thu, 27 Oct 2011 22:19:37 +0300 (EEST) Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA96F8F.1010207@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > Exactly. That's why I don't want to automate killing jobs longer than X > days. Probably irrelevant after so many suggestions, but Caos NSA had this very nice 'pam_slurm' module which allows a user to login only to those nodes on which the said user has active jobs (allocated through slurm). The principal idea ["you are welcome to be bring your allocated node (and, thus, your job) to a halt if that's what you want"], sounds pedagogically attractive ... ;-) Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Thu Oct 27 15:33:18 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 27 Oct 2011 15:33:18 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <4EA9B1FE.8090903@ias.edu> On 10/27/2011 03:19 PM, Nicholas M Glykos wrote: > >> Exactly. That's why I don't want to automate killing jobs longer than X >> days. > Probably irrelevant after so many suggestions, but Caos NSA had this very > nice 'pam_slurm' module which allows a user to login only to those nodes > on which the said user has active jobs (allocated through slurm). The > principal idea ["you are welcome to be bring your allocated node (and, > thus, your job) to a halt if that's what you want"], sounds pedagogically > attractive ... ;-) > > This doesn't apply to my case, since access to the systems in question isn't controlled by a queuing system. That alone would fix the problem. I think there's a similar pam module for SGE, too. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From reuti at staff.uni-marburg.de Thu Oct 27 15:43:59 2011 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 27 Oct 2011 21:43:59 +0200 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EA9B1FE.8090903@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <4EA9B1FE.8090903@ias.edu> Message-ID: <94F21C03-C8BB-4DB4-AA3A-D1271524E43E@staff.uni-marburg.de> Am 27.10.2011 um 21:33 schrieb Prentice Bisbal: > On 10/27/2011 03:19 PM, Nicholas M Glykos wrote: >> >>> Exactly. That's why I don't want to automate killing jobs longer than X >>> days. >> Probably irrelevant after so many suggestions, but Caos NSA had this very >> nice 'pam_slurm' module which allows a user to login only to those nodes >> on which the said user has active jobs (allocated through slurm). The >> principal idea ["you are welcome to be bring your allocated node (and, >> thus, your job) to a halt if that's what you want"], sounds pedagogically >> attractive ... ;-) They use it in one cluster with Slurm I have access to. But it looks like you are never thrown out again once you are in. -- Reuti > This doesn't apply to my case, since access to the systems in question > isn't controlled by a queuing system. That alone would fix the problem. > > I think there's a similar pam module for SGE, too. > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From hahn at mcmaster.ca Thu Oct 27 19:37:29 2011 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 27 Oct 2011 19:37:29 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > nice 'pam_slurm' module which allows a user to login only to those nodes > on which the said user has active jobs (allocated through slurm). The I think this is slightly BOFHish, too. do people actually have problems with users stealing cycles this way? the issue is actually stealing, and we simply tell our users not to steal. (actually, I don't think we even point it out, since it's so obvious!) that means we don't attempt to control (we had pam_slurm installed and actually removed it.) after all, just because a user's job is done, it doesn't mean the user has no reason to go onto that node (maybe there's a status file in /tmp, or a core dump or something.) if someone persisted in stealing cycles, we'd lock their account. regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From skylar at cs.earlham.edu Thu Oct 27 19:43:24 2011 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 27 Oct 2011 16:43:24 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <4EA9EC9C.9090307@cs.earlham.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/27/2011 04:37 PM, Mark Hahn wrote: >> nice 'pam_slurm' module which allows a user to login only to those nodes >> on which the said user has active jobs (allocated through slurm). The > > I think this is slightly BOFHish, too. do people actually have problems > with users stealing cycles this way? the issue is actually stealing, > and we simply tell our users not to steal. (actually, I don't think we > even point it out, since it's so obvious!) > > that means we don't attempt to control (we had pam_slurm installed and > actually removed it.) after all, just because a user's job is done, it > doesn't mean the user has no reason to go onto that node (maybe there's a > status file in /tmp, or a core dump or something.) > > if someone persisted in stealing cycles, we'd lock their account. > We do the equivalent with GE it if the end user requests it. We have some clusters that need to support a mix of critical jobs supporting data pipelines, and less-critical academic work. Our default stance, though, is to trust our users to do the right thing. Mostly it works, but sometimes we do need to bring out the LART stick. - -- - -- - -- Skylar Thompson (skylar at cs.earlham.edu) - -- http://www.cs.earlham.edu/~skylar/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk6p7JwACgkQsc4yyULgN4aRdgCbB3er3VI9OZEVSWO0GjL15rgU Z0sAoIZBKFsCeaYwA44uQT13JcdMN3dz =ervm -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From rgb at phy.duke.edu Fri Oct 28 14:04:02 2011 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 28 Oct 2011 14:04:02 -0400 (EDT) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: On Thu, 27 Oct 2011, Mark Hahn wrote: > if someone persisted in stealing cycles, we'd lock their account. Exactly. Or visit them with a sucker rod. Or have a department chair have a "talk" with them. Human to human interactions and controls work better than installing complex tools or automated constraints. Sure, sucker rods are a joke and no we don't actually bop users on the head or the desk or whomp them upside the head with a manual, but in most cases a stern talking to followed by locking their account unless/until they formally agree to change their ways is more than sufficient. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From sabujp at gmail.com Fri Oct 28 14:22:03 2011 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 28 Oct 2011 13:22:03 -0500 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > Human to human interactions and controls work better than installing > complex tools or automated constraints. ?Sure, sucker rods are a joke > and no we don't actually bop users on the head or the desk or whomp them > upside the head with a manual, but in most cases a stern talking to > followed by locking their account unless/until they formally agree to > change their ways is more than sufficient. Funny you should mentioned that, we've got such a device handy, passed down through the years from previous sysadmins: http://i.imgur.com/G0pjk.jpg It's also got a nice foam layer on the bopping side. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From beckerjes at mail.nih.gov Fri Oct 28 14:27:48 2011 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Fri, 28 Oct 2011 14:27:48 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: <20111028182748.GC41282@mail.nih.gov> On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >http://i.imgur.com/G0pjk.jpg > >It's also got a nice foam layer on the bopping side. Then it's just a prop. What's the *real* one look like? -- Jesse Becker NHGRI Linux support (Digicon Contractor) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From sabujp at gmail.com Fri Oct 28 14:33:52 2011 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 28 Oct 2011 13:33:52 -0500 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111028182748.GC41282@mail.nih.gov> References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: I don't know, maybe we drop this on their head: http://i.imgur.com/VWxyF.jpg or worse, switch out their linux workstation with it. On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: > On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >> >> http://i.imgur.com/G0pjk.jpg >> >> It's also got a nice foam layer on the bopping side. > > Then it's just a prop. ?What's the *real* one look like? > > -- > Jesse Becker > NHGRI Linux support (Digicon Contractor) > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From james.p.lux at jpl.nasa.gov Fri Oct 28 14:58:33 2011 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 28 Oct 2011 11:58:33 -0700 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: Google "Microsoft we share your pain" and look for the WSYP videos on youtube.. The three minute version is probably the one you want. Jim Lux +1(818)354-2075 > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Sabuj Pattanayek > Sent: Friday, October 28, 2011 11:34 AM > To: Beowulf Mailing List > Subject: Re: [Beowulf] Users abusing screen > > I don't know, maybe we drop this on their head: > > http://i.imgur.com/VWxyF.jpg > > or worse, switch out their linux workstation with it. > > On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: > > On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: > >> > >> http://i.imgur.com/G0pjk.jpg > >> > >> It's also got a nice foam layer on the bopping side. > > > > Then it's just a prop. ?What's the *real* one look like? > > > > -- > > Jesse Becker > > NHGRI Linux support (Digicon Contractor) > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From glykos at mbg.duth.gr Fri Oct 28 15:10:18 2011 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Fri, 28 Oct 2011 22:10:18 +0300 (EEST) Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> Message-ID: > > if someone persisted in stealing cycles, we'd lock their account. > > Exactly. Or visit them with a sucker rod. Or have a department chair > have a "talk" with them. > > Human to human interactions and controls work better than installing > complex tools or automated constraints. I can't, of course, even contemplate the possibility of disagreeing with RGB. Having said that, we (humans) do install complex tools and automated constraints on each and every technologically advanced piece of equipment, from cars and aircrafts, to computing machines (and we do not assume that proper training and human interaction suffices to guarantee proper operation of the said equipment). In this respect, methods like allocating (in a controlled manner) exclusive rights to compute nodes do appear sensible. I agree that installing restraints is a balancing act between crippling creativity (and making power users mad) and avoiding equipment misuse, but clearly, there are limits in the freedom of use (for example, you wouldn't add all cluster users to your sudo list). My twocents, Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From prentice at ias.edu Fri Oct 28 16:20:41 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 28 Oct 2011 16:20:41 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EA96F8F.1010207@ias.edu> <20111028182748.GC41282@mail.nih.gov> Message-ID: <4EAB0E99.10407@ias.edu> I was still supporting those only 4 years ago. Much heavier than a Dell or HP workstation. Will fix 'layer 8' problems in a jiffy. -- Prentice On 10/28/2011 02:33 PM, Sabuj Pattanayek wrote: > I don't know, maybe we drop this on their head: > > http://i.imgur.com/VWxyF.jpg > > or worse, switch out their linux workstation with it. > > On Fri, Oct 28, 2011 at 1:27 PM, Jesse Becker wrote: >> On Fri, Oct 28, 2011 at 02:22:03PM -0400, Sabuj Pattanayek wrote: >>> http://i.imgur.com/G0pjk.jpg >>> >>> It's also got a nice foam layer on the bopping side. >> Then it's just a prop. What's the *real* one look like? >> >> -- >> Jesse Becker >> NHGRI Linux support (Digicon Contractor) >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 28 16:56:49 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 28 Oct 2011 16:56:49 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <20111027054147.GB29939@bx9.net> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: I think Greg is right on the money. Particularly at a place like IAS, where resources are good and users may be errant but are doing great things, I'd have a sequence of limits; first, a mail warning ("Your job PID 666 has consumed one million core hours, and its priority will be decremented in 500,000 CH unless you call the sysadmin at 555-1212") and later nice (iwith another email warning) and only then kill (with an email notificiation). If they have opportunities to upscale the allocations to really important jobs, and they are notified about automatic limitations ahead of time, they have no reason to complain. Peter On Thu, Oct 27, 2011 at 1:41 AM, Greg Lindahl wrote: > On Wed, Oct 26, 2011 at 05:14:13PM -0400, Steve Crusan wrote: > > > If the issue is processes that run for far too long, and are abusing > > the system, cgroups or 'pushing' the users to use a batch system seems > > to work better than writing scripts to make decisions on killing > > processes. > > What I saw work well was nicing the process after a certain time, > including an email, and then killing and emailing after a longer > time. The emails can push the batch alternative. Users generally don't > become angry if the limits are enforced by a script; they can only be > surprised once, and that first time is just nicing the process. If > they have a hard time predicting runtime (a common issue, especially > for non-hardcore supercomputing types), it's not like they > _intentionally_ are exceeding the limits... > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice at ias.edu Fri Oct 28 18:21:50 2011 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 28 Oct 2011 18:21:50 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> Message-ID: <4EAB2AFE.7000901@ias.edu> On 10/28/2011 04:56 PM, Peter St. John wrote: > I think Greg is right on the money. Particularly at a place like IAS, > where resources are good and users may be errant but are doing great > things, Have you been a visitor, member or staff member at IAS? -- Prentice _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From peter.st.john at gmail.com Fri Oct 28 19:16:44 2011 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 28 Oct 2011 19:16:44 -0400 Subject: [Beowulf] Users abusing screen In-Reply-To: <4EAB2AFE.7000901@ias.edu> References: <4EA16F3A.8080209@ias.edu> <4EA505F4.7080007@science-computing.de> <4EA56C49.9060204@ias.edu> <20111025231305.GC9493@bx9.net> <4EA819DC.9090106@ias.edu> <774_1319662643_4EA87433_774_78911_1_alpine.LFD.2.02.1110261647170.7933@coffee.psychology.mcmaster.ca> <20111027054147.GB29939@bx9.net> <4EAB2AFE.7000901@ias.edu> Message-ID: Prentice, No, I didin't mean to imply anything specific about e.g. your budget, but IAS has a fantastic reputation. Say hi to Dima for me, he plays Go and is an algebraic geometer visiting this year. Peter On Fri, Oct 28, 2011 at 6:21 PM, Prentice Bisbal wrote: > > On 10/28/2011 04:56 PM, Peter St. John wrote: > > I think Greg is right on the money. Particularly at a place like IAS, > > where resources are good and users may be errant but are doing great > > things, > > Have you been a visitor, member or staff member at IAS? > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf