The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on the taxonomy of clusters and grids (somehow Larry the Cable Guy and his taxidermy skills seems appropriate for this subject) and remote console management. I think the discusion threads I present below provide some very useful information that can be used by everyone (despite the age of the postings).
Every once in a while there is an arguement on the Beowulf list about what is a "cluster" and what is a "beowulf." These discussions are usually somewhat fun, always contenious, and occassional hilarious (e.g. when Micrsoft CCS enters into the discussion). However on Sept. 16 2005, Sebastian opened up the question even further. He asked what the difference was between a grid and a cluster (inquiring minds want to know!).
The opening salvo was launched by our beloved, Robert Brown (rgb). He said that a grid is a specific kind of cluster that is usually designed to run embarrassingly parallel applications over a wide user base. He also wrote that many times, grids are just "clusters of clusters." He went to say that, "A lot of grids are accessed and managed by web-based tools with authentication schema and shared storage (or not) to match."
Then rgb defined a cluster as a "collection" of systems, "... simultaneously used to solve a given task." However he did distinguish between High Availability (HA) clusters and High Performance Computing (HPC) Clusters. He did mention that the beowulf list is focused on HPC clusters (hint, hint). He split clusters into two distinct groups: "beowulfs" and "everything else." Typically HPC clusters have nodes on a flat network and can be dedicated to the cluster or a collection of workstations.
The next salvo was fired by Egan Ford. For those of you that don't know Egan, he is one of the best cluster experts around. He is a bit on the quiet side on the beowulf mailing list though. Nevertheless, Egan gave a very succinct definition of clusters and grids that bears repeating:
Another very experienced cluster person, Mark Hahn, jumped into the fray with this post. His first comment was very cheekily (is that a real word?) built on Egan's comment about grids being more political by saying that, "I'd even go so far as to say that setting up a grid is basically a PR measure, whereas setting up a cluster is usually done for a specific set of practical reasons ;)" Remember that his is 2005 and grids were an extremely hot subject at the time (they have cooled off a bit since then). Mark went on to make a very good comment that I happen to agree with. He pointed out that people build grids and then expect their applications to work on them. Mark's comment was, "You just arrange the plumbing and the flops will flow to whereever they're needed." He went on to talk about how his organization has separate clusters with a single administrative domain but has not created a Grid from them because each one is designed for a different set of codes. This is a very important point. Clusters can be designed for a certain type of code (in the old supercomputing days, you had to design the application around the hardware). With grid computing it may be difficult to create specialized clusters and have them act as a grid.
With these comments from Mark, the next broadside came from rgb. His opening statement in response to Mark's post says it all, "I would have to strongly disagree with both of these statements." Rgb put a more precise definition on what he thought a cluster was and what he though a grid was. A cluster, in his opinion, can be one of the following:
He went on to say that in his opinion, it was the software that made the differentiation between a cluster and a grid. Grid software is a bit different from cluster software. He even made a list of challenges faced by grid developers.
There was more discussion about grids and how they work with
a little history of PVM thrown in by rgb. Then Joe Landman, another
his ideas on the difference betwee a cluster and a grid. I thought
his definition is worth repeating,
"In a nutshell, a grid defines a virtualized cloud of processing/data motion across one or more domains of control and authentication/authorization, while a cluster provides a virtualized cloud of processing/data motion across a single domain of control and authentication/authorization."
(Note: This definition disagrees slightly with Egan's definition). Joe dove down into some details about the ongoing discussion, but at the end, made a nice comment about how grids and clusters work.
"Most parallelization on clusters is the wide type: you distribute your run over as many nodes as practical for good performance. Parallelization on grids can either be trivial ala Monte Carlo, or pipeline based. Pipeline based parallelism is getting as much work done by creating the longest pipeline path practical keeping as much work done per unit time as possible (and keeping the communication costs down). Call this deep type parallelism On some tasks, pipelines are very good for getting lots of work done. For other tasks they are not so good. There is an analogy with current CPU pipelines if you wish to make it." BTW - rgb had a very long post to Joe's comments. Probably one of the longest rgb posts I've seen (and actually read!).
Just for good measure, our distinguished Head Monkey, Doug,
in with his definition of a grid and a cluster. His definition revolves
around administrative domains with the added complexity of the network
characteristics. He said,
"I think the main point is administrative domains. Basically, a cluster is run under a single administrative domain (i.e. you have an account and file space in this domain) A grid is where you combine multiple administrative domains to form a computing resource. (i.e. you may use two clusters each of which have separate administrative domains which may or may not be under the control of your organization.)"
While the length of the posts and some of the disagreement in this thread was a bit high, I think it's a good thread to highlight because it makes you think about what you want in a grid (if you want to build one) and reminds you of the strength of clusters.
Another topic that seem to generate a number of postings was a subject about remote power management. On Sept. 22, 2005, Bruce Allen posted to the beowulf list about remote console management. He was getting ready to build a large cluster and wanted to get some opinions on remote console management. He mentioned that they were looking at IPMI to do the job or using Cyclades serial consoles to gain access to the nodes. Or a combination of both. BTW - Bruce is the lead developer on smartmontools which is used to gain access to the SMART features of hard drives.
Suprisingly, asking this question generated a large number of replies. Stuart Midgley was the first to respond to Bruce's post. He said that he didn't have very good experiences in the area of remote console management. He said that serial consoles tend to be expensive and not always provide access to the BIOS (one estimate he gave was $500-$1,000/node). He said that they use kvm (Keyboard/Video/Mouse) over ethernet to gain access to the head nodes and don't bother with the compute nodes. Since Bruce is at a university, Stuart thought the cheapest option was, "... to give a PhD or grad student an extra $10k and get a small trolley with keyboard/monitor/mouse." Having been a Ph.D. student at one point, I can say that I would have relished the extra $10k, but I probably wouldn't have finished my degree.
Bruce responded with some interesting price numbers he found (remember that this is 2005). The nodes they were considering had serial BIOS access. The IPMI cards for the nodes were about $60-$80/node. The serial consoles they were looking at were about $83/node.
Joe Landman also replied that doing KVM over IP would be very expensive. In the case of remote power management, Joe used a power control unit from APC (no commercial intended here). For console access he prefers KVM over IP but they can be expensive (he talked about some pricing for such a configuration).
was from our esteemed Monkey Leader (Gorilla?) that
I think is a very intersting philosophy. I think it's interesting
enought to warrant it's reproduction here.
This brings up an interesting point and I realize this does come down to a design philosophy, but cluster economics sometimes create non standard solutions. So here is another way to look at "out of band monitoring". Instead of adding layers of monitoring and control, why not take that cost and buy extra nodes? (but make sure you have a remote hard power cycle capability). If a node dies and cannot be rebooted, turn it off, and fix it later. Of course monitoring fans and temperatures is a good thing (tm), but if a node will not boot, and you have to play with the BIOS, then I would consider it broken.
Because you have "over capacity" in your cluster (you bought extra nodes) then this design philosophy does not impact the amount work that needs to get done (if you don't lose too many nodes). Indeed, prior to the failure you can have the extra nodes working for you. You fully understand that at various time one or two nodes will be off line. They are taken out of the scheduler and there is no need to fix them right away. So if you don't lose nodes too often, then will actually get more work from a cluster that has traded nodes for console access.
Doug went on to say,
This approach also depends on what you are doing with your cluster and the cost of nodes etc. In some cases out-of-band access is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node in the head and fix it tomorrow" approach is also reasonable.
I love this approach because it's so simple and you can gain extra computing power for the same cost. You have remote power capability for those tough to reach nodes that are bent sideways and need to be power cycled. But rather than adding additional layers of control and management, you use that money to buy more nodes. If we take some of the numbers that were floating around, you could spend about $150/node on console access. If a node costs about $4k, then you can get extra 3.3% nodes. In the case of 1000 nodes, this amounts to an extra 33 nodes. For 100 nodes, this is an extra 3.3 nodes (maybe you can squeeze it to an extra 4 nodes). While this amount sounds small, if you have perhaps 90% of this extra capacity over a 3 year life of the cluster, then you can get a great deal of extra work done (for the same cost).
I also really like Doug's comment about BIOS problems. If there is BIOS problem, the node should be considered "broken" and sent back to the vendor for repair. This points out the the vendor needs to do their homework and check the BIOS for problems (this is another good reason for LinuxBIOS). If you have an entire cluster with bad BIOS, then the vendor should come on-site, fix the problems (quickly) and I would think twice about dong business with this vendor since it is probably true that they didn't do their homework up front. This approach forces the vendors to really do testing on the nodes they offer. It means you get a better, more robust solution and the vendor has less to fix.
Unfortunately, Bruce couldn't use Doug's approach. He said that they stored data on the compute nodes, so they couldn't really afford to just give up on a node and send it back for repair. They needed to know what was wrong with the node and diagnose it with the vendor and get it back into production.
There was a great of discussion about whether certain motherboards could do serial over LAN (SOL) and which approach is better, etc. I have omitted those discussions since they are very specific. Please look at the Beowulf archives in September 2005 if you want to read more about the details.
However, I included this thread because I found it very interesting that alternative ideas were discussed. Do you take commodity hardware and add somewhat pricey monitoring to it? In the past when we had the dinosaur supercomputer hardware, adding monitoring to them was a cost effective solution (monitoring was much less than the hardware itself). But now, we have commodity based supercomputing where the cost per node is cheap. But is it cheap enough to treat as though they are disposable? This is point that Doug is making. Is it worthwhile economically to add layers of monitoring on top of commodity hardware? Or is it better to just buy as many nodes as you can, use them all, and send the ones that are broken to the vendor for repair (if the vendor can repair and return them in a timely manner). I think Doug has a very valid point. However, as we've seen it may not be possible to always treat the nodes as throw-aways. In that case, you need a cost effective console management approach. Please review the thread for more discussion about the details of remote console access.
Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted) so he can store all of the positings to all of the mailing lists he monitors. He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).