Issues, but no real answers
The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list on performance measurements and Microsoft's foray into the HPC market (nice historical perspective here) which resulted in a good and timely discussion about Linux distributions for the HPC world. While these discussions were from 2004 and a bit on the older side (aren't we all?), they do provide some good insights into the these continuing discussions
G5 Performance and Benchmarks
This Beowulf list thread started when the Eugen Leitel posted a forward from another mailing list asking about Apple G5 performance, particularly on the computational chemistry QM and MD codes (the original poster was Joe Leonard). Eugen also forwarded a response to Joe's initial posting from the other list. Mike McCallum had tested dual G5's and found them to be near the top in price/performance, especially using the IBM compilers. He also found the built-in GigE to be quite good for scaling on NAMD and CHARMM which according to Mike scale well with GigE (these were based on some benchmark numbers that Mike posted).
Bill Broadley was the first to reply with a request for some clarification of some of the test results that Mike had posted. He also took issue with some of the conclusions that Mike made regarding the scalability of NAMD and CHARMM on GigE vs. Myrinet. Michael Huntingdon then posted that he thought that the Itanium 2 was a good solution based on the performance numbers.
Joe Landman took issue with the idea of using SPEC numbers as a "good" predictive indication of performance. Joe also points out the benchmarks posted by Mike McCallum may have some flaws because of compilers and compile options. Joe also had problems with recommending Itanium 2 processors for all HPC applications. Joe pointed out that for Bioinformatics applications, Itanium 2 processors don't compare well to other processors.
A frequent contributor, Mark Hahn, joined in to say that he had done a small analysis of the SpecFP benchmarks. His goal was to point out that the SpecFP scores of some machines are due to a small subset of the SpecFP scores. He sorted the scores for each machine and omitted the top scores and plotted the results. Based on his analysis he concluded that the PowerPC 970 was good for cache friendly codes and codes that could perform two vector mult-add operations per cycle; the Itanium 2 is great if your code is cache friendly, your code is amenable to the pipelining required by EPIC (the Itanium 2 architecture), and you can afford them; and Opterons are great if your working set makes caches less effective.
Greg Lindahl posted a response to Mark's analysis. He didn't think the analysis was valid because the scores were not normalized to make their absolute scores valid. Greg also took issue with the phrase, "...if you can afford them..." in regard to the Itanium 2. Greg's point is that you find the best price/performance and buy those systems or you buy the most performance for a given price. Both of those approaches don't care about the cost of a single system, just the performance and cost of the entire cluster.
Windows HPC Edition and Linux Clustering
As would be expected, this thread brought up a lot of opinions and insights from the list membership. On May 25, 2004 Eugen Leitel posted an article describing how Microsoft was creating an HPC version of Windows (Now available as Windows Compute Cluster Server 2003). Of course, there were the immediate comments about seeing hundreds of nodes all getting the "blue screen of death" at the same time. Of course there is always a chuckle about these comments, but there are some real issues behind Microsoft's entry into this space.
Robert Brown started off the discussion by writing a "few" comments about Microsoft's motives for entering the HPC market. Shortly there after Joe Landman jumped in to say that companies don't really have nefarious motives behind their efforts. In his opinion he thought that most of their efforts are clearly discernible from their basic goals (usually involving making lots of money). Douglas Eadline, editor of ClusterMonkey, posted that he thought there were two reasons that Microsoft is entering this market. The first reason, like any corporation, is profit. He noted that the article stated that Microsoft was focusing on the financial and cycle scavenging markets where the return on the investment is good. The second reason was for competitive reasons. He suggested that they are trying to limit "Linux creep" and that using Linux to build airplanes, find oil, and search the genome have added some legitimacy to Linux in the data center. His final comment, "I think we just got legit.", was of a positive nature because the entry of a big company often helps legitimize/solidifying markets.
Robert Myers posted that he thought Microsoft would have a tough time convincing people to switch to Windows because so much of the HPC world is built on Unix or Unix-like software. He went on to suggest that Microsoft may want to consider telling the world that it's not out to make money on their HPC product, but rather to help solve the problems of the world by harvesting spare cycles. (Good thought. I wonder if Robert has a future in advertising?).
Of course, being the good pundit that I am, I had to jump into the fray by pointing out this these developments are a huge threat to using Linux in the production world. I talked about how the cost of the server level Linux distributions are now much higher than companies pay for a license of a server version of Windows. This situation is going to push many IT managers into asking the questions, "I thought Linux was cheaper than Windows?" In fact, I know cases where this is happening. I finished my posting by asking Linux distribution companies to wake up and come up with a reasonable pricing model for HPC systems, for cluster companies to support for an open Linux distribution, and for commercial software companies to support a kernel/glibc combination rather than a specific distribution. I know of one major distribution company who has already developed a much more reasonable HPC pricing model.
Roger Smith, a very experienced and knowledgeable cluster user, jumped in to say, "Amen"! (Thanks Roger!). Roger discussed how Red Hat was charging way too much for a distribution for his cluster (even with educational pricing) and would cause havoc because of the need for commercial compilers and applications that are only supported on these expensive distributions. He made a very valid point that he doesn't mind spending a little more on a distribution but not the amount that Red Hat was asking.
Laurence Liew asked a very good question in response to Roger's comments. He asked what people are willing to spend on a per node basis. He pointed out correctly, that supporting an HPC distribution requires HPC savvy engineers which can be expensive. Laurence guessed that $50-$100 per node was what people might like to see. Joe Landman has a latter post where he asked a very similar question about what people are willing to pay for commercial applications and distributions. Joe made a very good observation that there are several companies that take open-source applications and re-brand them as their own applications and charge a fair amount for them. He was curious about what people would pay for support of open-source applications.
John Burton added some excellent observations. He made the point that regardless of the number of nodes in a cluster, the patches, updates, etc. are stored only on the master node. The compute nodes are updated, patched, etc. by the cluster administrators. John then asked, "Since we're handling the systems administration, installation, maintenance, etc, ourselves, what are we getting for our ($100 per node) money?" He also asked the question that if he expanded his cluster from 100 nodes to 200 nodes, what does the added cost of the distribution give him? He still wants the same thing from the distribution company despite the increased number of nodes - that is, updates and patches for the server node.
Robert Brown had a very long and interesting post in response to Laurence's question. In a quick summary, Robert suggested that cluster people devote a portion of their IT time to supporting an open distribution such as Fedora rather than spend a great deal of money supporting a pointless HPC pricing model from various Linux distribution vendors. Laurence did a good job extracting the major points from Bob's posting and came to the conclusion that Bob would like to see something like a $10-$20 cost per node for a distribution.
This thread was one of the better one I've read on the Beowulf mailing list about what I would term non-technical things (i.e. not about hardware specs or code performance tweaks, etc.). I think the postings from various people point out the frustration that the cluster community is having with non-HPC pricing models. I also think that the commercial distribution companies would do well to pay attention to the opinions expressed by members of the list as there were a number of good ideas, comments, and observations that would help them if they want to keep or build a presence in the HPC cluster market. At the same time, the threads also showed that the open distributions are still very viable in the HPC market and that there may be a need for companies to support these distributions.
| Sidebar One: Links Mentioned in Column | 
| NAMD - NAnoscale Molecular Dynamics CHARMM - Chemistry at HARvard Molecular Mechanics | 
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at ClusterMonkey.net (don't stick your arms through the bars though).