Buying the Best Linux Performance?

From the "How tall is a building?" department

The Beowulf Mailing List is a source of valuable information about high performance computing (HPC) Linux clusters. Conversations on the list apply to not only HPC computing, but Linux performance for any system. Recently (March 8, 2007) the following was posted to the Beowulf Mailing list:

I would like to know what server has the best performance for HPC systems between The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send me suggestions...

Here are the complete specifications for both servers:
- Poweredge 1435SC, Dual Core AMD Opteron 2216 2.4GHz 3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs
- Poweredge 1950, Dual Core Intel Xeon 5130 2.0GHz 2GB 533MHz (4x512MB), Single Ranked DIMMs

From your specifications, almost certainly the Opteron. For a variety of reasons, but higher clock certainly helps -- it would probably have been faster at equivalent clock anyway. Now that I've "answered", let me tell you why you shouldn't believe me and what you should actually do to answer your own question.

{mosgoogle right} There is a standard litany we like to chant on the Beowulf list:

  • Your Mileage May Vary
  • A benchmark in hand is worth any number of anecdotal reports
  • The best benchmark is your own application
  • What do you plan to do with it?
  • It depends...

In particular, it depends on your application (mix), its memory and disk and network requirements, the topology and type of your network, the communication and memory access pattern used by your application (mix), the compiler and library used, and a few dozen other variable major and minor, which is why nobody is going to tell you one is always better than the other even if they think it is true.

And then there is the cost -- the REAL question is which one has the better cost-benefit, not which one is the cheapest or fastest independently. Ask yourself the question -- with a fixed budget to spend, which architecture lets me get the most done in the least time.

So if you like, I wouldn't be doing you a favor by telling you definitely the Opteron only partly because it might not be true. If you believed me (because I sound so glib and because you don't know that AMD once sent me a cool tee-shirt and Intel hasn't, although I do have a pair of these cool little contamination-suited Intel dude key chains that come close) then you might be tempted to skip the correct cluster engineering step(s) of:

  1. Study your application (mix) -- figure out in at least general terms its (their) communication patterns, its (their) memory requirements (size and access pattern), its (their) CPU requirements. Some applications are "I/O bound" -- run at a speed determined by the access speed of disk, for example. Some applications are "memory bound" -- they spend all of their time fetching data from memory, relatively little on actually doing something to it. Some applications (especially parallel cluster applications) are "network bound" and run at a rate that is determined by the latency or bandwidth of a network connection, further complicated in the case of real parallel code by the communication PATTERNS which can cause bottlenecks outside of the system altogether. Some applications (the happiest ones, I tend to think:-) run at a rate that is limited by CPU clock (clock, clock and nothing but clock) Although different CPU architectures (e.g. Xeon and Opteron, 32 or 64 bit) have a different BASE performance at any given clock.
  2. If at all possible, and it nearly always is possible, beg, borrow, steal, buy, or rent a system or two in your competing architectures and run Your Code compiled with Your Target Compiler on those systems and just measure its performance. This is actually a whole lot easier than the stuff in step 1) and a lot more likely to be accurate, but I still don't advise skipping 1). If you are planning on buying more than a handful of systems, it is actually often worth your while to buy< one of each (or even three or four candidate systems), test them, and then buy the other 127 or however many nodes you plan to put in your winning cluster. The other option is possibly buying 128 of the wrong kind. You can always recycle the losers as servers, really powerful desktops, whatever. A really good vendor will often loan you systems (or network access to systems) to do this testing. A really good compiler vendor (e.g. Pathscale) will even/often lend you a compiler for a trial period to do the testing. Other compilers like Intel and Portland Group have free trial licenses you can use as well.
  3. Don't lock yourself in to a single distributor. while looking over systems. I personally don't use some tier 1 vendors (e.g. IBM, Dell, HP), although I know people that do. I have found that some (not all) tier 1 hardware is not the most reliable in some cases, but their service plans tend to be very good, their cost is reasonable, and they aren't Linux-averse I do think, however, that they're still working on becoming actively Linux-friendly. There are a number of other tier 2 vendors that cater to the HPC crowd. The hardware is often as good or better and with equally attractive prices and service deals. These include companies like Penguin Computing, Appro, Microway and others. Penguin Computing is my own personal favorite, largely because with the exception of one DOA system out of a good size stack of Altus servers we've purchased (no doubt the one that "fell off the truck" and likely not Penguin's fault) I have yet to see an Altus fail in harness. Seriously, pretty extraordinary, really, given that they run at full load pretty much 24x7 for over a year. I've heard that their service is really good -- maybe one day I'll have a chance to find out. Penguin will almost certainly let you prototype on their systems
  4. When you've done all your research above, then Do the cost benefit analysis. If your application is network bound, don't worry so much about system clock and speed, worry about getting a really high speed cluster network to match (which is expensive, so you may want to get cheaper slower nodes if the application isn't CPU bound. If your application is memory bound, you may want to skip the dual cores and get two single cores or quad single cores -- otherwise you might just be using two cores at a time while the other cores are waiting in line to get at memory, wasting all the money you spent on the dual cores in the first place. Although dual-core may be the default in the near future. If your applications is disk bound then look more closely at disk and less at CPU -- what kind of bus, what kind of disk subsystem, what are the bottlenecks (per system) and the costs of minimizing them.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.