Article Index

The following is a partial list of the more important variables in cluster design that can be rate determining for various kinds of code.

  • CPU Architecture. Different architectures get different amounts of different kinds of work done per clock cycle.
  • CPU Clock. This is the "easy" one -- more seems like it would always be better. On a fixed budget, it isn't.
  • Memory Architecture and Speed. If your job involves large vector or matrix operations, memory speed is very likely to be the rate limiting feature instead of CPU.
  • Disk Architecture and Speed. Sometimes this is completely irrelevant. Other times it is very important. It depends on how much and how often your task accesses disk, which tends to be very slow relative to everything else.
  • Network Architecture, Bandwidth and Latency. One day this will be a suitable topic of a whole article. For real parallel tasks with interprocessor communications distributed over the network, it is (as we have learned in earlier columns) the critical bottleneck, likely to be as or more important than all of the features above.
  • Miscellany. Operating system, compiler, libraries, administrative interface, parallel environment, task organization, and much more all can have a significant impact on task speed.

Each of the headings above can be further fractionated. Integer instructions versus floating point instructions, 32 bit floating point instructions versus 64 bit floating point instructions, floating point broken down further as multiplication, addition, multiplication AND addition, division, transcendental function support, instruction pipelining, the size and speed and number of layers of cache memory -- a particular task might well be bottlenecked by how efficient one particular mix of these instructions is executed on a system, which itself might vary by a factor of three or more depending on compiler and whether optimized libraries are used.

If the above makes you feel dizzy and a bit inadequate, don't worry, it should. How are you supposed to figure out which arrangement will work best for your code without a degree of some sort in advanced computer science? Especially when even people who have an degree in computer science cannot always predict performance except in broad terms? Still, not to worry. As you might expect, there is a way to get by.

Optimizing Node Architecture

The best way to proceed to optimize your node architecture and cluster design is by studying your task and prototyping the task on as many distinct architectures as you can manage.

Studying your task is pretty obvious. A lot of computational tasks spend a lot of time in one "central" loop doing one particular series of chores over and over again. It is easiest to show you how to proceed by going over an example. Let me tell you in very rough terms about about my code (a Monte Carlo simulation studying the critical behavior of O(3) symmetric magnets), so you can get a feel for what sort of thing to look for when studying or profiling your task.

In my code, the core loop contains a lot of general floating point instructions and a few transcendental calls (exponential, natural log, trig functions). Transcendental calls tend to be library functions and very slow on many architectures, but have microcode optimizations and associated CPU instructions on Intel and AMD processors, inherited from the days when the 8088 CPU on the original IBM PC was accompanied by the 8087 numerical co-processor for people who wanted a bit of floating point speed.

My code accesses memory in an irregular (not terribly vectorizable) pattern and hence is not particular dependent on memory speed or architecture, although I did notice a bump moving from 32 bit memory buses to 64 bit buses because it is all double precision code. My code does a fair amount of integer operations -- indexing arithmetic and actual integer variable manipulation. My code doesn't do much disk I/O and is embarrassingly parallel the way I tend to run it and so it doesn't stress the network at all. From the above, one might guess that my code will be bottlenecked primarily by CPU clock (true), maybe by CPU architecture (less true), and a bit by memory (true, but in a way that doesn't gain much advantage from memory optimizing architectures).

YMMV! Your code will likely be different.

For me the "best" cluster tends to be the one where I can buy the most distributed CPU clock for the least amount of money -- a relatively simple design with one really important "knob" that controls the amount of work I can do. For your task, the network might be more important, or the memory bus might be more important than CPU clock per se. How, then, can we determine which architectures do better on the critical parameters once we know roughly what they are?

I have found (like many before me) that standard benchmarks are relatively useless -- LINPACK (a common benchmark that returns "MFLOPS" as a measure of floating point speed) is a relatively poor predictor of performance of my job, because it mixes CPU speed and memory access speed in a particular pattern, and my job doesn't use that pattern. Stream (another benchmark that focuses on memory bus speed and certain common operations) is also less useful, as it doesn't measure floating point division rates and my code (alas) has a fair amount of unavoidable division in it. SPEC rates are better, but not the overall rates which average too much.

In the card game of bridge there is an old adage: "A peek is worth a thousand finesses". Don't guess -- measure. Use your own code as a benchmark! This is the only benchmark that really, truly matters. It is often useful to run a bunch of "standard" benchmarks on the system(s) you plan to test because with time and experience you may be able to learn a pattern that is moderately predictive of task completion times, and you will then be miraculously enabled in your search for the best possible deal.

If possible, benchmark and measure without actually having to buy the test systems. Most reputable cluster component vendors will loan you either a test system or access via the Internet to a test system. Compile your job on the test system and see how fast it runs in an environment as similar as possible to the one you are thinking of building. Vary task parameters and as much as you can in the environment and see how the completion time varies.

In a fairly short period of time you should have a good idea of how your task runs on at least the major commodity architectures, and how its performance varies when you vary some of the standard purchase-decision knobs -- CPU clock, size and kind of memory and so forth. The more experience you gain, the more measurements you perform, the greater your ability to engineer a cost-effective cluster with confidence.

While you are prototyping nodes, it is a great idea to also prototype networks, compilers, and possibly parallel libraries, especially if your task is a real parallel task that will use the network intensively. However, space and a philistine editor prevent me from going into any detail on this subject in this issue -- check back in next month and we'll talk a bit about networking before coming back to the cost part of the cost-benefit analysis of your alternative cluster designs and that all-important part of the process, shopping.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.