Print
Hits: 5960
We recently received an email from Joe Springer asking a common question about clusters:

Question: I want to run an application whose memory and disk requirements are larger than any one node. Could using a cluster allow me to run an application such that "memory" and disk needs are fulfilled by being distributed...?

The short answer is: It depends for memory, yes for storage, but there is more to it than that ...

The longer answer is bit more involved. Let's look at memory first. Many applications that run on clusters would never fit on a single node. If a program that uses a large data set is run across a cluster, then the data is usually sliced and diced across the nodes, communication is done via MPI (Message Passing Interface). Note that MPI is essentially a memory copy operation as each cluster node is like an island it has its own memory and HDD (sometime no HDD, but a network File System is used.) When a node needs information from another node, that information is sent across the network and placed into memory.

If you want to use a cluster to expand the memory on a node, then this can gets a bit more involved. Remember that moving data across a network is often an order of magnitude slower than accessing it on a motherboard. For this reason, attempts to create a shared memory clusters have been met with various levels of success, but there is no general solution. After all, you are still passing memory (messages) between nodes at the lowest level. There is one company, ScaleMP, that provides a software solution for a cluster wide shared memory model. i.e. your program might be able to use more memory than a single node, but I do not have experience with this software (and it requires InfiniBand). There are also "memory appliances" like the Violin Scalable Memory that can support up to 10 TBytes of DRAM.

In terms of storage there are many solutions. A simple solutions is some form of attached storage like the JackRabbit from Scalable Informatics. If you are considering parallel I/O (many nodes reading and writing from the same file/filesystem that gets very application specific. Check out our File Systems pages for more information.

To answer to your final question:

Which Linux project or distro would be best for such a situation?

The specifics of your application requirements will determine how you would use a cluster. I'm not sure I have enough information to give a complete answer. You may find it useful to look at our Learning About Clusters section and take a look at our links Links Page.

Finally, maybe some of our readers will supply their comments as well. Thanks for asking!