Article Index

Martha would be proud. The creative side of the Beowulf mailing list.

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list about Parallel Memory and packing in motherboards. I think the discussion threads presented below provide some very useful information despite the age of the postings. And another good use for cookie sheets!

Parallel Memory

On Oct. 18, 2005, Todd Henderson posted a question about whether any tools, drivers, etc. that allowed distributed nodes to have their collective memory appear as shared memory. In essence a "PVFS" for memory. Todd also mentioned that he wasn't worried about speed but just memory capacity (his application had memory usage that scaled with the cube of the problem size). He wanted to know about any approaches to distributed shared memory before he embarked on a large MPI porting process.

The first one to reply was Mark Hahn (doesn't he ever sleep?). He said that there were some student projects around to do this kind of thing. But he didn't think it was too worthwhile, "... unless you have some pretty much completely sequential, high-locality access patterns." Mark also pointed out that a memory access on a node is on the order of 60 ns (nano-seconds), and to fetch a page of memory over a network would be on the order of 80 micro-seconds. So the difference is about a factor of 1000. One suggestion that Mark made was to look at Global Arrays

Paulo Afonso Lopes, then suggested that Todd take a look at SSI (Single System Image) projects and DSM (Distributed Shared Memory) projects. One that he mentioned is, Kerrighed.

Robert Brown then posted with a mention to a project at Duke called "Trapeze" but he wasn't sure of the project was still around or not. He then went on with an idea to let the node where the code is running to start swapping. But the swap space is over NFS and on the NFS server, rather than use disks, you create a ramdisk. Robert thought this would be an interesting experiment to try. Of course your swap space would be limited to the largest ramdisk on the NFS server node (about 64GB for current commodity hardware). So if you combined this with the memory of the node where the code is running, you could get about 128 GB of usable memory+swap. If you need to go larger you could create swap files on various nodes using the same approach so that the node where the code is running could swap to a number of swap files.

Bogdan Costescu responded to Robert's post about swapping over NFS to a ramdisk by saying that there had been a discussion on some of the heavy duty Linux Kernel mailing lists about swapping over NBD (Network Block Device) or iSCSI. He discussed a situation where you could get a deadlock. (you need memory for the transfer, so what if you need to swap to do the swap ...) So he suggested that swapping over the network wasn't a good idea at the time (remember this is the end of 2005).

Robert replied to Bogdan's post with some discussion about the details of creating a ramdisk. He then came up with a heuristic about the largest possible ramdisk size (a corner case really). But this corner case showed that you really don't get more than about 50% of the system memory for creating a ramdisk (if you want to make sure this corner case can never happen). His finishing comment was that it's probably better to just write a simple parallel application with routines that do the data management for you (Global Arrays, mentioned earlier, does this).

Ashley Pittman wrote that in the 2.2 version of the kernel there was the ability to swap over the network. It used sockets to communicate to a remote server. The whole code was in user-space so it was probably simpler than using NFS. Michael Will chimed in that he used to swap over a 10/100 network to a remote ramdisk via NBD. He was using the swap to load, at that time, large gimp images. He said that, "Qualitative statement: It seemed faster than using the old IDE drive for swapping, maybe because the image data came from the IDE drive as well and so the extra 10MB/s channel via NBD was worth it."

Randy Wright wrote with a link to a paper he listened to at Cluster 2005. He said that they had a large quicksort running at 1.7 times slower than the speed of doing it in local memory only, but up to 21 times faster than using a local disk. He said that on a good day, it worked, but it was fairly flaky.

Richard Walsh also wrote with a suggestion to look at UPC project for C codes or the Co-Array Fortran project for Fortran codes. These are languages that allow you to use memory from other nodes and/or to thread the application. He said that there were some some libraries for common interconnects, allowing you to use memory on other nodes. Then went on to talk about some details of both UPC and Co-Array Fortran.

I like this discussion because once you get into clusters you eventually ask the same question Todd asked - Isn't there a good way to do distributed shared memory on clusters? While there were some good suggestions, I recommend using Global Arrays. It allows you to grab memory from distributed nodes to use locally and it handles everything very simply.

Cluster of Motherboards

Once you get into building your own clusters you also ask the question, what about building a cluster just using motherboards and no cases? Well, Fernando asked this question on Nov. 4 2005.

Of course, the universal answer was - yes. Glen Gardner wrote with a link to a cluster he built using mini-itx boards. He also said that the latest version of the cluster had 18 mini-itx motherboards in 12" space. He said that you could get up to 18 mini-itx motherboards in a 19" wide by 12" high by 26" deep rack space.

It looks like Robert Brown wasn't immune to posting about gadgetry (he's a DIY kind of guy!). He said that using just motherboards has been done a number of times before. He did mention a couple of idea people have tried (e.g. directly mounting motherboards to shelving), and he also told Fernando to be careful when dealing with electricity. Robert then mentioned that the list's EE guru, Jim Lux should comment on the subject.

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.