Hits: 9018
Martha would be proud. The creative side of the Beowulf mailing list.

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this article I review some postings to the Beowulf list about Parallel Memory and packing in motherboards. I think the discussion threads presented below provide some very useful information despite the age of the postings. And another good use for cookie sheets!

Parallel Memory

On Oct. 18, 2005, Todd Henderson posted a question about whether any tools, drivers, etc. that allowed distributed nodes to have their collective memory appear as shared memory. In essence a "PVFS" for memory. Todd also mentioned that he wasn't worried about speed but just memory capacity (his application had memory usage that scaled with the cube of the problem size). He wanted to know about any approaches to distributed shared memory before he embarked on a large MPI porting process.

The first one to reply was Mark Hahn (doesn't he ever sleep?). He said that there were some student projects around to do this kind of thing. But he didn't think it was too worthwhile, "... unless you have some pretty much completely sequential, high-locality access patterns." Mark also pointed out that a memory access on a node is on the order of 60 ns (nano-seconds), and to fetch a page of memory over a network would be on the order of 80 micro-seconds. So the difference is about a factor of 1000. One suggestion that Mark made was to look at Global Arrays

Paulo Afonso Lopes, then suggested that Todd take a look at SSI (Single System Image) projects and DSM (Distributed Shared Memory) projects. One that he mentioned is, Kerrighed.

Robert Brown then posted with a mention to a project at Duke called "Trapeze" but he wasn't sure of the project was still around or not. He then went on with an idea to let the node where the code is running to start swapping. But the swap space is over NFS and on the NFS server, rather than use disks, you create a ramdisk. Robert thought this would be an interesting experiment to try. Of course your swap space would be limited to the largest ramdisk on the NFS server node (about 64GB for current commodity hardware). So if you combined this with the memory of the node where the code is running, you could get about 128 GB of usable memory+swap. If you need to go larger you could create swap files on various nodes using the same approach so that the node where the code is running could swap to a number of swap files.

Bogdan Costescu responded to Robert's post about swapping over NFS to a ramdisk by saying that there had been a discussion on some of the heavy duty Linux Kernel mailing lists about swapping over NBD (Network Block Device) or iSCSI. He discussed a situation where you could get a deadlock. (you need memory for the transfer, so what if you need to swap to do the swap ...) So he suggested that swapping over the network wasn't a good idea at the time (remember this is the end of 2005).

Robert replied to Bogdan's post with some discussion about the details of creating a ramdisk. He then came up with a heuristic about the largest possible ramdisk size (a corner case really). But this corner case showed that you really don't get more than about 50% of the system memory for creating a ramdisk (if you want to make sure this corner case can never happen). His finishing comment was that it's probably better to just write a simple parallel application with routines that do the data management for you (Global Arrays, mentioned earlier, does this).

Ashley Pittman wrote that in the 2.2 version of the kernel there was the ability to swap over the network. It used sockets to communicate to a remote server. The whole code was in user-space so it was probably simpler than using NFS. Michael Will chimed in that he used to swap over a 10/100 network to a remote ramdisk via NBD. He was using the swap to load, at that time, large gimp images. He said that, "Qualitative statement: It seemed faster than using the old IDE drive for swapping, maybe because the image data came from the IDE drive as well and so the extra 10MB/s channel via NBD was worth it."

Randy Wright wrote with a link to a paper he listened to at Cluster 2005. He said that they had a large quicksort running at 1.7 times slower than the speed of doing it in local memory only, but up to 21 times faster than using a local disk. He said that on a good day, it worked, but it was fairly flaky.

Richard Walsh also wrote with a suggestion to look at UPC project for C codes or the Co-Array Fortran project for Fortran codes. These are languages that allow you to use memory from other nodes and/or to thread the application. He said that there were some some libraries for common interconnects, allowing you to use memory on other nodes. Then went on to talk about some details of both UPC and Co-Array Fortran.

I like this discussion because once you get into clusters you eventually ask the same question Todd asked - Isn't there a good way to do distributed shared memory on clusters? While there were some good suggestions, I recommend using Global Arrays. It allows you to grab memory from distributed nodes to use locally and it handles everything very simply.

Cluster of Motherboards

Once you get into building your own clusters you also ask the question, what about building a cluster just using motherboards and no cases? Well, Fernando asked this question on Nov. 4 2005.

Of course, the universal answer was - yes. Glen Gardner wrote with a link to a cluster he built using mini-itx boards. He also said that the latest version of the cluster had 18 mini-itx motherboards in 12" space. He said that you could get up to 18 mini-itx motherboards in a 19" wide by 12" high by 26" deep rack space.

It looks like Robert Brown wasn't immune to posting about gadgetry (he's a DIY kind of guy!). He said that using just motherboards has been done a number of times before. He did mention a couple of idea people have tried (e.g. directly mounting motherboards to shelving), and he also told Fernando to be careful when dealing with electricity. Robert then mentioned that the list's EE guru, Jim Lux should comment on the subject.

Being the good cluster-er that he is, Jim promptly wrote to the list with some good comments. His opening comments were,

Sure it's possible. Your problems are going to be power, cooling, and structures (assuming you're not in an environment where people care about electrical codes, RF interference, etc.)

He then went to explain each of these in a little detail with some warnings (e.g. watch for grounding loops).

Jim's comments led Robert to kick into "Watch Out There" mode and offer some warnings about creating dangerous conditions (he didn't mention anything about running with scissors though). But he did make one comment that I've seen in the past several times,

Yeah, I think that Jim's observation that you should think carefully about the diminishing returns of building a free-form caseless cluster is very apropos -- you'll save a bit of money on space and cases -- maybe -- at the expense of more hands on work building the cluster and at the risk of having to resolve problems with shielding and so on.

He did offer some very good suggestions though. He ended with this comment.

If you go anywhere beyond this, I'd REALLY recommend that you only proceed if you completely understand electricity and electrical wiring and know what a ground loop IS and so on.

A fitting comment if I've ever read one. This should be posted in every electrical section in Home Depot and Lowe's. But then again the Darwin Awards just wouldn't be the same.

On a more serious note, Josip Loncaric wrote that it's possible to find cheap cases for about $20 and these can save you some work, but not necessarily shelf space.

Marh Hann echoed these comments by explaining the lure of inexpensive hardware to make your cluster. He gave an interesting example of a 1,500 CPU cluster where you allow some money for the CPUs, motherboard, chassis, power supply, memory, and found that the sum was about 20% the cost of the real thing. (this is tempting isn't it?) He did mention that Google doesn't use cases. Rather they have bare motherboards on trays, perhaps much like Fernando wants to do. Mark finished his comments with the following.

In summary, subtracting the chassis sounds smart, but really only makes sense if you follow through with the rest - cheap motherboard, cheap cpu, minimal cpu, minimal network, cheap labor, workload that is embarrassingly parallel, and not long-running...

In short, you get what you pay for. (I've been burned on cheap memory several times in the past - never again).

H. Vidal then made a quick post that I think is interesting.

What's remarkable to consider is that one of the very largest (if not the largest?) data cluster systems in the world is a bare motherboard system, strapped together with lots of simple screws and Velcro.

That's Google, in case you did not know. I was shocked to see this when I saw a presentation recently by one of the Google guns here in NYC (actually, the inventor of Froogle). He showed us pix of a bunch of nodes essentially sitting on some insulating material, screwed to a simple frame-style chassis with careful consideration of grounding and power. His point was to emphasize that google considers lots of very cheap, very simple nodes key to their growth, and cases are 'right out' when you go to this scale (he would not share the exact N of nodes with us, but alluded to something on the order of 100K, at that time, and this is *always* growing).

I had heard about this in 2005. I think it's fairly common knowledge now. But it's still very interesting.

After a brief discussion about Google, Jim Lux came back with some interesting back of the envelope calculations. He was interested in the amount of time it takes to drill holes in a piece of sheet metal or aluminum as a base plate for a motherboard. Assuming that you could do about 12 plates at the same time, he ended up with an estimate that it takes about 30 minutes to drill and screw in a single motherboard. If you guess about $10/hour in labor costs plus the price of materials, and that cheap $20 case looks pretty attractive. Jim then finished with a true "Tool Time" suggestion.

There IS a faster way, for a bare system approach. Use double sided sticky foam tape. Plenty strong, it will last 2 or 3 years.

Then Doug Eadline weighed in and strongly recommended using regular case. He mentioned our Kronos Project to build the fastest system we could for $2,500. At that time, we found a well engineered small for about $40. Now you can find them for $30 or less. Doug followed up these comments with a slightly philosophical comment.

One of things I have learned when building clusters is to take advantage of mass produced anything (mostly hardware). Looking inside a diskless node, I often get the urge to build a better enclosure, but then realize that the cost and time to fuss around with everything is not worth it. As a hobby, sure, it might be fun, but my interest is software, a "good enough" solution that costs much less in both time and money always seems to win the day for me. YMMV

I think this is well said (although I can think of situations where a custom case is warranted). But, the siren song of commodity pricing is very hard to resist.

Right after Doug, Andrew Piskorski wrote that his favorite custom packaging scheme du jour was cookie sheets! He just uses basic cookie sheets and mounts the systems to the sheets. He said that there are ready made racks for these sheets. It's a long posting with lots of details, but he talked about how many micro-ATX motherboards he could get in a single rack (up to about 78) and how the density was more efficient than using standard cases. Since my wife's family is in the doughnut business, I think this is a great idea! This is really taking advantage of commodity components.

By the way, Andrew posted some links to bare bones motherboard systems. All of the links are still active. My favorite is the zBox.

These types of discussions are always fun. You get to see the creative side of various people come out and the contrasts are always fun as well (I still love the cookie sheet idea). Some of the ideas are worthwhile I think, but in many cases, it may be more effective to just use micro-ATX cases with micro-ATX boards.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer (donations gladly accepted) so he can store all of the postings to all of the mailing lists he monitors. He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).