|
Page 1 of 2
Martha would be proud. The creative
side of the Beowulf mailing list.
The Beowulf mailing list provides detailed discussions about
issues concerning Linux HPC clusters.
In this article I review some postings to the
Beowulf list about
Parallel Memory and packing in motherboards.
I think the discussion threads presented below provide some very
useful information despite the age
of the postings. And another good use for cookie sheets!
Parallel Memory
On Oct. 18, 2005, Todd Henderson
posted
a question about whether any tools, drivers, etc. that allowed distributed
nodes to have their collective memory appear as shared memory. In essence
a "PVFS" for memory. Todd also mentioned that he wasn't worried about
speed but just memory capacity (his application had memory usage that
scaled with the cube of the problem size). He wanted to know about any
approaches to distributed shared memory before he embarked on a large
MPI porting process.
The first one to reply was Mark Hahn (doesn't he ever sleep?). He
said
that there were some student projects around to do this kind of thing.
But he didn't think it was too worthwhile, "... unless you have some
pretty much completely sequential, high-locality access patterns."
Mark also pointed out that a memory access on a node is on the order
of 60 ns (nano-seconds), and to fetch a page of memory over a network
would be on the order of 80 micro-seconds. So the difference is about
a factor of 1000.
One suggestion that Mark made was to look at
Global Arrays
Paulo Afonso Lopes, then
suggested
that Todd take a look at SSI (Single System Image) projects and DSM
(Distributed Shared Memory) projects. One that he mentioned is,
Kerrighed.
Robert Brown then
posted
with a mention to a project at Duke called "Trapeze" but he wasn't
sure of the project was still around or not. He then went on with an
idea to let the node where the code is running to start swapping. But
the swap space is over NFS and on the NFS server, rather than use
disks, you create a ramdisk. Robert thought this would be an interesting
experiment to try. Of course your swap space would be
limited to the largest ramdisk on the NFS server node (about 64GB
for current commodity hardware). So if you combined this with the
memory of the node where the code is running, you could get about
128 GB of usable memory+swap. If you need to go larger you could
create swap files on various nodes using the same approach so that
the node where the code is running could swap to a number of swap
files.
Bogdan Costescu
responded
to Robert's post about swapping over NFS to
a ramdisk by saying that there had been a discussion on some of
the heavy duty Linux Kernel mailing lists about swapping over NBD (Network
Block Device) or iSCSI. He discussed a situation where you could
get a deadlock. (you need memory for the transfer, so what if you
need to swap to do the swap ...) So he suggested that swapping over the network
wasn't a good idea at the time (remember this is the end of 2005).
Robert
replied
to Bogdan's post with some discussion about the details of creating
a ramdisk. He then came up with a heuristic about the largest possible
ramdisk size (a corner case really). But this corner case showed that
you really don't get more than about 50% of the system memory for
creating a ramdisk (if you want to make sure this corner case can
never happen). His finishing comment was that it's probably better
to just write a simple parallel application with routines that do
the data management for you (Global Arrays, mentioned earlier, does
this).
Ashley Pittman
wrote
that in the 2.2 version of the kernel there was the ability to
swap over the network. It used sockets to communicate to a remote
server. The whole code was in user-space so it was probably
simpler than using NFS. Michael Will
chimed
in that he used to swap over a 10/100 network to a remote ramdisk
via NBD. He was using the swap to load, at that time, large gimp
images. He said that, "Qualitative statement: It seemed faster
than using the old IDE drive for swapping,
maybe because the image data came from the IDE drive as well and
so the extra 10MB/s channel via NBD was worth it."
Randy Wright
wrote
with a
link to a paper he
listened to at Cluster 2005. He said that they had a large quicksort
running at 1.7 times slower than the speed of doing it in local memory
only, but up to 21 times faster than using a local disk. He said that
on a good day, it worked, but it was fairly flaky.
Richard Walsh also
wrote
with a suggestion to look at
UPC project for C codes or the
Co-Array Fortran project for
Fortran codes. These are languages that allow you to use memory from
other nodes and/or to thread the application. He said that there were
some some libraries for common interconnects, allowing you to use
memory on other nodes. Then went on to talk about some details of
both UPC and Co-Array Fortran.
I like this discussion because once you get into clusters you eventually
ask the same question Todd asked - Isn't there a good way to do
distributed shared memory on clusters? While there were some good
suggestions, I recommend using Global Arrays. It allows you to
grab memory from distributed nodes to use locally and it handles
everything very simply.
Cluster of Motherboards
Once you get into building your own clusters you also ask the question,
what about building a cluster just using motherboards and no cases?
Well, Fernando
asked
this question on Nov. 4 2005.
Of course, the universal answer was - yes. Glen Gardner
wrote
with a
link to a cluster
he built using mini-itx boards. He also said that the latest version
of the cluster had 18 mini-itx motherboards in 12" space. He said that
you could get up to 18 mini-itx motherboards in a 19" wide by 12" high
by 26" deep rack space.
It looks like Robert Brown wasn't immune to
posting
about gadgetry (he's a DIY kind of guy!). He said that using just
motherboards has been done a number of times before. He did mention
a couple of idea people have tried (e.g. directly mounting motherboards
to shelving), and he also told Fernando to be careful when dealing
with electricity. Robert then mentioned that the list's EE guru, Jim Lux
should comment on the subject.
|