[Beowulf] Wireless clusters (Bridging 802.11a)
Robert G. Brown
rgb at phy.duke.edu
Thu Jan 15 08:31:42 EST 2004
On Wed, 14 Jan 2004, Jim Lux wrote:
> I'm building a cluster using wireless interconnects, and it is getting to
> be a royal pain to figure out what sort of adapters/interfaces to use. I'm
> casting some bread upon the waters here to see if anyone has fooled with
> this sort of thing.
> I want to go 802.11a because of spectrum usage, not because of data
> rate. If I go to 5GHz, I don't have to worry about accidentally connecting
> to JPL's 802.11b (2.4GHz) network infrstructure, for instance, which will
> keep netops off my back.
I don't think I can help you with anything below, but I can point out an
interesting and so far unexplained anomaly I've encountered with
802.11b. While working on a PVM article for CWM I often naturally end
up sitting in my den downstairs working on my laptop where I can keep an
eye on our neurotic and fey border collie puppy so he doesn't eat a
table or rip a hole in a wall or something. That's fine, I've got
switched 100BT systems a.k.a. "nodes" all over the house and when they
aren't running Diablo II on top of winex they even have respectable
I can run PVM on the laptop just fine. I can login with ssh
transparently (no password) to all the systems in the household just
fine. I can login to one of those systems, crank PVM on it, and add the
laptop (over the wireless connection) just fine.
BUT, when I run PVM on the laptop and try to add the SAME host that
worked fine the other way, I get:
rgb at lilith|T:125>pvm
pvm> add lucifer
libpvm [t40001]: pvm_addhosts(): Pvmd system error
This is both not terribly informative and frustrating. PVM has an
internal autodiagnostic mode these days that warns you of e.g. leftover
lockfiles, inability to ssh and so forth -- this doesn't even make it
that far. Running pvm -d0xffff (to get a debug trace) isn't much more
helpful although it might be to somebody who really knows the PVM
internals -- all the daemon's progress in setting up the connection
proceeds just as it ought up to the final
libpvm [t40001] mxinput() pkt src t80000000 len 36 ff 3
libpvm [t40001] mxinput() src t80000000 route t80000000 ctx 524286 tag
TM_ADDHOST len 4
libpvm [t40001] mesg_input() src t80000000 ctx 524286 tag TM_ADDHOST len
libpvm [t40001] mxfer() txfp 0 gotem 1 tt_rxf 0
libpvm [t40001] msendrecv() from t80000000 tag TM_ADDHOST
libpvm [t40001]: pvm_addhosts(): Pvmd system error
and then it fails where in an identical trace going the other way it
succeeds, no indication why in either case.
At a guess it is either:
a) I'm doing something stupid and silly somewhere. Always a
possibility, although I >>have<< been using PVM since 1993 and generally
can make it behave, especially since they added all the nifty
diagnostics to it.
b) Something "odd" about the wireless networking stack or the PCMCIA
bus where (perhaps) PVM does something at a low level for
speed/efficiency that just won't work with wireless.
The latter is a bit disconcerting to me -- one would think that all
transactions occur on top of virtual devices that keep wireless from
being anything but "a network interface", although there COULD be either
timeouts or packet reliability problems, of course. And the wireless
network on the laptop works flawlessly otherwise as long as signal
strength is adequate, and of course PVM works TO the wireless system but
not FROM the wireless system.
Anyway, I offer this up not because I want or need help with it (it
would be silly to use my wireless laptop as a pvm master node in
anything BUT writing an article or an EP task of some sort and there are
other ways to accomplish both of these that are a lot less work than
hacking the PVM sources to debug the failure) but because YOU might
encounter similar problems and have to invest the extra time to debug
Moral of story, don't assume that just because a wireless interface has
device support under linux and works perfectly with simple network tests
it will necessarily work with PVM, MPI, tools that access a network
stack at a low level for "efficiency" reasons and presume e.g. ethernet.
> The processors in the cluster are Mini-ITX widgets with Compact Flash (CF)
> drives, and, while booting off the net might be nice, I'm going to boot off
> CF for now.
> Here are some issues that have come up:
> 1) there's two ways to get the node to talk to the net:
> via the ethernet connector and an external bridge
> via a PCI card with a 802.11a adapter (most likely a 802.11a/b/g, since
> that's what's available) (D=Link, Netgear, and Linksys all have them)
> In all cases, I'd have an "access point" of some sort to talk to my head
> node/NFS, etc.
> Ideally, I'd like to set up the network in "ad-hoc" mode, where any node
> can talk to any other directly, without having to be routed through an
> access point. In "infrastructure" mode, many clients can talk to the access
> point, but clients cannot talk to clients, except by going through the
> access point, creating a single failure point (probably not important for
> my initial work, but philosophically "bad").
> 2) It's unclear whether there are Linux drivers for any of the PCI based
> 802.11a cards. The mfrs don't seem to want to fool with that market, and,
> chipset mfrs are quite reticent about releasing the intellectual property
> needed to do a good job writing the drivers.
> 3) I could go with external bridging adapters (perhaps with integrated
> routers, in case I add another ethernet device to the node, or, to hook up
> a sniffer). Here the problem is that not all mfrs appear to support
> bridging, at least with more than 2 (i.e. they can set up a point to point
> bridge, but not a many to many bridge)
> From some reading of the various manuals, it appears that some "access
> points" can be set up to appear to be a "client" in infrastructure mode,
> however that's a problem philosophically (and in terms of latency).
> So, does anyone know which "access points" (i.e. a 802.11x to ethernet box)
> can look like a client in an ad-hoc network.
> (possible candidates: Netgear FWAG114, D-link DWL-774, DWL-7000AP, Linksys
> WAP54A*, WAP51AB, WRT51AB. *Linksys says that the WAP54A doesn't do bridging)
> Part 2 of the quest.................
> I'm also looking for suggestions on performance and timing tests to run on
> this cluster once it's assembled. Aside from the usual network throughput
> (benchmark program recommendations requested), I'm interested in techniques
> to look at latency, latency distribution, and dropped packets/retries,
> since I suspect that wireless networks will have very "unusual" statistics
> compared to the usual cluster interconnects.
> And, bearing in mind our recent lengthy thread on timing and clocks, you
> can be sure that I will do those sorts of tests too.
> James Lux, P.E.
> Spacecraft Telecommunications Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf