[Beowulf] Newbie Question: Batching Versus Custom RPC

Robert G. Brown rgb at phy.duke.edu
Thu Feb 19 14:20:04 EST 2004


On Thu, 19 Feb 2004, Ryan Adams wrote:

> My question is basically this: is 2-5 seconds too small of a job to
> justify a batching system like *PBS or Gridengine?  It would seem that
> the overhead for a job that requires a few hours would be very
> insignificant, but what about a few seconds?  Certainly, one option
> would be to bundle sets of these chunks together for a larger effective
> job.  Am I wasting my time thinking about this?
> 
> I've been considering rolling my own scheduling system using some kind
> of RPC, but I've been around software development long enough to know
> that it is better to use something off-the-shelf if at all possible.
> 
> Thanks in advance...

I personally think that it is too small a task to use a batching system,
especially since you're likely not going to architect it as a true
batching system.

I think you have three primary options for ways to develop your code.
Well, four if you count NFS.

The SIMPLEST way is to put your data blocks in files on an NFS
crossmounted filesystem, and start jobs inside e.g. a simple perl script
loop that grabs "the next data file" and runs on it and writes out its
results back to the NFS file system for dealing with or accruing later.
You're basically using NFS as your transport mechanism.  Now, NFS isn't
horribly efficient relative to raw peak network speed, but neither is it
completely horrible -- at 100 BT (say 9-10 MB/sec peak BW) you ought to
be able to get at least half of that on an NFS read of a big file.  At 5
MB/sec, your 1/2 MB file should take a 0.1 seconds to be read (plus a
latency hit) which is "small" (as you note) compared to a run time of
2-5 seconds so you should be able to get nice parallel speedup on four
or five hosts.  You can test your combined latency and bandwidth with a
simple perl script or binary that opens a dozen (different!) files
inside a loop.  Beware caching, which will give you insane numbers if
you aren't careful (as in don't run the test twice on the files without
modifying them on the server).

The other three ways do it "properly" and permit you both finer control
(with the NFS method you'll have to work out file locking and work
distribution to make sure two nodes don't try to work on the same file
at the same time) and higher BW, close to the full bandwidth of the
network.  They'll ALSO require more programming.

  a) PVM

  b) MPI

  c) raw networking.

PVM is a message passing library.  There is a PVM program template on my
personal GPL software website:

 http://www.phy.duke.edu/~rgb/General/general.php

that might suffice to get you started -- it should just compile and run
a simple master/slave program, and you should be able to modify it
fairly simply to have the master distribute the next block of work to
the first worker/slave to finish.  If your CPUs are well balanced the
I/O transactions will antibunch and communications will be very
efficient.

MPI is another message passing library.  I don't have an MPI template,
but there are example programs in the MPI distributions and on many
websites, and there are books (on both PVM and MPI) from e.g. MIT press
that are quite excellent.  There is also a regular MPI column in Cluster
World Magazine that has been working through intro level MPI
applications, and old columns by Forrest Hoffman in Linux Magazine
ditto.  At least -- google is your friend.

Both PVM and MPI are likely to be similar in ease of programming, hassle
of setting up a parallel environment, and speed, and both of them should
give you a very healthy fraction of wirespeed while shielding you from
having to directly manipulate the network.

Finally there are raw sockets (which it sounds like you are inclined
towards).  Now, I have nothing against raw socket programming (he says
having spent the day on xmlsysd/wulflogger/libwulf, a raw socket-based
monitoring program:-).  However, it is NOT trivial -- you have to invent
all sorts of wheels that are already invented for you and wrapped up in
simple library calls with PVM or MPI.  Its advantages are maximal speed
-- you can't get faster than a point to point network connection -- the
ability to thread the connection/I/O component and MAYBE take advantage
of letting the NIC do some of the work via DMA while the CPU is doing
other work, and complete control.  The disadvantages are that you'll be
responsible for determining e.g. message length, dealing with a dropped
connection without crashing everything, debugging your server daemon and
worker clients (or worker daemons and master client) in parallel when
they are running on different machines, and so forth.

I >>might<< be able to provide you with some applications that aren't
exactly templates but that illustrate how to get started on this
approach (and refer you to some key books) but if you really are a
networking novice you'll need to want to do this as an excuse to stop
being a novice by writing your own application or it isn't worth it.
You'll need to be a much better and more skilled programmer altogether
in order to debug everything and check for the myriad of error
conditions that can occur and deal with them robustly.

There are really a few other approaches -- perl now supports threads so
you CAN use a perl script and ssh as a master/work distribution system
-- but raw sockets aren't much easier to manage in perl than they are in
C and using ssh as a transport layer adds overhead at least equal to or
in excess to NFS, so you'd probably want to use NFS as transport and the
perl script to just manage task distribution (for which it is ideally
suited in this simple a context).  I have a nice example threaded perl
task distribution script (which I wrote for MY Cluster Magazine column
some months ago:-) which I can put somewhere if this interests you.

 HTH,

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list