[Beowulf] Hypothetical Situation

Erik Paulson epaulson at cs.wisc.edu
Thu Jan 22 11:21:06 EST 2004

On Thu, Jan 22, 2004 at 09:29:09AM -0600, Brent M. Clements wrote:
> We have a request by some of our research scientists to form a beowulf
> cluster that provides the following abilities:
> 1. It must be part of a shared computing facility controlled by a batch
> queuing system.
> 2. A normal user must be able to compile a customized kernel, then submit
> a job with a variable pointing to that kernel. The batch queuing system
> must then load that kernel onto the allocated nodes and reboot the nodes
> allocated.
> 2a. If the node doesn't come back up after rebooting, the job must be
> canceled and the node rebuilt automatically with a stable kernel/image.
> 3. When the job is finished, the node must be rebuilt automatically using
> a stable kernel/image.
> The admin's here have come up with a name for this:  "user
> provisioned os imaging". Our gut feeling is that this can be done but it
> will be a highly customized system. What is everyone's thoughts,
> experiences, etc etc concerning doing something like this?

Auto rebuilding with custom images isn't difficult. The Utah folks probably 
have the best complete setup for this:


Read "An Integrated Experimental Environment for Distributed Systems and 
Networks" from OSDI 2002, it's a good overview. Their setup is
aimed more at building a network-emulation testbed, but it will provision up
machines, install the right OS, reboot nodes, muck with switches, etc etc.
Tossing a batch system into that image wouldn't be hard (it'd probably be
a bit tricky with PBS, since it doesn't like nodes coming and going, Condor
would work fine, and I know nothing about LSF or SGE.)

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list