[Beowulf] HPC fault tolerance using virtualization

Egan Ford egan at sense.net
Tue Jun 16 12:08:23 EDT 2009

The good news...

We (IBM) demonstrated such a system at SC08 as a Cloud Computing demo.  The
setup was a combination of Moab, xCAT, and Xen.

xCAT is an open source provisioning system that can control/monitor
hardware, discover nodes, and provision stateful/stateless physical nodes
and virtual machines.  xCAT supports both KVM and Xen including live

Moab was used as a scheduler with xCAT as one of Moab's resource managers.
Moab uses xCAT's SSL/XML interface to query node state and to tell xCAT what
to do.

Some of the things you can do:

1.  Green computing.  Provision nodes on-demand as needed with any OS
(Windows too).  E.g. Torque command line:  qsub -l
nodes=10:ppn=8,walltime=10:00:00,os=rhimagea.  Idle rhimagea nodes will be
reused, other idle or off nodes will be provisioned with rhimagea.  When
Torque checks in the job starts.  For this to be efficient all node images
including hypervisor images should be stateless.  For Windows we use
preinstalled iSCSI images (xCAT uses gpxe to simulate iSCSI HW on any x86_64
node).  When nodes are idle for more than 10 minutes Moab instructs xCAT to
power off the nodes (unless something in the queue will use them soon).
Since it's stateless there is no need for cleanup.  I have this running on a
3780 diskless node system today.

2.  Route around problems.  If a dynamic provision fails, it will try
another node.  Moab can also query xCAT about the HW health of the machine
and opt to avoid using nodes that have an "amber" light.  Excessive ECCs,
over temp, etc... are events that our service processors log.  If a
threshold is reached the node is marked "risky", or "doomed to fail".  Moab
policies can be setup to determine how to handle nodes in this state, e.g.
Local MPI jobs--no risky nodes.  Grid jobs from another University--ok to
use risky nodes.  Or, setup a reservation and email someone to fix it.

3. Virtual machine balancing.  Since xCAT can live migrate Xen, KVM, (and
soon ESX4) and since it provides a programmable interface, Moab has no
problem moving VMs around based on policies.  Combine this with the above
two examples and you can move VMs if a HW warning is issued.  You can enable
green to consolidate VMs and power off nodes.  You can query xCAT for node
temp and do thermal balancing.

The above is just a few ideas that we are pursuing with our customers today.

The bad news...

I have no idea the state of VMs on IB.  That can be an issue with MPI.
Believe it or not, but most HPC sites do not use MPI.  They are all batch
systems where storage I/O is the bottleneck.  However, I have tested MPI
over IP with VMs and moved things around.  No problem.  Hint:  You will need
a large L2 network since the VMs retain their MAC and IP.  Yes there are
workarounds, but nothing as easy as a large L2.

Application performance may suffer in a VM.  Benchmark first.  If you just
use #1 and #2 above on the iron, you can decrease your risk of failure and
run faster.  And we all check point, right?  :-)

Lastly checkout http://lxc.sourceforge.net/.  This is light weight
virtualization.  Its not a new concept, but hopefully by next year automated
check point/restart with MPI jobs over IB may be supported.  This may be a
better fit for HPC than full-on virtualization.

On Mon, Jun 15, 2009 at 11:59 AM, John Hearns <hearnsj at googlemail.com>wrote:

> I was doing a search on ganglia + ipmi (I'm looking at doing such a
> thing for temperature measurement) when I cam across this paper:
> http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf<http://www.csm.ornl.gov/%7Eengelman/publications/nagarajan07proactive.ppt.pdf>
> Proactive Fault Tolerance for HPC using Xen virtualization
> Its something I've wanted to see working - doing a Xen live migration
> of a 'dodgy' compute node, and the job just keeps on trucking.
> Looks as if these guys have it working. Anyone else seen similar?
> John Hearns
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.clustermonkey.net/pipermail/beowulf/attachments/20090616/e399a6f2/attachment-0001.html>
-------------- next part --------------
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list