[Beowulf] backtraces
Craig Tierney
ctierney at hypermall.net
Mon Jun 11 23:54:28 EDT 2007
Mark Hahn wrote:
>> Sorry to start a flame war....
>
> what part do you think was inflamed?
It was when I was trying to say "Real codes have user-level
checkpointing implemented and no code should ever run for 7
days."
>
>> Make sure that your code generates the exact same answer with
>> debug/backtrace enabled and disabled,
>
> part of the point of my very simple backtrace.so is that it has zero
> runtime overhead and doesn't require any special compilation.
>
Does the Intel version have overhead? I never measured it before,
but I never thought it was much.
>> then you add user-level checkpointing so that you can
>
> I'm most curious to hear people's experience with checkpointing.
> all our more serious, established codes do checkpointing, but it's
> extremely foreign to people writing newish codes.
> and, of course, it's a lot of extra work. I'm not arguing against
> checkpointing, just acknowledging that although we _require_ it,
> we don't actually demand "proof-of-checkpointability".
>
I included checkpointing in an ocean-model once. It was very easy,
but that was most likely because of how it was organized (Fortran 77,
most data structures were shared).
I don't think that it is foreign to people writing new codes.
It is foreign to scientists. Software developers (who could be
scientists) would think of this from the beginning (I hope).
>> restart where you want. Then you
>> run up until the problem and restart with the last checkpoint.
>
> restarting from checkpoint is fine (the code in question could
> actually do it), but still means you have hours of running,
> presumably under a debugger.
>
>> Run for a week without checkpointing? Just begging for trouble.
>
> suppose you have 2k users, with ~300 active at any instant,
> and probably 200 unrelated codes running. while we do require
> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many
> users never do. how do you check/validate/encourage/support
> checkpointing?
>
Set your queue maximums to 6-8 hours. Prevents system hogging,
encourages checkpointing for long runs. Make sure your IO system
can support the checkpointing because it can create a lot of load.
> part of the reason I got a kick out of this simple backtrace.so
> is indeed that it's quite possible to conceive of a checkpoint.so
> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job
> of checkpointing at least serial codes non-intrusively.
>
BTW, I like your code. I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away). The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set). Then when it crashes, you know which
process actually died.
Craig
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
!DSPAM:466e1970273956865219710!
More information about the Beowulf
mailing list