Hi, Ever used this? If so, does it work?
DMTCP is a distributed checkpointing system that can not only checkpoint sequential programs, but also threaded programs (with pthreads), families of processes (made with fork), and distributed processes across machines (like MPI). You can find reference for it here:
Would it be worthwhile to investigate DMTCP on grid and see how well it works by wrapping it around the job execution boundaries? Thanks, Sander