Some Open Challenges in Checkpointing

Session information has not yet been published for this event.


One Line Summary

In this presentation, we will talk about some of the open challenges in checkpointing.


Historically, checkpoint-restart approaches have divided into kernel-based modifications (e.g., BLCR) and purely user-space approaches (e.g., DMTCP). CRIU represents an intermediate strategy by bridging the kernel and the user process, by taking advantage of user-space data structures exported by the kernel. In the same way, DMTCP represents a bridging strategy between the system representation of the process and the user application. Specifically, DMTCP plugins allow for the virtualization of external resources transparent to the application, as well as offering DMTCP-aware extensions to the application.

In this talk, we present a wish list that might allow these two bridging technologies (from the kernel to the process and from the process to the application) to better inter-operate. We describe a number of challenges. As an example, we highlight the issue of the NSCD daemon. It shares memory with the application, and so when the application is restarted, the NSCD daemon must be partially re-initialized to account for this. What can be done in a manner that is transparent to both the application and NSCD?


DMTCP, Checkpointing, Checkpoint-Restart


  • Biography

    Kapil Arya is a Distributed Systems Engineer at Mesosphere and works on Apache Mesos. He also contributes to the open source distributed checkpointing project DMTCP.

  • Gene Cooperman

    Northeastern University


    Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from U. of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor since 1992. In 2014, he was awarded a five-year IDEX Chair of Attractivity at from the Université Fédérale Toulouse Midi-Pyrénées. Since 2004, he has led the DMTCP project (Distributed MultiThreaded CheckPointing). Prof. Cooperman also has a 15-year relationship with CERN, where his work on semi-automatic thread parallelization of task-oriented software is included in the million-line Geant4 high-energy physics simulator. His current research interests especially include studying the limits of transparent checkpoint-restart. Some current domains of interest are: supercomputing, cloud computing, GPU-accelerated graphics, GPGPU computing, and Internet of Things.