Checkpoint/Restart for distributed applications


One Line Summary

distributed checkpoint restart protocols


Scaling up the single application checkpoint/restart to extend it to parallel applications running across distributed resources is a delicate operation, one that can easily incur a significant, and certainly undesirable performance impact. From coordination between distributed processes, to ensuring no message has been lost or duplicated, several checkpoint/restart challenges are addressed and detailed. The talk covers in details different distributed checkpoint/restart protocols, and gives details about ongoing efforts to incorporate and optimize support for checkpoint/restart into an existing MPI implementation, Open MPI.


resilience, soft error detection, soft error correction


  • George Bosilca

    University of Tennessee


    Research Director and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. Core developer of Open MPI and fervent supporter of resilience in distributed computing and in particular in MPI.