-
Welcome
-
Subscribe to
Checkpoint/Restart for distributed applications
Session information has not yet been published for this event.
One Line Summary
distributed checkpoint restart protocols
Abstract
Scaling up the single application checkpoint/restart to extend it to parallel applications running across distributed resources is a delicate operation, one that can easily incur a significant, and certainly undesirable performance impact. From coordination between distributed processes, to ensuring no message has been lost or duplicated, several checkpoint/restart challenges are addressed and detailed. The talk covers in details different distributed checkpoint/restart protocols, and gives details about ongoing efforts to incorporate and optimize support for checkpoint/restart into an existing MPI implementation, Open MPI.
Tags
resilience, soft error detection, soft error correction
Speaker
-
George Bosilca
University of Tennessee- Website: http://icl.cs.utk.edu/~bosilca/
Biography
Research Director and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. Core developer of Open MPI and fervent supporter of resilience in distributed computing and in particular in MPI.