Proposals

Checkpoint/Restart in Linux mainline

*
Talk
lpc2009-0039
Scheduled: Friday, September 25, 2009 from 11:30am – 12:15pm in Salon CD

Excerpt

Requirements and challenges in implementation of Checkpoint/ Restart in Linux mainline.

Description

Checkpoint/Restart is the ability to save the entire state of a running application and then resume the application either on the same system or a completely different system with a similar operating environment.

A transparent C/R requires absolutely no changes to an application and does not even require a rebuild of the application.

C/R provides several benefits to system administrators and users. Critical applications can be checkpointed at regular intervals and restarted from a recent checkpoint after a software/hardware upgrade or an application or system failure. The checkpointed application may even be restarted on second system with a similar operating environment during extended down times of the original system.

C/R is especially useful to applications that have a long start-up times or incur significant performance penalty on startup due to parsing configuration information or populating a cache. Such applications can be checkpointed
when running at optimal speed and can be resumed from the checkpoint when it becomes necessary to restart the application.

C/R also enables significantly improved load-balancing and consolidation. When the load on a server goes up some applications on the server can be checkpointed and stopped on that server and the application can quickly be resumed on a new server.

Similarly, when the load on several servers is low, the applications on those could be checkpointed and restarted on fewer servers, improving system utilization.

For all its benefits, implementation of general purpose and transparent Checkpoint/Restart presents several complex and interesting challenges due to the vast and varied application state that needs to be saved and restored. The application state includes process memory, CPU registers, open file state, task relationships, signals state, SYSV IPC, sockets, network stack, device state etc.

We are actively seeking to implement C/R in mainline, based on two existing out-of-tree implementations: Zap and OpenVZ. We would like to briefly go over our current implementation and present some major requirements, challenges and design decisions for broader review.

But more importantly, we would like to understand and prioritize the devices, resources and state of HPC, X, Clusters and other applications that would need to be checkpointed and restarted. To that end, we would like to get input from administrators and users of such applications as well as from administrators and users of other C/R implementations.

We would also like input from developers of the various subsystems in the Linux kernel on how C/R may impact their subsystems and how we could improve/stage the implementation.

Tags

Checkpoint, Restart

Speaker

  • Biography

    Sukadev Bhattiprolu has been working on Linux kernel development for over 5
    years. For last two years he has been working on implementing container support
    in the Linux kernel including kernel thread conversion, pid name spaces, and
    devpts name space. He also worked on cryo, a user space checkpoint/restart
    prototype implementation. He is actively working on the implementation of C/R
    in mainline Linux kernel

Leave a private comment to organizers about this proposal