Checkpointing CUDA Applications with CRIU

Originally published at: https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/

Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications. Checkpointing overview Transparent, per-process checkpointing offers a middle…

2 Likes

Hey , can any one give example of sbtach and if it is supported by slurm ?

Using the example you provided, I encountered a crash issue with CRIU. I am using the binary from the cuda-checkpoint repo, and my driver version is 580. Could you tell me which version of CRIU is required? Are there any requirements for the Linux version? Or is there a stable Docker environment available?

We would be happy to help. Could you please open a GitHub issue with more information in GitHub - checkpoint-restore/criu: Checkpoint/Restore tool?