Originally published at: https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/
Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications. Checkpointing overview Transparent, per-process checkpointing offers a middle…
2 Likes
Hey , can any one give example of sbtach and if it is supported by slurm ?
Using the example you provided, I encountered a crash issue with CRIU. I am using the binary from the cuda-checkpoint repo, and my driver version is 580. Could you tell me which version of CRIU is required? Are there any requirements for the Linux version? Or is there a stable Docker environment available?
We would be happy to help. Could you please open a GitHub issue with more information in GitHub - checkpoint-restore/criu: Checkpoint/Restore tool?