Originally published at: https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/
Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications. Checkpointing overview Transparent, per-process checkpointing offers a middle…
2 Likes
Hey , can any one give example of sbtach and if it is supported by slurm ?