We use nvidia-docker containers for running ML/AI work on our DGX machines running Ubuntu. There are a number of these machines shared between many data scientists. We attempt to share the available GPU’s by isolating the GPU(s) for a container to run on. Recently I realized that some GPU’s had large chunks of their memory consumed by docker containers left running “until the user could get back to the work they had started” (nvidia-smi shows GPU memory consumed). However, no ML code is currently running in the container in the interim (days) (nvidia-smi shows utilization is 0). Consequently, the containers are “consuming” entire GPU’s but not doing anything but faithfully waiting for the users to return.
I’m wondering if there is a way to suspend the docker container, release the resources its using, and subsequently restoring them when commanded:
- store the CPU and GPU memory state and release the CPU/GPU memory while paused
- restore the state of CPU/GPU when restored
I initially considered using
docker pause CONTAINER, and subsequently unpausing it. I’m not an IT Engineer but if I understand the docker pause documentation I think it only triggers a cgroup freeze and would not release and restore the GPU memory as needed. However, those I consulted felt that if it did happen it would have to be performed by nvidia-docker or some other memory manager. If I pause a docker container started with
nvidia-docker run... should the GPU’s be released and then restored when I unpause it?
Subsequently, exploration of something like docker CRIU or docker checkpoint might provide a solution. Our systems don’t currently have experimental components of docker enabled so, for example,
docker checkpoint ls reports “a Docker daemon with experimental features enabled” is required.
Does nvidia-docker support docker CRIU functionality? Should I expect that if I run
docker checkpoint create on a container started by
nvidia-docker run ... then use
docker start --checkpoint... or perhaps
nvidia-docker start --checkpoint... that I’d end up with a docker container in the state it was in when the checkpoint was created?