Suspending NVIDIA-Docker Container

sean.davern · December 9, 2020, 7:38am

We use nvidia-docker containers for running ML/AI work on our DGX machines running Ubuntu. There are a number of these machines shared between many data scientists. We attempt to share the available GPU’s by isolating the GPU(s) for a container to run on. Recently I realized that some GPU’s had large chunks of their memory consumed by docker containers left running “until the user could get back to the work they had started” (nvidia-smi shows GPU memory consumed). However, no ML code is currently running in the container in the interim (days) (nvidia-smi shows utilization is 0). Consequently, the containers are “consuming” entire GPU’s but not doing anything but faithfully waiting for the users to return.
I’m wondering if there is a way to suspend the docker container, release the resources its using, and subsequently restoring them when commanded:

store the CPU and GPU memory state and release the CPU/GPU memory while paused
restore the state of CPU/GPU when restored

I initially considered using docker pause CONTAINER, and subsequently unpausing it. I’m not an IT Engineer but if I understand the docker pause documentation I think it only triggers a cgroup freeze and would not release and restore the GPU memory as needed. However, those I consulted felt that if it did happen it would have to be performed by nvidia-docker or some other memory manager. If I pause a docker container started with nvidia-docker run... should the GPU’s be released and then restored when I unpause it?
Subsequently, exploration of something like docker CRIU or docker checkpoint might provide a solution. Our systems don’t currently have experimental components of docker enabled so, for example, docker checkpoint ls reports “a Docker daemon with experimental features enabled” is required.

Does nvidia-docker support docker CRIU functionality? Should I expect that if I run docker checkpoint create on a container started by nvidia-docker run ... then use docker start --checkpoint... or perhaps nvidia-docker start --checkpoint... that I’d end up with a docker container in the state it was in when the checkpoint was created?

Topic		Replies	Views
Docker pause leads to monopolizing GPU when Volta MPS on CUDA Programming and Performance cuda , docker	0	501	November 2, 2022
Detach gpu from running container CUDA Setup and Installation cuda , docker	0	614	June 12, 2021
Suspension of jobs CUDA Setup and Installation	0	618	September 5, 2016
Suspend corrupts Digits Linux	2	1041	January 8, 2018
GPU becomes unavailable after some time in Docker container CUDA Setup and Installation	4	4066	October 12, 2021
Managing Container Best-Practices Docker and NVIDIA Docker	0	549	July 16, 2019
GPU-Accelerated Docker Containers Technical Blog	0	237	August 21, 2022
Jetson Xavier AGX and Docker Checkpoint issue Jetson Xavier NX kernel , docker	4	1051	January 11, 2023
Is the Docker container a GPU Emulator Docker and NVIDIA Docker riva	1	887	March 20, 2023
Setting up nvidia-docker container toolkit. Python application in docker container accessing Nvidia-GPU after mounting docker-volumes CUDA Setup and Installation	0	1575	January 21, 2022

Suspending NVIDIA-Docker Container

Related topics