GPU becomes unavailable after some time in Docker container

tobias.guenther · February 24, 2021, 2:43pm

Hello,

after updating the software on some of our workstations, we have the problem that GPUs become unavailable in a docker image.

We first noticed this, when PyTorch experiments failed on the second script called in the container with a RuntimeError: No CUDA GPUs are available.
When trying to debug this, we noticed that also just starting the container with nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash and running a watch -n 1 nvidia-smi inside the container does not work as expected. First, the output is as expected, but after some time (which varies between a few seconds and several hours) the output changes to Failed to initialize NVML: Unknown Error.

We could reproduce the error with different Docker images, such as nvidia/cuda:11.2.1-devel-ubuntu20.04 and images based on nvcr.io/nvidia/pytorch:20.12-py3 and pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime.

We have reproduced this bug on different workstations with completely different hardware and GPUs (GTX 1080 Ti and RTX 3090).

Setups that do NOT work (GTX 1080 Ti and RTX 3090 workstations) are:
Ubuntu 20.04 (nvidia-docker2 2.5.0-1):

linux-image-5.4.0-65-generic + nvidia-headless-450 450.102.04-0ubuntu0.20.04.1
linux-image-5.8.0-44-generic + nvidia-headless-460 460.39-0ubuntu0.20.04.1

A Setup that DOES WORK (on the same GTX 1080 Ti machine) is:
Ubuntu 16.04 (nvidia-docker2 2.0.3+docker18.09.2-1):

linux-image-4.4.0-194-generic + nvidia-430 430.26-0ubuntu0~gpu16.04.1
(Drivers for RTX 3090 not available for 16.04)

So we suspect that the problem is either in newer versions of the kernel, driver or nvidia-docker of the host machine.

We are looking for advice how to debug this further and fix the problem.
Thanks for any help!

BenjiBe · June 1, 2021, 8:56am

Hello,
I am experiencing a similar issue. Did you somehow manage to solve the issue?

Thanks for any help!

tobias.guenther · June 1, 2021, 9:43am

Hey, I got an answer in this Github issue.

In the end we found a working configuration by downgrading the machines to Ubuntu 18.04, which gave us the combination of the old, working versions of the nvidia container libraries we used under 16.04 and up to date driver packages.

# dpkg -l | grep nvidia
ii  libnvidia-cfg1-460:amd64               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-compute-460:amd64            460.56-0ubuntu0.18.04.1                         amd64        NVIDIA libcompute package
ii  libnvidia-container-tools              1.0.0-1                                         amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.0.0-1                                         amd64        NVIDIA container runtime library
ii  nvidia-compute-utils-460               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA compute utilities
ii  nvidia-container-runtime               2.0.0+docker18.09.2-1                           amd64        NVIDIA container runtime
ii  nvidia-container-runtime-hook          1.4.0-1                                         amd64        NVIDIA container runtime hook
ii  nvidia-dkms-460                        460.56-0ubuntu0.18.04.1                         amd64        NVIDIA DKMS package
ii  nvidia-docker2                         2.0.3+docker18.09.2-1                           all          nvidia-docker CLI wrapper
ii  nvidia-headless-460                    460.56-0ubuntu0.18.04.1                         amd64        NVIDIA headless metapackage
ii  nvidia-headless-no-dkms-460            460.56-0ubuntu0.18.04.1                         amd64        NVIDIA headless metapackage - no DKMS
ii  nvidia-kernel-common-460               460.56-0ubuntu0.18.04.1                         amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-460               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA kernel source package

BenjiBe · June 8, 2021, 8:35am

Hey,

thanks a lot for the answer and the infos. Only at last, it may be an option for us to downgrade. Luckily, our affected server is not heavily used at the moment. Maybe we can wait until the issue resolves via an update.

In case I’ll find another workaround, I’ll post it here.

Best,
Benedikt

Topic		Replies	Views
Rootless Docker; ERROR: No supported GPU(s) detected to run this container Docker and NVIDIA Docker docker	2	7749	April 8, 2022
nvidia-docker inside Kubernetes - Failed to initialize NVML: Unknown Error CUDA Setup and Installation	3	4175	January 9, 2022
Applications not using GPU inside docker container Docker and NVIDIA Docker	1	1243	May 2, 2024
Unable to start CUDA container with recent update on November 10 Container: CUDA cuda , ubuntu , docker	5	4105	November 21, 2023
Docker: Error response from daemon: OCI runtime create failed CUDA on Windows Subsystem for Linux	5	20424	September 19, 2022
Docker and nvidia-smi not working with clean install on Driver 470.14 and Insider Preview (Build 21343) Ubuntu 20.04 CUDA on Windows Subsystem for Linux	3	5618	April 17, 2021
NVIDIA driver is not available on latest docker Docker and NVIDIA Docker cuda , docker	8	5649	July 5, 2023
Nvidia-container-cli: detection error: nvml error: function not found: unknown CUDA Programming and Performance cuda , ubuntu , docker	5	7961	April 24, 2021
Failed to initialize NVML: Unknown Error when running nvidia-smi on Docker container CUDA Programming and Performance cuda , ubuntu , docker	2	10522	October 18, 2020
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available CUDA Setup and Installation cuda	1	184	February 20, 2025

GPU becomes unavailable after some time in Docker container

Related topics