Hello,
after updating the software on some of our workstations, we have the problem that GPUs become unavailable in a docker image.
We first noticed this, when PyTorch experiments failed on the second script called in the container with a RuntimeError: No CUDA GPUs are available
.
When trying to debug this, we noticed that also just starting the container with nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
and running a watch -n 1 nvidia-smi
inside the container does not work as expected. First, the output is as expected, but after some time (which varies between a few seconds and several hours) the output changes to Failed to initialize NVML: Unknown Error
.
We could reproduce the error with different Docker images, such as nvidia/cuda:11.2.1-devel-ubuntu20.04
and images based on nvcr.io/nvidia/pytorch:20.12-py3
and pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
.
We have reproduced this bug on different workstations with completely different hardware and GPUs (GTX 1080 Ti and RTX 3090).
Setups that do NOT work (GTX 1080 Ti and RTX 3090 workstations) are:
Ubuntu 20.04 (nvidia-docker2 2.5.0-1):
- linux-image-5.4.0-65-generic + nvidia-headless-450 450.102.04-0ubuntu0.20.04.1
- linux-image-5.8.0-44-generic + nvidia-headless-460 460.39-0ubuntu0.20.04.1
A Setup that DOES WORK (on the same GTX 1080 Ti machine) is:
Ubuntu 16.04 (nvidia-docker2 2.0.3+docker18.09.2-1):
- linux-image-4.4.0-194-generic + nvidia-430 430.26-0ubuntu0~gpu16.04.1
(Drivers for RTX 3090 not available for 16.04)
So we suspect that the problem is either in newer versions of the kernel, driver or nvidia-docker of the host machine.
We are looking for advice how to debug this further and fix the problem.
Thanks for any help!
1 Like
Hello,
I am experiencing a similar issue. Did you somehow manage to solve the issue?
Thanks for any help!
Hey, I got an answer in this Github issue.
In the end we found a working configuration by downgrading the machines to Ubuntu 18.04, which gave us the combination of the old, working versions of the nvidia container libraries we used under 16.04 and up to date driver packages.
# dpkg -l | grep nvidia
ii libnvidia-cfg1-460:amd64 460.56-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-compute-460:amd64 460.56-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.0.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.0-1 amd64 NVIDIA container runtime library
ii nvidia-compute-utils-460 460.56-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 2.0.0+docker18.09.2-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-dkms-460 460.56-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.0.3+docker18.09.2-1 all nvidia-docker CLI wrapper
ii nvidia-headless-460 460.56-0ubuntu0.18.04.1 amd64 NVIDIA headless metapackage
ii nvidia-headless-no-dkms-460 460.56-0ubuntu0.18.04.1 amd64 NVIDIA headless metapackage - no DKMS
ii nvidia-kernel-common-460 460.56-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-460 460.56-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
1 Like
Hey,
thanks a lot for the answer and the infos. Only at last, it may be an option for us to downgrade. Luckily, our affected server is not heavily used at the moment. Maybe we can wait until the issue resolves via an update.
In case I’ll find another workaround, I’ll post it here.
Best,
Benedikt