after updating the software on some of our workstations, we have the problem that GPUs become unavailable in a docker image.
We first noticed this, when PyTorch experiments failed on the second script called in the container with a
RuntimeError: No CUDA GPUs are available.
When trying to debug this, we noticed that also just starting the container with
nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash and running a
watch -n 1 nvidia-smi inside the container does not work as expected. First, the output is as expected, but after some time (which varies between a few seconds and several hours) the output changes to
Failed to initialize NVML: Unknown Error.
We could reproduce the error with different Docker images, such as
nvidia/cuda:11.2.1-devel-ubuntu20.04 and images based on
We have reproduced this bug on different workstations with completely different hardware and GPUs (GTX 1080 Ti and RTX 3090).
Setups that do NOT work (GTX 1080 Ti and RTX 3090 workstations) are:
Ubuntu 20.04 (nvidia-docker2 2.5.0-1):
- linux-image-5.4.0-65-generic + nvidia-headless-450 450.102.04-0ubuntu0.20.04.1
- linux-image-5.8.0-44-generic + nvidia-headless-460 460.39-0ubuntu0.20.04.1
A Setup that DOES WORK (on the same GTX 1080 Ti machine) is:
Ubuntu 16.04 (nvidia-docker2 2.0.3+docker18.09.2-1):
- linux-image-4.4.0-194-generic + nvidia-430 430.26-0ubuntu0~gpu16.04.1
(Drivers for RTX 3090 not available for 16.04)
So we suspect that the problem is either in newer versions of the kernel, driver or nvidia-docker of the host machine.
We are looking for advice how to debug this further and fix the problem.
Thanks for any help!