GPU becomes unavailable after some time in Docker container

Hello,

after updating the software on some of our workstations, we have the problem that GPUs become unavailable in a docker image.

We first noticed this, when PyTorch experiments failed on the second script called in the container with a RuntimeError: No CUDA GPUs are available.
When trying to debug this, we noticed that also just starting the container with nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash and running a watch -n 1 nvidia-smi inside the container does not work as expected. First, the output is as expected, but after some time (which varies between a few seconds and several hours) the output changes to Failed to initialize NVML: Unknown Error.

We could reproduce the error with different Docker images, such as nvidia/cuda:11.2.1-devel-ubuntu20.04 and images based on nvcr.io/nvidia/pytorch:20.12-py3 and pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime.

We have reproduced this bug on different workstations with completely different hardware and GPUs (GTX 1080 Ti and RTX 3090).

Setups that do NOT work (GTX 1080 Ti and RTX 3090 workstations) are:
Ubuntu 20.04 (nvidia-docker2 2.5.0-1):

  • linux-image-5.4.0-65-generic + nvidia-headless-450 450.102.04-0ubuntu0.20.04.1
  • linux-image-5.8.0-44-generic + nvidia-headless-460 460.39-0ubuntu0.20.04.1

A Setup that DOES WORK (on the same GTX 1080 Ti machine) is:
Ubuntu 16.04 (nvidia-docker2 2.0.3+docker18.09.2-1):

  • linux-image-4.4.0-194-generic + nvidia-430 430.26-0ubuntu0~gpu16.04.1
    (Drivers for RTX 3090 not available for 16.04)

So we suspect that the problem is either in newer versions of the kernel, driver or nvidia-docker of the host machine.

We are looking for advice how to debug this further and fix the problem.
Thanks for any help!

1 Like

Hello,
I am experiencing a similar issue. Did you somehow manage to solve the issue?

Thanks for any help!

Hey, I got an answer in this Github issue.

In the end we found a working configuration by downgrading the machines to Ubuntu 18.04, which gave us the combination of the old, working versions of the nvidia container libraries we used under 16.04 and up to date driver packages.

# dpkg -l | grep nvidia
ii  libnvidia-cfg1-460:amd64               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-compute-460:amd64            460.56-0ubuntu0.18.04.1                         amd64        NVIDIA libcompute package
ii  libnvidia-container-tools              1.0.0-1                                         amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.0.0-1                                         amd64        NVIDIA container runtime library
ii  nvidia-compute-utils-460               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA compute utilities
ii  nvidia-container-runtime               2.0.0+docker18.09.2-1                           amd64        NVIDIA container runtime
ii  nvidia-container-runtime-hook          1.4.0-1                                         amd64        NVIDIA container runtime hook
ii  nvidia-dkms-460                        460.56-0ubuntu0.18.04.1                         amd64        NVIDIA DKMS package
ii  nvidia-docker2                         2.0.3+docker18.09.2-1                           all          nvidia-docker CLI wrapper
ii  nvidia-headless-460                    460.56-0ubuntu0.18.04.1                         amd64        NVIDIA headless metapackage
ii  nvidia-headless-no-dkms-460            460.56-0ubuntu0.18.04.1                         amd64        NVIDIA headless metapackage - no DKMS
ii  nvidia-kernel-common-460               460.56-0ubuntu0.18.04.1                         amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-460               460.56-0ubuntu0.18.04.1                         amd64        NVIDIA kernel source package
1 Like

Hey,

thanks a lot for the answer and the infos. Only at last, it may be an option for us to downgrade. Luckily, our affected server is not heavily used at the moment. Maybe we can wait until the issue resolves via an update.

In case I’ll find another workaround, I’ll post it here.

Best,
Benedikt