Rootless Docker; ERROR: No supported GPU(s) detected to run this container

sascha.saralajew · April 7, 2022, 5:07pm

System specification:

Debian 11
tested NVIDIA driver releases 510.47.03 and 510.60.02
Docker version 20.10.14
NVIDIA Docker 2.10.0

I encounter the following error when starting rootless Docker containers:

$ docker run --rm --gpus all -it  nvcr.io/nvidia/pytorch:22.03-py3

=============
== PyTorch ==
=============

NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

[...]

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: No supported GPU(s) detected to run this container

This happens across several images (e.g., Tensorflow, Pytorch) and versions. CUDA base images work fine and I can call in all containers nvidia-smi and it returns the expected results. When I try to perform a GPU operation with, for example, PyTorch the library returns that no GPU device is available. This also happens when I start a CUDA container like nvidia/cuda:11.0-base (starts without any error!) and install PyTorch manually inside the container.

After a while I observed the following behavior that “fixes” the error mentioned above:

Start the Docker daemon in “rootful” mode.
Start any CUDA enabled container like sudo docker run --rm --gpus all -it nvcr.io/nvidia/pytorch:22.03-py3. The container will error, due to the known issue with cgroups. In particular, it raises the error ERROR: No supported GPU(s) detected to run this container and the error Failed to detect NVIDIA driver version.
Stop the “rootful” Docker daemon (or keep it running since it makes no difference)
Start a rootless docker container like docker run --rm --gpus all -it nvcr.io/nvidia/pytorch:22.03-py3 and the GPU error is gone.
After these steps, I can start any CUDA enabled rootless Docker container without any errors (e.g., PyTorch can run computations on the GPU).

I don’t know what is happening when I start a “rootful” container. My guess is that some CUDA/NVIDIA service or process is enabled/started and this process keeps running and is then used by the rootless container to function correctly.
I tried to identify whether a NVIDIA/CUDA process comes alive when I run a “rootful” container but without success. To make sure it is not caused by my system, I tested it with two Debian versions (freshly installed) and was able to reproduce the error. Debian runs without any errors. Docker, etc. are installed according to the manuals without raising any error.

I’ve searched for solutions in the https://forums.developer.nvidia.com/ forum and in the GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs repository - without any success. Because I am not sure if it is an nvidia-docker issue (since nvidia-smi works), I follow the recommendation in the issue template and post the issue here.

Can anybody help?

sascha.saralajew · April 8, 2022, 3:47pm

The problem is that the device nodes are not created at boot (check the devices ls -la /dev/nvidia*). To solve this follow the solution of Installation Guide Linux :: CUDA Toolkit Documentation of Section 7.4 and create a script like the following which is called during the startup of the system:

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
  mknod -m 666 /dev/nvidia-uvm-tools c $D 0
else
  exit 1
fi

If the script is called .nvidia-init.sh , the script can be called at startup via a cron job.
Assume the file is placed at /root/.nvidia-init.sh, then call sudo crontab -e and add at the end @reboot /root/.nvidia-init.sh. Make sure the file has the correct permissions: chmod 770 .nvidia-init.sh.

This should solve the error and the devices should be available after each reboot and the rootless containers should run without errors.

system · April 22, 2022, 3:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.