- Debian 11
- tested NVIDIA driver releases 510.47.03 and 510.60.02
- Docker version 20.10.14
- NVIDIA Docker 2.10.0
I encounter the following error when starting rootless Docker containers:
$ docker run --rm --gpus all -it nvcr.io/nvidia/pytorch:22.03-py3 ============= == PyTorch == ============= NVIDIA Release 22.03 (build 33569136) PyTorch Version 1.12.0a0+2c916ef Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. [...] This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license ERROR: No supported GPU(s) detected to run this container
This happens across several images (e.g., Tensorflow, Pytorch) and versions. CUDA base images work fine and I can call in all containers
nvidia-smi and it returns the expected results. When I try to perform a GPU operation with, for example, PyTorch the library returns that no GPU device is available. This also happens when I start a CUDA container like nvidia/cuda:11.0-base (starts without any error!) and install PyTorch manually inside the container.
After a while I observed the following behavior that “fixes” the error mentioned above:
- Start the Docker daemon in “rootful” mode.
- Start any CUDA enabled container like
sudo docker run --rm --gpus all -it nvcr.io/nvidia/pytorch:22.03-py3. The container will error, due to the known issue with cgroups. In particular, it raises the error
ERROR: No supported GPU(s) detected to run this containerand the error
Failed to detect NVIDIA driver version.
- Stop the “rootful” Docker daemon (or keep it running since it makes no difference)
- Start a rootless docker container like
docker run --rm --gpus all -it nvcr.io/nvidia/pytorch:22.03-py3and the GPU error is gone.
After these steps, I can start any CUDA enabled rootless Docker container without any errors (e.g., PyTorch can run computations on the GPU).
I don’t know what is happening when I start a “rootful” container. My guess is that some CUDA/NVIDIA service or process is enabled/started and this process keeps running and is then used by the rootless container to function correctly.
I tried to identify whether a NVIDIA/CUDA process comes alive when I run a “rootful” container but without success. To make sure it is not caused by my system, I tested it with two Debian versions (freshly installed) and was able to reproduce the error. Debian runs without any errors. Docker, etc. are installed according to the manuals without raising any error.
I’ve searched for solutions in the https://forums.developer.nvidia.com/ forum and in the GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs repository - without any success. Because I am not sure if it is an nvidia-docker issue (since nvidia-smi works), I follow the recommendation in the issue template and post the issue here.
Can anybody help?