I receive an error when trying to run the container: see below. Looking for some guidance to diagnose this. Thanks.
docker run --entrypoint /bin/bash --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt
output…
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: open /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied: unknown.
@ryanmelnick
I’m curious, since the error ends with “permission denied”. Do you have sudo privileges in this environment? If you run the command with “sudo” does it work?
sudo docker run --entrypoint /bin/bash --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt
What kind of system are you running on? (OS, cloud platform or local, etc)
running it on ubuntu in google cloud platform.
Sudo is not better, it has issues finding the image. I’ve also given the file in question read and write access to all without any luck.
I was able to work around this issue. But thank you
@ryanmelnick glad to hear it! Can you elaborate on the fix, for the sake of other forum members that might hit this issue?
I’ll also post what I just did here for others, setting up a fresh Ubuntu box on GCP. Here are the steps I followed and links to the instructions:
System: ubuntu 22.04 with a Tesla T4 GPU, 128 GB disk
Things I did with relevant command history included:
I used the “deb(local)” method and cut-and-paste
4 sudo apt install gcc
5 lspci | grep -i nvidia
11 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
12 sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
13 wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
14 sudo dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
15 sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
16 sudo apt-get update
17 sudo apt-get -y install cuda
Checked that I now had working drivers:
18 nvidia-smi
20 sudo apt-get update
21 sudo apt-get install -y apt-transport-https ca-certificates curl gnupg-agent software-properties-common
22 curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
23 sudo apt-key fingerprint 0EBFCD88
24 sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
27 sudo apt-get update
28 sudo apt-get install -y docker-ce docker-ce-cli containerd.io
29 sudo docker run hello-world
32 distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
33 curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
34 curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
35 sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
36 sudo systemctl restart docker
37 sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
Added my user to the docker group, logged into ngc and ran cuOpt
40 sudo usermod -aG docker $USER
# here you need to logout and log back in
43 docker login nvcr.io
44 docker run --entrypoint /bin/bash --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt
I do not have the details of the fix unfortunately.
But thank you for the detail, this is really excellent. I followed your instructions which is very close to what we have. The only difference is our system is…
ubuntu 20.04
Tesla T4
rootless docker (not docker-ce)
On line 32 my distribution is 20.02, which is correct.
But on line 34 the distribution that ends up in my nvidia-docker.list is ubuntu 18.04. Is that correct behavior?
Then running on line 37 i get the same driver error message:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
@ryanmelnick
okay, I’m trying a 20.04 as well to see if I can reproduce …
but yes, even on the 22.04 instance that worked, my nvidia-docker.list also shows ubuntu18.04
@ryanmelnick
I tried on 20.04, with both docker and docker-ce (and both nvidia-container-toolkit and nvidia-docker2 packages), and I can’t reproduce :(
Here is the relevant history from my latest attempt, with docker and nvidia-docker2
1 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
2 sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
3 wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
4 sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
5 sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
6 sudo apt-get update
7 sudo apt-get -y install cuda
8 nvidia-smi
11 sudo apt-get install docker.io
14 curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
15 curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
16 sudo apt-get update && sudo apt-get install -y nvidia-docker2
17 sudo systemctl restart docker
18 sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
21 sudo usermod -aG docker $USER
22 exit
23 docker login nvcr.io
24 docker run --entrypoint /bin/bash --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt
Not sure what to try next.
Ah, nvidia-docker2 installs nvidia-container-toolkit …so it’s essentially the same situation.
The only ways I seem to be able to get the could not select device driver “” with capabilities: [[gpu]] error are
- do not install the nvidia-container-toolkit package
- or, install the package, but fail to restart docker
@ryanmelnick when you say rootless docker, do you mean something like this
or do you mean simply adding the user to the docker group like I have above?
Update: I switched to rootless docker using this page How to do a Rootless Docker Installation?
At first I got the original error you reported (permission denied for /sys/fs/cgroup/devices/user.slice/devices.allow)
I fixed that by setting “no-groups = true” in /etc/nvidia-container-runtime/config.toml under [nvidia-container-cli]
After that I was able to run the container using rootless docker.