Error starting up CuOpt container

I receive an error when trying to run the container: see below. Looking for some guidance to diagnose this. Thanks.

docker run --entrypoint /bin/bash  --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt

output…
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: open /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied: unknown.

@ryanmelnick

I’m curious, since the error ends with “permission denied”. Do you have sudo privileges in this environment? If you run the command with “sudo” does it work?

sudo docker run --entrypoint /bin/bash  --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt

What kind of system are you running on? (OS, cloud platform or local, etc)

running it on ubuntu in google cloud platform.

Sudo is not better, it has issues finding the image. I’ve also given the file in question read and write access to all without any luck.

I was able to work around this issue. But thank you

@ryanmelnick glad to hear it! Can you elaborate on the fix, for the sake of other forum members that might hit this issue?

I’ll also post what I just did here for others, setting up a fresh Ubuntu box on GCP. Here are the steps I followed and links to the instructions:

System: ubuntu 22.04 with a Tesla T4 GPU, 128 GB disk

Things I did with relevant command history included:

Installed the CUDA toolkit based on instructions at CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer

I used the “deb(local)” method and cut-and-paste

    4  sudo apt install gcc
    5  lspci | grep -i nvidia
   11  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
   12  sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
   13  wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
   14  sudo dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
   15  sudo cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
   16  sudo apt-get update
   17  sudo apt-get -y install cuda

Checked that I now had working drivers:

   18  nvidia-smi

Installed docker based on instructions at Installing Docker and The Docker Utility Engine for NVIDIA GPUs — NVIDIA AI Enterprise documentation (these instructions are also other places)

   20  sudo apt-get update
   21  sudo apt-get install -y     apt-transport-https     ca-certificates     curl     gnupg-agent     software-properties-common
   22  curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
   23  sudo apt-key fingerprint 0EBFCD88
   24  sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
   27  sudo apt-get update
   28  sudo apt-get install -y docker-ce docker-ce-cli containerd.io
   29  sudo docker run hello-world

Installed nvidia-container-toolkit with additional instructions here Installing Docker and The Docker Utility Engine for NVIDIA GPUs — NVIDIA AI Enterprise documentation

   32  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
   33  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
   34  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   35  sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
   36  sudo systemctl restart docker
   37  sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi

Added my user to the docker group, logged into ngc and ran cuOpt

   40  sudo usermod -aG docker $USER

   # here you need to logout and log back in

   43  docker login nvcr.io
   44  docker run --entrypoint /bin/bash  --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt

I do not have the details of the fix unfortunately.

But thank you for the detail, this is really excellent. I followed your instructions which is very close to what we have. The only difference is our system is…

ubuntu 20.04
Tesla T4
rootless docker (not docker-ce)

On line 32 my distribution is 20.02, which is correct.
But on line 34 the distribution that ends up in my nvidia-docker.list is ubuntu 18.04. Is that correct behavior?

Then running on line 37 i get the same driver error message:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled 

@ryanmelnick

okay, I’m trying a 20.04 as well to see if I can reproduce …

but yes, even on the 22.04 instance that worked, my nvidia-docker.list also shows ubuntu18.04

@ryanmelnick

I tried on 20.04, with both docker and docker-ce (and both nvidia-container-toolkit and nvidia-docker2 packages), and I can’t reproduce :(

Here is the relevant history from my latest attempt, with docker and nvidia-docker2

    1  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
    2  sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
    3  wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
    4  sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb
    5  sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
    6  sudo apt-get update
    7  sudo apt-get -y install cuda
    8  nvidia-smi
   11  sudo apt-get install docker.io
   14  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
   15  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   16  sudo apt-get update && sudo apt-get install -y nvidia-docker2
   17  sudo systemctl restart docker
   18  sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
   21  sudo usermod -aG docker $USER
   22  exit
   23  docker login nvcr.io
   24  docker run --entrypoint /bin/bash  --network=host -it --gpus all --rm nvcr.io/ea-reopt-member-zone/ea-cuopt

Not sure what to try next.

Ah, nvidia-docker2 installs nvidia-container-toolkit …so it’s essentially the same situation.

The only ways I seem to be able to get the could not select device driver “” with capabilities: [[gpu]] error are

  1. do not install the nvidia-container-toolkit package
  2. or, install the package, but fail to restart docker

@ryanmelnick when you say rootless docker, do you mean something like this

or do you mean simply adding the user to the docker group like I have above?

Update: I switched to rootless docker using this page How to do a Rootless Docker Installation?

At first I got the original error you reported (permission denied for /sys/fs/cgroup/devices/user.slice/devices.allow)
I fixed that by setting “no-groups = true” in /etc/nvidia-container-runtime/config.toml under [nvidia-container-cli]

After that I was able to run the container using rootless docker.