Help with rootless podman or rootless docker and nvidia GPU

I’m trying to run containers using rootless podman or rootless docker and accessing an NVIDA GPU.
The instructions at Installation Guide — NVIDIA Cloud Native Technologies documentation got me to the point where I can run as root, but not as a regular user. When running as a regular user I see the error below. Running rootless docker produces the same error message.

$ podman run -it --rm  --runtime nvidia nvidia/cuda:11.0-base nvidia-smi
2022/03/10 09:38:07 Error running [/usr/bin/nvidia-container-runtime delete --force 2a82614297c63b515fded4af4b4dfdbc44fb61e5a6cbb0cf4805b29f246787b9]: error creating runtime: error constructing runc runtime: error locating runtime: no runtime binary found from candidate list: [docker-runc runc]
                                                                                   ERRO[0000] Error removing container 2a82614297c63b515fded4af4b4dfdbc44fb61e5a6cbb0cf4805b29f246787b9 from runtime after creation failed 
Error: OCI runtime error: time="2022-03-10T09:38:13-05:00" level=warning msg="unable to get oom kill count" error="no directory specified for memory.oom_control"
time="2022-03-10T09:38:13-05:00" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: privilege change failed: invalid argument\n"

I’ve made the cgroup change to /etc/nvidia-container-runtime/config.toml

$ cat /etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/"
load-kmods = true
#no-cgroups = false
no-cgroups = true 
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

#debug = "/var/log/nvidia-container-runtime.log"
#debug = "~/.local/nvidia-container-runtime.log"
#debug = "/tmp/nvidia-container-runtime.log"

This is on Ubuntu 20.04.

Package versions:

$ apt list --installed '*nvidia*'
Listing... Done
libnvidia-cfg1-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-510/unknown,now 510.47.03-0ubuntu1 all [installed,automatic]
libnvidia-compute-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,now 1.8.1-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.8.1-1 amd64 [installed,automatic]
libnvidia-decode-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-extra-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-compute-utils-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.8.1-1 amd64 [installed]
nvidia-dkms-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-source-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-prime/focal-updates,now 0.8.16~ all [installed,automatic]
nvidia-settings/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]

$ podman --version
podman version 3.4.2

Does anyone know how to resolve this privilege error?

I recently just got this installed on WSL2 and I wrote up what I did to get it working for me here: GitHub - henrymai/podman_wsl2_cuda_rootless

The main difference I can see from what you wrote is that when I invoke podman run, I don’t use the “–runtime nvidia” arg, I just let the nvidia-container-toolkit hook do its thing instead.

1 Like