Permission issue of cuda driver in AWS RHEL 8.2 podman after restart

Issue 1: After reboot, podman can only run the nvidia/cuda:11.0-base podman container after running nvidia-smi in host machine.

Step1 - Input command:
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Result:
Error: OCI runtime error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded
Step2 - Input command:
nvidia-smi
Result:
Return normal result
Step3 - Input command:
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Result:
Return normal result


Issue 2: podman can only run the mirrorgooglecontainers/cuda-vector-add podman container after running sudo.

Step1 - Input command:
podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
Error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory: OCI runtime attempted to invoke a command that was not found
Step2 - Input command:
sudo podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
Failed to allocate device vector A (error code no CUDA-capable device is detected)!
[Vector addition of 50000 elements]
Step3 - Input command:
podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done


What’s the potential cause of these error in issue 1 and issue 2 and how to fix it?

Here are the versions of software that we are using in our AWS environment:

podman -v

podman version 3.2.3

cat /etc/os-release

NAME=“Red Hat Enterprise Linux”
VERSION=“8.2 (Ootpa)”
ID=“rhel”
ID_LIKE=“fedora”
VERSION_ID=“8.2”
PLATFORM_ID=“platform:el8”
PRETTY_NAME=“Red Hat Enterprise Linux 8.2 (Ootpa)”
ANSI_COLOR=“0;31”

I followed the steps to install the container toolkit for podman:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#id10

1 Like