Issue 1: After reboot, podman can only run the nvidia/cuda:11.0-base podman container after running nvidia-smi in host machine.
Step1 - Input command:
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Result:
Error: OCI runtime error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded
Step2 - Input command:
nvidia-smi
Result:
Return normal result
Step3 - Input command:
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:11.0-base nvidia-smi
Result:
Return normal result
Issue 2: podman can only run the mirrorgooglecontainers/cuda-vector-add podman container after running sudo.
Step1 - Input command:
podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
Error: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory: OCI runtime attempted to invoke a command that was not found
Step2 - Input command:
sudo podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
Failed to allocate device vector A (error code no CUDA-capable device is detected)!
[Vector addition of 50000 elements]
Step3 - Input command:
podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --cap-drop=ALL docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
Result:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
What’s the potential cause of these error in issue 1 and issue 2 and how to fix it?