Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

uyenvt85 · June 14, 2024, 3:15am

Hi,

We installed gpu operator helm chart https://helm.ngc.nvidia.com/nvidia v23.3.2
Cluster is AWS EKS 1.29. Node is g4dn.xlarge. OS Ubuntu 20.04

It works for first setup and after a period of time then it is error as below:

$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-gj7v5                                       0/1     Init:0/1                0                3m31s
gpu-feature-discovery-zwpvt                                       0/1     Init:0/1                0                4m5s
gpu-operator-5f5589bb7c-4mpgw                                     1/1     Running                 0                13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-ma8pjr8   1/1     Running                 0                13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo6zbqv   1/1     Running                 0                48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo7hz58   1/1     Running                 0                48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wopdzng   1/1     Running                 0                28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worbzxr   1/1     Running                 0                24d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worzplr   1/1     Running                 0                23h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wovhh5r   1/1     Running                 0                46d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxg6rj   1/1     Running                 0                28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxjpx6   1/1     Running                 0                23h
nvidia-container-toolkit-daemonset-2trdc                          0/1     Init:0/1                0                4m5s
nvidia-container-toolkit-daemonset-7z44h                          0/1     Init:0/1                0                3m31s
nvidia-dcgm-exporter-hlq27                                        0/1     Init:0/1                0                4m5s
nvidia-dcgm-exporter-z9v66                                        0/1     Init:0/1                0                3m31s
nvidia-device-plugin-daemonset-hsqnw                              0/1     Init:0/1                0                3m31s
nvidia-device-plugin-daemonset-xn5m8                              0/1     Init:0/1                0                4m5s
nvidia-driver-daemonset-ht7bc                                     0/1     Init:CrashLoopBackOff   92 (3m31s ago)   7h32m
nvidia-driver-daemonset-ngbdl                                     0/1     Init:CrashLoopBackOff   91 (4m5s ago)    7h29m
nvidia-operator-validator-8kzrf                                   0/1     Init:0/4                0                3m31s
nvidia-operator-validator-ms7k9                                   0/1     Init:0/4                0                4m5s

When we check event of the namespace then there is error like this

27m         Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-xhqvx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Any suggestion to resolve this issue would be very much appreciated!

Thanks!

phene · June 28, 2024, 8:21am

Did you manage to resolve this? Have a similar problem with the gpu-operator.

uyenvt85 · July 8, 2024, 1:37am

Hi @phene,

I have checked the log of one of pods and see there is not found cluster policy. I checked in some threads then it requires to uninstall clearly

I have tried go to argocd then delete all resources and let argocd re-deploy again. After that I saw it is running well.

Regards!

Topic		Replies	Views
Docker pull nvcr.io/nvidia/driver:550.54.15-amzn2 not found Docker and NVIDIA Docker	4	932	May 2, 2024
Verifying Kata Manager, Confidential Computing Manager, and VFIO Manager FAILED Confidential Computing	0	323	January 5, 2024
"docker: Error response from daemon: exec: "nvidia-container-runtime-hook": executable file not found in $PATH"? CUDA Setup and Installation	0	4469	January 16, 2024
GPUOperator Support on CentOS 7.8 - GLIBC_2.27 Docker and NVIDIA Docker	0	1882	August 14, 2020
GPU becomes unavailable after some time in Docker container CUDA Setup and Installation	4	3816	October 12, 2021
Nvidia-container-cli: detection error: nvml error: function not found: unknown CUDA Programming and Performance cuda , ubuntu , docker	5	7996	April 24, 2021
Rootless Docker; ERROR: No supported GPU(s) detected to run this container Docker and NVIDIA Docker docker	2	7767	April 8, 2022
Docker: Error response from daemon: OCI runtime create failed CUDA on Windows Subsystem for Linux	5	20439	September 19, 2022
I am unable to expose available GPU CUDA Developer Tools	0	500	April 29, 2021
command "docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi" fails with Error CUDA Setup and Installation	1	10004	January 16, 2019

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Related topics