Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Hi,

We installed gpu operator helm chart https://helm.ngc.nvidia.com/nvidia v23.3.2
Cluster is AWS EKS 1.29. Node is g4dn.xlarge. OS Ubuntu 20.04

It works for first setup and after a period of time then it is error as below:

$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-gj7v5                                       0/1     Init:0/1                0                3m31s
gpu-feature-discovery-zwpvt                                       0/1     Init:0/1                0                4m5s
gpu-operator-5f5589bb7c-4mpgw                                     1/1     Running                 0                13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-ma8pjr8   1/1     Running                 0                13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo6zbqv   1/1     Running                 0                48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo7hz58   1/1     Running                 0                48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wopdzng   1/1     Running                 0                28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worbzxr   1/1     Running                 0                24d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worzplr   1/1     Running                 0                23h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wovhh5r   1/1     Running                 0                46d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxg6rj   1/1     Running                 0                28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxjpx6   1/1     Running                 0                23h
nvidia-container-toolkit-daemonset-2trdc                          0/1     Init:0/1                0                4m5s
nvidia-container-toolkit-daemonset-7z44h                          0/1     Init:0/1                0                3m31s
nvidia-dcgm-exporter-hlq27                                        0/1     Init:0/1                0                4m5s
nvidia-dcgm-exporter-z9v66                                        0/1     Init:0/1                0                3m31s
nvidia-device-plugin-daemonset-hsqnw                              0/1     Init:0/1                0                3m31s
nvidia-device-plugin-daemonset-xn5m8                              0/1     Init:0/1                0                4m5s
nvidia-driver-daemonset-ht7bc                                     0/1     Init:CrashLoopBackOff   92 (3m31s ago)   7h32m
nvidia-driver-daemonset-ngbdl                                     0/1     Init:CrashLoopBackOff   91 (4m5s ago)    7h29m
nvidia-operator-validator-8kzrf                                   0/1     Init:0/4                0                3m31s
nvidia-operator-validator-ms7k9                                   0/1     Init:0/4                0                4m5s

When we check event of the namespace then there is error like this

27m         Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-xhqvx            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Any suggestion to resolve this issue would be very much appreciated!

Thanks!

1 Like

Did you manage to resolve this? Have a similar problem with the gpu-operator.

Hi @phene,

I have checked the log of one of pods and see there is not found cluster policy. I checked in some threads then it requires to uninstall clearly

I have tried go to argocd then delete all resources and let argocd re-deploy again. After that I saw it is running well.

Regards!

1 Like