Hi,
We installed gpu operator helm chart https://helm.ngc.nvidia.com/nvidia v23.3.2
Cluster is AWS EKS 1.29. Node is g4dn.xlarge. OS Ubuntu 20.04
It works for first setup and after a period of time then it is error as below:
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-gj7v5 0/1 Init:0/1 0 3m31s
gpu-feature-discovery-zwpvt 0/1 Init:0/1 0 4m5s
gpu-operator-5f5589bb7c-4mpgw 1/1 Running 0 13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-ma8pjr8 1/1 Running 0 13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo6zbqv 1/1 Running 0 48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo7hz58 1/1 Running 0 48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wopdzng 1/1 Running 0 28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worbzxr 1/1 Running 0 24d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worzplr 1/1 Running 0 23h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wovhh5r 1/1 Running 0 46d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxg6rj 1/1 Running 0 28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxjpx6 1/1 Running 0 23h
nvidia-container-toolkit-daemonset-2trdc 0/1 Init:0/1 0 4m5s
nvidia-container-toolkit-daemonset-7z44h 0/1 Init:0/1 0 3m31s
nvidia-dcgm-exporter-hlq27 0/1 Init:0/1 0 4m5s
nvidia-dcgm-exporter-z9v66 0/1 Init:0/1 0 3m31s
nvidia-device-plugin-daemonset-hsqnw 0/1 Init:0/1 0 3m31s
nvidia-device-plugin-daemonset-xn5m8 0/1 Init:0/1 0 4m5s
nvidia-driver-daemonset-ht7bc 0/1 Init:CrashLoopBackOff 92 (3m31s ago) 7h32m
nvidia-driver-daemonset-ngbdl 0/1 Init:CrashLoopBackOff 91 (4m5s ago) 7h29m
nvidia-operator-validator-8kzrf 0/1 Init:0/4 0 3m31s
nvidia-operator-validator-ms7k9 0/1 Init:0/4 0 4m5s
When we check event of the namespace then there is error like this
27m Warning FailedCreatePodSandBox pod/nvidia-operator-validator-xhqvx Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Any suggestion to resolve this issue would be very much appreciated!
Thanks!