GPU Operator Validator Pods Are Failing

Hi Team,

Please find the following details of my environment

OS: CentOS 8
GPU: Tesla T4
Driver Version: 515.65

Unfortunately, some of the GPU Operator pods have been crashing in my case. Those are mainly the operator and device plugin validators.

On performing a describe on one of the validators, I found that it has to do something with the Nvidia runtime config not be discovered by the operator. Is there any way to circumvent this problem?

Can you confirm you are using toolkit version as v1.10.0-ubi8? Also can you paste the complete output of “kubectl describe pod -l app=nvidia-device-plugin-daemonset -n gpu-operator” and “kubectl logs --all-containers -l app=nvidia-operator-validator -n gpu-operator”?.

No, the toolkit version I used was 1.7.1-centos8. PFA attached outputs for your reference.
nv_device_plugin_ds.txt (12.7 KB)
nv_op_val.txt (4.0 KB)

Can you update the toolkit to above mentioned version and also share output of validator pods that are failing. “kubectl logs --all-containers -n gpu-operator” and “kubectl describe pod -n gpu-operator”. The only difference between cuda-validation and plugin-validation is we ensure that GPU resources are advertised by the plugin to the kubelet and run “nvidia-smi” with explicit resource request. The logs might indicate why that is failing.

Please find the attached logs and output. I updated it to the aforementioned toolkit version.

desc_po.txt (136.0 KB)
validator_all_logs.txt (4.6 KB)