On performing a describe on one of the validators, I found that it has to do something with the Nvidia runtime config not be discovered by the operator. Is there any way to circumvent this problem?
Can you confirm you are using toolkit version as v1.10.0-ubi8? Also can you paste the complete output of “kubectl describe pod -l app=nvidia-device-plugin-daemonset -n gpu-operator” and “kubectl logs --all-containers -l app=nvidia-operator-validator -n gpu-operator”?.
Can you update the toolkit to above mentioned version and also share output of validator pods that are failing. “kubectl logs --all-containers -n gpu-operator” and “kubectl describe pod -n gpu-operator”. The only difference between cuda-validation and plugin-validation is we ensure that GPU resources are advertised by the plugin to the kubelet and run “nvidia-smi” with explicit resource request. The logs might indicate why that is failing.