GPU-Operator 1.3.0 throws: nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06

I put together a “Virtualized NVIDIA EGX Stack with NGC GPU-Operator on Ubuntu 18.04.5” install page here: https://gus-gpu.gitlab.io/html/egx-on-vm.html
It worked beautifully with helm.ngc.nvidia.com/nvidia nvidia/gpu-operator last week.
I tried it with helm.ngc.nvidia.com/nvidia nvidia/gpu-operator this week (there are changes in the chart), and I’m getting:
nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06
Which is interfering with gpu-operator-resources pod/nvidia-dcgm-exporter-XXXX (continuous CrashLoopBackOff)
Part of the install verifies there are no nvidia drivers installed prior to installing GPU Operator.
Drivers are installed via the helm chart:
sudo helm install --devel nvidia/gpu-operator --wait --generate-name

I can see in the helm chart where we are getting 450.80.02
tolerations:
effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
version: 450.80.02

Where is it pulling/using 450.51.06?
CUDA 10.2?
validator:
image: cuda-sample
repository: nvcr.io/nvidia/k8s
version: vectoradd-cuda10.2

gus@ubu18vm:~$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-svljv 1/1 Running 11 19h
nvidia-container-toolkit-daemonset-8cscj 1/1 Running 2 19h
nvidia-dcgm-exporter-gvjmw 0/1 CrashLoopBackOff 57 19h
nvidia-device-plugin-daemonset-s54f8 1/1 Running 11 19h
nvidia-device-plugin-validation 0/1 Completed 0 103m
nvidia-driver-daemonset-24lrn 1/1 Running 6 19h
nvidia-driver-validation 0/1 Completed 0 103m

gus@ubu18vm:~$ kubectl get deployments --show-labels
NAME READY UP-TO-DATE AVAILABLE AGE LABELS
gpu-operator 1/1 1 1 19h app.kubernetes.io/component=gpu-operator,app.kubernetes.io/managed-by=Helm
gpu-operator-1604686334-node-feature-discovery-master 1/1 1 1 19h app.kubernetes.io/component=master,app.kubernetes.io/instance=gpu-operator

Only reference I can find to this error:


This occurs with GPU driver versions later than 450.51.06. The version check occurs on
all DGX systems, but applies only to NVSwitch systems, so the message can be ignored
on non-NVSwitch systems such as the DGX Station or DGX-1.

I’ll bring up fresh Ubuntu 18 and 20 VMs and retry/post results. Wondering if this is being seen elsewhere?

I brought up another 18.04 VM with the same results.
I brought up a 20.04.1 VM with the same results. (see images

)
Both showing issues with gpu-operator-resources pod/nvidia-dcgm-exporter
and throwing: “nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06”
This install worked on Oct. 23, and has failed each time I’ve tried it since Nov.7
There are 9 references to DCGM in the 13 files that were changed on Oct 27 here:


This PDF https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_450_v1.5.pdf
States: This release (450.80.02)supports the following APIs:
‣ NVIDIA® CUDA® 11.0 for NVIDIA® KeplerTM, MaxwellTM, PascalTM, VoltaTM, TuringTM and
NVIDIA Ampere architecture GPUs
The helm chart appears to call CUDA 10.2…?
I also tried walking back one of the chart changes with --set
 helm install --debug nvidia/gpu-operator --generate-name --set dcgmExporter.version=2.0.10-2.1.0-rc.2-ubuntu20.04 --wait

Same results.
After 30min gpu-operator-resources started crashing too.
kubectl describe -n gpu-operator-resources pod/nvidia-device-plugin-validation
Events:
Type Reason Age From Message


Warning FailedScheduling 28m (x2 over 28m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Will continue digging.
Gus

I’ve moved this thread to the correct location:
https://github.com/NVIDIA/gpu-operator/issues/115
I went to reproduce the issue and capture error descriptions, found the version had changed to 1.4.0
I installed 1.4.0 with “different” results:
https://github.com/NVIDIA/gpu-operator/issues/116