GPU-Operator 1.3.0 throws: nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06

gus.gilliland · November 7, 2020, 2:04pm

I put together a “Virtualized NVIDIA EGX Stack with NGC GPU-Operator on Ubuntu 18.04.5” install page here: https://gus-gpu.gitlab.io/html/egx-on-vm.html
It worked beautifully with helm.ngc.nvidia.com/nvidia nvidia/gpu-operator last week.
I tried it with helm.ngc.nvidia.com/nvidia nvidia/gpu-operator this week (there are changes in the chart), and I’m getting:
“nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06”
Which is interfering with gpu-operator-resources pod/nvidia-dcgm-exporter-XXXX (continuous CrashLoopBackOff)
Part of the install verifies there are no nvidia drivers installed prior to installing GPU Operator.
Drivers are installed via the helm chart:
sudo helm install --devel nvidia/gpu-operator --wait --generate-name

I can see in the helm chart where we are getting 450.80.02
tolerations:
effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
version: 450.80.02

Where is it pulling/using 450.51.06?
CUDA 10.2?
validator:
image: cuda-sample
repository: nvcr.io/nvidia/k8s
version: vectoradd-cuda10.2

gus@ubu18vm:~$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-svljv 1/1 Running 11 19h
nvidia-container-toolkit-daemonset-8cscj 1/1 Running 2 19h
nvidia-dcgm-exporter-gvjmw 0/1 CrashLoopBackOff 57 19h
nvidia-device-plugin-daemonset-s54f8 1/1 Running 11 19h
nvidia-device-plugin-validation 0/1 Completed 0 103m
nvidia-driver-daemonset-24lrn 1/1 Running 6 19h
nvidia-driver-validation 0/1 Completed 0 103m

gus@ubu18vm:~$ kubectl get deployments --show-labels
NAME READY UP-TO-DATE AVAILABLE AGE LABELS
gpu-operator 1/1 1 1 19h app.kubernetes.io/component=gpu-operator,app.kubernetes.io/managed-by=Helm
gpu-operator-1604686334-node-feature-discovery-master 1/1 1 1 19h app.kubernetes.io/component=master,app.kubernetes.io/instance=gpu-operator

Only reference I can find to this error:

This occurs with GPU driver versions later than 450.51.06. The version check occurs on
all DGX systems, but applies only to NVSwitch systems, so the message can be ignored
on non-NVSwitch systems such as the DGX Station or DGX-1.

I’ll bring up fresh Ubuntu 18 and 20 VMs and retry/post results. Wondering if this is being seen elsewhere?

gus.gilliland · December 7, 2020, 8:15pm

I brought up another 18.04 VM with the same results.
I brought up a 20.04.1 VM with the same results. (see images

)
Both showing issues with gpu-operator-resources pod/nvidia-dcgm-exporter
and throwing: “nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06”
This install worked on Oct. 23, and has failed each time I’ve tried it since Nov.7
There are 9 references to DCGM in the 13 files that were changed on Oct 27 here:

This PDF https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_450_v1.5.pdf
States: This release (450.80.02)supports the following APIs:
‣ NVIDIA® CUDA® 11.0 for NVIDIA® KeplerTM, MaxwellTM, PascalTM, VoltaTM, TuringTM and
NVIDIA Ampere architecture GPUs
The helm chart appears to call CUDA 10.2…?
I also tried walking back one of the chart changes with --set

 helm install --debug nvidia/gpu-operator --generate-name --set dcgmExporter.version=2.0.10-2.1.0-rc.2-ubuntu20.04 --wait

Same results.
After 30min gpu-operator-resources started crashing too.
kubectl describe -n gpu-operator-resources pod/nvidia-device-plugin-validation
Events:
Type Reason Age From Message

Warning FailedScheduling 28m (x2 over 28m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Will continue digging.
Gus

gus.gilliland · December 11, 2020, 8:12pm

I’ve moved this thread to the correct location:
https://github.com/NVIDIA/gpu-operator/issues/115
I went to reproduce the issue and capture error descriptions, found the version had changed to 1.4.0
I installed 1.4.0 with “different” results:
https://github.com/NVIDIA/gpu-operator/issues/116

Topic		Replies	Views
GPU Operator Validator Pods Are Failing Docker and NVIDIA Docker kubernetes	4	2394	September 2, 2022
GPU operator deployment fails with nvidia-driver-daemonset pod crached Linux vmware-solutions , esxi	7	2252	September 30, 2025
NVIDIA GPU driver installation failure - (nvidia-driver-daemonset) openshift/NVIDIA GPU Operator NGC GPU Cloud kernel , driver	0	1431	October 7, 2021
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Docker and NVIDIA Docker	2	2171	July 8, 2024
Failure installing NVIDIA GPU Operator on OpenShift on AWS NGC GPU Cloud	0	518	March 25, 2022
Adding More Support in NVIDIA GPU Operator Technical Blog	0	378	January 26, 2021
GPU Operator helm chat deployment issues NVIDIA NeMo containers	3	279	November 10, 2025
Get vGPU working in OpenShift 4.8 with NVIDIA Operator 1.9.1 on VMware More vGPU Forums	0	644	July 15, 2022
Guest driver issue - OpenShift running on KVM with vGPU and A5000 card General Discussion	0	782	January 13, 2024
RKE2 NVIDIA GPU Operator Failure RDMA Software For GPU kubernetes , gpu-computing	0	160	December 9, 2025

GPU-Operator 1.3.0 throws: nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06

Related topics