GPU operator deployment fails with nvidia-driver-daemonset pod crached

tmichaeli · October 22, 2023, 1:25pm

We are testing GPU on Kubernetes running ob vSphere for AI workloads
Have NVIDIA A40 GPU with Display enabled in vSphere 8.0u1c with TKGS.
Providing the vGPU in to TKC cluster seems to be ok.
Problem: My worker nodes (VMs) are failing on NVIDIA vGPU kernel module load.

VC: 8.0 u1c
TKG: v1.25.7—vmware.3-fips.1-tkg.1 / ubuntu
NVAIE: 535.104.12
GU operator: v23.6.1, tested also with v23.3.2
GPU: NVIDIA A40 with Display enabled
VM vGPU profile: nvidia_a40-48q

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2106384 C+G …-a40-primary-r7np9-7cc99b745f-f6snh 48512MiB |
| 1 N/A N/A 2106381 C+G …-a40-primary-r7np9-7cc99b745f-pzh59 48512MiB |
±--------------------------------------------------------------------------------------+

[adm@esxi03-dc7:~] nvidia-smi -q |more

==============NVSMI LOG==============

Timestamp : Fri Oct 20 15:28:36 2023
Driver Version : 535.104.06
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported

Attached GPUs : 2
GPU 00000000:B1:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled

GPU k logs -f nvidia-driver-daemonset-t8swg -n gpu-operator
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.104.12
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.12…

Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
[154358.907198] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
[154773.303613] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[154773.305991] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[154773.306416] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:2235)
NVRM: installed in this system is not supported by the

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

            NVRM: NVIDIA 535.104.12 driver release.
            NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
            NVRM: in this release's README, available on the operating system
            NVRM: specific graphics driver download page at www.nvidia.com.

[154773.306621] nvidia: probe of 0000:02:00.0 failed with error -1
[154773.306640] NVRM: The NVIDIA probe routine failed for 1 device(s).
[154773.306640] NVRM: None of the NVIDIA devices were initialized.

root@gpu-tkc01-nodepool-a40-primary-lfsnn-7c89bf7f9-7mjgk:~# lspci -s 0000:02:00.0 -v -xxx
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2235 (rev a1) (prog-if 00 [VGA controller])
DeviceName: pciPassthru0
Subsystem: NVIDIA Corporation Device 14e0
Physical Slot: 32
Flags: fast Back2Back, 66MHz, user-definable features, ?? devsel
Memory at fa000000 (32-bit, non-prefetchable) [size=256K]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [d0] Vendor Specific Information: Len=1b <?>
Capabilities: [7c] MSI-X: Enable- Count=3 Masked-
Kernel modules: nvidiafb
00: de 10 35 22 02 03 ff 06 a1 00 00 03 00 00 00 00
10: 00 00 00 fa 0c 00 00 d0 00 00 00 00 04 00 00 f8
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 e0 14
30: 00 00 00 00 d0 00 00 00 00 00 00 00 ff 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 05 00 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 11 00 02 00
80: 00 00 01 00 00 00 02 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 09 7c 1b 56 46 00 16 35 33 35 2e 31 30 34 2e 30
e0: 36 72 35 33 37 5f 31 33 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Persistence Mode                      : Enabled

generix · October 22, 2023, 1:35pm

Looks like a vGPU setup so you need to install the GRID driver in the VM, not the normal driver.

tmichaeli · October 22, 2023, 1:47pm

It is vGPU seup…
The driver is auto selected and installed by the gpu-operator…
I can control only the version

generix · October 22, 2023, 1:56pm

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html#about-installing-the-operator-and-nvidia-vgpu
did you set the correct version with “-grid” appended?

tmichaeli · October 22, 2023, 2:00pm

Going to reinstall operator with following paraeters:

helm install --wait gpu-operator ./gpu-operator-v23.6.1.tgz -n gpu-operator
–set driver.imagePullPolicy=Always --set driver.version=535.104.12-grid \
–set node-feature-discovery.worker.serviceAccount.name=node-feature-discovery

And let you know later…

In fact none of our documentation is mentionning the -grid option…

tmichaeli · October 22, 2023, 2:02pm

No such driver fount on the NVIDIA image repo:

Normal Pulling 8s (x4 over 101s) kubelet Pulling image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”
Warning Failed 7s (x4 over 99s) kubelet Failed to pull image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: rpc error: code = NotFound desc = failed to pull and unpack image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: failed to resolve reference “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04: not found
Warning Failed 7s (x4 over 99s) kubelet Error: ErrImagePull

generix · October 22, 2023, 2:05pm

The GRID driver is not available in repos, it can only be downloaded from the nvidia vgpu customer portal.
From the link:

Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal and append -grid:

export VGPU_DRIVER_VERSION=525.60.13-grid

The Operator automatically selects the compatible guest driver version from the drivers bundled with the driver image. If you disable the version check by specifying --build-arg DISABLE_VGPU_VERSION_CHECK=true when you build the driver image, then the VGPU_DRIVER_VERSION value is used as default.

parthasarathi · September 30, 2025, 4:31pm

I have the same problem

nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none

[ 84.209877] NVRM**: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:2235)**

           **NVRM: installed in this system is not supported by the**                                                                                                     

           **NVRM: NVIDIA 560.35.05 driver release.**                                                                                                                     

           **NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'**                                                                                              

           **NVRM: in this release's README, available on the operating system**                                                                                          

           **NVRM: specific graphics driver download page at www.nvidia.com.**

[ 84.210207] nvidia**: probe of 0000:02:00.0 failed with error -1**

[ 84.210292] NVRM**: The NVIDIA probe routine failed for 1 device(s).**

[ 84.210295] NVRM**: None of the NVIDIA devices were initialized.**

[ 84.210951] nvidia-nvlink: Unregistered Nvlink Core, major device number 236

02:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [A40] (rev a1) (prog-if 00 [VGA controller])

    DeviceName: pciPassthru0                                                                                                                                          

    Subsystem: NVIDIA Corporation GA102GL \[A40\]                                                                                                                       

    Physical Slot: 32                                                                                                                                                 

    Flags: fast Back2Back, 66MHz, user-definable features, ?? devsel                                                                                                  

    Memory at fa000000 (32-bit, non-prefetchable) \[size=256K\]                                                                                                         

    Memory at d0000000 (64-bit, prefetchable) \[size=256M\]                                                                                                             

    Memory at f8000000 (64-bit, non-prefetchable) \[size=32M\]                                                                                                          

    Capabilities: \[d0\] Vendor Specific Information: Len=1b                                                                                                         

    Capabilities: \[7c\] MSI-X: Enable- Count=3 Masked-                                                                                                                 

    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Topic		Replies	Views
vGPU pods stuck after the installation General Discussion kubernetes	0	379	September 27, 2024
Get vGPU working in OpenShift 4.8 with NVIDIA Operator 1.9.1 on VMware More vGPU Forums	0	644	July 15, 2022
NVIDIA GPU driver installation failure - (nvidia-driver-daemonset) openshift/NVIDIA GPU Operator NGC GPU Cloud kernel , driver	0	1431	October 7, 2021
Guest driver issue - OpenShift running on KVM with vGPU and A5000 card General Discussion	0	782	January 13, 2024
Verifying Kata Manager, Confidential Computing Manager, and VFIO Manager FAILED Confidential Computing	0	383	January 5, 2024
RKE2 NVIDIA GPU Operator Failure RDMA Software For GPU kubernetes , gpu-computing	0	160	December 9, 2025
Problem configuring vGPU access using Kubevirt General Discussion	0	885	May 14, 2023
Inquiries about deploying GPU OPERATOR in a Kubernetes environment General Topics & Other SDKs ubuntu	1	395	July 13, 2023
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes Technical Blog	0	553	August 25, 2020
GPU Operator Problem NVIDIA Virtual GPU Technology	0	815	February 24, 2021

GPU operator deployment fails with nvidia-driver-daemonset pod crached

Related topics