GPU operator deployment fails with nvidia-driver-daemonset pod crached

We are testing GPU on Kubernetes running ob vSphere for AI workloads
Have NVIDIA A40 GPU with Display enabled in vSphere 8.0u1c with TKGS.
Providing the vGPU in to TKC cluster seems to be ok.
Problem: My worker nodes (VMs) are failing on NVIDIA vGPU kernel module load.

VC: 8.0 u1c
TKG: v1.25.7—vmware.3-fips.1-tkg.1 / ubuntu
NVAIE: 535.104.12
GU operator: v23.6.1, tested also with v23.3.2
GPU: NVIDIA A40 with Display enabled
VM vGPU profile: nvidia_a40-48q

[adm@esxi03-dc7:~] nvidia-smi
Fri Oct 20 15:28:33 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: N/A |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:B1:00.0 Off | Off |
| 0% 25C P8 30W / 300W | 48512MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A40 On | 00000000:CA:00.0 Off | Off |
| 0% 26C P8 32W / 300W | 48512MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2106384 C+G …-a40-primary-r7np9-7cc99b745f-f6snh 48512MiB |
| 1 N/A N/A 2106381 C+G …-a40-primary-r7np9-7cc99b745f-pzh59 48512MiB |
±--------------------------------------------------------------------------------------+

[adm@esxi03-dc7:~] nvidia-smi -q |more

==============NVSMI LOG==============

Timestamp : Fri Oct 20 15:28:36 2023
Driver Version : 535.104.06
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported

Attached GPUs : 2
GPU 00000000:B1:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled

GPU k logs -f nvidia-driver-daemonset-t8swg -n gpu-operator
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.104.12
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.12…

Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
[154358.907198] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
[154773.303613] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[154773.305991] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[154773.306416] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:2235)
NVRM: installed in this system is not supported by the

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

            NVRM: NVIDIA 535.104.12 driver release.
            NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
            NVRM: in this release's README, available on the operating system
            NVRM: specific graphics driver download page at www.nvidia.com.

[154773.306621] nvidia: probe of 0000:02:00.0 failed with error -1
[154773.306640] NVRM: The NVIDIA probe routine failed for 1 device(s).
[154773.306640] NVRM: None of the NVIDIA devices were initialized.

root@gpu-tkc01-nodepool-a40-primary-lfsnn-7c89bf7f9-7mjgk:~# lspci -s 0000:02:00.0 -v -xxx
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2235 (rev a1) (prog-if 00 [VGA controller])
DeviceName: pciPassthru0
Subsystem: NVIDIA Corporation Device 14e0
Physical Slot: 32
Flags: fast Back2Back, 66MHz, user-definable features, ?? devsel
Memory at fa000000 (32-bit, non-prefetchable) [size=256K]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [d0] Vendor Specific Information: Len=1b <?>
Capabilities: [7c] MSI-X: Enable- Count=3 Masked-
Kernel modules: nvidiafb
00: de 10 35 22 02 03 ff 06 a1 00 00 03 00 00 00 00
10: 00 00 00 fa 0c 00 00 d0 00 00 00 00 04 00 00 f8
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 e0 14
30: 00 00 00 00 d0 00 00 00 00 00 00 00 ff 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 05 00 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 11 00 02 00
80: 00 00 01 00 00 00 02 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 09 7c 1b 56 46 00 16 35 33 35 2e 31 30 34 2e 30
e0: 36 72 35 33 37 5f 31 33 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Persistence Mode                      : Enabled

Looks like a vGPU setup so you need to install the GRID driver in the VM, not the normal driver.

It is vGPU seup…
The driver is auto selected and installed by the gpu-operator…
I can control only the version

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html#about-installing-the-operator-and-nvidia-vgpu
did you set the correct version with “-grid” appended?

Going to reinstall operator with following paraeters:

helm install --wait gpu-operator ./gpu-operator-v23.6.1.tgz -n gpu-operator
–set driver.imagePullPolicy=Always --set driver.version=535.104.12-grid \
–set node-feature-discovery.worker.serviceAccount.name=node-feature-discovery

And let you know later…

In fact none of our documentation is mentionning the -grid option…

No such driver fount on the NVIDIA image repo:

Normal Pulling 8s (x4 over 101s) kubelet Pulling image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04
Warning Failed 7s (x4 over 99s) kubelet Failed to pull image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: rpc error: code = NotFound desc = failed to pull and unpack image “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: failed to resolve reference “nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04”: nvcr.io/nvidia/driver:535.104.12-grid-ubuntu20.04: not found
Warning Failed 7s (x4 over 99s) kubelet Error: ErrImagePull

The GRID driver is not available in repos, it can only be downloaded from the nvidia vgpu customer portal.
From the link:

  • Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal and append -grid:

export VGPU_DRIVER_VERSION=525.60.13-grid

The Operator automatically selects the compatible guest driver version from the drivers bundled with the driver image. If you disable the version check by specifying --build-arg DISABLE_VGPU_VERSION_CHECK=true when you build the driver image, then the VGPU_DRIVER_VERSION value is used as default.