We are testing GPU on Kubernetes running ob vSphere for AI workloads
Have NVIDIA A40 GPU with Display enabled in vSphere 8.0u1c with TKGS.
Providing the vGPU in to TKC cluster seems to be ok.
Problem: My worker nodes (VMs) are failing on NVIDIA vGPU kernel module load.
VC: 8.0 u1c
TKG: v1.25.7—vmware.3-fips.1-tkg.1 / ubuntu
NVAIE: 535.104.12
GU operator: v23.6.1, tested also with v23.3.2
GPU: NVIDIA A40 with Display enabled
VM vGPU profile: nvidia_a40-48q
[adm@esxi03-dc7:~] nvidia-smi
Fri Oct 20 15:28:33 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: N/A |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:B1:00.0 Off | Off |
| 0% 25C P8 30W / 300W | 48512MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A40 On | 00000000:CA:00.0 Off | Off |
| 0% 26C P8 32W / 300W | 48512MiB / 49140MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2106384 C+G …-a40-primary-r7np9-7cc99b745f-f6snh 48512MiB |
| 1 N/A N/A 2106381 C+G …-a40-primary-r7np9-7cc99b745f-pzh59 48512MiB |
±--------------------------------------------------------------------------------------+
[adm@esxi03-dc7:~] nvidia-smi -q |more
==============NVSMI LOG==============
Timestamp : Fri Oct 20 15:28:36 2023
Driver Version : 535.104.06
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 2
GPU 00000000:B1:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
GPU k logs -f nvidia-driver-daemonset-t8swg -n gpu-operator
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.104.12
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.104.12…
Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
[154358.907198] nvidia-nvlink: Unregistered Nvlink Core, major device number 240
[154773.303613] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[154773.305991] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[154773.306416] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:2235)
NVRM: installed in this system is not supported by the
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
NVRM: NVIDIA 535.104.12 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: in this release's README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
[154773.306621] nvidia: probe of 0000:02:00.0 failed with error -1
[154773.306640] NVRM: The NVIDIA probe routine failed for 1 device(s).
[154773.306640] NVRM: None of the NVIDIA devices were initialized.
root@gpu-tkc01-nodepool-a40-primary-lfsnn-7c89bf7f9-7mjgk:~# lspci -s 0000:02:00.0 -v -xxx
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2235 (rev a1) (prog-if 00 [VGA controller])
DeviceName: pciPassthru0
Subsystem: NVIDIA Corporation Device 14e0
Physical Slot: 32
Flags: fast Back2Back, 66MHz, user-definable features, ?? devsel
Memory at fa000000 (32-bit, non-prefetchable) [size=256K]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [d0] Vendor Specific Information: Len=1b <?>
Capabilities: [7c] MSI-X: Enable- Count=3 Masked-
Kernel modules: nvidiafb
00: de 10 35 22 02 03 ff 06 a1 00 00 03 00 00 00 00
10: 00 00 00 fa 0c 00 00 d0 00 00 00 00 04 00 00 f8
20: 00 00 00 00 00 00 00 00 00 00 00 00 de 10 e0 14
30: 00 00 00 00 d0 00 00 00 00 00 00 00 ff 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 05 00 80 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 11 00 02 00
80: 00 00 01 00 00 00 02 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 09 7c 1b 56 46 00 16 35 33 35 2e 31 30 34 2e 30
e0: 36 72 35 33 37 5f 31 33 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Persistence Mode : Enabled