Dell R740 with Tesla M10: nvidia-smi Failed to initialize NVML: Unknown Error

Server: Dell R740 - 384GB RAM, Tesla M10
OS: ESXi 6.5U3
nVidia Driver: 460.73.02
GPU: nVidia Tesla M10

In the BIOS,
Memory Mapped I/O Above 4GB: Enabled
Memory Mapped I/O Base: 512GB
https://www.dell.com/support/kbdoc/en-ca/000144038/dell-poweredge-14g-esxi-returns-failed-to-initialize-nvml-unknown-error-with-nvidia-gpu

# esxcli system maintenanceMode set --enable true
# esxcli software vib install -d /path-to-zip/NVIDIA-bootbank-offline-bundle.zip
# esxcli system maintenanceMode set --enable false
# reboot
# nvidia-smi
Failed to initialize NVML: Unknown Error

https://kb.vmware.com/s/article/2064775
# esxcli hardware pci list –c 0x0300 –m 0xf
0000:40:00.0
Address: 0000:40:00.0
Segment: 0x0000
Bus: 0x40
Slot: 0x00
Function: 0x0
VMkernel Name: vmgfx3
Vendor Name: NVIDIA Corporation
Device Name: NVIDIATesla M10
Configured Owner: Unknown
Current Owner: VMkernel
Vendor ID: 0x10de
Device ID: 0x13bd
SubVendor ID: 0x10de
SubDevice ID: 0x1160
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa2
Interrupt Line: 0xff
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x3201
Module ID: -1
Module Name: None
Chassis: 0
Physical Slot: 4294967295
Slot Description: PCIe Slot 1; relative bdf 04:00.0
Passthru Capable: true
Parent Device: PCI 0:60:17:0
Dependent Device: PCI 0:64:0:0
Reset Method: Bridge reset
FPT Sharable: true

As shown above, Module Name: None is not correct. It should be Module Name: nVidia

  1. Does anyone know what could be causing the Failed to intialize NVML: Unknown Error?

  2. On my ESXi server, does xorg need to be started? I still get this error whether xorg is started or not started.

dmesg

2021-06-17T09:03:16.173Z cpu61:66459)MemSched: 14642: uw.66459 (1649) extraMin/extraFromParent: 9688/9688, vmkdevmgr (791) childEmin/eMinLimit: 1255/10752
2021-06-17T09:03:16.174Z cpu61:66459)MemSched: 14635: Admission failure in path: vmkdevmgr/vmkdevmgr.66459/uw.66459
2021-06-17T09:03:16.174Z cpu61:66459)MemSched: 14642: uw.66459 (1649) extraMin/extraFromParent: 9688/9688, vmkdevmgr (791) childEmin/eMinLimit: 1255/10752
2021-06-17T09:03:16.240Z cpu47:71670)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2021-06-17T09:03:16.268Z cpu45:71671)NVIDIA: Starting vGPU Services.
2021-06-17T09:03:16.301Z cpu60:71674)NVIDIA: Starting Xorg service.

As shown in DMESG, the nVidia module failed to load.

Ok. I solved it.

The version of ESXi 6.5U3 that I was using is the latest from Dell but it’s still from Dec 2019. I downloaded the latest patch updates (Feb 2021) for ESXi 6.5U3 from VMWare and patched the server. Now the nVidia drivers work properly.

From a quick google search, one of the patches fixes the maximum size allowed for VIBs.