P40 with Dell 740xd: nvidia-smi Failed to initialize NVML: Unknown Error

Using the Dell vSphere installer (VMware-VMvisor-Installer-6.5.0.update01-6765664.x86_64-DellEMC_Customized-A02), the NVIDIA Grid VIB installed fine (NVIDIA-VMware_ESXi_6.5_Host_Driver_384.73-1OEM.650.0.0.4598673).

However, nvidia-smi returns:

Failed to initialize NVML: Unknown Error

Is there an incompatibility between the two versions that I have installed?

Thanks,

-Ryan

Please try the default bits from VMWare. I don’t think our VIB is tested with Dell installer. In addition please check with dmesg to see if there are any other errors that may indicate a BIOS settings error.

Regards

Simon

Thanks Simon.

The problem with the default installer from VMWare was that it did not recognize the 10Gb network ports on the server. I’ll try to find another workaround for that.

In the mean time, it does appear that there are some issues from the results of dmesg:

2017-11-04T20:02:39.735Z cpu30:67099)Starting service nvidia-init
2017-11-04T20:02:39.736Z cpu30:67099)Activating Jumpstart plugin nvidia-init.
2017-11-04T20:02:39.751Z cpu0:68125)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2017-11-04T20:02:39.756Z cpu4:68126)NVIDIA: Starting vGPU Services.
2017-11-04T20:02:39.766Z cpu37:68129)NVIDIA: Starting Xorg service.
2017-11-04T20:02:40.872Z cpu12:68209)ALERT: NVIDIA: Xorg service start failed.
2017-11-04T20:02:40.876Z cpu34:68210)NVIDIA: Starting the DCGM node engine.
2017-11-04T20:02:41.959Z cpu26:67491)Config: 706: "VMOverheadGrowthLimit" = 4294967295, Old Value: -1, (Status: 0x0)
2017-11-04T20:02:42.961Z cpu20:67099)Jumpstart plugin nvidia-init activated.

I will look into these issues.

Thanks,

-Ryan

According to this doc: VMware Knowledge Base, the Module Name needs to be "nvidia", but I show it as "None", which might explain why Xorg will not start.

I’m also not sure if the fact that

>esxcli hardware pci list -c 0x0300 -m 0xf

returns the embedded VGA controller as well as the NVIDIA controller is an issue or not…

Hi Ryan,

Did you get any further? I have the same issue…

Paul

Found it.
Just if someone else has the same problem:
http://topics-cdn.dell.com/pdf/vmware-esxi-6.5.x_release%20notes_en-us.pdf
page 8

Description:
When system BIOS has "Memory Mapped I/O Base" set to 56 TB and if the server has GPU cards such as Nvidia M60 as the PCIe Pass-Through device, the virtual machines fails to power on.

Applies to:
ESXi 6.5.x and Dell EMC’s 14th generation PowerEdge servers

Solution:
To resolve this, set the MMIO to 12 TB. To set MMIO, in System BIOS Settings >
Integrated Devices, you have to set "Memory Mapped I/O Base" to 12 TB.
For more information, refer to
VMware Knowledge Base article 2142307