Tesla T4 + vSphere 7.0u1C = smi can't communicate

Hypervisor is vSphere 7.0 u 1C
T4 is not in passthrough mode.
Followed the instructions here and installed 11.2 VGPU driver.

Any ideas on what else I can check?

Blockquote
nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Blockquote
dmesg | grep NVIDIA
2020-12-22T07:54:30.378Z cpu24:2102461)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2020-12-22T07:54:30.390Z cpu25:2102464)NVIDIA: Starting vGPU Services.
2020-12-22T07:54:30.405Z cpu41:2102467)NVIDIA: Starting Xorg service.
2020-12-22T07:54:33.096Z cpu45:2104601)NVIDIA: Starting the DCGM node engine.
2020-12-22T08:36:17.100Z cpu42:2112615)NVIDIA: Stopping the DCGM node engine.
2020-12-22T08:36:17.280Z cpu4:2112625)NVIDIA: Unloading nvidia module during vib remove.
2020-12-22T08:42:27.473Z cpu30:2113597)NVIDIA: Unloading nvidia module during vib install/upgrade.
2020-12-22T08:42:28.106Z cpu37:2113606)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2020-12-22T08:42:28.123Z cpu1:2113607)NVIDIA: Starting vGPU Services.
2020-12-22T08:42:28.140Z cpu36:2113610)NVIDIA: Starting Xorg service.
2020-12-22T08:42:31.344Z cpu8:2115707)NVIDIA: Starting the DCGM node engine.

Hi

Have you configured the BIOS correctly? And what happens if you uninstall / reinstall the driver?

Regards

MG

BIOS is configured to the best of my knowledge - SR-IOV & 4G decoding are enabled.
Un & Re-installing the driver, I still get the same error.

I even tried installing an “older” driver, I get the same error message.

Even though the vib install says it is successful, this, I believe is the problem:
2020-12-22T07:54:30.378Z cpu24:2102461)ALERT: NVIDIA: module load failed during VIB install/upgrade.

[:~] esxcli software vib list
Name Version Vendor Acceptance Level Install Date


NVIDIA-VMware_ESXi_7.0_Host_Driver 450.89-1OEM.700.0.0.15525992 NVIDIA VMwareAccepted 2020-12-23

Sorry, I forgot to list the hardware.
Server: SuperMicro SYS-1028U-TNRTP+
GPU: Tesla T4
o/s: vSphere 7.0 u1C

Hi

What about the MMIO settings in the BIOS?

Just checking … Where did you download the driver from?

Regards

MG

I have tried different MMIO settings, at this point I’m just guessing, I don’t know what the “correct” settings should be.

The driver was downloaded from the nvid.nvidia portal.

Hi

No need to guess, Google is your friend :-)

Try these:

FAQ Entry | Online Support | Support - Super Micro Computer, Inc.

Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized. | NVIDIA (custhelp.com)

Make sure the BIOS and all firmware are fully up to date, reset the BIOS to factory default and start again to ensure there are no rogue settings in there.

Regards

MG

1 Like

I’m still getting errors even after a BIOS reset.

I need to find the right “combo” of BIOS settings.

Can you make some sense of this? (see attached)

Hi

RedHat? What happened to VMware?

Regards

MG

This was just a test, I’m still using VMware.

All related to the lack of driver install

Definitely a BIOS setting issue!
I need to find the right “combo” for the SYS-1028U.

I moved the T4 into another SM box SYS-2029BT-HNR, enabled SR-IOV & installed the latest vib, it all works.

Here are the SYS-2029 BIOS settings.