femi
December 22, 2020, 5:14pm
1
Hypervisor is vSphere 7.0 u 1C
T4 is not in passthrough mode.
Followed the instructions here and installed 11.2 VGPU driver.
Any ideas on what else I can check?
Blockquote
nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Blockquote
dmesg | grep NVIDIA
2020-12-22T07:54:30.378Z cpu24:2102461)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2020-12-22T07:54:30.390Z cpu25:2102464)NVIDIA: Starting vGPU Services.
2020-12-22T07:54:30.405Z cpu41:2102467)NVIDIA: Starting Xorg service.
2020-12-22T07:54:33.096Z cpu45:2104601)NVIDIA: Starting the DCGM node engine.
2020-12-22T08:36:17.100Z cpu42:2112615)NVIDIA: Stopping the DCGM node engine.
2020-12-22T08:36:17.280Z cpu4:2112625)NVIDIA: Unloading nvidia module during vib remove.
2020-12-22T08:42:27.473Z cpu30:2113597)NVIDIA: Unloading nvidia module during vib install/upgrade.
2020-12-22T08:42:28.106Z cpu37:2113606)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2020-12-22T08:42:28.123Z cpu1:2113607)NVIDIA: Starting vGPU Services.
2020-12-22T08:42:28.140Z cpu36:2113610)NVIDIA: Starting Xorg service.
2020-12-22T08:42:31.344Z cpu8:2115707)NVIDIA: Starting the DCGM node engine.
MrGRID
December 27, 2020, 10:17am
2
Hi
Have you configured the BIOS correctly? And what happens if you uninstall / reinstall the driver?
Regards
MG
femi
December 27, 2020, 7:19pm
3
BIOS is configured to the best of my knowledge - SR-IOV & 4G decoding are enabled.
Un & Re-installing the driver, I still get the same error.
femi
December 27, 2020, 7:33pm
4
I even tried installing an “older” driver, I get the same error message.
Even though the vib install says it is successful, this, I believe is the problem:
2020-12-22T07:54:30.378Z cpu24:2102461)ALERT: NVIDIA: module load failed during VIB install/upgrade.
femi
December 27, 2020, 7:55pm
5
[:~] esxcli software vib list
Name Version Vendor Acceptance Level Install Date
NVIDIA-VMware_ESXi_7.0_Host_Driver 450.89-1OEM.700.0.0.15525992 NVIDIA VMwareAccepted 2020-12-23
femi
December 27, 2020, 8:32pm
6
Sorry, I forgot to list the hardware.
Server: SuperMicro SYS-1028U-TNRTP+
GPU: Tesla T4
o/s: vSphere 7.0 u1C
MrGRID
December 28, 2020, 10:32am
7
Hi
What about the MMIO settings in the BIOS?
Just checking … Where did you download the driver from?
Regards
MG
femi
December 28, 2020, 3:50pm
8
I have tried different MMIO settings, at this point I’m just guessing, I don’t know what the “correct” settings should be.
The driver was downloaded from the nvid.nvidia portal.
MrGRID
December 29, 2020, 10:54am
10
1 Like
femi
January 1, 2021, 3:01am
11
I’m still getting errors even after a BIOS reset.
I need to find the right “combo” of BIOS settings.
Can you make some sense of this? (see attached)
MrGRID
January 1, 2021, 9:51am
12
Hi
RedHat? What happened to VMware?
Regards
MG
femi
January 1, 2021, 5:55pm
13
This was just a test, I’m still using VMware.
femi
January 1, 2021, 10:19pm
14
All related to the lack of driver install
femi
January 2, 2021, 8:43am
15
Definitely a BIOS setting issue!
I need to find the right “combo” for the SYS-1028U.
I moved the T4 into another SM box SYS-2029BT-HNR, enabled SR-IOV & installed the latest vib, it all works.
Here are the SYS-2029 BIOS settings.