ESXi 6.7 + Tesla V100 + 430.27 not working

Hello,

we have a ESXi 6.7 installed on our Server. Now i wanted to passthrough the Tesla V100 to one VM.

I installed the latest Host Driver for ESXi:
NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib and reboot the machine.

But when i run the nvidia-smi comes an error:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Can anybody help?

Hi

If you’re using Passthrough you don’t need to install the .vib in the ESXi Host.

Remove the GPU from running in Passthrough, and use a vGPU Profile instead. Then run nvidia-smi again.

Regards

MG

Hello thank you for your reply,

maybe "passthrough" was not the correct word. I want to "attach" the vGPU to more than one VM like the Document

430.27-430.30-431.02-grid-software-quick-start-guide.pdf
Chapter 3. INSTALLING AND CONFIGURING NVIDIA VGPU MANAGER AND THE GUEST DRIVER describes.

I register me in nvidia license portal, download the package NVIDIA-GRID-vSphere-6.7-430.27-430.30-431.02.zip for ESXi 6.7.

The installation with "esxcli software vib install –v NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib" was also successful.

But nvidia-smi throw the error:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Would be helpful to know which server hardware you are using. If it is Dell you need to modify your BIOS to restrict MMIO.
You should also run "dmesg" to get more information on the host what might be the issue

The machine is a Dell PowerEdge R740xd.

dmesg | grep NVIDIA shows:

2019-07-15T11:46:44.718Z cpu94:2101167)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2019-07-15T11:46:44.722Z cpu109:2101168)NVIDIA: Starting vGPU Services.
2019-07-15T11:46:44.728Z cpu0:2101171)NVIDIA: Starting Xorg service.
2019-07-15T11:46:45.225Z cpu39:2101248)NVIDIA: Starting the DCGM node engine.

It looks like the installation with VIB was not correct.

I can not find information about "BIOS restrict MMIO". Is it what this nvidia support site describe?
https://nvidia.custhelp.com/app/answers/detail/a_id/4119/~/incorrect-bios-settings-on-a-server-when-used-with-a-hypervisor-can-cause-mmio

Ok i found the solution. Passthrough was enabled in in ESXi. I disabled it and can see now information about my GPU with nvidia-smi.

Okay it still not works.

I disabled the passthrough setting in ESXi Host by PCI-Devices.
I change with vSphere Web Client the Settings for the Host Graphics and Graphics Devices to shared direct.
Restart the ESXi Host.

Now nvidia-smi works:

[root@bigdata:~] nvidia-smi 
Fri Jul 19 10:21:25 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.27       Driver Version: 430.27       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   37C    P0    28W / 250W |     39MiB / 32767MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

vmkload_mod works:

[root@bigdata:~] vmkload_mod -l | grep nvidia
nvidia                         13    17840

and dmesg has no errors:

[root@bigdata:~] dmesg | grep nvidia
2019-07-18T16:52:41.547Z cpu0:2097152)VisorFSTar: 1856: nvidia_v.v00 for 0x48fd082 bytes
2019-07-18T16:52:48.072Z cpu37:2098396)Loading module nvidia ...
2019-07-18T16:52:48.098Z cpu37:2098396)Elf: 2101: module nvidia has license NVIDIA
2019-07-18T16:52:48.471Z cpu37:2098396)nvidia-nvlink core initialized
2019-07-18T16:52:48.471Z cpu37:2098396)Device: 192: Registered driver 'nvidia' from 21
2019-07-18T16:52:48.472Z cpu37:2098396)Mod: 4962: Initialization of nvidia succeeded with module ID 21.
2019-07-18T16:52:48.472Z cpu37:2098396)nvidia loaded successfully.
2019-07-18T16:52:48.477Z cpu27:2098226)Device: 327: Found driver nvidia for device 0x47bd4309e61c8101
2019-07-18T16:52:55.943Z cpu78:2098417)NVRM: nvidia_associate vmgfx0
2019-07-18T16:53:25.704Z cpu60:2100286)Starting service nvidia-init
2019-07-18T16:53:25.704Z cpu60:2100286)Activating Jumpstart plugin nvidia-init.
2019-07-18T16:53:35.787Z cpu109:2100286)Jumpstart plugin nvidia-init activated.

But i still can’t attach the graphic card. The menu option for Adding PCI-Devices is grey:
https://www.directupload.net/file/d/5518/ejz8rkxv_png.htm

Any ideas?

ECC memory disabled? Correct license present (Enterprise Plus) on vSphere?

It was the License, we have only the Standard License in ESXi. I try it with passthrough in one VM till we have the right licenses.