Hi to all,
I’ve got a configuration problem in our laboratory that involves the NVIDIA A100 with ESXI 7u1.
I’ve followed vmware blog(https://blogs.vmware.com/apps/2020/09/vsphere-7-0-u1-with-multi-instance-gpus-mig-on-the-nvidia-a100-for-machine-learning-applications-part-1-introduction.html) and also the NVIDIA MIG deployment guide
on the vSphere 7.0 U1 host I’ve tried MIG-backed and time-sliced vGPU profiles. Neither of the two seems to work correctly.
After enabling MIG Feature on the GPU, creating a GPU Instance, a Compute Instance and applying the vGPU profile to a virtual machine, during the power on procedure VMware reports that error:
could not initialize plugin /usr/lib64/vmware/plugin/libnvidia-vgx.so for vGPU grid_a100-10cRunning nvidia-smi on the VMware host after the error shows the following message:unable to determine the device handel for gpu 0000:3b:00:0 gpu is lost. reboot the system to recover this gpuWe must reboot the host to make the GPU “operational” again.I’ve also done same test creating only the GPU Instance and not the Compute Instance.
The same applies to time sliced vGPU profiles (disabling MIG feature and using “normal” vGPU).I’ve used all the latest vGPU driver: 11.2 and 12.0, same result.
The VMware version is ESXi 7.0 U1 build 17325551.
the SRV-IO feature in the BIOS is enabled. Dell r740 host server is in use.
I’ve tried 2 different cards but the result is the same. Also tried the passtrought, but the card always go in unrecoverable state
Any one can help?