Insufficient resources. One or more devices (pciPassthru0) required by VM xxx are not available on host yyy

Hello

We have four identical (Lenovo SR650) ESXi hosts with one Tesla P40 each, and four Win2016 server VMs with profile “grid_p40-4q”. They start up on separate hosts without a problem, but attempting a migration to a host with a GPU VM already on, it’s blocked with “Insufficient resources. One or more devices (pciPassthru0) required by VM xxx are not available on host yyy”.

Initially the VMs had different profiles (-4q and -8q for example). Since I read in the thread Unable to start VMs with VGPU that this is not supported, I changed them to the identical profile, but it’s still not working.

The P40 has 22.5 (or 24?) GB of video memory, right? So two (or even all four) VMs with grid_p40-4q profile (allocating 4 GB of memory each, right?) must work?

Both vCenter and the hosts are on the absolute latest 7.0.2 releases and build numbers.

Where’s the problem?

[root@yyy:~] esxcli software vib list | grep -i nvid
NVIDIA-VMware_ESXi_7.0.2_Driver  470.63-1OEM.702.0.0.17630552          NVIDIA   VMwareAccepted    2021-10-06

[root@yyy:~] nvidia-smi
Thu Oct  7 13:04:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63       Driver Version: 470.63       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000000:2F:00.0 Off |                    0 |
| N/A   23C    P8    19W / 250W |   3856MiB / 23039MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2130493    C+G   XXX                              3806MiB |
+-----------------------------------------------------------------------------+

Could you please disable ECC first and test again?
Nvidia-smi -e 0 should do the trick…

Regards Simon

Nope. Disabled ECC on two of the four hosts and rebooted them, still doesn’t work.

It’s also independent of vMotion, just powering on a second GPU-VM on the host with freshly disabled ECC (while one host was unavailable because of the reboot) did not work, with the same error message.

vMotion of a powered-on GPU-VM to a host that has zero GPU-VMs works though.

I disabled ECC on the hosts, but I see the same nvidia-smi command is also available within the VM. Where do you mean to use it…?

And your VMWare placement policy? Most likely it is configured on performance instead of density.

Nope, in case you mean this vSphere setting:

Edit Host Graphics Settings
Default graphics type: Shared Direct
Shared passthrough GPU assignment policy: Group VMs on GPU until full (GPU consolidation)

And it should not even make a difference in this case…?

Hmm, running out of ideas. Would recommend to open a support ticket with NVES to check nvidia-bugreport.

Regards Simon

Opened a case a few days ago, had an one hour Webex session today. Everything looks configured correctly (except the “disable ECC” thing maybe), versions are up to date and identical across hosts and VMs, yet it does not work at all. Only one VM per GPU…