Insufficient resources. One or more devices (pciPassthru0) required by VM xxx are not available on host yyy

johannes.froemter · October 7, 2021, 1:44pm

Hello

We have four identical (Lenovo SR650) ESXi hosts with one Tesla P40 each, and four Win2016 server VMs with profile “grid_p40-4q”. They start up on separate hosts without a problem, but attempting a migration to a host with a GPU VM already on, it’s blocked with “Insufficient resources. One or more devices (pciPassthru0) required by VM xxx are not available on host yyy”.

Initially the VMs had different profiles (-4q and -8q for example). Since I read in the thread Unable to start VMs with VGPU that this is not supported, I changed them to the identical profile, but it’s still not working.

The P40 has 22.5 (or 24?) GB of video memory, right? So two (or even all four) VMs with grid_p40-4q profile (allocating 4 GB of memory each, right?) must work?

Both vCenter and the hosts are on the absolute latest 7.0.2 releases and build numbers.

Where’s the problem?

[root@yyy:~] esxcli software vib list | grep -i nvid
NVIDIA-VMware_ESXi_7.0.2_Driver  470.63-1OEM.702.0.0.17630552          NVIDIA   VMwareAccepted    2021-10-06

[root@yyy:~] nvidia-smi
Thu Oct  7 13:04:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63       Driver Version: 470.63       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000000:2F:00.0 Off |                    0 |
| N/A   23C    P8    19W / 250W |   3856MiB / 23039MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2130493    C+G   XXX                              3806MiB |
+-----------------------------------------------------------------------------+

sschaber · October 7, 2021, 5:23pm

Could you please disable ECC first and test again?
Nvidia-smi -e 0 should do the trick…

Regards Simon

johannes.froemter · October 7, 2021, 6:07pm

Nope. Disabled ECC on two of the four hosts and rebooted them, still doesn’t work.

It’s also independent of vMotion, just powering on a second GPU-VM on the host with freshly disabled ECC (while one host was unavailable because of the reboot) did not work, with the same error message.

vMotion of a powered-on GPU-VM to a host that has zero GPU-VMs works though.

johannes.froemter · October 7, 2021, 6:11pm

I disabled ECC on the hosts, but I see the same nvidia-smi command is also available within the VM. Where do you mean to use it…?

sschaber · October 7, 2021, 6:15pm

And your VMWare placement policy? Most likely it is configured on performance instead of density.

johannes.froemter · October 7, 2021, 6:21pm

Nope, in case you mean this vSphere setting:

Edit Host Graphics Settings
Default graphics type: Shared Direct
Shared passthrough GPU assignment policy: Group VMs on GPU until full (GPU consolidation)

And it should not even make a difference in this case…?

sschaber · October 7, 2021, 6:34pm

Hmm, running out of ideas. Would recommend to open a support ticket with NVES to check nvidia-bugreport.

Regards Simon

johannes.froemter · October 14, 2021, 3:35pm

Opened a case a few days ago, had an one hour Webex session today. Everything looks configured correctly (except the “disable ECC” thing maybe), versions are up to date and identical across hosts and VMs, yet it does not work at all. Only one VM per GPU…

johannes.froemter · November 2, 2021, 5:42pm

We had a cases both with NVIDIA and VMware, without results - but it just works now! What has changed? vCenter was upgraded to build 18778458, that seemingly solved the issue!

Or - maybe the setting vgpu.hotmigrate.enabled = true requires a restart of vCenter?? The documentation does NOT state this… Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation

sschaber · November 3, 2021, 4:55am

Hi, thanks for the feedback. Indeed, vCenter needs to be on the latest version to fix this. Our Eng is still working with VMWare on this

Topic		Replies	Views
Unable to start VMs with VGPU General Discussion	10	4053	October 6, 2021
Is it possible to present multiple vGPU's to a single VM from a Tesla T4 card on ESXi 6.7? General Discussion	4	3543	July 9, 2020
Nvidia VMware vSphere-6.7 NVIDIA Virtual GPU Technology	14	10197	August 19, 2019
NVIDIA Vmware vSphere-6.5 NVIDIA Virtual GPU Technology	20	78302	July 15, 2019
L40S unavailable when other GPUs are present on ESXi host NVIDIA Virtual GPU Drivers	5	496	February 5, 2025
I can't boot VM with vGPU General Discussion	3	1284	February 10, 2023
What software to use for our new single NVIDIA T4 Tesla card on VMware 6.7 ESXi Host General Discussion	14	10157	August 17, 2020
Can't power on another vGPU enabled VM NVIDIA Virtual GPU Technology	7	10644	May 14, 2018
[SOLVED] M10 with ESXi 6.5 - vGPU: Device not supported General Discussion	7	22793	October 18, 2017
ESXi 6.7 + Tesla V100 + 430.27 not working NVIDIA Virtual GPU Drivers	8	15028	July 23, 2019

Insufficient resources. One or more devices (pciPassthru0) required by VM xxx are not available on host yyy

Related topics