Can't power on another vGPU enabled VM

Hello,

I sometimes get the following error if I power on another VM (10 VMs per host are running):

Could not initialize plugin /usr/lib64/vmware/plugin/libnvidia-vpx.so for vGPU " passthrough device ‘pciPassthru0’ vGPU ‘grid_m60-2q’ disallowed by vmkernel: Out of memory"

The hosts have enough memory and vGPU resources left to power on the VM.
The support says it’s an known issue:

http://docs.nvidia.com/grid/5.0/grid-vgpu-release-notes-vmware-vsphere/index.html#bug-200060499-vGPU-enabled-VMs-fail-too-much-memory

But why can I sometimes power on another VM, e.g. the eleventh and sometimes not? I think it’s another issue.
Does anyone have the same problem?

Best Regards

Hi,

how much system memory have the VMs affected?

Regards

Simon

I have the same problem. In my environment with the profile I’m using I should be able to have 96 VM’s running. Sometimes with only 85 VM’s provisioned I still get the error you mention. After hammering Power On it will eventually power on. Sometimes I even have to delete another VM before I can power on my parent image.

Extremely frustrating.

5 VMs with 128 GB system memory
5 VMs with 16 GB system memory

In total 720 GB.

The hosts have 960 GB system memory.

So did you file a ticket with Nvidia and VMWare?

Hello,

yes, after the issue came back again i’ve opened Cases at Nvidia and VMware.

Nvidia Case: 00007591
The Nvidia Support Engineer didn’t find any errors on Nvidia side.

VMware says it’s an Known Issue of Nvidia.
https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-vmware-vsphere/index.html
Nvidia REF: 200060499

Best Regards
Georg

Hi Georg,

yes it seems you hit the given issue but I disagree that this is a NV issue. From my understanding you need to “fully reserve memory” for the vGPU enabled VMs on ESX. This works until there is not enough system memory available for the hypervisor (VMKernel) any more. There seem to be no rule how much memory needs to be available for hypervisor and therefore only trial and error with reducing the allocated system memory to the VMs seems to help. I’ll try to get some advise what we can do here or if this is something that needs to be addressed from VMWare (which I believe) as I’ve never heard that the same issue occurs on other hypervisors.

regards

Simon

Hello Simon,

any news from VMware?

Best Regards
Georg