K2 vGPU help

Hi, we are running 3 K2 cards on a Dell T630 with K260q profiles on VMware 6.0.0. It appears we are only getting 6 desktops working when we should be able to get 12. When we try the 7th we get power on failure "The amount of graphics resource available in the parent resource pool is insufficient for the operation." Anyone seen this or have any suggestions?

Let’s begin with “nvidia-smi” @ ESXi to get your version info, card visibility and vGPU usage.

Tue May 17 12:49:51 2016
±-----------------------------------------------------+
| NVIDIA-SMI 352.70 Driver Version: 352.70 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K2 On | 0000:06:00.0 Off | Off |
| N/A 31C P8 28W / 117W | 1872MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GRID K2 On | 0000:07:00.0 Off | Off |
| N/A 35C P0 48W / 117W | 1876MiB / 4095MiB | 16% Default |
±------------------------------±---------------------±---------------------+
| 2 GRID K2 On | 0000:85:00.0 Off | Off |
| N/A 32C P8 30W / 117W | 1873MiB / 4095MiB | 51% Default |
±------------------------------±---------------------±---------------------+
| 3 GRID K2 On | 0000:86:00.0 Off | Off |
| N/A 31C P8 28W / 117W | 1872MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GRID K2 On | 0000:89:00.0 Off | Off |
| N/A 32C P8 31W / 117W | 1873MiB / 4095MiB | 45% Default |
±------------------------------±---------------------±---------------------+
| 5 GRID K2 On | 0000:8A:00.0 Off | Off |
| N/A 30C P8 27W / 117W | 1872MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6794573 C+G Cad-25 1856MiB |
| 1 6859646 C+G Cad-23 1856MiB |
| 2 7348433 C+G Cad-32 1856MiB |
| 3 6798732 C+G Cad-1 1856MiB |
| 4 6904581 C+G Cad-46 1856MiB |
| 5 6793658 C+G Cad-6 1856MiB |
±----------------------------------------------------------------------------+

This is looking good. All cards visible, vGPU even distribution (none vDGA) …
But:
Did you try to upgrade ESXi (https://communities.vmware.com/thread/525118?start=0&tstart=0) ?
Did you try to install newer grid version 352.83/354.80 or 362.13/361.40 ?

There is a similar issue detailed in the troubleshooting guide for VMware:

vGPU-enabled VMs fail to start, nvidia-smi fails when VMs are configured with too high a proportion of the server’s memory.
Description: If vGPU-enabled VMs are assigned too high a proportion of the server’s total memory, one of more of the VMs may fail to start with the error “The available Memory resources in the parent resource pool are insufficient for the operation”, and nvidia-smi run in the host shell returns this error:
-sh: can’t fork
For example, on a server configured with 256G of memory, these errors may occur if vGPU
-enabled VMs are assigned more than 243G of memory.
Workaround: Reduce the total amount of system memory assigned to the VMs.
Status: Closed
Ref. #200060499

This error has also been seen if the wrong VIB is used: see: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2126577

Can you check all the hosts in the pool have GPUs and are licensed sufficiently for vGPU? It could be a VM trying to start on a host without vGPU available.

We have 384GB of memory and only 6 current VMs with 8GB of RAM reserved on each. So I don’t believe it is RAM. Related. We have a large number of other VMs but no other reservations.

We are in the process of updating ESXi and the VIB and drivers. 1 Host is done 3 more to go.

I’ve seen that KB but if we had the wrong VIB I’m guessing none of the VMs would work. Here’s the output of the command
NVIDIA-vGPU-kepler-VMware_ESXi_6.0_Host_Driver 352.70-1OEM.600.0.0.2494585 NVIDIA VMwareAccepted 2016-02-11