Vmware HA fails with vGPU configured on VM

I am testing vGPU config and automated recovery after a host failure with HA. When an ESXi host dies the running VMs fail to automatically start on another vGPU enabled host.

The error given by HA is:
"Insufficient resources to fail over this virtual machine. vSphere HA will retry the fail over when enough resources are available. Reason: Unable to find healthy compatible hosts for the VM"

When all hosts have recovered and the VM is powered back on it shows an error in the Event log:

"Hardware GPU resources are not available. The virtual machine will use software rendering."

Which in fact is a total lie. vgpuvm shows that it is using the hardware GPU, nvidia-smi shows it, and dxdiag in the guest vm shows it is using it.

Any ideas?

[root@smview1:/vmfs/volumes/9d2dede8-7f3698ef/Training-0004] grep -i ‘mks|’ vmware.log
2016-06-09T15:41:56.342Z| mks| I120: VTHREAD start thread 2 “mks” pid 39369
2016-06-09T15:41:56.342Z| mks| I120: MKS thread is alive
2016-06-09T15:41:56.343Z| mks| I120: MKS-RenderMain: RenderMain: PowerOn allowed 0 1 1 1 0
2016-06-09T15:41:56.343Z| mks| I120: MKS-RenderMain: Collecting RenderOps caps…
2016-06-09T15:41:56.344Z| mks| W110: GLWindow: Unable to reserve host GPU resources
2016-06-09T15:41:56.351Z| mks| I120: MKS-SWP: plugin started - llvmpipe (LLVM 3.3, 256 bits)
2016-06-09T15:41:56.351Z| mks| I120: Started Shim3D
2016-06-09T15:41:56.352Z| mks| I120: Stopped Shim3D
2016-06-09T15:41:56.352Z| mks| I120: MKS-SWP: plugin stopped
2016-06-09T15:41:56.352Z| mks| I120: MKS-RenderMain: Starting MKSBasicOps
2016-06-09T15:41:56.352Z| mks| I120: KHBKL: Unable to parse keystring at: ‘’
2016-06-09T15:41:56.352Z| mks| I120: MKS-RemoteMgr: Set default display name: Training-0004
2016-06-09T15:41:56.352Z| mks| I120: MKS-RemoteMgr: Loading VNC Configuration from VM config file
[root@smview1:/vmfs/volumes/9d2dede8-7f3698ef/Training-0004]
[root@smview1:/vmfs/volumes/9d2dede8-7f3698ef/Training-0004]
[root@smview1:/vmfs/volumes/9d2dede8-7f3698ef/Training-0004] grep -i ‘mks’ vmware.log
2016-06-09T15:41:55.633Z| vmx| I120: MKSXlib: Initialized thread-safe Xlib
2016-06-09T15:41:55.701Z| vmx| I120: DICT mks.enable3d = “TRUE”
2016-06-09T15:41:55.701Z| vmx| I120: DICT mks.use3dRenderer = “automatic”
2016-06-09T15:41:56.342Z| vmx| I120: MKS PowerOn
2016-06-09T15:41:56.342Z| mks| I120: VTHREAD start thread 2 “mks” pid 39369
2016-06-09T15:41:56.342Z| mks| I120: MKS thread is alive
2016-06-09T15:41:56.343Z| mks| I120: MKS-RenderMain: RenderMain: PowerOn allowed 0 1 1 1 0
2016-06-09T15:41:56.343Z| mks| I120: MKS-RenderMain: Collecting RenderOps caps…
2016-06-09T15:41:56.344Z| mks| W110: GLWindow: Unable to reserve host GPU resources
2016-06-09T15:41:56.351Z| mks| I120: MKS-SWP: plugin started - llvmpipe (LLVM 3.3, 256 bits)
2016-06-09T15:41:56.351Z| mks| I120: Started Shim3D
2016-06-09T15:41:56.352Z| mks| I120: Stopped Shim3D
2016-06-09T15:41:56.352Z| mks| I120: MKS-SWP: plugin stopped
2016-06-09T15:41:56.352Z| mks| I120: MKS-RenderMain: Starting MKSBasicOps
2016-06-09T15:41:56.352Z| mks| I120: KHBKL: Unable to parse keystring at: ‘’
2016-06-09T15:41:56.352Z| mks| I120: MKS-RemoteMgr: Set default display name: Training-0004
2016-06-09T15:41:56.352Z| mks| I120: MKS-RemoteMgr: Loading VNC Configuration from VM config file
2016-06-09T15:41:56.353Z| vmx| I120: [msg.mks.noGPUResourceFallback] Hardware GPU resources are not available. The virtual machine will use software rendering.
2016-06-09T15:41:56.354Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.mks.noGPUResourceFallback’ (seq 24717) is revoked
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMks : 33 33 - | 2 2 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMks3d : 180224 180224 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksGLRenderer : 12288 12288 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksGLTransient : 65536 65536 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksLLVM : 8192 8192 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksScreenTemp : 36866 36866 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksVnc : 19362 19362 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksScreen : 32769 32769 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxMksSVGAVO : 4096 4096 - | 0 0 -
2016-06-09T15:41:56.500Z| vmx| I120: OvhdMem OvhdUser_VmxThreadMks : 512 512 - | 512 512 -
[root@smview1:/vmfs/volumes/9d2dede8-7f3698ef/Training-0004] grep -i ‘wddm’ vmware.log
2016-06-09T15:42:29.264Z| vcpu-3| I120: Guest: vm3d: SVGA WDDM Full Display driver, Version: 8.15.01.0045, Build Number: 3471414
2016-06-09T15:42:29.264Z| vcpu-3| I120: Guest: vm3d: WDDM OS version: 6.1, build number: 7601, service pack version: 1.0, platform Id: 2, product type: 1, suite mask: 0x110
2016-06-09T15:42:29.270Z| vcpu-3| I120: Guest: vm3d: WDDM Guest backed surface is enabled.
2016-06-09T15:42:29.272Z| vcpu-3| I120: Guest: vm3d: WDDM 3D is enabled.
2016-06-09T15:42:29.272Z| vcpu-3| I120: Guest: vm3d: WDDM DX10 context is disabled.
2016-06-09T15:42:29.272Z| vcpu-3| I120: Guest: vm3d: WDDM GL3 is disabled.
2016-06-09T15:42:29.272Z| vcpu-3| I120: Guest: vm3d: WDDM DX cap is disabled.
2016-06-09T15:42:29.272Z| vcpu-3| I120: Guest: vm3d: WDDM Guest backed primary in aperture is disabled.
2016-06-09T15:42:29.273Z| vcpu-3| I120: Guest: vm3d: WDDM GDI HW Acceleration is enabled.
2016-06-09T15:42:29.273Z| vcpu-3| I120: Guest: vm3d: WDDM GDI HW Acceleration Patch is enabled.
2016-06-09T15:42:29.273Z| vcpu-3| I120: Guest: vm3d: WDDM primary bounding box mem 16384KB.
2016-06-09T15:42:29.273Z| vcpu-3| I120: Guest: vm3d: WDDM VRAM 49152KB.
2016-06-09T15:42:29.276Z| vcpu-3| I120: Guest: vm3d: WDDM using 152KB memory for OTable.
2016-06-09T15:42:29.276Z| vcpu-3| I120: Guest: vm3d: WDDM GMR memory segment 262144KB.
2016-06-09T15:42:29.276Z| vcpu-3| I120: Guest: vm3d: WDDM Aperture memory 524288KB.

[root@smview3:~] nvidia-smi
Thu Jun 9 15:57:40 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.45 Driver Version: 361.45.09 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K2 On | 0000:84:00.0 Off | Off |
| N/A 34C P8 29W / 117W | 846MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GRID K2 On | 0000:85:00.0 Off | Off |
| N/A 31C P8 28W / 117W | 426MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GRID K2 On | 0000:8A:00.0 Off | Off |
| N/A 24C P8 28W / 117W | 426MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GRID K2 On | 0000:8B:00.0 Off | Off |
| N/A 36C P8 28W / 117W | 426MiB / 4095MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 37053 C+G Training-0001 416MiB |
| 0 37054 C+G VDI-0047 416MiB |
| 1 37055 C+G Training-0003 416MiB |
| 2 37056 C+G Training-0002 416MiB |
| 3 39359 C+G Training-0004 416MiB |
±----------------------------------------------------------------------------+
[root@smview3:~] gpuvm
Xserver unix:0, PCI ID 0:132:0:0, vGPU: 0x11b0:0x109d, GPU maximum memory 4184024KB
pid 37053, VM "Training-0001", reserved 425984KB of GPU memory.
pid 37054, VM "VDI-0047", reserved 425984KB of GPU memory.
GPU memory left 3332056KB.
Xserver unix:1, PCI ID 0:133:0:0, vGPU: 0x11b0:0x109d, GPU maximum memory 4184024KB
pid 37055, VM "Training-0003", reserved 425984KB of GPU memory.
GPU memory left 3758040KB.
Xserver unix:2, PCI ID 0:138:0:0, vGPU: 0x11b0:0x109d, GPU maximum memory 4184024KB
pid 37056, VM "Training-0002", reserved 425984KB of GPU memory.
GPU memory left 3758040KB.
Xserver unix:3, PCI ID 0:139:0:0, vGPU: 0x11b0:0x109d, GPU maximum memory 4184024KB
pid 39359, VM "Training-0004", reserved 425984KB of GPU memory.
GPU memory left 3758040KB.
[root@smview3:~]

I don’t believe HA is supported with vGPU, only with vSGA. http://nvidia.custhelp.com/app/answers/detail/a_id/4146/kw/vmotion has some links to VMware documentation.

HA is a hypervisor feature so it’s support and any problems with it would be one to call VMware about.

Well come to find out that if you use vGPU enabled VMs with VMware you are not protected by HA. HA will not automatically recover a vGPU enabled VM. At least as of vSphere 6.0 U2 it currently is not supported.

I have written a PowerCli script to automate recovery from a host failure and powering the VMs back on a surviving host.

Testing by pulling the power cord on a vGPU hosts with 80 running desktop VMs allowed for them to automatically be re-registered on the surviving vGPU enabled hosts and have all 80 VMs powered-on and at the login screen in 15min.

The script has a lot of built-in logic to handle different failure scenarios, and an advanced option to use plink to cleanup a failed VMs vmx config to speed up the boot process.

Check out more on the Code plex site. Happy to answer any questions via the Discussions link on code plex.

https://vgpumon.codeplex.com/

Matt