We have been running vSGA for a couple years off of Nvidia K1s. Upgraded to Horizon View 6.2 and testing vGPU profiles. The initial testing went very well but once I scaled my pool out several VMs failed customization. However, no error occurred in Horizon, they just remained in a customization status.
If I forced a power reset and then responded to windows recovery to boot normally, it would usually continue customization and finish. The best part is because it never boots, I can’t VNC into it as the console is disabled with the vGPU K100 attached to the VM. It is very odd behavior and I have an open ticket with VMware.
As I dug through the logs, I found this interesting item in the vmware.log for the VM:
2016-04-27T01:44:09.329Z| mks| W110: GLWindow: Unable to reserve host GPU resources
2016-04-27T01:44:09.339Z| vmx| I120: [msg.mks.noGPUResourceFallback] Hardware GPU resources are not available. The virtual machine will use software rendering.
If you look at the work around, where I power reset the VM and it eventually works. It seems like there is a failure during power on for some VMs to get assigned a GPU core on the K1s. I haven’t found ANYTHING online referring to this issue. I’ll keep this post updated.
System Environment
Can you check the vBIOS of the K1 cards installed and ensure they’re at the latest version.
You may need to request this update from SuperMicro.
Also, why use K100?
K120Q is a better choice, more Graphics Memory and exactly the same density as each GPU only supports a maximum of 8 vGPU sessions ( so that’s 32 on a K1).
Thanks Jason, I contacted Super Micro but they are not aware of any "authorized" BIOS updates for the Nvidia GRID cards. I also tried looking online but didn’t find a BIOS version history for the GRID cards.
Running the nvidia-smi command, it reports that they are running:
VBIOS Version : 80.07.BE.00.04
MultiGPU Board : Yes
Board ID : 0x8300
GPU Part Number : 900-52401-0020-000
Inforom Version
Image Version : 2401.0502.00.02
Do you know where I could find that info? Also, regarding the K100 choice. I agree, we just wanted to test the K100 and K120Q separately to understand the performance gains on applications being used. I plan to go K120Q for production since we get the same user density.
Thanks for the quick response!
You’re on the latest VBIOS so no update required.
I would avoid the K100 profile, it’s only there for legacy support and I would recommend all new projects / deployments to not use it.
Out of interest, how many VM’s do you have in the pool you’re creating, and hown many K1’s are available in those hosts?
Thanks for checking on the vBIOS.
I will test out the K120Q then and report back. Regarding the pool size, it was planned to be 55 VMs with the target host having two K1s. A second host with two K1s would be a standby in case of host failure in the cluster (I know vmotion isn’t supported).
What are the pool settings?
We are good to go, the K100 was the issue. Once I switched over to the K120q and tested re-provisioning, all VMs came up normally. We had users on the new vGPU profile today without issues.
I think Nvidia should drop the K100 from their deployment documentation as it definitely impacted us. I realize the K100 and K120q have the same user density but a good POC means you test up in complexity. I would have avoided the K100 if it was marked as legacy.
Thanks Jason for being very responsive and informative! That was the exact info I needed to help root cause the issue.
Good to know it’s resolved!
I’ll raise the point about dropping the K100. There are some reasons for it to persist, but we can always ask…