K2 vGPU on ESXi 6.0 and HV 6.2 driver 354.97 Win 10 x64 - Issues with random system lock ups.

We have deployed a Horizon View setup on a dell R730 with a GRID k2. Our Win 10 Ent x64 pool is experiencing a few issues which make the user experience unreliable.

Summary of environment

ESXi version 6.0
Horizon View 6.2
Dell PE R730
NVIDIA K2; running 354.97 driver both on Host and vGPU
Client pool running Win 10 Ent 1511
WYSE P45 Zero Client running PCoIP - Terra 2 chip - 5.2 firmware from teradici
Dual monitor setup, 1920x1200

Issue 1.

Full screen video within a web browser (any web browser) freezes up within seconds of going full screen, the only way to restore it is to escape out of full screen. The work around is to disable hardware acceleration within the browser, but this defeats the purpose in investing in the GRID infrastructure.

Issue 2.

At least once a day at random the win 10 VDI client will experience a major failure of the graphics driver. This manifest itself at first as a lock up of the screen, the audio then fails followed by the session ending and unable to be reinitialized through the Zero Client menu. The workaround is to initiate a restart of the VM from the Zero Client menu which reboots the entire machine losing all unsaved work. It seems that the OS itself has not crashed as the reboot when initiated is a clean reboot of the OS not a hard reset of the Virtual hardware.

Can anyone please help us address this issue or point us to a known stable build of the GRID drivers?

Hi Mohb60,

I don’t know of any known issue like this and the drivers should be stable. The best thing you can do is to raise a support ticket because this is GRID 1.0 product (K2/K1 boards) you need to do this via the OEM who supplied the board (Dell in this case) and they in turn can escalate it into NVIDIA engineering if it’s a driver issue.

In the GRID 2.0 SUMS support is available so there’s a process to raise tickets directly with NVIDIA but in the older hardware sales model - I’m afraid you do need to go via the OEM who sold you the card.

Best wishes,
Rachel

Hi Rachel,

Thanks for your feedback. Anyway, I have some good news, we managed to find a work around for issue #1 in our environment. From what I can tell, the video freezing up during full screen playback in a browser was due to the image quality tolerance settings. Counter-intuitively, it seems that setting the lower threshold too low on the image quality tolerance on the PCoIP client produces the issue. Setting the bar from 80% to perceptively lossless seems to be the best setting on our setup and no longer results in the screen locking up during full screen playback. I hope that this helps others with a similar setup and are facing the same issue.

We are still looking into issue #2.

This issue still persists in our environment. We have taken the following steps and concluded that the newer Nvidia drivers are unstable.

Our environment is the same as what is listed above.

To investigate, our initial step was to look into the logs to determine why VMs were randomly crashing and rebooting. The logs show very little, windows logs show that windows had recovered from an unclean shutdown after experiencing a failure of the VM’s OS. The VM logs show an error with approximately the same time stamp correlating the failure with what’s seen in the windows logs.

This instability was introduced into our environment after upgrading both the vib and vGPU drivers from 348.27 to the newer 354.97. This was confirmed by rolling back the vib on one of our hosts and creating a new identical vGPU pool with 348.27 drivers installed. The new pool has been stable.

Can anyone from Nvidia give us a reason for the instability in the newer drivers?

Thanks for the feedback very helpful - I’ll see what is known internally.
Rachel

I have the same scenario at a customer as your problem #2.

Our problems began after upgrading from vGPU-346.68-348.27 to vGPU-361.45.09-362.56. Since then we tried vGPU-367.43-369.17 but no luck.

VMware has closed the case and pointed at Nvidia, and Nvidia are referring to HP who sold us the GRID K2. So far the only suggestion from HP was to upgrade the Proliant server BIOS. Sigh!

What a big disappointment. I have several customers who could benefit from this technology, but I am not touching it again before this is solved.

Oh, forgot https://gridforums.nvidia.com/default/topic/974/how-to-get-support-problem-with-win10-k2-vgpu-and-view7/#3417

I have an update from my end, I worked with VMware and have come to the conclusion that Issue #2 has to do with a particular scenario. Win 10 1511 VDI, running newer NVIDIA drivers and the presence of an emulated SATA CD-ROM and SATA controller on the guest VM. As an experiment I’ve tried removing the SATA CD-ROM and controller from the guest VM and ran pools with both the older and newer drivers both have proven to be stable.

@Oletho, try to remove your SATA CD-ROM and controller, perhaps this will work for you? I have been told that ESXi 6.0.0 Update 2 addresses the issue with the CD-ROM emulation crashing the VM. I have not had a chance to confirm this in my environment.

I just saw your reply and will immediately set up with the customer to test the CD-ROM solution. Thanks a lot.

We are running Update2 with latest patches already.