XenDesktop vGPU PoC Application Issues

LewisBerrie · February 19, 2015, 5:41pm

I have just set up a Proof of Concept VDI for a customer, with the aim of utilising NVidia GRID vGPU, but I have had major application compatibility issues :(
My setup is

HP ProLiant DL380 Gen 9, dual 10 Core CPU, 128GB RAM, 4x300GB 15k SAS (about 550GB local storage)
NVidia Grid K2
Citrix XenServer 6.5
Citrix XenDesktop 7.6 (recommended patches applied to server components, and VDA)
NVidia vGPU Drivers for XenServer 6.5 - Windows Display Driver (341.08) and GRID vGPU Manager (340.57)

I have created one base desktop image configured for vGPU, and created a Machine Catalog. I then modified the base image for passthrough GPU, and created another Machine Catalog.
This gave me side-by-side comparison of vGPU vs. vDGA, with the vGPU configured with GRID K240Q profiles and the vDGA getting one of the GPUs on the card passed through.
With the vDGA machine, basically all of the software worked, which is all fine.
However, with the vGPU machine nearly anything that required OpenGL crashed in the NVOGLV64.DLL :(
The list of applications that don’t work is

3DEqualiser4
Adobe After Effects CC 2014
Adobe PhotoShop CC 2014 (it ran, but with no hardware acceleration)
Adobe Premier Pro CC 2014
Autodesk AutoCAD 2015
Autodesk AutoCAD Architecture 2015
Autodesk Maya 2015
Hiero
Mari
MODO
Nuke
Silhouette
SolidWorks 2010
Toon Boom Harmony
Toon Boom Storyboard

I know that 3D acceleration is possible, as the Unigine Heaven benchmark works in both profiles (vGPU and vDGA) and in all rendering modes.
I really need some help to understand if

I have an issue on my setup
There is an issue in the NVidia VM driver
The applications just won't work until re-written to support vGPU environment

Most of the crashes take the following form

Faulting application name: AEGPUSniffer.exe, version: 0.0.0.0, time stamp: 0x53e05513
Faulting module name: nvoglv64.DLL, version: 9.18.13.4108, time stamp: 0x5452245c
Exception code: 0xc000001d
Fault offset: 0x0000000000d5fb10
Faulting process id: 0x1878
Faulting application start time: 0x01d049daf4c88fde
Faulting application path: C:\Program Files\Adobe\Adobe After Effects CC 2014\Support Files\AEGPUSniffer.exe
Faulting module path: C:\Windows\SYSTEM32\nvoglv64.DLL

Some of the applications created crash dump files, and analysing those showed the 0xc000001d exception (invalid op code) was caused my a AVX instruction (I think). My only thoughts are that the memory pointed in the instruction wasn’t correctly 16-byte aligned, but it would require more debugging than I have access to.

Any help/pointers would be greatly appreciated, otherwise vGPU is pretty much of no use to this customer :(

JasonSouthernNV · February 23, 2015, 12:14pm

What CPU’s are you using? I’m going to make a guess at Haswell based?

LewisBerrie · February 23, 2015, 9:00pm

Thanks for the reply Jason.
The CPUs are Intel Xeon E5-2650 v3 @ 2.30Ghz, so yes they are Haswell-EP.
Is there a known issue with these processors with vGPU?

jrosenkvist · February 24, 2015, 10:46am

I have experience the same issue, with the flowing setup:

NVidia Grid K1
Citrix XenServer 6.5
Citrix XenApp 6.0 (Windows 2008 R2)
NVidia vGPU Drivers for XenServer 6.5 - Windows Display Driver (341.08) and GRID vGPU Manager (340.57)

I manage to change the Windows Display Driver with 347.52-quadro-tesla-grid-winserv2008-2008r2-2012-64bit-international-whql.exe. After that, I do not get fault errors in nvoglv64.dll.

JasonSouthernNV · February 24, 2015, 11:37am

Similar issues have been reported since the XenServer 6.5 has been released, though that may be more coincidental with customers buying Haswell systems.

We have an updated driver package that should be released this week that has incorporated a workaround to address this issue.

Check our drivers download page later today for a new vGPU package for Xenserver 6.5, once you’ve downloaded and tested them, let us know if it resolves the issue.

LewisBerrie · February 24, 2015, 8:11pm

Jason, you’ve made me a very happy man :)

I will do.
Thanks a lot for the update.

LewisBerrie · February 24, 2015, 11:24pm

Preliminary Update

I have updated the XenServer driver and the Windows driver in the vGPU profile base image [NVIDIA GRID VGPU SOFTWARE RELEASE 340.78/341.44 WHQL], and initial testing has been 100% positive :)
The ones I quickly tested (it’s quite late here) are

Adobe PhotoShop CC 2014
Autodesk AutoCAD 2015
Autodesk Maya 2015
Nuke
SolidWorks 2010

All ran with 3D acceleration. So looking very promising!
I will do some more thorough testing in a couple of days, when I visit the customer’s site.

Many thanks again for the information, and the heads-up on the new driver release.

LewisBerrie · February 25, 2015, 8:25am

Full Update

I dialled back in and tested all the remaining applications on my "crash" list

3DEqualiser4
Adobe After Effects CC 2014
Adobe Premier Pro CC 2014
Autodesk AutoCAD Architecture 2015
Hiero
Mari
MODO
Silhouette
Toon Boom Harmony
Toon Boom Storyboard

and all of them ran without crashing :)
So certainly my issue has been resolved by the driver updated.

On an aside: currently CUDA/OpenCL is not supported with vGPU mode. Is this a technical issue (hardware limitation, etc.), or driver limitation? There are a few applications that my customer is testing that do use CUDA/OpenCL for raytracing, etc., and while CPU is always a fallback, it would have been interesting to benchmark/compare CPU vs. vGPU to see what performance gains could be had.
Is there a roadmap to add support for CUDA/OpenCL to vGPU, or is there not enough perceived demand for it and just concentrating on OpenGL/DirectX visuals (rather than compute)?

JasonSouthernNV · February 26, 2015, 11:07am

Excellent, thanks for letting us know it’s resolved.

Onto the CUDA question.

First it’s important to understand that vGPU shares resources based on a scheduler, it doesn’t allocate blocks of CUDA cores, but you get allocated a "slice" of the clock schedule. This allows us to increase a VM’s clock time if the GPU is not fully utilised so giving users a bump in performance when other users aren’t using the GPU fully.

Now, when using CUDA you would essentially be sending code directly to the GPU and it will run until completion. If this exceeds the users scheduled time, it just keeps running and locks out the GPU resources for other users. Today there’s no mechanism to suspend or pre-empt completion of the code so not a good situation for multiple users sharing resources!

This is the reason why today CUDA support not available for vGPU, only for passthrough.

Is it being developed for the future? Absolutely, we’re keen to ensure that vGPU can offer identical capabilities to a passthrough GPU including use for GPGPU, and it is a roadmap item, though I can’t share timelines at present.

LewisBerrie · February 27, 2015, 7:14am

Hi Jason,

Thank you for the explanation on CUDA/vGPU.
Yes, I can see that a pre-emptive scheduler would be required to handle the correct allocation of resource between vGPUs in the case of CUDA. I had read up on the "time-slicing" of the compute cores to each vGPU (after I noticed that the vGPU was reporting all cores to the VM, not a subset of them, unlike the RAM allocation), but didn’t know how it was actually achieved.
I am guessing it is some kind of round-robin queuing in the dom0 driver? As in, it accepts graphics "operations" from each VM and then goes round executing "operations" from each VM queue in sequence that has a pending operation? Or is it strictly timed via the dom0? If so, what is the size of the time-slices used? Pure curiosity on my part, so if "secret sauce" I understand if you don’t want to say :) Interesting to know the effect that the VM "sees" as any time-scheduling will cause a "pulsing" in activity: most of the time nothing, then pulses of "full power" on the GPU. Must be fun ensuring that this doesn’t cause any adaptive timing issues in the VM :)

JasonSouthernNV · February 27, 2015, 2:52pm

The scheduler is actually in the GPU hardware, Dom0 isn’t aware of it because it happens at the hardware level. It works in exactly the same way on vSphere as it does on XenServer, no hypervisor involvement in the GPU virtualisation at all.

When a VM boots and the vGPU profile is attached to the physical GPU it’s effectively given a minimum guaranteed slice of time, but if more is available it can be utilised. All done in hardware so it’s really fast.

There’s a lot of clever behaviour in the cards and driver that is there to smooth things out for the application, and we have the Frame Rate Limiter which prevents users experiencing wild swings in FPS when they’re the only user on a physical GPU.

Topic		Replies	Views
XenServers with 3D CAD utilization via VDI passthrough GPUs, getting crashed repeatedly. General Discussion	2	4337	June 6, 2016
Support statement regarding NVIDIA vGPU + XenApp 7.5 XenApp	12	22776	November 18, 2015
Dell R720 with GRID K1 Card and XenDesktop 7.5 XenDesktop	5	13933	April 6, 2014
Nvidia Grid Tesla P40 performance issue after vgpu driver installation on vdi XenDesktop	0	2474	March 14, 2020
Poor Performance With K1 & XD 7.1 App Edition XenApp	8	16124	May 2, 2014
Error in Application after upgrading to xenserver 6.5 /vGPU K1 K2 NVIDIA Virtual GPU Technology	3	7808	February 26, 2015
vGPU Driver 331.59-332.83 BSOD NVIDIA Virtual GPU Technology	13	25207	October 23, 2014
XenDesktop / GRID environment blue-screening on DirectX Driver NVIDIA Virtual GPU Drivers	25	43014	April 1, 2015
Xenserver 6.5 with K2 on Dell R730 NVIDIA Virtual GPU Technology	6	13879	February 12, 2015
VM's locked up on XenServer 6.5 NVIDIA Virtual GPU Technology	2	6901	October 13, 2015

XenDesktop vGPU PoC Application Issues

Related topics