XenDesktop / GRID environment blue-screening on DirectX Driver

Hi All,

I posted this on another, older thread on the GRID forums so I am posting a fresh topic here too.

I have XenDesktop 7.1 running on a number of XenServers with K1 and K2 cards.

The virtual desktops are bluescreening at random times and referencing the driver, dxgmms1.sys. I understand that this is a DirectX driver. Most of the time, I actually find the virtual desktops started in Windows Recovery Mode. I have been able to catch it bluescreened though. Windows event logs do not show anything leading up to the issue. I believe these machines are just bluescreening while idle as no users have reported getting kicked out during a session.

The virtual desktops are bluescreening across various different hosts and GPU profiles. The are bluescreens on both K1 and K2 passthroughs and vGPU profiles K120Q and K140Q.

All the virtual desktops are 64 bit Windows 7 and there are no Direct X updates showing as needed from Windows Update.

If I run DXdiag, it shows that I have DirectX 11. Microsoft notes that this tool reads "DirectX 11" even if the systems have DirectX 11.1 or 11.2. I also ran the 64bit DxDiag and it said that there were no problems found.

I updated XenServer and the GRID Manager a couple of weeks ago but the issue occurred before then. Here are my current drivers below. If you go to this site http://www.nvidia.com/download/driverResults.aspx/77752/en-us as of 9/22/2014, I am using all the latest drivers from release "331.59.01/332.83"

From Windows Device Manager: 9.18.13.3283

All XenServers have: NVIDIA-vgx-xenserver-6.2-331.59.01.i386.rpm

XenServer:
Build Date 2014-06-26
BUild Number: 70446c
Version: 6.2

Has anyone else experienced this or have any ideas on next steps?

Thanks for your help!!

Richard
CrashDump.zip (29.7 KB)

Common server platform? Common BIOS level?

I cannot get the Unigine Heaven benchmark to run on a Windows VM under XenServer 6.2 with either DirectX 9 or 11 (it runs fine with OpenGL) using a K1 or a K2. Is this a known limitation?

Hi Tobias,

It runs fine in Direct X, though you may have issues trying to run it full screen due to the limitation of HDX 3D Pro.

I just took a screen grab so you can see it in action inside a VM.

Screen Shot 2014-09-23 at 17.37.00 by j_southern

Richard,

In addition to Luke’s query, what applications do you have installed? Something must be making a call on DirectX and it would be good to have an idea of what it is…

Also, can you recreate with another profile, or even in passthrough?

Cheers,

Thanks Luke and Jason,

The specs of the SuperMicro are below. There are various Autodesk applications installed. These are persistent/“Dedicated” desktops so the users are installing their own apps. But any graphics-related apps will be Autodesk apps. I will work on tracking which apps are installed on the crashing machines to see if there is any commonality. Perhaps the user is leaving an app open at it eventually hangs. As I mentioned, I don’t believe the desktops are bluescreening while they are being used. I also noted that this started before the latest NVIDIA driver update. I want to clarify one thing though. I noticed before the update that some machines were stuck in Windows Recovery Mode. Until only recently did I catch some of the machines in a blue-screened state. So I can’t say definitively that the dxgmms1.sys was at fault prior to the NVIDIA driver update but it is likely.

This is occuring on K2 passthroughs and vGPU profiles K120Q and K140Q.

SuperMicro SuperServer 1027GR-72R2 aka X9DRG-HF+II
CPU: 2 Ivy Bridge 10C E5-2690V2 3.0G 25M 8GT/s QPI
Memory: 16 x 16GB DDR3-1600 1.35V 2Rx4 ECC REG RoHs

The BIOS is the latest, 3.0A on all the servers.

Thanks for your help!

Richard

Update:

I restarted one of the failed VM’s and found it in a crashed state shortly thereafter. When I checked the console, it had blue screened and it referenced nvlddmkm.sys. I checked with the user and he had not logged onto this machine today. Therefore, I am sure that this machine crashed without any apps having been actively in use. Also, this is the first time that I have seen the nvlddmkm.sys driver referenced. Previously the dxgmms1.sys was referenced in the bluescreen.

Update 2:

I rebooted another VM that was stuck in Windows Recovery Mode (after a previous crash). When I rebooted this one, it bluescreened upon boot, also referencing the nvlddmkm.sys driver.

Hi, Jason:
No, I was not running the Heaven benchmarking app at all in full screen mode. I get for both DirectX 9 and 11 an error something like:
D3D11 Render: D3D11Render(): Unknown GPU
and 10 or so additional lines for various other functions it says are not supported and then it fails. Works just fine for OpenGL with any resolution I throw at it.

NVIDIA driver version is 331/59.01, running on a Win 2012 R2 XenApp 7.5 server on XenServer 6.2 (patches including XS62ESP1 and XSESP10002-0009), nvidia-smi version 340.66. Am using GPU passthrough (didn’t make any different with any of the vGPU settings). Might this be a XenApp 7.5 support limitation?

Hi Tobias,

XenApp may be the Culprit as the DirectX support is routed through the Citrix driver so it’s possible that’s causing the issue.

I’ll be back in my lab on Friday so if I get a chance I’ll have a look at whether I can get Unigene running in XenApp, my screen grab up top was from XenDesktop.

Richard,

Do you have a memory dump available?

Also, where are you located APAC, Europe or North America?

@Jason,
Thanks much! That feedback regarding XenApp would be much appreciated. I don’t think I ever got it to work with a Win 7 or 8.1 desktop, either, but I’ll be interested what you find out!

Hi Jason,

I just uploaded two crash dumps to my original post from a running machine that referenced the dxgmms1.sys driver. I will also fetch some crash dumps from a machine that has bluescreened on the nvlddmkm.sys driver.

I am in Northern California.

Thank you!

Richard
CrashDump.zip (29.7 KB)
Crash nvlddmkm.zip (24.6 KB)

Hi Jason,

I have uploaded both crash dumps now into my post directly above. CrashDump.zip is from a crash from a machine that had been running and it referenced dxgmms1.sys. The "Crash nvlddmkm.zip" is from a machine that bluescreened on boot and referenced the nvlddmkm.sys driver.

I have also received the below two errors from XenServer when trying to start some of these VMs.

xenopsd internal error: Device.Ioemu_failed(“vgpu exited unexpectedly”)
xenopsd internal error: Failure(“Couldn’t lock GPU with device ID 0000:05:00.0”)

Thanks for your help!

Richard

Hi Richard,

I’ve just seen that you’re working with one of my colleagues in the US and have this raised as a bug, so once there’s a resolution we’ll post it back here for the community.

If anyone else has this issue though, report in here and watch this thread for an update.

Hi Ricard and all.
Just to update everybody on this issue. It turns out that Memory Balooning was enabled on these servers. vGPU today does not support Memory Ballooning. Here is an article on the subject: http://www.citrix.com/content/dam/citrix/en_us/documents/products-solutions/citrix-xenserver-dynamic-memory-control-quick-start-guide.pdf

The reasons are that if you overprovision system memory, graphics performance will take a huge hit when the VMM is paging system memory on behalf of the guests.

Once the servers had Memory Balooning disabled, the VMs seem to be stable now. We hope to support this feature at some point in the future.

I am having the same issue. I had one vm using DMC and disabled it but it still would not work. We have been using the k120q profile for testing and are now moving into production. after the 8th vm, we noticed we could not boot up additional vm’s as if they cannot start using the next GPU. I get the same BSOD from dxgmms1.sys. My workaround has been to use the k100 profile.
We are using the 331.59.01 driver on XS 6.2SP1 and XD 7.5. I have the GPU set to Maximum Density and have all vGPU profiles allowed.
I noticed a new driver and vmanager is available but I do not that it resolve this issue. Is a possible bug or do I need an updated firmware? The k1 card was purchased through Dell in 4/13.

Hi Chris,

Can you update both the in VM driver and vGPU manager on XenServer to the latest version and then retest.

The firmware changed in Nov 2013 to support multiple displays and all cards after this date will have shipped preloaded.

If you did purchase your cards before this release, Dell have the firmware update posted on their site here

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=1YCT8

You will need physical access to the machine to boot it into Linux to carry out the firmware update.

thanks for the reply. I will upgrade tonight or tomorrow. BTW, the preferred order for HDX3DPro and vGPU is 1)Nvidia driver 2)Xentools 3)VDA. upgrading the Nvidia driver does not require me to remove everything else correct?

Hi folks,

Luke, Steve, Jason, thanks for your persistence and sticking through this to help my colleague Richard get to a resolution. As there’s some pretty good nuggets of information, not yet common knowledge in the industry, I’ve taken the time to write a blog post on this topic. For those that are interested, it can be found here:
http://blog.itvce.com/2015/01/02/xenserver-dynamic-memory-and-nvidia-grid-vgpu-dont-do-it/

Thanks again for all your hard work and support! You guys rock,
@youngtech