Dell 730 with Single Tesla - Receiving ESXi Purple Screen

Hello,
Currently we are encountering ESXi purple screens when using the M60 0Q profile to run multiple video streams across multiple Windows 7 VMs (6 streams per desktop). Each desktop (Windows VM) is running the GRID 2.0 Windows driver (this is stated in a separate NVIDIA post as the correct driver to be using). The purple screen occurs when the 12 streams are running for about 5 minutes. We have also seen the purple screen occur sooner (within minutes) when running 18 streams across three desktops.

Thanks,
MHL

Hello MHL,

can you please explain what application you are using, what a stream is in this context and what your build looks like?

Thanks,
Erik

Erik Bohnhorst | GRID Performance Architect
NVIDIA Corporation

Hello,
Yes, we are using Genetec Security Center 5.3 to generate the video streams. We receive the purple screen when receiving 12 streams across two thin clients (6 streams per Windows 7 client).

Here is some additional data from the dump files:

2016-03-16T22:07:54.568Z cpu0:36485)WARNING: NMI: 911: NMI received; attempting to diagnose…^[[0m
NVRM: GPU at 0000:84:00.0 has fallen off the bus.
NVRM: GPU is on Board 0323015037010.
2016-03-16T22:07:54.568Z cpu0:36485)World: 9729: PRDA 0x418040000000 ss 0x4018 ds 0x4018 es 0x4018 fs 0x0 gs 0x0
2016-03-16T22:07:54.568Z cpu0:36485)World: 9731: TR 0x4000 GDT 0xfffffffffc60a000 (0xffff) IDT 0xfffffffffc608000 (0xffff)
2016-03-16T22:07:54.568Z cpu0:36485)World: 9732: CR0 0x8005003b CR3 0x20feb42000 CR4 0x42660
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
NVRM: GPU at 0000:85:00.0 has fallen off the bus.
NVRM: GPU is on Board 0323015037010.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

We initially encountered the purple screen using GRID 2.1 (NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_352.70-1OEM.600.0.0.2494585.vib and 354.56_grid_win8_win7_64bit_international.exe Windows drivers). We then tried using GRID 2.0 Windows drivers (354.13_grid_win8_win7_64bit_international.exe), while using the same VIB and encountered the same purple screen again.

Encountered another purple screen with 14 streams across 2 thin clients (attaching screen shot). Should we update to GRID 2.2? Thanks, Marty
M60 PSOD3 - 2 desktops and 14 streams.jpg

As per Erik’s request, can you describe your build, what VDI stack you’re using, VM config etc.

Why did you select the 0Q profile? does the same issue occur with a 1Q or larger profile?

What does 14 streams across 2 thin clients mean?

14 video streams being decoded into 2 VM’s with 0Q profiles, then each VM being remotely connected to from the thin clients?
14 discrete VM’s with 0Q profiles being delivered to 2 thin clients?
Something else?

  1. As per Erik’s request, can you describe your build, what VDI stack you’re using, VM config etc.
    a. I assume you are referring to the ESXi build? If so, ESXi 6.0.0, 3380124. Each VM has 8 GB of memory with 6 vCPUs (1 socket / 6 cores). The M60 has been added to each VM as a shared PCI device via the VMware vSphere 6.0 web client. I’m not sure what you mean by “VDI stack”.

  2. Why did you select the 0Q profile? Does the same issue occur with a 1Q or larger profile?
    a. We selected 0Q because it (in addition to 0b), allows the greatest number of vGPUs for the entire board (32). We are trying to determine the maximum number of video streams that can be reached across the maximum number of VMs. No, we have not tried other profiles.

  3. What does 14 streams across 2 thin clients mean?
    a. It means each thin client (running Windows 7 Enterprise), is attached to 7 surveillance cameras that are each streaming their own surveillance video stream, via Genetec Security Center version 5.3.

Just checking…any updates available yet to my responses above? Thanks. Marty

MHL,

These forums are not staffed full time, and are not a formal route to support so responses take time, for GRID 2.0 you have the support portal that you gained access to when you purchased the licenses. This forum is maintained by engineers, developers and architects across Nvidia that answer questions if and when they have time. IF you have urgent support needs, you should follow the support route.

To follow up on your responses.

You haven’t told us what remoting protocol you are using to access the VM? Horizon VIEW, RDP, RemoteFX or something else?

You’ve selected the smallest profile, with 512MB RAM and are attempting to run 6 streams. It’s likely you have inadequate resources allocated into the VM. My first recommendation is

  1. Reduce the number of streams
  2. Increase the available frame buffer.

512MB is really only aimed at users of core desktop applications such as MS Office, web browsers and the like, it’s not intended for users of applications that demand more from the GPU.

Looking at the system requirements for the application, they suggest at least a K620 with 2GB graphics memory, which suggests that the application requires more graphics memory / frame buffer than GPU resource.

"Minimum of 2 GB of video RAM recommended."

I’d suggest, based on this and your experiences above, increasing to the 2Q profile and repeating the test.

Thanks. We will go down the support portal route.

Did you test with the applications recommended configuration of 2GB Frame Buffer?

Worth checkign the known issues in the KB that could cause PSOD (purple screen of death when these occur: http://nvidia.custhelp.com/app/answers/list/st/5/kw/grid%20psod/page/1