HPE Apollo Xl190r Gen10 + ESXi 6.7 U3 + 2x Tesla V100 = Virtual Machines Crashing

Hi,

I have been fighting with this issue of VMs crashing or becoming unstable after a random length of time for the last 5 days.

sample of the (many) errors in the vmware.log of any of the VMs using vGPUs (v100D_2Q profile):

2019-12-07T14:14:01.622Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff

2019-12-07T14:14:01.627Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff

When the screen is rendering it is choppy/laggy and eventually freezes, then requires the use of esxcli vm process kill on the host to stop the VM.

VM Specs:

EFI Boot
4x vCPU
16GB vRAM - Reserved
Paravirtualized SCSI Adapater
Shared PCI Device vGPU

These VMs are to serve desktops using Horizon 7.11, Both Windows 1903 and Ubuntu 1804 LTS are generating the same errors in their respective vmware.log files.

Ubuntu seems to crash more offten than windows, I think due to the Video Ram Usage on the vGPU.

Tried GRID Drivers (and VM drivers to match)
10.1
10.0
9.2

ESXi Versions tried
6.7 U3
6.7 U2
6.7 U1

We are fully licensed for vQwS and vPC

HPE RBSU (BIOS) Profile - Virtualization - Max Performance (SR-IOV, VT-D enabled)

I cannot provide a full vmware.log as this is a darksite.

Looking for any Assistance/Hints to solve this.

Thanks in Advance

Hi

What changed in the environment 5 days prior to this? Or is it a new deployment that’s only 5 days old?

Are these issues there all the time or only on a specific workload? Also, what kind of workload?

Have you monitored the GPU locally inside the OS to see how much resource is being used when the issue occurs?

Regards

MG

Hi MG,

This is a brand new install on vendor Certified equipment, we ran into this problem when we spinning up some initial Horizon desktops, some of the desktops became unresponsive, the workload is just rendering the desktop using a vGPU, I have watched the output of nvidi-smi vgpu -u while this happens, nothing unusual about the data, from within the VM the desktop is very sluggish like its drops alot of frames, Low utilization of all other resources, and only a single VM on the host.

At one time… I was able to get a VM running without errors in the vmware.log and without frame rate issues, this was straight after converting the VM to EFI boot, but I was not able to repeat, it seemed to be a fluke, restarting the same box for another run ended up with the same errors this was on the same host.

The errors are indicative of memory remapping issues from the GPU to the Virtual Machine., today I will be looking at the vmkernel.log and mapping out where all the devices are registered on the host and from within the virtual machine focusing on the vGPU.

Hi

I know you’ve mentioned it’s ok, but double check the license is being applied from your license server. Low FPS / sluggish performance is indicative of a license not being applied.

You mention the workload is just rendering the desktop, do you mean you’re not running any Apps and the VM is just slow whilst using the OS? If yes, again, sounds initially like a license issue.

If the licensing is working and being applied correctly …

Have you monitored the vGPU from inside the VM to see what it’s doing? Is it running out of Framebuffer? (How much is it using?)

Once the VM has been created, you should not be changing the BIOS mode as they work in different ways and can cause issues. EFI is what you should be using and with the VMware Windows 10 template it’s default and there’s no need to change it.

Try a manual clean build Windows 10 VM from scratch. Leave the VM template options, just give it enough vCPUs, RAM, Storage, VMXNET3 Network Adapter and add a vGPU. Install VMTools making sure to do a Custom Install and unselect the vSGA graphics driver so it’s not installed. Get it domain joined and enable RDP, then get the vGPU driver installed. After that, install the Horizon Agent, but without Instant Clone or Linked Clone options. Once installed, install the Direct Connect Agent. Connect directly to the VM and see if the issue still remains.

Regards

MG

Hey Squishy,

I encountered a similar issue where I had to use esxcli to kill the frozen GPU enabled VM on 6.7 U3 with the T4 Tesla GPU.

One thing i’d check is that your Nvidia ECC settings are set the same across all of your hosts. We had all our hosts set to disable Nvidia ECC except for one. That caused issues when a VM migrated there, where a different host would have a lock on the vmx, which then caused issues with managing the VM in vCenter.

I noted also a migration related ECC mismatch error in the vmware.log of the VM.

I’m still waiting back to see if those VMs that were impacted will crash again. I am not entirely sure this ECC related migration issue was the cause.

Just wanted to follow up, even after verifying all the proper settings, the VMs are still crashing. It’s suspected to relate to a in house developed 3D application that this particular department uses. However, there’s nothing really obvious in the logs other than the overlap entries in the OP. Case opened with Nvidia Enterprise Support for further troubleshooting. Will report back if a solution is found.

@squishy,

I am curious how that XL190R can accommodate two double width V100 GPUs assuming your configuration has the 32GB V100s, of course. From the spec sheets I look at, there is only room for one double width GPU card in these based on the backplane slots.