Hi,
Maybe someone experienced a similar issue. We use:
- HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
- Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
- Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)
- Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13
- XenServer 7.1, latest updates applied{/.]
- GRID M60-0B profiles 512MB
Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM’s just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM’s: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes.
In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use.
Mon Oct 2 11:54:15 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73 Driver Version: 384.73 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:86:00.0 Off | Off |
| N/A 45C P8 25W / 150W | 3066MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 00000000:87:00.0 Off | Off |
| N/A 48C P0 58W / 150W | 18MiB / 8191MiB | 100% Default |
±------------------------------±---------------------±---------------------+
One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens:
timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB]
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 1% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 3% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 16% 1% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 44 100% 0% 8191 MiB 3093 MiB 5098 MiB
As you can see the load was not as much before this happened.
The sad thing about this, users lose their work as VM’s crash. On top of that VM’s cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU’s from VM’s,…
Citrix claims this issue not their problem. Everything points to Nvidia at the moment.
We saw this issue first 27.09.2017.