Tesla M60 Freeze, 100% Load Issue

MAGO · October 2, 2017, 12:25pm

Hi,

Maybe someone experienced a similar issue. We use:

HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)
Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13
XenServer 7.1, latest updates applied{/.]
GRID M60-0B profiles 512MB

Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM’s just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM’s: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes.

In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use.

One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens:

timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB]
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 1% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 3% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 16% 1% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 44 100% 0% 8191 MiB 3093 MiB 5098 MiB

As you can see the load was not as much before this happened.

The sad thing about this, users lose their work as VM’s crash. On top of that VM’s cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU’s from VM’s,…

Citrix claims this issue not their problem. Everything points to Nvidia at the moment.

We saw this issue first 27.09.2017.

MAGO · October 2, 2017, 12:42pm

Maybe as an addon, we don’t use HDX PRO 3D, we use standard VDA deployments forour XenDesktop environement.

BJones · October 2, 2017, 5:56pm

Hi

Have you tried a different vGPU profile size? Maybe the 1B profile?

I take it you have SUMs, have you raised it with NVIDIA?

Failing both of the above, can you not role back to the previous driver that was working to give you stability whilst you troubleshoot on a Dev platform?

Regards

MAGO · October 2, 2017, 8:48pm

Thanks for your reply. Yes we have SUMS and yes, we’ve raised the issue with NVIDIA (no solution so far). Roll back could be an option, we simply removed the GPU for now, since we had to have quick solution. The 1GB profile could be tried as well, but then I can run only 64 users, so I would need more M60ties for that. Since we use Win7 in standard VDA mode we thought the 512 Profile will just be right. Our test environment does not have any M60 in it for the moment, those cards are quite expensive :-).

BJones · October 3, 2017, 7:46am

PM Sent …

BJones · October 3, 2017, 10:08am

Something else to consider after my PM … You may be better investigating using M10s rather than M60s. These are cheaper than M60s but have twice the Framebuffer and twice the amount of GPUs, so you would be able to give your users 1GB whilst maintaining current VM / server density. The M10 offers less performance than the M60, but if you’re only allocating 512MB, then these are clearly not high performance users. Also, if you’re only allocating 512MB, then you’re not even using NVEnc, as this is only available on 1GB profiles and higher.

If you want better density per physical server, then you might want to look at a XenApp model (again using the M10). Obviously depending on applications being used, security requirements etc etc.

Have a look at an M10 on your dev platform and see what you think … Use my PM as guidance for locating one for testing …

Regards

MAGO · October 4, 2017, 6:30am

Yes, got your point. For the test environment we will go for the M10 I think. Since it is not the same card we might not have the same issue.

We will assign the 1GB profile to some test users now. Just to see if we can reproduce the issue. We also use XenApp to push apps to the XenDesktop, but only XenApp will not work for our users I’m afraid - we need to keep both XenDesktop and XenApp.

You are right, our users are not high end users in that sense. We can flatten CPU Usage in general with the Nvidia cards, users have a better GUI experience for sure. We also use Bloomberg, Thomsone Reuters etc. which benefit from the cards as well,…

What is a bit disappointing is that such a severe issue is happening in the first place and Nvidia support is a bit limited so far,…

Best regards

BJones · October 4, 2017, 8:06am

When did you raise the call with NVIDIA Support (Date / Time)?

Have you had any response back yet? I take it you have a case number?

Feel free to PM me that if you like …

Regards

MAGO · October 6, 2017, 7:39am

09/27/2017, 03:37 AM
ticket ID: 170927-000048

Guess what, not solution yet, very very disappointed by support of Nvidia I must say.

Regards

BJones · October 6, 2017, 8:15am

I’ll ask someone to take a look and see what’s happening with the ticket …

Regards

MAGO · November 7, 2017, 10:01am

Here a short update. Nvidia did not solve the issue, support is very very very limited, meaning not existing.

Well, we know by know that this issue happens also with the 1 GB profile as well.

The good news so far is that testing showed that if we disable Windows 7 Areo by disabling this service: Desktop Window Manager Session Manager (Service Name: UxSms) the issue did not occure any at all.

Best regards

dhtax · August 3, 2018, 10:35am

i can’t fix.

ingiacucre · May 20, 2019, 6:42am

I want to fix ?