Dell R730 with Tesla M60 on XenServer 7.0 unexpectedly reboot when a few VMs with vGPU are started

I’ve run a GPU enabled XenServer 7.0 with 768GB RAM without any issues (Cisco UCS). Although it does list Dell R720 & R730, unless you are experiencing that issue, I wouldn’t bother with it on any other Hosts.

Just out of interest, what CPU do you have installed? Can you give it’s full name? (EG: E5-2670 v4)

With newer Hypervisors you shouldn’t need to do this, but on one of your Hosts, in the BIOS can you try disabling “Memory Mapped I/O above 4GB” and see what happens? As said, you shouldn’t need to do it with newer Hypervisors. If it won’t boot afterwards or throws any errors, just set it back to how it was.

Power Supplies are fine, that’s what I expected.

Can you just confirm how you have the M60 powered up for me? Are you able to take a clear photo? (Feel free to PM me that if you’d rather not post it on here). It’s fairly strait forward, but mistakes can happen, I’ve been there. (JS if you read this, not a word! ;-) )

As your servers are pretty much unusable at the moment, if the BIOS change mentioned above doesn’t do anything, do you have time to remove 1 from the Resource Pool and start again? Reset R730 BIOS to factory default, clean install of XenServer (don’t add it back into the Resource Pool, keep it stand-alone), license XenServer (Enterprise or above), don’t use the memory workaround you mentioned above, fully update XenServer, install latest GRID drivers, build a clean Windows VM (from an .iso, not a pre-built template) with all Windows updates and install GRID drivers, don’t bother with Apps or running it through MCS, just see if the issue remains.

It’s strange you have this same issue on all of the R730s. It’s possible that 1x R730 or M60 may have had an issue if this were only 1 Host, but with it being on all of them, this is far less likely. There is obviously a common issue between them somewhere, something has been connected, installed or configured incorrectly.

How did you get on with Passthrough?

I did some Tests with the Master-VM and a GPU-Passthrough - no Problems until now. So it Looks like it’s limited to vGPU.

CPU E5-2667

Will check the Memory Mapped Setting later.

You would like to get Pictures of the M60 power cabling? The separate Cable is plugged into the rise Card. the two Ends (6 and 8pin) are connected to the two 8pin of the Card. Then one 8pin into the Card.

Ok, just checking CPU TDP as it sounds like you’ve swapped bits around between servers (R7910 > R730 at least M60 wise) so no idea what systems you’ve bought together and what other components you may have moved between chassis.

Cabling sounds right. But there are different cables for different generation GPUs and Models and you’ve already mentioned you had issues with the cables you had been sent … If you’re confident that it’s right, we can forget about that.

Good, if it’s just vGPU, it’s a software issue. This is why I asked in my first 2 posts to try Passthrough, as it removes the Hypervisor driver from the loop and you know where to look for the issue.

If you’ve already removed the driver from the Hypervisor > rebooted > installed the new driver > rebooted again and made sure that the correctly paired driver is in your Master Image (that bit is important because it’s the difference between the problem being in your VM and following you between hosts, or a problem of some sort with all 3 Hosts). Then I’d do as suggested above and start again with 1 of your XenServers. Complete fresh start. Reset BIOS etc etc (as above) … Doesn’t take long, XenServer is a quick and easy install.

No - we didn’t swap Things between the Servers ;)
Will make a Picture later and attach it.
Master VM is using the Driver attached to 367.64.
First check now is Memory mapped…

Something to check in your Master Image … In “Device Manager”, enable “Show Hidden Devices”. Go through and remove every Ghost device that is listed (every one of them! Regardless of what it is, if it’s a Ghost, then remove it). Then try a vGPU again.

After you’ve either changed vGPU profile or changed between Passthrough and vGPU, you should go back and remove the Ghost profile for each GPU that is no longer in use.

I didn’t suggest this before because I’ve never heard of it crashing a Host, so it may not be relevant, but it’s worth checking and should be done anyway just to keep the Master Image clean.

Sorry for not replying such a long time (was on Holidays etc)
We are in an escalation with Dell/Citrix/Nvidia - but no solution was found until know. I got the information that other people have the same problems with Dell R730 - but never heard about the same problems with other hardware vendors. A few interesting notes:
No problems with XenServer 6.5
No problems when we use a Xeon v3 CPU (instead of v4)

Hi jhmeier,

wenn ich mir deinen Namen anschaue bist du Deutscher;-)

Ich habe bei mehreren Kunden das gleiche Problem und indirekt bei FSC und Dell einen Supportfall auf.

Meines Erachtens liegt es dem gleichen Problem wie hier :

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146388

Bei FSC ist bereits ein BIOS Update erschienen. Bei Dell sollte ein Update für das BIOS sicherlich folgen.

hey,

we are having the exact same issues on Xen 7.0 servers with grid m60 cards, the changes in this article help, and we are still testing https://support.citrix.com/article/CTX220674

We had the exact same issue on our LAB environment. We put an M60 in a R720 server.
It was running OK for a few months but last week we had an host crash with exact the same errors reported:

A bus fatal error was detected on a component at bus 64 device 2 function 0.
A bus fatal error was detected on a component at slot 4.

Is there any update on this case?

Hi RKossen,

The case result was the WAR posted above in the CTX220674 article. You issue seems to be different as I doubt you’re already using Intel v4 CPUs with Dell R720. In addition Tesla M60 is not supported at all for this hardware

Regards

Simon

Yes the “no-pml” fixes the problem. There is also a private hotfix available to fix it (can’t confirm because I didn’t have the time to test it). Currently it’s fine to disable the pml feature - it’s for live-migrations (which currently are not possible with vGPU VMs).

I know it is not officialy supported but it worked for serveral months and suddenly we got a BSOD.
So I thought our issue was maybe related to this topic ( crash codes are the same )