Dell R730 with Tesla M60 on XenServer 7.0 unexpectedly reboot when a few VMs with vGPU are started

Hello.
We have three new Dell R730 Servers with Tesla M60 Cards. They are installed with XenServer 7.0 - all Patches to 21. All three have the following Problem:
As soon as a few VMs start they are just rebooting without showing any informations. In the event-log the following Problem is logged:
A bus fatal error was detected on a component at slot 6.
A fatal error was detected on a component at bus 0 device 2 function 0.
The M60 is installed in Slot 6.
The powerplug was already replaced (was not correct). The same happens if we move the Card to Slot 4.
There is no XenServer Crashdump.
This site is undergoing maintenance did not fix it.

Any hints where to search?

New one on me…

Do you have SUMS support or is this pre-sales POC?

Best wishes,
Rachel

Depends :)
We have some Systems already licensed - but for this one we need to test which licenses are necessary - but for that we need to test on a working enviroment :)

Details I’d like added:

  • the VDA / XD versions
  • The NVIDIA driver versions
  • Bios

With all new M60 I’d recommend checking modeswitch has applied correctly: Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails | NVIDIA

VDA 7.11
XD 7.11

Drivers:
XenServer:
NVIDIA vGPU (version 361.45.09)
NVIDIA vGPU (version 367.43)
VM:
369.17_grid_win8_win7_server2012R2_server2008R2_64bit_international

Bios:
2.2.5

With compute mode the vms didn’t start - only with graphics mode they start :)

Hi jhmeier

Can you please let us know why you have 2 different host drivers and only 1 VM driver listed above? The drivers are released in pairs (Host / VM). If you are using multiple drivers, I would have expected to see them listed in pairs.

361.45.09 is from the GRID 3.1 package, and should only be paired wtih 362.56. As you can see by version comparison, it’s quite a way behind the current release.

The latest drivers for Xen are 367.64 paired only with 369.71, available from here: https://nvidia.flexnetoperations.com/control/nvda/login

Does the problem occur when you start only a single VM, or is it when multiple VMs are started and are you able to start any VMs at all with a vGPU assigned or do none power on successfully?

When you run nvidia-smi on the Xen Hosts, what are the results?

When you created your Master Image, the VM obviously had a vGPU assigned for you to install the NVIDIA drivers, did you experience any issues then?

Are you running just XenDesktop or XenApp as well and which Operating Systems are you using?

Does it do it with Passthrough as well?

What is your provisioning method? MCS or PVS?

Regards

Ben

Hi,
we have two host Drivers because we started with the old Version and updated to the new one. As far as I know it’S not possible to remove one Version from XenCenter (except with a full reinstallation).
Thus we have 367.64 with 369.17 in use.

As far as I can see it only happened when a few vms have been started.

Nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.43 Driver Version: 367.43 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:05:00.0 Off | Off |
| N/A 35C P8 25W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:06:00.0 Off | Off |
| N/A 31C P8 23W / 150W | 14MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

No Problems with the master - but the master Image was created on other hosts (without the Problem).
We are using both XA/XD - but in this Case it’s XD 7.11 with Windows 7.

MCS

Just crashed a Server only with the master vm - no other vms - so it also happens with just one vm.

To remove the NVIDIA driver from XenServer -

  • Query the NVIDIA driver: rpm -qa | grep -i nvidia

Let’s assume it comes back with: NVIDIA-vgx-xenserver-7.0-361.45.09 (Adjust to your version if it is different)

  • Remove NVIDIA driver: rpm -ev NVIDIA-vgx-xenserver-7.0-361.45.09

I typically put a reboot in here after removal.

  • Copy new NVIDIA driver to Xen host using WinSCP and install: rpm -iv /change-to-your-path/NVIDIA-vgx-xenserver-7.0. . . .rpm (Adjust to your driver version)

Or you can use the GUI method of mounting using a .iso

Reboot after install completes.

=====

When you say "crashed a server", do you mean the R730 rebooted?

I take it all 3 XenServer hosts are correctly licensed? Which vGPU profiles are you using?

Have you made any changes to the R730 BIOS?

Can you try a Passthrough profile for me and let me know what happens?

Regards

Ben

yes - the r730 rebooted.
Yes all licensed. M60-0b - preparing a test with m60-1b but deployment takes some time.
No - I found a hint that there should be a Dell document available with bios Settings for grid - but I can’t find that.

Is the m60-1b test also ok?

Ok, thanks for the additional info.

All you’re doing is increasing the framebuffer from 512MB to 1GB. The reason I asked for a Passthrough test, is that Passthrough will not use the driver in the hypervisor, whereas any other profile will.

I don’t think increasing the framebuffer will stop this issue.

Let me do some investigation …

Regards

Ben

Hi

Can you please review this and let me know what you think: Error | NVIDIA

May well be worth an update to your current BIOS version… You can also help validate that by trying a Passthrough profile.

There are also some other suggestions at the bottom of that page.

Regards

Ben

Thanks for the hin - already checked that - bios etc is all up to date.
The Major different to Most hints is that our vms start - in most Scenarios the vms don’t boot.

just tried to remove one of the old Nvidia suplemental packs:
error: package NVIDIA-vgx-xenserver-7.0-361.45.09 is not installed
I guess they are removed during upgrade - but not fully thus old Version is still visible in XenCenter.

Ok

There are 3x Dell R730 servers. All identical specs, all identical firmware and all completely up to date. The BIOS is the same version across all 3 hosts (2.2.5) and configured in the same way (factory default).

(Dell website is reporting that there is a newer BIOS version available (2.3.4). I’m not saying it will fix the issue as the VMs boot, only that a newer version is available)

http://www.dell.com/support/home/uk/en/ukbsdt1/Drivers/DriversDetails?driverId=0FR48&fileId=3586834680&osCode=CXS07&productCode=poweredge-r730&languageCode=en&categoryId=BI

XS 7.0 is fully patched and licensed across all R730s. NVIDIA drivers are the same (or should be) across all XS hosts. Using the information provided further up, you can now run the same GRID drivers across all XS hosts. The VM driver is correctly paired with the host driver. Running nvidia-smi on each host shows no errors.

Can you run this "rpm -qa | grep -i nvidia" on all 3 XS hosts and post the results.

As you’re using MCS, you have a storage connection for each GPU profile you wish to use and the correct storage connection is being used when you create your XD Catalog. However, even before you get that far, when you create a VM and assign a vGPU, the entire host crashes with either a single VM or multiple VMs?

Based on that, the only step left I can think of, is to try a Passthrough GPU and see if that makes any difference …

You might want to post on the Citrix Forums to see if anyone there can offer some advice: GPU Technologies - Discussions

Regards

Ben

The bios was configure in the same way - we Changed one to uefi boot - but didn’t make a difference.
just installed the new bios (not available through lifecylcle Controller and ftp.dell.com) - no difference.
XS 7 is fully patched - nvidia drive is the same.
nvidia-smi shows no error

rpm -qa | grep -i nvidia
All three Show: NVIDIA-vGPU-xenserver-7.0-367.43.x86_64

At the moment I can reproduce the Problem with one VM which was manually created - no MCS involved. I will check Passthrough later and give Feedback.

With only 512gb ram the error Messages are gone - instead the Server just freezes. We had the same Problem with dell 7910 when the m60 was connected with a wrong power plug - but the power plug is correct in the 730 (according to dell). Is there a method to check if the M60 receives enough power?
(The Servers mostly freezes when the VMs are rebooted the second time).

With only 512GB RAM? What did you have installed previously? … If you had 1TB or more RAM installed, this is not a compatible server configuration.

Which power supplies do you have in the R730s?

When you ordered the R730s, you should also have ordered "GPU Enablement Kits" for each server (from Dell). This gives you the correct power cables and low profile heat-syncs. Did order those? …

576gb had been installed - the Workaround /opt/xensource/libexec/xen-cmdline --set-dom0 iommu=dom0-passthrough for Systems with more than 512gb did not help.

Redundant 1100w

yes the gpu Installation kit was part of the order. (Was also for the other ones - but Dell delivered the wrong ones - furthermore in the 730 not all power plugs had been connected and they were in the air flow…)