GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot

Hi,

We have a machine that we use for CUDA applications with the following specs:

  • AMD Ryzen 7 1700
  • Asus PRIME X370-PRO (with latest BIOS)
  • 64 MB RAM
  • 2x GTX 1080Ti

It is running on Ubuntu 18.04 with recent kernel (4.15.0-43-generic) and nvidia drivers (410.79-0ubuntu1).

When running CUDA applications, one of the two GPUs systematically crashes after a while with the following message:

NVRM: GPU at PCI:0000:0a:00: GPU-1d86bbcc-1cbf-2a15-6b63-92edc7614e46
NVRM: GPU Board Serial Number: 
NVRM: Xid (PCI:0000:0a:00): 62, 0a8a(2aa4) 00000000 00000000
NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Global Exception on (GPC 2, TPC 0): Physical Multiple Warp Errors
NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x4 0x514644=0xd3eff2 0x51464c=0x17f
NVRM: Xid (PCI:0000:0a:00): 43, Ch 00000008, engmask 00000101

After that, nvidia-smi only reports the second GPU.

We already tried to swap the GPUs in their PCIe slots, but the same GPU keeps crashing, so we suspect this is a hardware problem and the GPU is faulty, but have no idea on how to troubleshoot it farther.

nvidia-bug-report.log.gz (324 KB)

It’s ending with this

[1003129.556327] NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000000 intr 80040000
[1003129.567986] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[1003129.568052] NVRM: rm_init_adapter failed for device bearing minor number 0

Doesn’t look good. Maybe test it as single card in another system but I doubt it will work, probably broken.
https://devtalk.nvidia.com/default/topic/1026107/linux/-solved-xid-62-fixeable-/post/5222318/#5222318