We have a machine that we use for CUDA applications with the following specs:
- AMD Ryzen 7 1700
- Asus PRIME X370-PRO (with latest BIOS)
- 64 MB RAM
- 2x GTX 1080Ti
It is running on Ubuntu 18.04 with recent kernel (4.15.0-43-generic) and nvidia drivers (410.79-0ubuntu1).
When running CUDA applications, one of the two GPUs systematically crashes after a while with the following message:
NVRM: GPU at PCI:0000:0a:00: GPU-1d86bbcc-1cbf-2a15-6b63-92edc7614e46 NVRM: GPU Board Serial Number: NVRM: Xid (PCI:0000:0a:00): 62, 0a8a(2aa4) 00000000 00000000 NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Global Exception on (GPC 2, TPC 0): Physical Multiple Warp Errors NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x4 0x514644=0xd3eff2 0x51464c=0x17f NVRM: Xid (PCI:0000:0a:00): 43, Ch 00000008, engmask 00000101
After that, nvidia-smi only reports the second GPU.
We already tried to swap the GPUs in their PCIe slots, but the same GPU keeps crashing, so we suspect this is a hardware problem and the GPU is faulty, but have no idea on how to troubleshoot it farther.
nvidia-bug-report.log.gz (324 KB)