GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot

anon83339707 · January 16, 2019, 8:48am

Hi,

We have a machine that we use for CUDA applications with the following specs:

AMD Ryzen 7 1700
Asus PRIME X370-PRO (with latest BIOS)
64 MB RAM
2x GTX 1080Ti

It is running on Ubuntu 18.04 with recent kernel (4.15.0-43-generic) and nvidia drivers (410.79-0ubuntu1).

When running CUDA applications, one of the two GPUs systematically crashes after a while with the following message:

NVRM: GPU at PCI:0000:0a:00: GPU-1d86bbcc-1cbf-2a15-6b63-92edc7614e46
NVRM: GPU Board Serial Number: 
NVRM: Xid (PCI:0000:0a:00): 62, 0a8a(2aa4) 00000000 00000000
NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0): Out Of Range Address
NVRM: Xid (PCI:0000:0a:00): 13, Graphics SM Global Exception on (GPC 2, TPC 0): Physical Multiple Warp Errors
NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x514648=0x100000e 0x514650=0x4 0x514644=0xd3eff2 0x51464c=0x17f
NVRM: Xid (PCI:0000:0a:00): 43, Ch 00000008, engmask 00000101

After that, nvidia-smi only reports the second GPU.

We already tried to swap the GPUs in their PCIe slots, but the same GPU keeps crashing, so we suspect this is a hardware problem and the GPU is faulty, but have no idea on how to troubleshoot it farther.

nvidia-bug-report.log.gz (324 KB)

generix · January 16, 2019, 11:05am

It’s ending with this

[1003129.556327] NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000000 intr 80040000
[1003129.567986] NVRM: RmInitAdapter failed! (0x26:0xffff:1127)
[1003129.568052] NVRM: rm_init_adapter failed for device bearing minor number 0

Doesn’t look good. Maybe test it as single card in another system but I doubt it will work, probably broken.
https://devtalk.nvidia.com/default/topic/1026107/linux/-solved-xid-62-fixeable-/post/5222318/#5222318

Topic		Replies	Views
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4927	September 25, 2017
GPU hard reset CUDA Programming and Performance	6	12402	March 17, 2011
GTX 1080 Ti falling off bus Linux	19	2472	September 3, 2018
ASUSTek GTX 1080 TI on Ubuntu 16.04 with X.Org Server version 11.0 crashes at random times even when idle Linux	2	720	October 14, 2021
Crash with error code XID 62 for GTX 1070ti Linux	12	1665	October 12, 2021
GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61 Linux	2	1237	July 22, 2019
980Ti crash CUDA Setup and Installation	2	1726	May 14, 2016
GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04. Linux	2	5555	January 3, 2018
GTX1080 crash, after reboot for crashing in windows 10, must poweroff GPU - Hardware	13	2304	May 22, 2018
X server random crash / frozen - 2080 (Ubuntu 16.04.5 - Driver 410.48) Linux	1	1160	December 1, 2018

GTX 1080Ti keeps crashing while under CUDA load and "disappears" from the system until reboot

Related topics