Random Xid 61 and GPU disappears (RTX 2080 Ti, 440.64 driver, Ubuntu 20.04)

paolieri · June 15, 2020, 8:21pm

We regularly receive this error:

/var/log/kern.log:
Jun 15 16:45:28 hal2 kernel: [    1.752052] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jun 15 16:45:28 hal2 kernel: [    1.801781] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.64  Fri Feb 21 01:17:26 UTC 2020
Jun 15 16:45:28 hal2 kernel: [    1.811054] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.64  Fri Feb 21 00:43:19 UTC 2020
Jun 15 16:45:28 hal2 kernel: [    1.812681] [drm] [nvidia-drm] [GPU ID 0x00006800] Loading driver
Jun 15 16:45:28 hal2 kernel: [    1.813031] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:68:00.0 on minor 0
Jun 15 16:45:28 hal2 kernel: [    5.157959] nvidia-uvm: Loaded the UVM driver, major device number 511.
Jun 15 16:48:42 hal2 kernel: [  208.455485] NVRM: GPU at PCI:0000:68:00: GPU-2a9d6094-368e-e43d-6843-cea9465affed
Jun 15 16:48:42 hal2 kernel: [  208.455486] NVRM: GPU Board Serial Number: 
Jun 15 16:48:42 hal2 kernel: [  208.455487] NVRM: Xid (PCI:0000:68:00): 61, pid=2324, 0cec(3098) 00000000 00000000
Jun 15 16:48:59 hal2 kernel: [  224.736826] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
Jun 15 16:48:59 hal2 kernel: [  224.736854] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 0

after a few minutes of execution with

Intel i9-9940X
WS X299 SAGE/10G (bios rev. 2002)
Asus GeForce RTX 2080 Ti 11GB Turbo Edition
Corsair Vengeance LPX 128GB DDR4 2666
HP EX950 M.2 2TB SSD
CORSAIR AX1600i PSU

Ubuntu 20.04
Linux 5.4.0-37-generic
Drivers nvidia-driver-440 (440.82+really.440.64-0ubuntu6)
https://packages.ubuntu.com/focal/nvidia-driver-440

Secure Boot is disabled and we tried the pcie_port_pm=off kernel parameter.
I enclose a report before and after the crash (hal2_before… and hal2_after…) .

We have built and configured an identical box (hal1) where the problem is not showing up (I enclose the report for that box as well). We tried to swap the GPUs, PSU, PSU cables between the two boxes, but the problem persists (only on hal2).

Occasionally, the error occurs after a longer period of time (about 15 minutes while watching the output of nvidia-smi, or 10 minutes of training with TensorFlow). After this error, some or all of the GPUs are not found by nvidia-smi.

hal2_before_prob_nvidia-bug-report.txt (998.5 KB) hal2_after_prob_nvidia-bug-report.txt (1.3 MB) hal1_nvidia-bug-report.txt (2.6 MB)

generix · June 15, 2020, 8:40pm

Please see this:
https://forums.developer.nvidia.com/t/random-xid-61-and-xorg-lock-up/79731/185?u=generix

Topic		Replies	Views
RTX 2080ti driver crashing randomly with or without load ERR! 39C PERR! / 250W Linux	8	1223	October 12, 2021
X server random crash / frozen - 2080 (Ubuntu 16.04.5 - Driver 410.48) Linux	1	1133	December 1, 2018
Ubuntu 20.04 randomly freezes screen with Xid 62 last message in syslog Linux	2	1310	May 26, 2020
Rtx2080 ti - err - xid 61 Linux	3	1414	May 22, 2021
GPU randomly lost Linux cuda , ubuntu	2	645	July 27, 2023
Random Xid 62 error on ML workloads - Titan RTX Linux	0	720	July 8, 2020
GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61 Linux	2	1199	July 22, 2019
[SOLVED] XID 62: fixeable? Linux	3	4058	November 23, 2017
Dual RTX4000 ADA XID 62 Failure Ubuntu 22.04 Linux cuda , ubuntu	1	109	November 7, 2024
Unable to determine device handle (TitanX - Ubuntu 18.04 - NVIDIA Driver 460.32, CUDA 11.2) Linux ubuntu , gpu	1	1031	March 22, 2021

Random Xid 61 and GPU disappears (RTX 2080 Ti, 440.64 driver, Ubuntu 20.04)

Related topics