Weird PCI-E errors: NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000

birdie · September 6, 2020, 9:41am

After a couple of suspends/resumes now I’ve received this error:

[110415.697822] nvidia 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[110415.703344] NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000

At this point I’m unable to query via nvidia-smi:

power.draw, driver_version, pcie.link.gen.current, pcie.link.width.current, memory.used - they all return either errors or zeros.

Running kernel 5.7.18, NVIDIA 450.57, GTX 1660 Ti (desktop version, the only card in the system) with an motherboard based on AMD X570 chipset.

To fix the issue I tried to restart the X server - it couldn’t start any longer. I tried to rmmod and modprobe all four NVIDIA drivers, I got multiple errors:

dmesg:
Sep 06 09:45:34 zen kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Sep 06 09:45:34 zen kernel: caller os_map_kernel_space.part.0+0x69/0x80 [nvidia] mapping multiple BARs
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
# after rmmod/modprobe:
Sep 06 09:47:36 zen kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.57  Sun Jul  5 09:42:25 UTC 2020
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0

Xorg.log:
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:7:0:0.  Please
[111248.549] (EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
[111248.549] (EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
[111248.549] (EE) NVIDIA(GPU-0):     README for additional information.
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[111248.549] (EE) NVIDIA(0): Failing initialization of X screen
[111248.549] (II) UnloadModule: "nvidia"
[111248.549] (II) UnloadSubModule: "glxserver_nvidia"
[111248.549] (II) Unloading glxserver_nvidia
[111248.549] (II) UnloadSubModule: "wfb"
[111248.549] (II) UnloadSubModule: "fb"
[111248.549] (EE) Screen(s) found, but none have a usable configuration.
[111248.549] (EE)
Fatal server error:
[111248.549] (EE) no screens found(EE)
[111248.549] (EE)
Please consult the Fedora Project support
         at http://wiki.x.org
 for help.
[111248.549] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[111248.549] (EE)
[111248.551] (EE) Server terminated with error (1). Closing log file.

My motherboard is ASUS TUF Gaming X570-Plus (Wi-Fi) which means it’s PCI-E 4.0.
My GPU is GTX 1660 Ti.

Once the error occurs, nvidia-smi starts malfunctioning:

nvidia-smi
Mon Sep  7 02:07:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:07:00.0  On |                  N/A |
|ERR!   50C    P5   ERR! / 130W |    837MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3744      G   /usr/libexec/Xorg                 306MiB |
|    0   N/A  N/A      9377      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A     44983      G   firefox                           468MiB |
+-----------------------------------------------------------------------------+

At this point I cannot set fan speed or check GPU temperature but the system keeps on working as if everything is OK. There are no errors logged to an X.org log file.

I’m running Fedora 32 with Linux 5.8.7 (vanilla). I haven’t changed anything in my system for the past year - it’s been rock solid so far, except for the past two days.

birdie · September 24, 2020, 2:31pm

Drivers 450.66 have solved this issue for me. Hooray!

Topic		Replies	Views
465.27 NVRM: Xid errors on a Quadro RTX 3000 Mobile / Max-Q Linux	0	504	May 9, 2021
rm_init_adapter fails at X startup for Ubuntu drivers more recent than 304 (`nvidia-current` package) [GeForce GT 750M, Dell XPS 15 9530 laptop] Linux	0	1446	October 24, 2016
Linux driver 418.56, GTX 1660 Ti, NVRM: RmInitAdapter failed! Linux	7	2079	March 29, 2019
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port Linux	6	4944	July 2, 2013
Ubuntu 18 NVRM: Xid 6,12,13,69 at old hardware Linux nvbugs	1	610	August 3, 2020
Unable to determine the device handle for GPU :GPU is lost Linux	10	31979	August 11, 2021
Driver allocating memory over pci slot size Linux kernel	12	2692	February 16, 2021
418.56, GTX 1050 TI mobile, Dell XPS, 4.19.34-1-lts, RmInitAdapter failed! Linux	0	244	April 17, 2019
418.56, GTX 1050 TI mobile, Dell XPS, 4.19.34-1-lts, RmInitAdapter failed! Linux	0	194	April 17, 2019
Kernel panic on linux 4.16.8 with nvidia 396.24 (rm_init_adapter) Linux	2	2894	October 14, 2021

Weird PCI-E errors: NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000

Related topics