Weird PCI-E errors: NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000

After a couple of suspends/resumes now I’ve received this error:

[110415.697822] nvidia 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[110415.703344] NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000

At this point I’m unable to query via nvidia-smi:

power.draw, driver_version, pcie.link.gen.current, pcie.link.width.current, memory.used - they all return either errors or zeros.

Running kernel 5.7.18, NVIDIA 450.57, GTX 1660 Ti (desktop version, the only card in the system) with an motherboard based on AMD X570 chipset.

To fix the issue I tried to restart the X server - it couldn’t start any longer. I tried to rmmod and modprobe all four NVIDIA drivers, I got multiple errors:

dmesg:
Sep 06 09:45:34 zen kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Sep 06 09:45:34 zen kernel: caller os_map_kernel_space.part.0+0x69/0x80 [nvidia] mapping multiple BARs
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
# after rmmod/modprobe:
Sep 06 09:47:36 zen kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.57  Sun Jul  5 09:42:25 UTC 2020
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
Xorg.log:
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:7:0:0.  Please
[111248.549] (EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
[111248.549] (EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
[111248.549] (EE) NVIDIA(GPU-0):     README for additional information.
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[111248.549] (EE) NVIDIA(0): Failing initialization of X screen
[111248.549] (II) UnloadModule: "nvidia"
[111248.549] (II) UnloadSubModule: "glxserver_nvidia"
[111248.549] (II) Unloading glxserver_nvidia
[111248.549] (II) UnloadSubModule: "wfb"
[111248.549] (II) UnloadSubModule: "fb"
[111248.549] (EE) Screen(s) found, but none have a usable configuration.
[111248.549] (EE)
Fatal server error:
[111248.549] (EE) no screens found(EE)
[111248.549] (EE)
Please consult the Fedora Project support
         at http://wiki.x.org
 for help.
[111248.549] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[111248.549] (EE)
[111248.551] (EE) Server terminated with error (1). Closing log file.

My motherboard is ASUS TUF Gaming X570-Plus (Wi-Fi) which means it’s PCI-E 4.0.
My GPU is GTX 1660 Ti.

Once the error occurs, nvidia-smi starts malfunctioning:

nvidia-smi
Mon Sep  7 02:07:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:07:00.0  On |                  N/A |
|ERR!   50C    P5   ERR! / 130W |    837MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3744      G   /usr/libexec/Xorg                 306MiB |
|    0   N/A  N/A      9377      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A     44983      G   firefox                           468MiB |
+-----------------------------------------------------------------------------+

At this point I cannot set fan speed or check GPU temperature but the system keeps on working as if everything is OK. There are no errors logged to an X.org log file.

I’m running Fedora 32 with Linux 5.8.7 (vanilla). I haven’t changed anything in my system for the past year - it’s been rock solid so far, except for the past two days.

Drivers 450.66 have solved this issue for me. Hooray!