After a couple of suspends/resumes now I’ve received this error:
[110415.697822] nvidia 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[110415.703344] NVRM: Xid (PCI:0000:07:00): 61, pid=0, 0d02(31c4) 00000000 00000000
At this point I’m unable to query via nvidia-smi:
power.draw, driver_version, pcie.link.gen.current, pcie.link.width.current, memory.used - they all return either errors or zeros.
Running kernel 5.7.18, NVIDIA 450.57, GTX 1660 Ti (desktop version, the only card in the system) with an motherboard based on AMD X570 chipset.
To fix the issue I tried to restart the X server - it couldn’t start any longer. I tried to rmmod and modprobe all four NVIDIA drivers, I got multiple errors:
dmesg:
Sep 06 09:45:34 zen kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Sep 06 09:45:34 zen kernel: caller os_map_kernel_space.part.0+0x69/0x80 [nvidia] mapping multiple BARs
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:45:38 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
# after rmmod/modprobe:
Sep 06 09:47:36 zen kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.57 Sun Jul 5 09:42:25 UTC 2020
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x24:0x65:1224)
Sep 06 09:47:40 zen kernel: NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 0
Xorg.log:
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:7:0:0. Please
[111248.549] (EE) NVIDIA(GPU-0): check your system's kernel log for additional error
[111248.549] (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: Common Problems in the
[111248.549] (EE) NVIDIA(GPU-0): README for additional information.
[111248.549] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[111248.549] (EE) NVIDIA(0): Failing initialization of X screen
[111248.549] (II) UnloadModule: "nvidia"
[111248.549] (II) UnloadSubModule: "glxserver_nvidia"
[111248.549] (II) Unloading glxserver_nvidia
[111248.549] (II) UnloadSubModule: "wfb"
[111248.549] (II) UnloadSubModule: "fb"
[111248.549] (EE) Screen(s) found, but none have a usable configuration.
[111248.549] (EE)
Fatal server error:
[111248.549] (EE) no screens found(EE)
[111248.549] (EE)
Please consult the Fedora Project support
at http://wiki.x.org
for help.
[111248.549] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[111248.549] (EE)
[111248.551] (EE) Server terminated with error (1). Closing log file.
My motherboard is ASUS TUF Gaming X570-Plus (Wi-Fi) which means it’s PCI-E 4.0.
My GPU is GTX 1660 Ti.
Once the error occurs, nvidia-smi
starts malfunctioning:
nvidia-smi
Mon Sep 7 02:07:50 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:07:00.0 On | N/A |
|ERR! 50C P5 ERR! / 130W | 837MiB / 5941MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3744 G /usr/libexec/Xorg 306MiB |
| 0 N/A N/A 9377 G ...AAAAAAAAA= --shared-files 60MiB |
| 0 N/A N/A 44983 G firefox 468MiB |
+-----------------------------------------------------------------------------+
At this point I cannot set fan speed or check GPU temperature but the system keeps on working as if everything is OK. There are no errors logged to an X.org log file.
I’m running Fedora 32 with Linux 5.8.7 (vanilla). I haven’t changed anything in my system for the past year - it’s been rock solid so far, except for the past two days.