I just exchanged my ‘GeForce GT 720’ with a KFA2 ‘GeForce GTX 1650 SUPER’ in my 5 years old AMD PC, but the new GPU is only visible in lspci. nvidia-smi results in ‘No devices were found’ and these dmesg messages:
[ 8595.773293] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 8595.773873] caller _nv000705rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 8603.948675] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 8603.948812] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 8604.007154] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 8604.007769] caller _nv000705rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 8612.181198] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 8612.181335] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
–nvidia-bug-report.log.gz (533.2 KB)
for details see the nvidia-bug-report.log.gz (attached)
Please upgrade your bios first to the latest version, it’s 6 years old. If that still doesn’t work, please check for pcie gen settings in bios and try to lower/higher them. Also, please try reseating the card in its slot.
wow! It is visible …
[root@otto ~]# nvidia-smi
Mon Dec 28 22:36:28 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165… Off | 00000000:01:00.0 Off | N/A |
| 40% 28C P8 8W / 100W | 1MiB / 3911MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
[root@otto ~]# uname -a
Linux otto 5.8.7-200.fc32.x86_64 #1 SMP Mon Sep 7 15:26:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
This is my ‘old’ system disk from Sept.2020 based on Fedora 32, before I switched to CentOS-8.3.
What is your conclusion ? What are the newer elements in the Linux-Kernel to make this happen ?
I switched back to my CentOS-8 disk and installed the latest kernel-ml from elrepo … and it worked.
Can you please track down or guess the root cause of this behaviour ?
So that Red Hat can back port the relevant changes between Kernel 4.18 and 5.8 … in order to enable their distribution as platform for AI workloads with NVIDIA GPUs.
[root@otto ~]# uname -a
Linux otto 5.10.3-1.el8.elrepo.x86_64 #1 SMP Wed Dec 23 13:25:00 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@otto ~]# nvidia-smi
Tue Dec 29 10:48:32 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165… Off | 00000000:01:00.0 Off | N/A |
(…)
I hardly see switching kernel to out of distro kernel a solution. Rather a temporary workaround, I wish. I have similar forum posting about rhel, and was kindly guided here.
Did you ever find real root cause and fix for this? Sounds to me like kernel would have improved protection of overallocating PCI address space, and thus won’t allow driver to go forward. Could it be the reason?
I agree, switching to an 5.10.3 elrepo Kernel is a temporary workaround, not a solution.
Unfortunately finding the real root cause is beyond my capabilities. I also did not find the time yet to open a bug report towards RHEL/CentOS 8.3. But, I will try this week the combination of Kernel 4.18.0-240.1.1 and Nvidia Driver 460.32.03
The “resource sanity check” message is just a symptom, very common. From observation, it’s always displayed when the nvidia driver resets the gpu.
driver can’t communicate with the gpu (RmInitAdapter failed) and resets it (resource message), in a loop.