Nvidia-smi - No devices were found on CentOS-8.3 and 460.27.04

I just exchanged my ‘GeForce GT 720’ with a KFA2 ‘GeForce GTX 1650 SUPER’ in my 5 years old AMD PC, but the new GPU is only visible in lspci. nvidia-smi results in ‘No devices were found’ and these dmesg messages:

[ 8595.773293] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 8595.773873] caller _nv000705rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 8603.948675] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 8603.948812] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 8604.007154] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 8604.007769] caller _nv000705rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 8612.181198] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xffff:1290)
[ 8612.181335] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
nvidia-bug-report.log.gz (533.2 KB)
for details see the nvidia-bug-report.log.gz (attached)

Please upgrade your bios first to the latest version, it’s 6 years old. If that still doesn’t work, please check for pcie gen settings in bios and try to lower/higher them. Also, please try reseating the card in its slot.

Thanks for the quick reply.

  • unfortunately the BIOS F2 from 11/25/2014 is the latest version for GA-78LMT-USB3
  • the BIOS does not offer any PCIe gen settings
  • and also reseating did not help
    Any more ideas to get the ‘GeForce GTX 1650 super’ enabled for cuda on this PCIe-G2 based motherboard ?

If you have a spare drive, would it be possible to try a different Linux distribution?

wow! It is visible …
[root@otto ~]# nvidia-smi
Mon Dec 28 22:36:28 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165… Off | 00000000:01:00.0 Off | N/A |
| 40% 28C P8 8W / 100W | 1MiB / 3911MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
[root@otto ~]# uname -a
Linux otto 5.8.7-200.fc32.x86_64 #1 SMP Mon Sep 7 15:26:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

This is my ‘old’ system disk from Sept.2020 based on Fedora 32, before I switched to CentOS-8.3.
What is your conclusion ? What are the newer elements in the Linux-Kernel to make this happen ?

I switched back to my CentOS-8 disk and installed the latest kernel-ml from elrepo … and it worked.
Can you please track down or guess the root cause of this behaviour ?
So that Red Hat can back port the relevant changes between Kernel 4.18 and 5.8 … in order to enable their distribution as platform for AI workloads with NVIDIA GPUs.

[root@otto ~]# uname -a
Linux otto 5.10.3-1.el8.elrepo.x86_64 #1 SMP Wed Dec 23 13:25:00 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

[root@otto ~]# nvidia-smi
Tue Dec 29 10:48:32 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165… Off | 00000000:01:00.0 Off | N/A |
(…)

1 Like

I hardly see switching kernel to out of distro kernel a solution. Rather a temporary workaround, I wish. I have similar forum posting about rhel, and was kindly guided here.

Did you ever find real root cause and fix for this? Sounds to me like kernel would have improved protection of overallocating PCI address space, and thus won’t allow driver to go forward. Could it be the reason?

I agree, switching to an 5.10.3 elrepo Kernel is a temporary workaround, not a solution.

Unfortunately finding the real root cause is beyond my capabilities. I also did not find the time yet to open a bug report towards RHEL/CentOS 8.3. But, I will try this week the combination of Kernel 4.18.0-240.1.1 and Nvidia Driver 460.32.03

The “resource sanity check” message is just a symptom, very common. From observation, it’s always displayed when the nvidia driver resets the gpu.
driver can’t communicate with the gpu (RmInitAdapter failed) and resets it (resource message), in a loop.

1 Like

Just to follow up here, Red Hat released an errata for this issue: RHSA-2021:0558