NVRM: RmInitAdapter failed! - ETH mining server

Hello all,

This issue has me completely stumped. I have scoured the internet and tried everything that seemed to work for people having a similar issue but to no avail.

In short, 2/6 of the graphics cards I have installed in my ETH mining server cannot be detected by nvidia-smi. One of them was working just fine on the same PCIe port until I installed a new card on a different port. Now that one is fine, but the old one is having problems. The 6th and final card and port have not been confirmed functional.

lspci shows all cards:

01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)

But dmesg shows a failure of RmInitAdapter from NVRM:

[ 75.298825] NVRM: GPU 0000:07:00.0: RmInitAdapter failed! (0x23:0xffff:624)
[ 75.298873] NVRM: GPU 0000:07:00.0: rm_init_adapter failed, device minor number 4
[ 75.415957] NVRM: GPU 0000:08:00.0: RmInitAdapter failed! (0x23:0xffff:624)
[ 75.416004] NVRM: GPU 0000:08:00.0: rm_init_adapter failed, device minor number 5

Output of nvidia-smi:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 106… On | 00000000:01:00.0 Off | N/A |
| 59% 74C P2 101W / 120W | 4352MiB / 6076MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 106… On | 00000000:02:00.0 Off | N/A |
| 50% 74C P2 94W / 120W | 4345MiB / 6078MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 106… On | 00000000:04:00.0 Off | N/A |
| 39% 72C P2 92W / 120W | 4345MiB / 6078MiB | 100% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 106… On | 00000000:05:00.0 Off | N/A |
| 43% 72C P2 94W / 120W | 4345MiB / 6078MiB | 97% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1037 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 1116 G /usr/bin/gnome-shell 1MiB |
| 0 N/A N/A 1446 C ethminer 4337MiB |
| 1 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1446 C ethminer 4337MiB |
| 2 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1446 C ethminer 4337MiB |
| 3 N/A N/A 1037 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1446 C ethminer 4337MiB |
±----------------------------------------------------------------------------+

I have already tried running nvidia-persistanced on boot without success. I am almost certain it is not a hardware issue, at least for the card that was working on the same PCIe port. I doubt it is a kernel configuration issue (I haven’t touched my kernel.) Does anyone have any ideas?

I am running an almost fresh install of Ubuntu Server 20.04.2 LTS. Here is my bug report: nvidia-bug-report.log.gz (881.0 KB)

Thanks in advance!

Broken risers, wrong pcie gen?

Highly unlikely. Let me try to paint a picture:

Let | be a PCIe riser, and let g| be a riser with a 1060 installed. This was my configuration initially, with a GPU installed directly on my motherboard:

           g|  g|  |  g|  |  |

This configuration worked perfectly. Now let g|x be a GPU having issues with RmInitAdapter:

         g|  g|  g|  g|x  g|x  |
Device:   1   2   3   4    5  
(device 0 is on board)

I didn’t touch the configuration for 4. This suggests to me that, at least in the case of what is now 4, the issue is not hardware related since it was all working before installing what is now 3.

This is calle crosstalk from bad risers.

Interesting, I have been looking into this and I have come across some discussions talking about how cheap risers can have problems with cross-talk in multi-GPU configurations. I’ll order a couple new ones and see what happens.

Ok, so I was messing around with the build and strangely enough I can only get 4 to work at a time no matter what configuration I use. All risers are fully functional if they are part of the only 4 that are connected, but when more than 4 are connected it seems like everything in the higher-numbered PCIe slots drop off.

Could this error be caused by a limitation of my hardware? I have the Asrock Z270 Killer SLI motherboard. I have read that other miners have gotten this board to work with at least 6 cards, but I don’t know the details. Is there some sort of software or BIOS limitation in place that I am unaware of? This is seeming too reproducible to me to be cross-talk alone.

Yes, you are correct. I’ve taken a deeper look and:

[    0.226100] pci 0000:07:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    0.226103] pci 0000:07:00.0: BAR 1: trying firmware assignment [mem 0x20000000-0x2fffffff 64bit pref]
[    0.226105] pci 0000:07:00.0: BAR 1: [mem 0x20000000-0x2fffffff 64bit pref] conflicts with System RAM [mem 0x00100000-0x59316017]
[    0.226108] pci 0000:07:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[    0.226110] pci 0000:07:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.226112] pci 0000:07:00.0: BAR 3: trying firmware assignment [mem 0x30000000-0x31ffffff 64bit pref]
[    0.226115] pci 0000:07:00.0: BAR 3: [mem 0x30000000-0x31ffffff 64bit pref] conflicts with System RAM [mem 0x00100000-0x59316017]
[    0.226117] pci 0000:07:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    0.226119] pci 0000:00:1c.4: PCI bridge to [bus 07]
[    0.226121] pci 0000:00:1c.4:   bridge window [io  0xa000-0xafff]
[    0.226125] pci 0000:00:1c.4:   bridge window [mem 0xd6000000-0xd70fffff]
[    0.226130] pci 0000:08:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[    0.226132] pci 0000:08:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]
[    0.226135] pci 0000:08:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    0.226137] pci 0000:08:00.0: BAR 3: trying firmware assignment [mem 0x10000000-0x11ffffff 64bit pref]
[    0.226139] pci 0000:08:00.0: BAR 3: [mem 0x10000000-0x11ffffff 64bit pref] conflicts with System RAM [mem 0x00100000-0x59316017]
[    0.226141] pci 0000:08:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]

Please check your bios for an option “Above 4G decoding” or “large/64bit BARs” and enable it. Normally, the nvidia driver would put out a clearer error message in that case.

That did the trick! Yeah that was not a very clear error message at all for something so simple, you’re the best generix thanks a million!