RmInitAdapter failed! since kernel > 6.4

Hi guys,

I’m having an issue with my nvidia drivers under kernel >=6.5.
Tested and confirmed with kernel 6.6 and 6.7.

It’s not nvidia-drivers version dependant, tested with 535.x series and 550.x series.
I don’t have any issue while using 6.4 kernel.

Here some infos :
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)

[ 1227.783072] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783075] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783078] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783081] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783084] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783087] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783090] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783093] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783096] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE

[ 1227.783105] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783108] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783111] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783115] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783120] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x110100, regvalue: 0xbadf5720, error code: Unknown SYS_PRI_ERROR_CODE
[ 1227.783124] NVRM: gpuWaitForGfwBootComplete_TU102: GSP failed to halt after GFW completion
[ 1227.783126] NVRM: kgspWaitForGfwBootOk_TU102: failed to wait for GFW boot complete: 0x65 VBIOS version 94.06.34.00.2F
[ 1227.783127] NVRM: kgspWaitForGfwBootOk_TU102: (the GPU may be in a bad state and may need to be reset)
[ 1227.783130] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspWaitForGfwBootOk_HAL(pGpu, pKernelGsp) @ kernel_gsp.c:3053
[ 1227.783133] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 1227.784356] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1784)
[ 1227.784858] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 1227.785670] Loading firmware: nvidia/550.54.14/gsp_ga10x.bin
[ 1227.791066] NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11.

cat /proc/driver/nvidia/gpus/0000:01:00.0/information

Model: NVIDIA GeForce RTX 3060 Laptop GPU
IRQ: 165
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
GPU Firmware: N/A
GPU Excluded: No

nvidia-smi

No devices were found

Any help would be greatly appreciated
Thanks

Looks like you’re using the nvidia-open driver. Please switch to the non -open driver and retest.

Hello,
Maybe it was the wrong one, I tried the open driver.

Please note it’s a Lenovo 15IAH7H, with integrated intel GPU + Discrete RTX 3060. I’ve just updated Vbios :

Tried again with non open on 6.7 kernel series :

[ 224.564558] nvidia: loading out-of-tree module taints kernel.
[ 224.564566] nvidia: module license ‘NVIDIA’ taints kernel.
[ 224.564567] Disabling lock debugging due to kernel taint
[ 224.564569] nvidia: module license taints kernel.
[ 224.571671] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[ 224.572436] nvidia 0000:01:00.0: enabling device (0000 → 0003)
[ 224.572498] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 224.615457] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
[ 224.644823] ACPI Warning: _SB.NPCF._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[ 224.644866] ACPI Warning: _SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[ 225.045986] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1556)
[ 225.046011] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 225.174257] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 225.179101] nvidia-uvm: Loaded the UVM driver, major device number 508.

cat /proc/driver/nvidia/gpus/0000:01:00.0/information

Model: NVIDIA GeForce RTX 3060 Laptop GPU
IRQ: 165
GPU UUID: GPU-11838d21-e7d7-da9e-f200-668931907842
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
GPU Excluded: No

Here is from kernel 6.4.15 :

Model: NVIDIA GeForce RTX 3060 Laptop GPU
IRQ: 165
GPU UUID: GPU-11838d21-e7d7-da9e-f200-668931907842
Video BIOS: 94.06.34.00.2f
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
GPU Excluded: No

Hello guys,
Updated to 550.67 same issue

[ 264.486288] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[ 264.487090] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=none
[ 267.105733] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.67 Tue Mar 12 23:54:15 UTC 2024
[ 267.540572] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1556)
[ 267.540602] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 267.687936] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 267.698679] nvidia-uvm: Loaded the UVM driver, major device number 508.

nvidia-smi
No devices were found

Any help please ?

Hello again,
I can confirm the regression, was able to isolate the issue came between 6.4.16 (last working and latest on kernel.org) and 6.5.1.

Here my kernel config if needed : ## Automatically generated file; DO NOT EDIT.# Linux/x86 6.4.16-gentoo Kerne - Pastebin.com

Can please any dev check this issue ? I’m able to make some tests for you if needed

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Hello,
Please find attached the nvidia-report-log.
Best regards,
nvidia-bug-report.log.gz (9.8 MB)

Should not have an influence on the nvidia gpu but your i915 fw is missing.

Regarding the nvidia gpu, something seems to be wrong regarding rBAR, lspci reports

Region 1: Memory at 6000000000 (64-bit, prefetchable) [size=256M]
[...]
BAR 1: current size: 8GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB

256MB != 8GB

Hello,
Thanks for your reply…

Yes you are right… lspci on 6.4 kernel reports the correct size

01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Lenovo GA106M [GeForce RTX 3060 Mobile / Max-Q]
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        IOMMU group: 19
        Region 0: Memory at 5f000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 6000000000 (64-bit, prefetchable) [size=8G]
        Region 3: Memory at 6200000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 4000 [size=128]
        Expansion ROM at 60000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 8GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver <?>
        Capabilities: [e00 v1] Data Link Feature <?>
        Kernel driver in use: nvidia
        Kernel modules: nvidia_drm, nvidia

Please feel free to ask me if you need any other infos

Please provide a dmesg output from the 6.4 kernel.

dmesg from 6.4 kernel
dmesg6_4.txt (75.6 KB)

On kernel 6.8, there’s a link failure detected (for whatever reasons) and the link is retrained on the upstream bridge:

[    0.521517] pci 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s

I suspect this breaks resource allocation in regard to resizable bar so the nvidia gpu ends up defunct.
The code to retrain was added to the kernel around kernel 6.5 to drivers/pci/quirks.c

You could change bool pcie_failed_link_retrain(struct pci_dev *dev) to just return true and recompile the kernel to check if my suspicion is correct. Otherwise, I guess you’ll have to bisect the kernel.

Hello,
Nice suspicion mate. I patched as your idea, and it’s working again on 6.8 kernel with a few warnings.

[  238.398533] nvidia: loading out-of-tree module taints kernel.
[  238.398539] nvidia: module license 'NVIDIA' taints kernel.
[  238.398539] Disabling lock debugging due to kernel taint
[  238.398541] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  238.398541] nvidia: module license taints kernel.
[  238.405806] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[  238.406263] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[  238.406308] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[  238.456202] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.67  Tue Mar 12 23:54:15 UTC 2024
[  238.488299] ACPI Warning: \_SB.NPCF._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  238.488331] ACPI Warning: \_SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  239.420288] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[  239.427230] nvidia-uvm: Loaded the UVM driver, major device number 508.

Now we need to find something less ugly to fix it, any idea ?

But still not working correctly. I tried to start a game to test the GPU (CS2), game is freezing as soon as started.

Please create a new nvidia-bug-report.log.

For your intel igpu, you’ll need to add the firmware to the kernel image when running without an initrd. Setting
CONFIG_EXTRA_FIRMWARE
to
i915/adlp_dmc_ver2_16.bin i915/adlp_guc_70.1.1.bin
and
CONFIG_EXTRA_FIRMWARE_DIR
to
/lib/firmware
should be necessary.

I added the intel GPU firmwares but I got a LOT of issue (screen blinking/black screen etc).
After some work I finally give up using the xf86-intel drivers and switch to the modesettings drivers.
It fixes quite all issues with intel firmwares.

Here are nvidia-bug-reports WITH quirks.c patched :
nvidia-bug-report_6.5.log.gz (2.0 MB)

nvidia-bug-report_6.8.log.gz (1.6 MB)

I can use nvidia-drivers without issue on 6.5 in this case, but game freezing on 6.8. I don’t know if this issue is related or not…

The “Intel” driver is broken and shouldn’t be used on meodern hardware, modesetting is the correct driver. You now set an intel-only config, so the nvidia driver isn’t used at all. Please delete /etc/X11/xorg.conf.d/20-modesetting.conf
After reboot, the nvidia gpu should be in offload mode, run somethin on it by prepending
__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia

I can’t try removing modesettings.conf at the moment, but I think you’re guessing it because the nvidia module wasn’t re-compiled before starting X. I re-compile & modprobe it after X start. I can confirm I also used the nvidia gpu (my CS2 game was running on nvidia GPU : confirmed with nvidia-smi).

I attached a glxinfo as example.
__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo
glxinfo.txt.gz (12.7 KB)