Ubuntu 21.10 with GeForce 1650: nvidia-drm fails to allocate NvKmsKapiDevice, and fails to register device

This computer is a Dell XPS 15 7590 with an OLED screen running Ubuntu-Mate 21.10; it has an Intel UHD 630 as its on-chip graphics card, with a GeForce GTX 1650 Mobile / Max-Q in addition.
At present all drivers are those auto-installed, and the active driver is 470

uname -a

Linux psyche 5.13.0-20-generic #20-Ubuntu SMP Fri Oct 15 14:21:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

In between GRUB menu and my login screen, the following errors occur:

[drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice 
[drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

Oddly, booting proceeds normally from there, though a little reading through the bug report indicates a repeated

NVRM: GPU 0000:01:00.0: RmInitAdapter failed! 

All functionality not dependent on the NVIDIA GPU is still present, and I have decent functionality for everything but gaming and, of course, scientific applications of the GPU.

Output of lshw -c video:

  *-display                 
       description: 3D controller
       product: TU117M [GeForce GTX 1650 Mobile / Max-Q]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:16 memory:ec000000-ecffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:3000(size=128) memory:ed000000-ed07ffff
  *-display
       description: VGA compatible controller
       product: CoffeeLake-H GT2 [UHD Graphics 630]
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 02
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list
       configuration: driver=i915 latency=0

output of lspci | grep NVIDIA

01:00.0 3D controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] (rev a1)

I note that I am able to run nvidia-settings both from the command line and through GUI; but all I am able to do is choose the profile, with no other options. (Intel, NVIDIA performance mode, NVIDIA on-demand). Notably, the profile is set to NVIDIA performance mode!

nvidia-smi returns No devices were found, unsurprisingly.

It escapes me, somewhat, as to what’s going on here; the computer is aware of the GPU on a physical level, it would seem, and is running the appropriate driver… but still can’t actually find the device.

Note: this doesn’t seem like a hardware issue, since on windows (dual boot), everything runs without a hitch.
Simple fixes, like manually installing the latest driver or disabling nouveau, do nothing.

nvidia-bug-report.log.gz (185.6 KB)

I don’t really know what’s going on since Windows is not affected. Also, no general incompatibility with your notebook model know, should work. Only thing that comes to my mind is trying to reset the mainboard which requires detaching the battery ( and press+hold power button for 20 sec. to discharge).
Sidenote: please delete xorg.conf because once the driver is working, you’ll run into a black screen due to it.

The only thing about the model I know that could be causing problems is that the screen is an OLED; even in 20.04 LTS, the screen had trouble, esp. with changing the brightness.
Two questions, so that I understand your propositions:
1: What about the error suggests a problem in the mainboard?
2: What is problematic in Xorg.conf? the entry that corresponds to the NVIDIA seems fairly non-troublesome:

`

Section "Monitor"

``

    Identifier     "Monitor0"

``

    VendorName     "Unknown"

``

    ModelName      "Unknown"

``

    Option         "DPMS"

``

EndSection

``


``

Section "Device"

``

    Identifier     "Device0"

``

    Driver         "nvidia"

``

    VendorName     "NVIDIA Corporation"

``

EndSection

``


``

Section "Screen"

``

    Identifier     "Screen0"

``

    Device         "Device0"

``

    Monitor        "Monitor0"

``

    DefaultDepth    24

``

    SubSection     "Display"

``

        Depth       24

``

    EndSubSection

`
From answers to some similar problems, I note that my config doesn’t seem to identify the PCI of the device, but this is the autogen config from running nvidia-xconfig.

Your help is very appreciated, Generix. Thanks!

  1. the rminit failed message points to a low-level bus problem, from experience I know that resetting the mainboard sometimes helps with inexplicable errors.
  2. Your display is connected to the intel igpu, the xorg.conf sets up an nvidia-only config. So if the driver worked, you would get no output on the internal screen.

Did the nvidia gpu work with 20.04? Then this might also be a kernel issue.

Thanks for the explanation!
The screen issues – brightness and tearing – kept me from ever even bothering to see if it was working in 20.04. I wouldn’t say it’s ever fully worked on any Linux distro I’ve tried; there was a brief moment when I could use it for scientific computing on Elementary os 5, which is built off of Ubuntu 18.04 LTS, but I’ve never had both cuda-enablement and reasonable graphics, no matter the kernel.

Digging a bit into the matter taught me that brightness control for oled displays is only working in kernels 5.12 and up.
You could try to upgrade to a 5.14 kernel, you will need 4 packages
linux-headers-XX
linux-headers-XX-generic
linux-image-unsigned-XX-generic
linux-modules-XX-generic
from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.15/

Hello !
Same problem for me : Dell XPS 9700 with RTX 2060 Max Q / Intel iGPU
OS: Ubuntu 21.10 with kernel 5.13.0-20-generic
Nvidia drivers :
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 495.44 Fri Oct 22 06:05:22 UTC 2021

When looking dmesg logs :

[    4.020528] nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
[    4.020533] ucsi_ccg 0-0008: i2c_transfer failed -110
[    4.020536] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[    4.020541] ucsi_ccg: probe of 0-0008 failed with error -110
....
[    4.291675] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    4.291709] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    4.291783] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    4.291894] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

I’m not a kernel expert but the logs show early :
[ 0.632595] pci 0000:01:00.0: can't claim BAR 6 [mem 0xfff80000-0xffffffff pref]: no compatible bridge window

This is the full log about :

$ sudo dmesg | grep 0000:01:00.0
[    0.525579] pci 0000:01:00.0: [10de:1f12] type 00 class 0x030000
[    0.525601] pci 0000:01:00.0: reg 0x10: [mem 0x72000000-0x72ffffff]
[    0.525619] pci 0000:01:00.0: reg 0x14: [mem 0x60000000-0x6fffffff 64bit pref]
[    0.525638] pci 0000:01:00.0: reg 0x1c: [mem 0x70000000-0x71ffffff 64bit pref]
[    0.525650] pci 0000:01:00.0: reg 0x24: [io  0x3000-0x307f]
[    0.525662] pci 0000:01:00.0: reg 0x30: [mem 0xfff80000-0xffffffff pref]
[    0.525746] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    0.525803] pci 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    0.576515] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.576515] pci 0000:01:00.0: vgaarb: bridge control possible
[    0.632595] pci 0000:01:00.0: can't claim BAR 6 [mem 0xfff80000-0xffffffff pref]: no compatible bridge window
[    0.633184] pci 0000:01:00.0: BAR 6: assigned [mem 0x73080000-0x730fffff pref]
[    0.634251] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[    0.634331] pci 0000:01:00.2: D0 power state depends on 0000:01:00.0
[    0.634646] pci 0000:01:00.3: D0 power state depends on 0000:01:00.0
[    0.636372] pci 0000:01:00.0: Adding to iommu group 1
[    3.608761] nvidia 0000:01:00.0: enabling device (0002 -> 0003)
[    3.608958] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.291675] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    4.291709] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    7.606617] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    7.606683] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    7.717250] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    7.717300] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    7.824112] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    7.824149] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    7.955013] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0xffff:1433)
[    7.959563] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
$ nvidia-smi 
No devices were found

If someone could help us ?

Dell told me to apply this firmware :
https://www.dell.com/support/home/en-eu/drivers/driversdetails?driverid=3v8m8&driverid=3v8m8&lwp=rt

Unfortunately I can’t apply this firmware with linux/ubuntu. And I don’t know if it could solve the issue…

The “BAR 6” issue is a red herring, it’s a very common bios bug with no bad effects.
Looks like the VBIOS update is fixing some critical bug:
https://www.dell.com/community/XPS/RTX-2060-keeps-disappearing-from-device-manager/td-p/8034428
So it’s worth a shot.

Two updates:
I was wrong about it working on windows. I can’t get any valuable info out of windows, unsurprisingly, but there’s a failed detection going on there too that is just being obscured from me.
I tried the method of taking out the battery for a mainboard reset. It seemed to do… something, since on the first startup, I saw a flurry of information. The device turned off due to low battery before reaching login. ( guess I hadn’t checked that). On plugging in and restarting…
Same error. No changes.
Happy to provide any new information or try to dig through windows for it now that I know the premise of my post was flawed.
As always, Generix, much is owed you.

Doesn’t sound good. You should check the bios settings for the graphics used, maybe the nvidia gpu got disabled during battery removal.
If not, please check Windows’ device manager, if the nvidia gpu driver reports code 43, it’s simply broken.

It does indeed report Code 43 :(
Lucky for me the NVIDIA warranty outlasts the Dell warranty.

After many calls to Dell support team, re-installing windows, latest drivers, firmware etc… Error 43 in the device manager.
My laptop is under warranty and a tech will change the motherboard + GPU.
Dell is very silencious about this issue… but I think a lot of people will be affected.

About the firmware update :
Maybe it was too late for the firmware patch to have an effect. I don’t understand what is exactly this hardware issue but there is a lot of complaint about this error 43. Nvidia fault ? Motherboard assembler ?

Hoping this is the end of the nightmare for me and maybe this post will help other people.