Debian 12, 525.147.05 drivers, no files under /dev

I have been facing problems with getting my GPU running for quite some time now. Previously, everything was working well, but I suspect an update broke something.

I am on the 525.147.05 drivers installed from the debian repositories using the nvidia-driver package. First, when I ran the nvidia-smi command, it used to say No devices find, and I used to get the failed to allocate NvKmsKapi in the dmesg logs. Along with the RmInitAdapter thing.
Recently however, I get the following

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Looking the the status of nvidia-persistenced, it says

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 115 has read and write permis…

Checking /dev, it indeed doesn’t have those files.
I have tried older drivers, tried 535 from the backports repository, tried doing a clean reinstall of 525 drivers itself, all to no avail.

I am coming to this forum as sort of my last hope, since no one who has been kind enough to help me so far has been able to pinpoint what’s going on.

Here are the logs from nvidia-bug-report.sh
nvidia-bug-report.log.gz (240.7 KB)

It seems that the NVIDIA module is not loaded. which makes me suspect that it didn’t build. Have you checked for the presence of the module(s) under: /lib/modules/$YOUR_KERNEL_VERSION/updates/dkms ?

If they’re not there, the most common cause is that your kernel headers are not installed.

So, I do have the modules under the mentioned location. The following are present

  1. nvidia-current-drm.ko
  2. nvidia-current-modeset.ko
  3. nvidia-current-peermem.ko
  4. nvidia-current-uvm.ko
  5. nvidia-current.ko

Also, the linux headers are installed, checked by ls -l /usr/src/linux-headers-$(uname -r)

Please disable secure boot in bios.

I have disabled secure boot, which has changed the error I am now getting.
When running nvidia-smi, it says No devices found.
Running sudo dmesg | grep -i nvidia shows the following

[    0.030464] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0-0.deb12.4-amd64 root=UUID=bd7a24c0-9057-49f1-ab49-48506df0c89d ro rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 quiet splash
[    5.057323] nvidia: loading out-of-tree module taints kernel.
[    5.057330] nvidia: module license 'NVIDIA' taints kernel.
[    5.057333] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.057334] nvidia: module license taints kernel.
[    5.238540] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[    5.239182] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[    5.239268] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.312024] audit: type=1400 audit(1708531242.821:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=676 comm="apparmor_parser"
[    5.312027] audit: type=1400 audit(1708531242.821:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=676 comm="apparmor_parser"
[    6.004662] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[    6.641537] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    6.673948] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    6.674023] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

Here is the bug report I generated after disabling secure boot.
nvidia-bug-report.log.gz (271.6 KB)

[    6.673695] ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [TMPB] at bit offset/length 1572864/32768 exceeds size of target Buffer (262144 bits) (20230331/dsopcode-198)
[    6.673699] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP._ROM due to previous error (AE_AML_BUFFER_LIMIT) (20230331/psparse-529)
[    6.673716] NVRM: GPU 0000:01:00.0: Failed to copy vbios to system memory.
[    6.673809] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x30:0xffff:974)
[    6.673853] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

With that message, the gpu is most likely broken. You might want to install Windows to double-check.

Alright, I’ll check and post the findings here this weekend

I guess you should re-enable secure boot to disable the driver meanwhile.

Sorry, I totally forgot to update this thread!
So over the last week I installed windows, and then found out that my GPU might really be dead. Opening up Device Manager, it shows that the Nvidia GPU did not start because of some errors it reported. Neither did reinstalling the latest Nvidia drivers from their Nvidia’s website help.

The exact code reported was code 43, with error code shown as 0000002B. Not sure if those numbers mean much to anyone here, but yeah, that’s what I found.

code 43 means the gpu is dead.

Well then, thank you for your time. I’ll see what steps I can take from here :(

Since it’s an issue with the vbios, there’s a very slim chance this can be fixed with a reflash. To check you would need to use nouveau to debug:
https://forums.developer.nvidia.com/t/nvidia-geforce-gtx-1650-no-external-monitor-not-detected-in-xrandr-opensuse-leap-15-3/215296/19

I tried my best to follow the instructions, so there’s what I got in the dmesg
dmesg.txt (91.3 KB)

[    2.161249] nouveau 0000:01:00.0: bios: trying PRAMIN...
[    2.161253] nouveau 0000:01:00.0: bios: ... not enabled
[    2.161254] nouveau 0000:01:00.0: bios: trying PROM...
[    2.162340] nouveau 0000:01:00.0: bios: 00000000: ROM signature (0000) unknown
[    2.162341] nouveau 0000:01:00.0: bios: image 0 invalid
[    2.162343] nouveau 0000:01:00.0: bios: scored 0
[    2.162344] nouveau 0000:01:00.0: bios: trying ACPI...
[    2.162616] nouveau 0000:01:00.0: bios: 00000000: ROM signature (0000) unknown
[    2.162618] nouveau 0000:01:00.0: bios: image 0 invalid
[    2.162618] nouveau 0000:01:00.0: bios: scored 0
[    2.162619] nouveau 0000:01:00.0: bios: trying ACPI...
[    2.162865] nouveau 0000:01:00.0: bios: 00000000: ROM signature (0000) unknown
[    2.162866] nouveau 0000:01:00.0: bios: image 0 invalid
[    2.162867] nouveau 0000:01:00.0: bios: scored 0
[    2.162867] nouveau 0000:01:00.0: bios: trying PCIROM...
[    2.162878] nouveau 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x0000
[    2.162883] nouveau 0000:01:00.0: bios: trying PLATFORM...
[    2.162884] nouveau 0000:01:00.0: bios: unable to locate usable image

nouveau can read nothing but zeros, I guess the rom is dead.

Is that something a flash would be able to fix as you suggested before? I’m not at all familiar with GPU architecture unfortunately

If the flash rom is intact but only the vbios image stored on it is corrupt, a reflash might be able to fix it. In your case, the flash rom by it self seems to be broken so there’s nothing to flash.
You could try to reflash it anyway using nvflash or try loading it from file
https://blog.umito.nl/2014/04/13/getting-a-romless-mxm-card-to-work-on-ubuntu-in-a-laptop-with-no-bios-support-for-it.html
This should be the correct vbios file for your laptop:
https://www.techpowerup.com/vgabios/222797/222797

Tried running nvflash, it’s complaining about no EEPROM being found or supported. Do I need the nouveua module installed and loaded during this process?

So reflashing won’t work, the flash rom is gone.
You can only try to load it from disk on every boot.

That doesn’t sound like a good solution, what do you think?

Of course not, it’s a fiddly make shift to keep the notebook with nvidia gpu alive. Two alternatives:

  1. keep using the notebook without the nvidia gpu.
  2. buy a new one.