525.85.12 driver fails to access mobile 3080 TI

Recently Debian pushed 525.85.12 driver into Testing release, and after that happened, I lost access to my nvidia GPU (embedded Intel one still works).

There isn’t much info available on it in journalctl. Key lines seem to be:

$ sudo journalctl -b0 -p debug -u nvidia-persistenced
Feb 26 17:48:31 aw systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Feb 26 17:48:31 aw nvidia-persistenced[681]: Started (681)
Feb 26 17:48:36 aw nvidia-persistenced[681]: device 0000:01:00.0 - failed to open.
Feb 26 17:48:38 aw systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.

After load, it attempts to load firmware on every attempt to access GPU. For example, whenever I try to query some NVIDIA-related info (i.e. via glxinfo):

$ __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only glxinfo
name of display: :0
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  152 (GLX)
  Minor opcode of failed request:  24 (X_GLXCreateNewContext)
  Value in failed request:  0x0
  Serial number of failed request:  50
  Current serial number in output stream:  51

Following entries are logged:

Feb 26 22:15:10 aw kernel: nvidia 0000:01:00.0: firmware: direct-loading firmware nvidia/525.85.12/gsp_tu10x.bin
Feb 26 22:15:12 aw kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x0:1835)
Feb 26 22:15:12 aw kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Feb 26 22:15:12 aw kernel: nvidia 0000:01:00.0: firmware: direct-loading firmware nvidia/525.85.12/gsp_tu10x.bin
Feb 26 22:15:14 aw kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x0:1835)
Feb 26 22:15:14 aw kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Feb 26 22:15:14 aw kernel: nvidia 0000:01:00.0: firmware: direct-loading firmware nvidia/525.85.12/gsp_tu10x.bin
Feb 26 22:15:17 aw kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x0:1835)
Feb 26 22:15:17 aw kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Feb 26 22:15:17 aw kernel: nvidia 0000:01:00.0: firmware: direct-loading firmware nvidia/525.85.12/gsp_tu10x.bin
Feb 26 22:15:19 aw kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x0:1835)
Feb 26 22:15:19 aw kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

It takes about 10 seconds for glxinfo to conclude that info cannot be fetched. Likewise, it takes about 10 seconds to load gdm, and log into xorg session from gdm.

Since there is not much info which could point at where the issue is, can anyone point direction where one (who is not a graphics-/system dev) could look at?

I attempted to find something relevant in google, but everything I found so far was not related to open source kernel module, and always had some extra info before failure to open the device.

More info on my hardware setup is available in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1032003

nvidia-bug-report.log.gz (142.0 KB)

Added bug report log, since I couldn’t while post was on premoderation

That’s the wrong driver, the “open kernel modules”.

I am unsure what it means. Is 525.85.12’s userspace part usable only with non-open kernel module? Or is open kernel module supposed to co-exist with non-open one?

Debian’s maintainers specified nvidia-driver’s package dependency this way:

nvidia-kernel-dkms (= 525.85.12-1) | nvidia-kernel-525.85.12 | nvidia-open-kernel-525.85.12 | nvidia-open-kernel-525.85.12, nvidia-support

Which implies that either open-source or non-opensource will do. Due to reasons unknown to me, upgrade path switched from the non-open module (which was the only available option in the previous version) to the open one.

edit: switching to non-open kernel module fixed the issue. However, it’d be nice to understand what’s the issue with the open one.

The open kernel modules per default work on compute hardware (former Teslas). They can also be enabled on all Turing and up gpus by setting a module option but those are not feature complete so shouldn’t be used especially on mobile gpus
https://github.com/NVIDIA/open-gpu-kernel-modules

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.