How to deal with nvidia-modprobe when switching between nvidia/nouveau

tim.kane · November 26, 2023, 2:43pm

I realise this isn’t a support forum for nouveau, but my question specifically relates to the behaviour of nvidia-modprobe

I’ve not provided logs or a bug report here as I’m more generally interested in how to tweak observed behaviour of the nVidia userspace driver components (shared libraries).

I’m running Arch linux and I frequently switch between driver combinations to test various development scenarios.

I have drivers installed for nvidia (545.29.06) + nouveau (mesa) + amdgpu.
I switch between these using kernel boot options to blacklist as necessary. This has worked well for me on Fedora 38/39 and I’m now broadening my development/test-bed surface.

I’ve found that when blacklisting the nvidia drivers and running a simple diagnostic (vulkaninfo, eglinfo, gbminfo) the relevant ICD is attempting to poll the nvidia driver to see what’s available.

That’s all normal and fine, it does the same for all other ICD’s I have installed and if the driver isn’t running then it’s by definition not available.

Except for nvidia :)
When I run one of these (vulkaninfo in particular) the ICD wants to call out to libGLX_nvidia.so.0 which is then automagically trying to load the nVidia drivers on-demand by way of nvidia-modprobe

That behaviour is documented as such:

If the user-space NVIDIA driver component cannot load the kernel module or create the device files itself, it will attempt to invoke the setuid root nvidia-modprobe utility, which will perform these operations on behalf of the non-privileged driver.

While that sounds like a great idea, in this scenario I’m not using nvidia drivers. I blacklisted them at boot time and am actively using nouveau drivers. Obviously, it fails (correctly) to load the nvidia drivers.

vulkaninfo (and friends) report:

ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0

Now, that’s probably not what I’d expect (though I can live with it). I have radeon ICD’s installed and no such issue is triggered when I’m not using the AMD drivers.

However… in the case of vulkaninfo it appears to want to query the driver many times consecutively… this causes nvidia-modprobe to be pinged 32 times and it takes a solid 42 seconds (and some amount of load) for nvidia-modprobe to repeatedly fail to load drivers that are not going to load.

My question is… (finally)
How can I best augment this behaviour in this arrangement?

Option 1)
I can temporarily move the nvidia ICD’s out of the way. This works and has the desired effect, but isn’t a robust solution since I’m switching between drivers frequently.

Option 2)
I can move nvidia-modprobe out of the way. The ICD loader still throws an error, but it does so quickly and without anything trying to spin up the nvidia drivers. Unfortunately this also prevents the ICD loader from working correctly even when the nvidia drivers are loaded (when I want them to be).

So it seems there is a hard dependency on nvidia-modprobe.

Is there some other way I can:
a) Tell the nvidia libs (libGLX_nvidia.so.0 and friends) to not attempt calling out to nvidia-modprobe
b) tell nvidia-modprobe to do nothing if the nvidia modules aren’t loaded already?

I did take a look at setting NVreg_ModifyDeviceFiles=0 but I’m not sure it applies to this situation (and did not appear to help as a kernel boot option)

It seems there must already be a mechanism to do something along these lines - since I do not encounter this behaviour on Fedora Linux (on the same machine) and more interestingly they do not appear to distribute nvidia-modprobe as part of their packages. It isn’t clear to me exactly how this is achieved but I suspect it has something to do with their nvidia distribution utilising akmods. Certainly I don’t encounter this issue on Fedora when the nvidia drivers are blacklisted.

Any guidance or correction in my understanding would be much appreciated, thank you.

generix · November 26, 2023, 2:56pm

nvidia-modprobe should only be called if the nvidia modules are not loaded and a user process needs them. So it should not have any effect if it’s missing.

tim.kane · November 26, 2023, 3:09pm

That is what I had hoped to be the case, but after moving nvidia-modprobe out of the way and booting up with nvidia drivers I find that the nvidia ICD still fails to query for nvidia support in the same way as if the drivers aren’t loaded.

My own application code also fails to locate any supporting Vulkan device (or EGL, GBM) until I put it back

generix · November 26, 2023, 3:17pm

Please check that nvidia, nvidia-modeset, nvidia-drm are really loaded when calling vulkaninfo as user. The driver doesn’t need nvidia-modprobe unless arch does something different. I have no issues when removing it from path.

tim.kane · November 26, 2023, 3:48pm

Hmm. My apologies, it seems there is a little more to this.

If I boot up with nvidia-modprobe in place and subsequently move it out of the way, then there is no issue and the ICD behaves as it should with the nvidia device present.

However it does not survive a reboot… as if the nvidia drivers are only partially loaded (?)

Having rebooted I’ve found that I can no longer login to wayland (falls back to X11, this seems to happen from time to time and may not be related).
vulkaninfo again has trouble with the ICD

ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0

lsmod reports the following:

nvidia drm             118784  1
nvidia_modeset        1858152  1 nvidia_drm
nvidia_uvm            3502080  0
nvidia               62390272  2 nvidia_uvm,nvidia_modeset
video                   77842  2 asus_wmi,nvidia_modeset

The X11 session reports a bunch of

NVIDIA: Failed to initialise the NVIDIA kernel module. Please see the system's kernel log for additional error messages ... .. ...

However, prior to X11 starting up I can see the kernel modules being loaded, and udev rules being processed.

Apologies for the lack of logs, I’m transcribing this from one machine to another (nvidia forums not letting me login from the machine where this is ocurring).

Once I can log in properly from there I will provide all logs and a report.

tim.kane · November 26, 2023, 4:17pm

There may be a timing issue at play here…
The ICD starts working after a little while, just not immediately upon logging in.

More interestingly, the very act of running the nvidia-bug-reporter.sh appears to kick the drivers into life. I’m going to explore that some more when it isn’t 2am :)

Bug report attached.

nvidia-bug-report.log.gz (502.2 KB)

btw, thank you for confirming the expected behaviour of nvidia-modprobe

tim.kane · November 27, 2023, 5:49am

I’ve dug into this a little more. Apologies for the deep dive.

There is in fact no timing issue, the ICD loader simply fails for libGLX_nvidia.so.0 on a fresh boot when nvidia-modprobe is not present.

ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0

However

I found that running nvidia_bug_report.sh, would at some point call out to
/sbin/nvidia-debugdump -D

at which point vulkaninfo reports the device succesfully, though it still logs:

ERROR while creating surface for extension VK_KHR_xcb_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN
ERROR while creating surface for extension VK_KHR_xlib_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN

The bug reporter then calls vulkaninfo itself… it’s important to note that this occurs as the root user (as per usage semantics of nvidia-debugdump)

It’s this execution as root that finally clear the ERROR_UNKNOWN (for all users)… That must be triggering the drivers to connect the last pieces together.

Note that at no time post-startup do I observe any log entries in the journal, or dmesg regarding this.
I have udev logging set to debug, as well as gdm.

So something is happening during nvidia-debugdump/vulkaninfo when run as root, that isn’t happening on startup (or indeed, from starting up X11)…

Exploring nvidia-debugdump further… It’s enough to simply perform the following for the device to show up in vulkaninfo for a non-privileged user.

root]# nvidia-debugdump --list
Found 1 NVIDIA devices
	Device ID:              0
	Device name:            NVIDIA GeForce GTX 970   (*PrimaryCard)
	GPU internal ID:        GPU-0f820b91-4b52-39da-79f9-31a36d336ebb

But ultimately the root user needs to call vulkaninfo (or otherwise query the driver) for any non-privileged user to be able to see the device.

Rewinding a bit… looking at my system startup I see the following udev rules failing:

nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255
'' failed with exit code 1.

nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.

This is emitted a few times (nvidia-frontend doesn’t occur in /proc/devices), but the driver starts up fine otherwise

systemd-udevd[277]: nvidia_drm: Device ready for processing (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon systemd-udevd[277]: nvidia_drm: sd-device-monitor(manager): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon (udev-worker)[291]: Inserted module 'nvidia_drm'
Nov 28 02:52:25 archon (udev-worker)[291]: Module 'nvidia' is already loaded
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Processing device (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Device processed (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: sd-device-monitor(worker): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

Documentation suggests nvidia-frontend is only used when multiple NVIDIA kernels are in play, so that seems above board.

As for why GDM is starting up in X11 mode rather than wayland… it’s failing to obtain an EGL display. Probably related to the fact the drivers aren’t fully initialised.

gnome-shell[658]: Running GNOME Shell (using mutter 45.1) as a Wayland display server
gnome-shell[658]: Failed to make thread 'KMS thread' realtime scheduled: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.freedesktop.RealtimeKit1" does not exist
Nov 28 02:52:26 archon gnome-shell[658]: Device '/dev/dri/card0' prefers shadow buffer
gnome-shell[658]: Added device '/dev/dri/card0' (nvidia-drm) using atomic mode setting.
gnome-shell[658]: Failed to initialize accelerated iGPU/dGPU framebuffer sharing: No EGL display
gnome-shell[658]: Created gbm renderer for '/dev/dri/card0'
gnome-shell[658]: Boot VGA GPU /dev/dri/card0 selected as primary
org.gnome.Shell.desktop[658]: Failed to setup: The GPU /dev/dri/card0 chosen as primary is not supported by EGL.

So… the next question becomes…
How is this intended to work on a system where nvidia-modprobe is not deployed? At a guess, it seems that something needs to talk to ‘libGLX_nvidia’ as root at some stage during the boot process. Perhaps another udev rule to do something special?

generix · November 27, 2023, 11:51am

The log looks like the nvidia (drm,modeset) modules are only pulled in by GDM starting, which is running as user ‘gdm’ so needs nvidia-modprobe. Please embed the modules into the initrd and add them to modules-load.d (or whatever Arch uses).

tim.kane · November 27, 2023, 1:38pm

Thanks @generix
I’ll have more of a play around with that. I gave it a quick go at rebuilding the initramfs with those modules and it didn’t appear to change GDM behaviour, it’s entirely possible I did this incorrectly (I’m less familiar with mkinitcpio and initramfs to begin with).

What I can tell you is… both nvidia_drm and nvidia_modeset do appear to be loaded before the GDM service is loaded.

I’ve verified this by

Disabling the gsm service
Booting to a console
lsmod shows both nvidia_drm and nvidia_modeset as loaded

Behaviour remains the same with GDM and vulkaninfo.
journal logs also confirm these modules are being loaded ahead of GDM

tim.kane · December 30, 2023, 12:49am

Well… seems I now have the same issue on Fedora 39 with the latest 545.29.06 drivers which do now seem to include nvidia-modprobe

On both arch and fedora, moving nvidia-modprobe out of the way solves the case where nvidia drivers are blacklisted… but when I do choose to boot with nvidia drivers they are not fully initialised at boot time (I can force them to initialise, by manually running the renamed nvidia-modprobe/nvidia-debugdump or running vulkaninfo as root)

Worth noting that on Fedora, the nvidia drivers are installed as akmods

Topic		Replies	Views
My Fedora 40 KDE does not use NVIDIA 550.78 drivers for OpenGL Linux	8	3386	May 9, 2024
nvidia driver is unable to load Linux	9	65169	March 10, 2016
Ubuntu 18.04 NVIDIA driver not loaded after GCC update Linux	9	1264	October 30, 2022
Fedora 35: NVIDIA kernel module missing. Falling back to nouveau Linux	15	12988	August 31, 2022
Linux modprob cannot load Nvidia driver Linux	1	5215	June 16, 2022
nvidia-modprobe from Developer Zone debian repository fails Linux	8	12679	February 19, 2014
ERROR: Unable to load the 'nvidia-drm' kernel module - CentOS 7 x86_64, version 396.54 Linux	9	6977	October 14, 2021
vulkaninfo doesn't work on ubuntu16.04 with nvidia 396.54.09 driver(GPU Tesla K10) Linux	2	1784	December 19, 2018
nvidia: probe of 0000:01:00.0 failed with error -1 Linux	1	8006	August 9, 2015
Cannot get the nvidia driver to work Linux kernel , ubuntu	23	2913	May 4, 2024

How to deal with nvidia-modprobe when switching between nvidia/nouveau

Related topics