How to deal with nvidia-modprobe when switching between nvidia/nouveau

tim.kane · November 27, 2023, 5:49am

I’ve dug into this a little more. Apologies for the deep dive.

There is in fact no timing issue, the ICD loader simply fails for libGLX_nvidia.so.0 on a fresh boot when nvidia-modprobe is not present.

ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0

However

I found that running nvidia_bug_report.sh, would at some point call out to
/sbin/nvidia-debugdump -D

at which point vulkaninfo reports the device succesfully, though it still logs:

ERROR while creating surface for extension VK_KHR_xcb_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN
ERROR while creating surface for extension VK_KHR_xlib_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN

The bug reporter then calls vulkaninfo itself… it’s important to note that this occurs as the root user (as per usage semantics of nvidia-debugdump)

It’s this execution as root that finally clear the ERROR_UNKNOWN (for all users)… That must be triggering the drivers to connect the last pieces together.

Note that at no time post-startup do I observe any log entries in the journal, or dmesg regarding this.
I have udev logging set to debug, as well as gdm.

So something is happening during nvidia-debugdump/vulkaninfo when run as root, that isn’t happening on startup (or indeed, from starting up X11)…

Exploring nvidia-debugdump further… It’s enough to simply perform the following for the device to show up in vulkaninfo for a non-privileged user.

root]# nvidia-debugdump --list
Found 1 NVIDIA devices
	Device ID:              0
	Device name:            NVIDIA GeForce GTX 970   (*PrimaryCard)
	GPU internal ID:        GPU-0f820b91-4b52-39da-79f9-31a36d336ebb

But ultimately the root user needs to call vulkaninfo (or otherwise query the driver) for any non-privileged user to be able to see the device.

Rewinding a bit… looking at my system startup I see the following udev rules failing:

nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255
'' failed with exit code 1.

nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.

This is emitted a few times (nvidia-frontend doesn’t occur in /proc/devices), but the driver starts up fine otherwise

systemd-udevd[277]: nvidia_drm: Device ready for processing (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon systemd-udevd[277]: nvidia_drm: sd-device-monitor(manager): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon (udev-worker)[291]: Inserted module 'nvidia_drm'
Nov 28 02:52:25 archon (udev-worker)[291]: Module 'nvidia' is already loaded
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Processing device (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Device processed (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: sd-device-monitor(worker): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

Documentation suggests nvidia-frontend is only used when multiple NVIDIA kernels are in play, so that seems above board.

As for why GDM is starting up in X11 mode rather than wayland… it’s failing to obtain an EGL display. Probably related to the fact the drivers aren’t fully initialised.

gnome-shell[658]: Running GNOME Shell (using mutter 45.1) as a Wayland display server
gnome-shell[658]: Failed to make thread 'KMS thread' realtime scheduled: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.freedesktop.RealtimeKit1" does not exist
Nov 28 02:52:26 archon gnome-shell[658]: Device '/dev/dri/card0' prefers shadow buffer
gnome-shell[658]: Added device '/dev/dri/card0' (nvidia-drm) using atomic mode setting.
gnome-shell[658]: Failed to initialize accelerated iGPU/dGPU framebuffer sharing: No EGL display
gnome-shell[658]: Created gbm renderer for '/dev/dri/card0'
gnome-shell[658]: Boot VGA GPU /dev/dri/card0 selected as primary
org.gnome.Shell.desktop[658]: Failed to setup: The GPU /dev/dri/card0 chosen as primary is not supported by EGL.

So… the next question becomes…
How is this intended to work on a system where nvidia-modprobe is not deployed? At a guess, it seems that something needs to talk to ‘libGLX_nvidia’ as root at some stage during the boot process. Perhaps another udev rule to do something special?

Topic		Replies	Views
My Fedora 40 KDE does not use NVIDIA 550.78 drivers for OpenGL Linux	8	3616	May 9, 2024
nvidia driver is unable to load Linux	9	65313	March 10, 2016
Ubuntu 18.04 NVIDIA driver not loaded after GCC update Linux	9	1381	October 30, 2022
Fedora 35: NVIDIA kernel module missing. Falling back to nouveau Linux	15	13405	August 31, 2022
Linux modprob cannot load Nvidia driver Linux	1	5607	June 16, 2022
nvidia-modprobe from Developer Zone debian repository fails Linux	8	12809	February 19, 2014
ERROR: Unable to load the 'nvidia-drm' kernel module - CentOS 7 x86_64, version 396.54 Linux	9	7110	October 14, 2021
vulkaninfo doesn't work on ubuntu16.04 with nvidia 396.54.09 driver(GPU Tesla K10) Linux	2	1862	December 19, 2018
nvidia: probe of 0000:01:00.0 failed with error -1 Linux	1	8297	August 9, 2015
Cannot get the nvidia driver to work Linux kernel , ubuntu	23	3191	May 4, 2024

How to deal with nvidia-modprobe when switching between nvidia/nouveau

Related topics