I’ve dug into this a little more. Apologies for the deep dive.
There is in fact no timing issue, the ICD loader simply fails for libGLX_nvidia.so.0 on a fresh boot when nvidia-modprobe is not present.
ERROR: [Loader Message] Code 0 : loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
However
I found that running nvidia_bug_report.sh, would at some point call out to
/sbin/nvidia-debugdump -D
at which point vulkaninfo reports the device succesfully, though it still logs:
ERROR while creating surface for extension VK_KHR_xcb_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN
ERROR while creating surface for extension VK_KHR_xlib_surface : /vulkan-sdk/1.3.268.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:237:vkGetPhysicalDeviceSurfacePresentModesKHR failed with ERROR_UNKNOWN
The bug reporter then calls vulkaninfo itself… it’s important to note that this occurs as the root user (as per usage semantics of nvidia-debugdump)
It’s this execution as root that finally clear the ERROR_UNKNOWN (for all users)… That must be triggering the drivers to connect the last pieces together.
Note that at no time post-startup do I observe any log entries in the journal, or dmesg regarding this.
I have udev logging set to debug, as well as gdm.
So something is happening during nvidia-debugdump/vulkaninfo when run as root, that isn’t happening on startup (or indeed, from starting up X11)…
Exploring nvidia-debugdump further… It’s enough to simply perform the following for the device to show up in vulkaninfo for a non-privileged user.
root]# nvidia-debugdump --list
Found 1 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce GTX 970 (*PrimaryCard)
GPU internal ID: GPU-0f820b91-4b52-39da-79f9-31a36d336ebb
But ultimately the root user needs to call vulkaninfo (or otherwise query the driver) for any non-privileged user to be able to see the device.
Rewinding a bit… looking at my system startup I see the following udev rules failing:
nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255
'' failed with exit code 1.
nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done'' failed with exit code 1.
This is emitted a few times (nvidia-frontend doesn’t occur in /proc/devices), but the driver starts up fine otherwise
systemd-udevd[277]: nvidia_drm: Device ready for processing (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon systemd-udevd[277]: nvidia_drm: sd-device-monitor(manager): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon (udev-worker)[291]: Inserted module 'nvidia_drm'
Nov 28 02:52:25 archon (udev-worker)[291]: Module 'nvidia' is already loaded
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Processing device (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: Device processed (SEQNUM=4674, ACTION=add)
Nov 28 02:52:25 archon (udev-worker)[287]: nvidia_drm: sd-device-monitor(worker): Passed 156 byte to netlink monitor.
Nov 28 02:52:25 archon kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Documentation suggests nvidia-frontend is only used when multiple NVIDIA kernels are in play, so that seems above board.
As for why GDM is starting up in X11 mode rather than wayland… it’s failing to obtain an EGL display. Probably related to the fact the drivers aren’t fully initialised.
gnome-shell[658]: Running GNOME Shell (using mutter 45.1) as a Wayland display server
gnome-shell[658]: Failed to make thread 'KMS thread' realtime scheduled: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.freedesktop.RealtimeKit1" does not exist
Nov 28 02:52:26 archon gnome-shell[658]: Device '/dev/dri/card0' prefers shadow buffer
gnome-shell[658]: Added device '/dev/dri/card0' (nvidia-drm) using atomic mode setting.
gnome-shell[658]: Failed to initialize accelerated iGPU/dGPU framebuffer sharing: No EGL display
gnome-shell[658]: Created gbm renderer for '/dev/dri/card0'
gnome-shell[658]: Boot VGA GPU /dev/dri/card0 selected as primary
org.gnome.Shell.desktop[658]: Failed to setup: The GPU /dev/dri/card0 chosen as primary is not supported by EGL.
So… the next question becomes…
How is this intended to work on a system where nvidia-modprobe is not deployed? At a guess, it seems that something needs to talk to ‘libGLX_nvidia’ as root at some stage during the boot process. Perhaps another udev rule to do something special?