Hi,
I recently found out that many Wayland compositors have a feature, where if the compositor receives an udev signal about GPU removal and the GPU is not the compositors “primary” GPU (the one that renders the compositor), it’s able to “detach” itself from the GPU, so it not longer uses it. This feature was created to be used with eGPUs to properly handle “plugging the cable out”, but I decided to try to use it for my VFIO setup, to be able to passthrough the GPU to a virtual machine without closing my Wayland session. I decided to try it on my EndeavourOS with KDE Plasma and my RTX 4060 (with my “primary” GPU as my Intel UHD Graphics 730 iGPU). While it worked perfectly on nouveau driver, on Nvidia’s driver, after sending the udev signal via udevadm the compositor stopped displaying on the monitor plugged into the GPU, but lsmod still displayed 2 references to the nvidia_drm module and even more to the nvidia module, this makes it impossible to unload the driver and bind the GPU to the vfio-pci driver for passthrough. I can confirm that’s the compositor using the GPU because sudo lsof /dev/nvidia0
and sudo lsof /dev/dri/card0
(Nvidia GPU) both report few instances of kwin_wayland and nothing else. This issue is not compositor bound, since I also tried it on Hyprland (before it dropped wlroots) and got the same results, it’s also not distro related, because I also tried it on Fedora and got the same results. I also tried both OpenRM and proprietary Nvidia driver and got the same results.
To reproduce:
- Run any Wayland compositor with Nvidia GPU as a non-primary GPU (I used KWIN_DRM_DEVICES environment variable on KWin or WLR_DRM_DEVICES on wlroots to set it up like that)
- Use
sudo su -c 'echo remove > /sys/bus/pci/devices/<PCI address of the card>/drm/cardX/uevent'
to send the udev signal lsmod
should return 2 references to the nvidia_drm module andsudo lsof /dev/nvidiaX
should return a few instances of your compositors process
Also, I was able to do it and it worked one time on Fedora 40 (or 39) KDE spin and that was more than half a year ago, but it broke again after I fully updated the system after installation, so the bug is probably a regression.
I can’t attach logs from every installation that I tried this on, but here’s my current setup that I daily drive
nvidia-bug-report.log.gz (709.2 KB)