I have a pair of NVIDIA GPUs in my desktop for PCI passthrough virtual machines using VFIO on Ubuntu 20.04, kernel 5.15.0-43. A 3090 Ti FE for the host, and a 750 Ti for the guest.
At some point after the 470 series drivers (490+, including the current 515.65.01), it has become increasingly difficult to hot-unbind the card from the
nvidia kernel modules. I’ve encountered two specific bits of the NVIDIA drivers which hold on to the card when they maybe shouldn’t:
nvidia-drmis loaded with the
modeset = 1parameter, it will hold on to all GPUs until it is unloaded. GPUs that are bound later to
nvidiaare not affected, as it seems that it’s the
nvidia-drm-drv.cwhich is grabbing the GPUs and it’s only called when the
nvidia-drmdriver is loaded. When attempting to unload
nvidia-drmis holding on to the card results in this:
NVRM: Attempting to remove device 0000:05:00.0 with non-zero usage count!
Something in the userspace drivers
On older drivers (<= 470) using X11, I was able to isolate the guest GPU’s
/dev/nvidiaXdevice from userspace applications with an X11 config containing
Option "AutoAddGPU" "false"in the
ServerLayoutsection, and the NVIDIA-specific
Option "ProbeAllGpus" "false"under the
This no longer seems to work as the userspace components seem to always be picking up the guest GPU (visible by Xorg, Firefox, etc holding the
/dev/nvidiaXdevice open) despite my X11 config only specifying to use the host GPU. Another shortfall is there doesn’t seem to be a way to achieve the same isolation in Wayland?
Basically what I’m looking for is a way to hot-unbind a card as needed, and be able to easily isolate it from userspace programs to facilitate the unbinding. I need to keep the guest card bound to the
nvidia driver while my VMs are not running to ensure proper power management and to check the status with
nvidia-smi, so leaving it completely unbound always isn’t a solution; plus it’s nice to be able to run headless compute or encoding on it while it’s not attached to a VM.
I don’t really need
nvidia-drm modesetting on the guest card, so a way to make that module ignore cards passed via a parameter would work for me, but getting true hotplug working would be ideal.
My current workaround is to late-bind the guest card to
nvidia and simply not give it a
/dev/nvidiaX node, but this isn’t great.