Hi,
I have a pair of NVIDIA GPUs in my desktop for PCI passthrough virtual machines using VFIO on Ubuntu 20.04, kernel 5.15.0-43. A 3090 Ti FE for the host, and a 750 Ti for the guest.
At some point after the 470 series drivers (490+, including the current 515.65.01), it has become increasingly difficult to hot-unbind the card from the nvidia
kernel modules. I’ve encountered two specific bits of the NVIDIA drivers which hold on to the card when they maybe shouldn’t:
-
nvidia-drm
Whennvidia-drm
is loaded with themodeset = 1
parameter, it will hold on to all GPUs until it is unloaded. GPUs that are bound later tonvidia
are not affected, as it seems that it’s thenv_drm_probe_devices
function innvidia-drm-drv.c
which is grabbing the GPUs and it’s only called when thenvidia-drm
driver is loaded. When attempting to unloadnvidia
whilenvidia-drm
is holding on to the card results in this:
NVRM: Attempting to remove device 0000:05:00.0 with non-zero usage count!
-
Something in the userspace drivers
On older drivers (<= 470) using X11, I was able to isolate the guest GPU’s/dev/nvidiaX
device from userspace applications with an X11 config containingOption "AutoAddGPU" "false"
in theServerLayout
section, and the NVIDIA-specificOption "ProbeAllGpus" "false"
under theDevice
section.
This no longer seems to work as the userspace components seem to always be picking up the guest GPU (visible by Xorg, Firefox, etc holding the/dev/nvidiaX
device open) despite my X11 config only specifying to use the host GPU. Another shortfall is there doesn’t seem to be a way to achieve the same isolation in Wayland?
Basically what I’m looking for is a way to hot-unbind a card as needed, and be able to easily isolate it from userspace programs to facilitate the unbinding. I need to keep the guest card bound to the nvidia
driver while my VMs are not running to ensure proper power management and to check the status with nvidia-smi
, so leaving it completely unbound always isn’t a solution; plus it’s nice to be able to run headless compute or encoding on it while it’s not attached to a VM.
I don’t really need nvidia-drm
modesetting on the guest card, so a way to make that module ignore cards passed via a parameter would work for me, but getting true hotplug working would be ideal.
My current workaround is to late-bind the guest card to nvidia
and simply not give it a /dev/nvidiaX
node, but this isn’t great.