Can't rebind GPU with 'driverctl' if system booted with GPU attached to nvidia driver

Can’t rebind GPU with ‘driverctl’ if system booted with GPU attached to nvidia driver.

Please consider this an issue to look into. I didn’t find an issue tracker to post this to, but if there’s a better place, just let me know.

I’ve found a workaround (shown here) and am not expecting a quick/easy answer. HOWEVER if there is a step I am missing that would allow the issue to be ignored, I welcome the reply.

Came across this issue while working on moving from Windows to Debian Linux. I’m setting up my system to use VFIO for a Win VM.

Notes:

  • Issue related to binding the driver, not related to any problems with graphics on desktop or games.
  • I don’t have X installed in any form at this time
  • I don’t currently have any functional VMs, this is a step I wanted to get through prior to setting those up

Given the process I’m about to outline works fine when using nouveau and vfio-pci, I’m guessing there’s something with the nvidia-driver package causing this behavior.


Relevant virtualization package install line:

Note: System is Debian 11 Bullseye

apt -y install qemu-kvm qemu-utils libvirt-daemon-system libvirt-clients virt-manager ovmf driverctl

At this point, with no ‘nvidia-driver’ package installed, I can successfully rebind the GPU between ‘nouveau’ and ‘vfio-pci’ to my heart’s content via:

# 3080 TI ... first = GPU, second = HD Audio
driverctl set-override 0000:0b:00.0 vfio-pci
driverctl set-override 0000:0b:00.1 vfio-pci
driverctl set-override 0000:0b:00.0 nouveau
driverctl set-override 0000:0b:00.1 snd_hda_audio

Likewise I can dynamically switch back to the defaults (nouveau & hda audio) like this:

driverctl unset-override 0000:0b:00.0
driverctl unset-override 0000:0b:00.1

Where the issue happens:

If I do not have an override set (as in I didn’t execute the commands above or I run the ‘unset-override’ to remove the overrides and go back to a stock config) and I do this:

apt -y install nvidia-driver firmware-misc-nonfree

and reboot, the system will come up with the GPU bound to the nvidia driver, like I’m sure most bare metal GPU users would expect. And this is where it gets ugly.

At this point if I try to:

driverctl set-override 0000:0b:00.0 vfio-pci

or

driverctl set-override 0000:0b:00.0 nouveau

the ‘driverctl’ command hangs hard. No way to kill, have to start a new shell to issue a reboot command to clear it out.

BUT

Workaround:

If I boot with an override (vfio-pci or nouveau), I -can- successfully:

driverctl --nosave set-override 0000:0b:00.0 nvidia

IMPORTANT: add the ‘–nosave’ so that, on a reboot, the system doesn’t bind the GPU to the nvidia driver.

And at this point I can rebind between nvidia / vfio-pci / nouveau / etc multiple times without issue.

Conclusion:

I can bind/unbind the nvidia driver dynamically so long as the system doesn’t attach to it at boot.

Why? I don’t know.

It’s possible this is an issue specific to driverctl but it doesn’t currently feel like it.


FWIW, I have a thread on Level1Techs where I’ll probably be more active in trying to post anything relevant that other users might want if they see this in a search.

driverctl flow:

To disconnect from nvidia driver and connect to vfio override:

prime-select intel
reboot
driverctl set-override 0000:xx:xx.x vfio-pci

To connect back to nvidia driver

rm /etc/driverctl.d/0000:xx:xx.x
reboot
prime-select nvidia

I have had zero luck with the unset command working, so I have resorted to the above with zero isssues.

Returned here as the issue has come up again for me.

@vob … that doesn’t help in this situation. I’m not using a laptop, and my CPU (AMD 5950X) doesn’t have an iGPU. AFAIK prime-select is only going to be present in a hybrid iGPU+dGPU setup.

At the moment I’m looking for a method to simply tell the nvidia driver not to bind to a secondary GPU at all while still binding to the primary.

Just a follow up for the record to clear things up if anyone happens on this thread.

To unload the nvidia driver, make sure that nvidia-drm module modeset option is set to 0.
https://forums.developer.nvidia.com/t/ubuntu-18-04-3-blank-screen-at-startup-with-430-drivers-and-gtx-960/107501/2?u=generix