VFIO VGA arbitration lock

Hi,

recently I’ve tried to use the NVIDIA driver for my primary GeForce GT
520 and use VFIO (Virtual Function I/O) framework to passthrough my
secondary GPU to KVM virtual machine. Obviously it failed because the
‘nvidia’ kernel module locks the legacy I/O and memory region. Due to
this ‘qemu’ runs in a deadlock. This works fine with the open-source
‘nouveau’ driver and attached is a fix for the binary driver version
319.23. However I don’t know if it is the proper way. After some testing
(Xorg, DRI, VDPAU, switching between X and text-console) it looks
stable. This patch was also posted on ‘qemu-devel’ and ‘kvm’ mailing
list.

http://lists.gnu.org/archive/html/qemu-devel/2013-05/msg04300.html

Basically the vga_tryget code doesn’t make much sense at all because the
driver never do a vga_put and vga_tryget is actually locking VGA arbitration.
This makes multi-GPU systems with different GPU vendors mostly
non-working.

diff -Nur NVIDIA-Linux-x86_64-319.23/kernel/nv-linux.h NVIDIA-Linux-x86_64-319.23-vfio-vgaarb-fix/kernel/nv-linux.h
— NVIDIA-Linux-x86_64-319.23/kernel/nv-linux.h 2013-05-17 04:00:02.000000000 +0200
+++ NVIDIA-Linux-x86_64-319.23-vfio-vgaarb-fix/kernel/nv-linux.h 2013-05-29 18:09:42.382925622 +0200
@@ -151,9 +151,9 @@
#error “struct file_operations compile test likely failed!”
#endif

-#if defined(CONFIG_VGA_ARB)
-#include <linux/vgaarb.h>
-#endif
+//#if defined(CONFIG_VGA_ARB)
+//#include <linux/vgaarb.h>
+//#endif

#if defined(NV_VM_INSERT_PAGE_PRESENT)
#include <linux/pagemap.h>
diff -Nur NVIDIA-Linux-x86_64-319.23/kernel/nv.c NVIDIA-Linux-x86_64-319.23-vfio-vgaarb-fix/kernel/nv.c
— NVIDIA-Linux-x86_64-319.23/kernel/nv.c 2013-05-17 04:00:02.000000000 +0200
+++ NVIDIA-Linux-x86_64-319.23-vfio-vgaarb-fix/kernel/nv.c 2013-05-29 18:10:01.494277314 +0200
@@ -2914,12 +2914,12 @@

 pci_set_master(dev);

-#if defined(CONFIG_VGA_ARB)
-#if defined(VGA_DEFAULT_DEVICE)

  • vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);
    -#endif

  • vga_set_legacy_decoding(dev, VGA_RSRC_NONE);
    -#endif
    +//#if defined(CONFIG_VGA_ARB)
    +//#if defined(VGA_DEFAULT_DEVICE)
    +// vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);
    +//#endif
    +// vga_set_legacy_decoding(dev, VGA_RSRC_NONE);
    +//#endif

    if (NV_IS_GVI_DEVICE(nv))
    {

Kind regards
Maik

Hi, I’ve been using this patch for a few months now.

I’m interested if this is the right way to do this, too and would like to have an official response.

Ideally I want to be able to run a unpatched driver, so this should be solved somehow.

Hey guys, Thanks for reporting this issue. Please provide reproduction steps step-by-step so I can reproduce this issue in house. Also attach nvidia bug report.

Thanks for your response, it seems as if I didn’t get a notification for this or something.

I run into problems when using vfio to pass another GPU to a guest and this patch fixed those problems.

The best step-by-step reproduction is probably this forum post which I followed to set this up.

Basically you need a kernel newer than 3.9 with specific options enabled and a very recent version of qemu and seabios. Qemu is then called with some parameters to specify the device to pass through and stuff:

/usr/bin/qemu-system-x86_64 -M q35 -enable-kvm -vga none -nographic -bios /usr/share/qemu/bios.bin -cpu host -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -device vfio-pci,host=06:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1 -device vfio-pci,host=06:00.1,addr=0.1,bus=root.1

The nvidia-bug-report.log is (hopefully) attached.

nvidia-bug-report.log.gz (268 KB)

If you need any further help please tell me. I’m also highly interested in seeing this issue fixed officially.

I too have been using this patch for some months already with a GT430 (host) and an elderly 8500GT (guest) in a KVM VFIO VGA passthrough setup. Works just fine, no side effects whatsoever.

Actually, because this setup works so well, I’m thinking of buying a more powerful nVidia card for the windows virtual machine. Seeing this “feature” officially supported (or that bug fixed, whatever you may call it) would second this being a good investment.

Just to bump up. There is a very long discussion in the Arch Linux forum about VFIO primary GPU passthrough already and several people are using this patch.

https://bbs.archlinux.org/viewtopic.php?id=162768

It becomes more critical to fix it as wider audience is using it. Moreover for now Nvidia is the only binary driver which will work on hostnode using VFIO primary/secondary GPU passthrough with it.

mbroemme, - Could you please provide nvidia bug report of your system?

I’ve tried to use the NVIDIA driver for my primary GeForce GT
520 and use VFIO (Virtual Function I/O) framework to passthrough my
secondary GPU to KVM virtual machine.

  • What is make and model of you secondary GPU to KVM virtual machine ?
  • Also when the issue occurs - As soon as you fire command and launch VM?
  • What error did you observe in dmesg Or logs ?

Filed bug internal 1463972 to track this issue.

we followed the procedure as per https://bbs.archlinux.org/viewtopic.php?id=162768 link.

While launching the VM with AMD GPU pass-through we are seeing below error message:

qemu-system-x86_64: -device vfio-pci,host=06:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1: vfio: error no iommu_group for device
qemu-system-x86_64: -device vfio-pci,host=06:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1: Device initialization failed.
qemu-system-x86_64: -device vfio-pci,host=06:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1: Device ‘vfio-pci’ could not be initialized

Comment which we used, “/usr/bin/qemu-system-x86_64 -M q35 -enable-kvm -vga none -nographic -bios /usr/share/qemu/bios.bin -cpu host -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -device vfio-pci,host=06:00.0,x-vga=on,addr=0.0,multifunction=on,bus=root.1 -device vfio-pci,host=06:00.1,addr=0.1,bus=root.1”

Just a suggestion, as I see there’s been no activity on this for 2 months; a lot of people have the error message you have, depending on their exact hardware setup. I would recommend posting on the https://bbs.archlinux.org/viewtopic.php?id=162768 thread to help resolve it in order to move forward. There is a lot of activity on that thread and the instructions are updated and kept current on page 1. There is a hardware compatibility list posted on it as well, later in the thread in the 50-60 page area I believe.

I was getting this error when I had forgotten to boot with intel_iommu=on in my kernel parameters.

So…any updates/news/announcements regarding this development?

As @Kinslayer already mentioned, booting a system with Intel VT-d capabilities the

intel_iommu=on

kernel parameter is required to enable the PCI device assignment (passthrough) to KVM.

On AMD systems, the kernel parameter is

amd_iommu=on

.

After you’ve got those preliminary steps completed, as indicated by a successful system-boot, proceed with your previously-outlined steps as per https://bbs.archlinux.org/viewtopic.php?id=162768 link.

I’ve been using the patch for at least two years and so far everthing’s alright.

I wonder if you might consider fixing this for the mainline driver?