Ubuntu 18.04 Host with KVM Hypervisor : Quadro RTX 4000 GPU Card is not accessible on Windows Server 2016 Guest VM

Hello Team,

Observing an issue while accessing Nvidia RTX 4000 GPU card in Windows server 2016 over KVM hypervisor based virtualization host.

Though the GPU card is visible under Display Adapter section of Device manager and corresponding driver installed, Windows is stopping the device due to some issue with error code 43.

Rechecked host configurations like enabling IOMMU flag and vfio-pci binding

Setup Details:

  1. Dell PowerEdge server with VT-d enabled
  2. Ubuntu 18.04 OS , Libvirt, Qemu and KVM installed
  3. Windows Server 2016 as Guest VM
  4. Passed GPU card via PCI Passthrough method

Host Configuration:

  1. dmesg outputs

    root@moving-deer:~# dmesg | grep -e DMAR -e IOMMU
    [ 0.000000] ACPI: DMAR 0x000000006F6C2000 0001E0 (v01 DELL PE_SC3 00000001 DELL 00000001)
    [ 0.000000] DMAR: IOMMU enabled
    [ 1.478298] DMAR: Intel® Virtualization Technology for Directed I/O

    [ 35.159689] vfio-pci 0000:d8:00.0: vgaarb: changed VGA decodes:
    olddecodes=io+mem,decodes=io+mem:owns=none
    [ 235.260035] vfio-pci 0000:d8:00.0: enabling device (0000 → 0003)
    [ 235.368328] vfio_ecap_init: 0000:d8:00.0 hiding ecap 0x1e@0x258
    [ 235.368353] vfio_ecap_init: 0000:d8:00.0 hiding ecap 0x19@0x900

  2. lspci outputs

root@moving-deer:~# lspci -nnk | grep -i d8:00 -A 3
d8:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1eb1] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb
d8:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
d8:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
d8:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad9] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci

This is on purpose, you’ll have to hide the hypervisor.

Yes. It has done already.

<kvm>
  <hidden state='on'/>
</kvm>

You might also need

<ioapic driver='kvm'/>

and

  <hyperv>
    ...
    <vendor_id state='on' value='someid'/>
    ...
  </hyperv>

Vendor ID part is there already.

Only ioapic was missing. I added and checked it, still same issue.

<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0xd8' slot='0x00' function='0x0'/>
  </source>
  <rom file='/usr/share/kvm/vbios.bin'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
</hostdev>

Tried with rom file option in hostdev section.

Did you add any extra kvm config during setup?

Added few options in grub file.

root~# cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9”

root~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9 disable_vga=1

And, kvm-spice emulator is used for Guest VM domain.
/usr/bin/kvm-spice

Any suggestions or configurations to try out…?

I missed to update version of Power Edge server. Dell Power Edge R740

Only thing that comes to my mind is to make sure you also passed through the subdevices, not just the main gpu function.

Thanks for your quick response.

Actually I tried passing all PCIs which comes under main GPU.

As an alternate option, Tried accessing RTX4000 GPU directly over bare metal (Dell Power Edge R740) by installing same Windows Server 2016 OS.

So, The observation is same on both virtualization and bare metel configurations.

Even, Its confirmed that PCIe 8-pin connector is powering the GPU.

So the device also doesn’t work with win 2016 bare-metal? Seems broken, then.

Yes., the issue is same on both configurations, but on two different Dell Power Edge R740 servers.

  1. One server installed with Ubuntu 18.04 virtualization + Windows Server 2016 Guest VM to access GPU
  2. Another server installed with Windows Server 2016 directly to access GPU

I guess you’ll need to have it replaced by your vendor if still under warranty.
If you want detailed info, install the driver in the Ubuntu host OS and run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz to your post.