Hello Team,
Observing an issue while accessing Nvidia RTX 4000 GPU card in Windows server 2016 over KVM hypervisor based virtualization host.
Though the GPU card is visible under Display Adapter section of Device manager and corresponding driver installed, Windows is stopping the device due to some issue with error code 43.
Rechecked host configurations like enabling IOMMU flag and vfio-pci binding
Setup Details:
Dell PowerEdge server with VT-d enabled
Ubuntu 18.04 OS , Libvirt, Qemu and KVM installed
Windows Server 2016 as Guest VM
Passed GPU card via PCI Passthrough method
Host Configuration:
dmesg outputs
root@moving-deer:~# dmesg | grep -e DMAR -e IOMMU
[ 0.000000] ACPI: DMAR 0x000000006F6C2000 0001E0 (v01 DELL PE_SC3 00000001 DELL 00000001)
[ 0.000000] DMAR: IOMMU enabled
[ 1.478298] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 35.159689] vfio-pci 0000:d8:00.0: vgaarb: changed VGA decodes:
olddecodes=io+mem,decodes=io+mem:owns=none
[ 235.260035] vfio-pci 0000:d8:00.0: enabling device (0000 → 0003)
[ 235.368328] vfio_ecap_init: 0000:d8:00.0 hiding ecap 0x1e@0x258
[ 235.368353] vfio_ecap_init: 0000:d8:00.0 hiding ecap 0x19@0x900
lspci outputs
root@moving-deer:~# lspci -nnk | grep -i d8:00 -A 3
d8:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1eb1] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb
d8:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
d8:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
d8:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad9] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a0]
Kernel driver in use: vfio-pci
This is on purpose, you’ll have to hide the hypervisor.
Yes. It has done already.
<kvm>
<hidden state='on'/>
</kvm>
generix
February 8, 2021, 10:02am
4
You might also need
<ioapic driver='kvm'/>
and
<hyperv>
...
<vendor_id state='on' value='someid'/>
...
</hyperv>
Vendor ID part is there already.
Only ioapic was missing. I added and checked it, still same issue.
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0xd8' slot='0x00' function='0x0'/>
</source>
<rom file='/usr/share/kvm/vbios.bin'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
</hostdev>
Tried with rom file option in hostdev section.
Did you add any extra kvm config during setup?
Added few options in grub file.
root~# cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9”
root~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9 disable_vga=1
And, kvm-spice emulator is used for Guest VM domain.
/usr/bin/kvm-spice
Any suggestions or configurations to try out…?
I missed to update version of Power Edge server. Dell Power Edge R740
generix
February 12, 2021, 1:16pm
10
Only thing that comes to my mind is to make sure you also passed through the subdevices, not just the main gpu function.
Thanks for your quick response.
Actually I tried passing all PCIs which comes under main GPU.
As an alternate option, Tried accessing RTX4000 GPU directly over bare metal (Dell Power Edge R740) by installing same Windows Server 2016 OS.
So, The observation is same on both virtualization and bare metel configurations.
Even, Its confirmed that PCIe 8-pin connector is powering the GPU.
generix
February 12, 2021, 1:25pm
12
So the device also doesn’t work with win 2016 bare-metal? Seems broken, then.
Yes., the issue is same on both configurations, but on two different Dell Power Edge R740 servers.
One server installed with Ubuntu 18.04 virtualization + Windows Server 2016 Guest VM to access GPU
Another server installed with Windows Server 2016 directly to access GPU
generix
February 12, 2021, 1:33pm
14
I guess you’ll need to have it replaced by your vendor if still under warranty.
If you want detailed info, install the driver in the Ubuntu host OS and run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz to your post.