RTX 3060 PCI passthrough to guest under KVM（qemu）

Longliu · October 17, 2022, 8:38am

I’m trying to passthrough RTX 3060 to instance (virtual machine) under KVM (qemu).
Host is ubuntu20.04. VM is Win10.
Host bios config is :
intel_iommu=on iommu=pt vfio-pci.ids=10de:2487,10de:228b vfio-pci.disable_idle_d3=1

Qemu command line is:
-device vfio-pci,host=0000:18:00.0,id=hostdev0,bus=pci.0,addr=0x9
-device vfio-pci,host=0000:18:00.1,id=hostdev1,bus=pci.0,addr=0xa

At first, it worked very well. VM can use RTX 3060 normally.
I can check RTX 3060 on host as following :

# lspci | grep NVIDIA
18:00.0 VGA compatible controller: NVIDIA Corporation Device 2487 (rev a1)
18:00.1 Audio device: NVIDIA Corporation Device 228b (rev a1)

# lspci -s 18:00.0 -vv
18:00.0 VGA compatible controller: NVIDIA Corporation Device 2487 (rev a1) (prog-if 00 [VGA controller])
		Subsystem: NVIDIA Corporation Device 1530
		Physical Slot: 6
		Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
		Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
		Latency: 0, Cache Line Size: 32 bytes
		Interrupt: pin A routed to IRQ 11
……
……

# lspci -t
……
+-[0000:17]-+-00.0
|           +-00.1
|           +-00.2
|           +-00.4
|           \-02.0-[18]--+-00.0
|                        \-00.1
……

Then I restart VM many times, I lost RTX 3060.
At this time, I check RTX 3060 on host

# lspci | grep NVIDIA
18:00.0 VGA compatible controller: NVIDIA Corporation Device 2487 (rev ff)
18:00.1 Audio device: NVIDIA Corporation Device 228b (rev ff)

# lspci -s 18:00.0 -vv
18:00.0 VGA compatible controller: NVIDIA Corporation Device 2487 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

# dmesg | grep vfio
[ 1523.197552] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 1523.197564] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1523.197568] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
[ 1523.197569] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
[ 1523.197570] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
[ 1523.198897] vfio-pci 0000:51:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
[ 1524.421723] vfio-pci 0000:51:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1524.421771] vfio-pci 0000:51:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1525.165440] vfio-pci 0000:51:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1526.413440] vfio-pci 0000:51:00.0: not ready 1023ms after FLR; waiting
[ 1527.469443] vfio-pci 0000:51:00.0: not ready 2047ms after FLR; waiting
[ 1529.581440] vfio-pci 0000:51:00.0: not ready 4095ms after FLR; waiting
[ 1533.933439] vfio-pci 0000:51:00.0: not ready 8191ms after FLR; waiting
[ 1542.381440] vfio-pci 0000:51:00.0: not ready 16383ms after FLR; waiting
[ 1559.789426] vfio-pci 0000:51:00.0: not ready 32767ms after FLR; waiting
[ 1567.229554] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 1567.229567] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1567.229572] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
[ 1567.229573] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
[ 1567.229574] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
[ 1567.231029] vfio-pci 0000:18:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
[ 1594.605404] vfio-pci 0000:51:00.0: not ready 65535ms after FLR; giving up
[ 1595.006376] vfio-pci 0000:51:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1595.014162] vfio-pci 0000:51:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 1821.133200] vfio-pci 0000:51:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1822.381198] vfio-pci 0000:51:00.0: not ready 1023ms after FLR; waiting
[ 1823.437196] vfio-pci 0000:51:00.0: not ready 2047ms after FLR; waiting
[ 1825.517195] vfio-pci 0000:51:00.0: not ready 4095ms after FLR; waiting
[ 1829.869194] vfio-pci 0000:51:00.0: not ready 8191ms after FLR; waiting
[ 1838.317188] vfio-pci 0000:51:00.0: not ready 16383ms after FLR; waiting
[ 1856.749168] vfio-pci 0000:51:00.0: not ready 32767ms after FLR; waiting
[ 1891.565131] vfio-pci 0000:51:00.0: not ready 65535ms after FLR; giving up

I attempted rescan PCI device manually, but I lost the device forever.

# file /sys/devices/pci0000:17/0000:17:02.0/0000:18:00.0
/sys/devices/pci0000:17/0000:17:02.0/0000:18:00.0: directory
# echo 1 > /sys/devices/pci0000:17/0000:17:02.0/0000:18:00.0/remove
# echo 1 > /sys/devices/pci0000:17/0000:17:02.0/0000:18:00.1/remove
# echo 1 > /sys/devices/pci0000:17/0000:17:02.0/rescan
# file /sys/devices/pci0000:17/0000:17:02.0/0000:18:00.0
/sys/devices/pci0000:17/0000:17:02.0/0000:19:00.0: cannot open `/sys/devices/pci0000:17/0000:17:02.0/0000:19:00.0' (No such file or directory)
# lspci | grep NVIDIA
(display nothing)

The only method is reboot host.
Any idea is appreciate.

Longliu · October 17, 2022, 8:42am

I have two 3060 cards on my host, named 0000:18:00.0 and 0000:51:00.0. So the kernel log with 0000:51:00.0 is the same as 0000:18:00.0

generix · October 20, 2022, 7:09am

on the host from lspci tells the gpus are turned off. Might be a power issue.

Longliu · October 20, 2022, 9:50am

Thanks for reply.

“Might be a power issue.”
How can I confirm it?

Only when VM power on, power off or reboot many times, this problem occurred. If VM’s status haven’t change, it’s always normal.

generix · October 20, 2022, 10:03am

IDK, it’s a difficult setup to get any info from. Only the Windows eventlog might contain something but the nvidia windows driver doesn’t have good logging.

Longliu · October 20, 2022, 10:21am

When VM is centos, the problem also occurred. Is there any way to skip it？

generix · October 20, 2022, 11:48am

Then please use the centos vm and run nvidia-bug-report.sh as root after the issue occured and attach the resulting nvidia-bug-report.log.gz file to your post.

Longliu · October 20, 2022, 12:34pm

I try it. Thank a lot.

Longliu · October 25, 2022, 3:25am

Sorry for the late reply. Here is my log, see attachment.
Host : ubunt 20.04
VM : centos 8
When RTX3060 lost , I collected log on host.
nvidia-bug-report.log (312.9 KB)

Longliu · October 26, 2022, 4:04am

Hi,sir,could you give me a hand again? Have a look about my log.

Besides, soft reboot host can return to normal. So I wonder if there is a way to restore RTX3060 after the issue occured. I tried rescan, but it doesn’t work. Now I’m looking for power reset method about RTX3060 pcie device.

generix · October 27, 2022, 8:51pm

That looks like a log from the host, please attach one from the centos vm.

Longliu · October 31, 2022, 4:12am

nvidia-bug-report.log (89.2 KB)
Attachment is centos vm log. Thanks a lot.

generix · October 31, 2022, 2:58pm

Nothing useful in the logs. Like said, a setup like this is not really debuggable.
To rule out power issues, does it reliably work if you physically remove one of the gpus?

Longliu · November 1, 2022, 1:54am

I don’t understand . I have two 3060 cards. Now only one of them lost. Which one should be removed? Just remove it or hotplug it？
I guess you let me hotplug the bad one and observe whether it returns to normal. Does it right?

Longliu · November 9, 2022, 8:36am

Hi, sir, Can I attempt to turn on and off GPU card by using bbswitch. I had a problem when loading bbswitch.

[ 3737.179360] bbswitch: version 0.8
[ 3737.179374] bbswitch: Found discrete VGA device 0000:06:00.0: \_SB_.PC00.RP06.VB00.D031
[ 3737.179376] bbswitch: Found discrete VGA device 0000:18:00.0: \_SB_.PC01.BR1A.H000
[ 3737.179379] bbswitch: Found discrete VGA device 0000:51:00.0: \_SB_.PC02.BR2C.H000
[ 3737.179414] bbswitch: failed to evaluate \_SB_.PC02.BR2C.H000._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0}0x100 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[ 3737.179417] bbswitch: failed to evaluate \_SB_.PC02.BR2C.H000._DSM {0xA0,0xA0,0x95,0x9D,0x60,0x00,0x48,0x4D,0xB3,0x4D,0x7E,0x5F,0xEA,0x12,0x9F,0xD4}0x102 0x0 {0x00,0x00,0x00,0x00}: AE_NOT_FOUND
[ 3737.179417] bbswitch: No suitable _DSM call found.

generix · November 9, 2022, 2:17pm

bbswitch is for notebooks only.

Longliu · November 10, 2022, 9:51am

I’m using Ubuntu22.04 on host and can check card status by /sys/bus/pci/devices/0000:18:00.0/power_state.
I disabled d3hot by setting vfio_pci.disable_idle_d3=1 in grub, so normally gpu card always stays d0 status.

root@POD209-CLU01-H012:/sys/bus/pci/devices/0000:18:00.0# cat power_state
D0
root@POD209-CLU01-H012:/sys/bus/pci/devices/0000:18:00.0#
root@POD209-CLU01-H012:/sys/bus/pci/devices/0000:18:00.0# setpci -s 18:00.0 60.l     // 60 is Power Management Capabilities Register
48036801

When problem occurred, power_state changed to D3cold.

root@POD209-CLU01-H045:/sys/bus/pci/devices/0000:18:00.0# cat power_state
D3cold

Does it mean main power is removed exceptionally. Any idea for that?

generix · November 12, 2022, 5:38pm

Absolutely no idea how that happens.

tao.xu1 · July 24, 2023, 9:14am

hi guys , is there any progress

Topic		Replies	Views
nvidia gtx1060 kvm passthrough Linux	6	4878	February 12, 2018
GTX 1080 & KVM PCI passthrough to guest CUDA Setup and Installation	12	17442	February 23, 2017
Trying to get discrete laptop GPU running in QEMU KVM Windows Linux	20	3947	February 12, 2023
Dell R730 with Tesla M60 on XenServer 7.0 unexpectedly reboot when a few VMs with vGPU are started NVIDIA Virtual GPU Technology	31	39540	February 24, 2017
VFIO VGA arbitration lock Linux	13	16173	March 20, 2016
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	20	4422	March 1, 2024
Nvidia kernel driver cannot bind to RTX 3060 laptop GPU Linux	14	5589	May 6, 2021
RTX 3070's not working, Driver Version: 470, Kernel Version: 5.13, MSI, Manjaro Linux, GNOME Linux kernel , driver	8	2682	November 21, 2021
H100 PCIe, NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux kernel , ubuntu , gpu , driver , nvidia-smi	17	4209	April 12, 2024
GTX 2060 not recognized after installing nvidia-driver-510 with a GTX 3060 recognized side by side on Ubuntu 20.04 Linux driver	1	428	June 23, 2023

RTX 3060 PCI passthrough to guest under KVM（qemu）

Related topics