nvidia-bug-report.log.gz (52.0 KB)
Environment:
- Host: Ubuntu 22.04, 5.19.0-18-generic (patched, explained below)
- Guest: Debian Buster, Linux 6.1
Equipments:
- CPU: Intel i9 13900K
- Motherboard: Asrock Z790 Taichi
- eGPU Dock: Razer Core X
- eGPU: RTX 3060 Lite Hash Rate
What I am trying to do:
1. Make the eGPU hotpluggable.
2. Bind the eGPU to VFIO-PCI driver.
3. Passthrough the eGPU to a QEMU Linux VM.
4. Check the eGPU functionality with nvidia-smi.
What I have done:
1. Make the eGPU hotpluggable.
- Wasn’t possible on a vanilla Ubuntu 20.04 or 22.04, the host will crash on hotunplug.
by adding host kernel parameters
quiet splash pci=assign-busses,hpbussize=0x40,realloc,hpmmiosize=128M,hpmmioprefsize=4G intel_iommu=on
the eGPU is hotpluggable.
2. Bind the eGPU to VFIO-PCI driver.
- No problems here.
3. Passthrough the eGPU to a QEMU Linux VM.
- eGPU passthrough did not work. When QEMU is run,
[11681.815785] vfio-pci 0000:08:00.0: can't enable device: BAR 5 [io 0x0000-0x007f] not claimed
the above message is printed.
FYI, this is my PCI bridge status printed with lspci -vt
.
Also, my full dmesg log on hotplug event is attached as dmesg.log
.
on dmesg, you can check that the BAR 5 for GPU is assigned [io 0x6000-0x607f]
.
+-1a.0-[03-86]----00.0-[04-86]--+-00.0-[05]----00.0 Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020]
| +-01.0-[06-45]----00.0-[07-45]--+-01.0-[08]--+-00.0 NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate]
| | | \-00.1 NVIDIA Corporation GA106 High Definition Audio Controller
| | \-04.0-[09-45]----00.0-[0a-0d]--+-00.0-[0b]----00.0 ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
| | +-01.0-[0c]----00.0 ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
| | \-02.0-[0d]----00.0 ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
To mitigate this, I assumed that BAR 5 functionality is not critical
(since the eGPU functions ok in my host, I’ve tried running benchmark tests on it)
and I patched drivers/pci/setup-res.c#n490
line to skip the BAR 5 sanity check on passthrough.
The passthrough works,
I can see the device inside the VM with lspci
,
I installed the out-of-tree open-gpu-kernel-module and the closed source part of the NVIDIA driver in the VM,
I can check that the GPU is bound to the nvidia
module in the VM.
4. Check the eGPU functionality with nvidia-smi.
- However, on running nvidia-smi it won’t detect any devices, as you can also check in the log.
Is there any possibility that the RmInitAdapter fail is related to the BAR 5?
Or is there something else? I can’t find much BAR 5 not claimed errors on web, any thoughts will greatly help.