Title: Blackwell (RTX Pro 5000) KVM Passthrough: GSP Timeout causes unexpected WPR2 already up and cannot be recovered via SBR/D3cold without Host Reboot
Environment:
-
GPU: NVIDIA RTX Pro 5000 (Blackwell Architecture, PCI ID:
10de:2bb3) -
Host OS: Linux (KVM/QEMU Hypervisor)
-
Guest OS: Ubuntu 24.04 LTS
-
Driver Version: 580.105.08 (Open Kernel Module / MIT-GPL Flavor)
Description: When passing through the RTX Pro 5000 (Blackwell) to an Ubuntu VM via VFIO, the GSP firmware occasionally hits a heartbeat timeout during initialization or driver reload. Once this happens, the GPU enters an unrecoverable “bad state” where the driver fails to probe with the following errors:
Plaintext
dmesg |grep -iE "xid|gsp|nvrm"
[ 8.726693] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 580.105.08 Release Build (dvs-builder@U22-I3-B10-02-5) Wed Oct 29 22:29:53 UTC 2025
[ 69.779330] NVRM: Xid (PCI:0000:01:00): 62, 32311d90 0002a258 00000000 205f2a72 205f2e00 205f2d46 205f412e 205f45a6
[ 73.781043] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from RPC_HDR->rpc_result @ kernel_gsp.c:4999
[ 73.781059] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspWaitForRmInitDone(pGpu, pKernelGsp) @ kernel_gsp_gh100980
[ 73.781106] NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
[ 73.781108] NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
[ 73.781177] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[ 73.782282] NVRM: iovaspaceDestruct_IMPL: 1 left-over mappings in IOVAS 0x100
[ 73.782300] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x40:2015)
[ 73.783709] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 3726.717643] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 580.105.08 Wed Oct 29 23:15:11 UTC 2025
[ 3778.266333] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 580.105.08 Wed Oct 29 23:15:11 UTC 2025
[ 3879.682006] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2bb3)
NVRM: installed in this system requires use of the NVIDIA open kernel modules.
[ 3879.682080] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:884)
[ 3879.684032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
The Core Issue (Hardware/Reset Lock): Attempts to recover the GPU from the Host without rebooting the entire physical machine all failed:
-
Triggering a Secondary Bus Reset (SBR) via
setpci -s <bridge> BRIDGE_CONTROL.w=0040completely breaks the PCIe link. The GPU drops off the bus entirely (probe failed with error -1/fallen off the bus), and the PCIe Link status downgrades or stays frozen withMSI: Enable-. -
Forcing D3cold power state change on the root port also fails to flush the WPR2 region or unfreeze the MSI capability.
Currently, the ONLY way to clear the unexpected WPR2 already u
nvidia-bug-report-close-version.log.gz (103.3 KB)
nvidia-bug-report-open-version.log.gz (112.7 KB)
p condition and flip MSI: Enable- back to Enable+ is a full hard reboot of the physical Host machine, which is highly disruptive for virtualization environments.
One additional detail: when I first install a Windows system (Windows 10/11) on the VM with the passed-through GPU, and then reinstall that same VM with a Linux system (Ubuntu), this issue can be reproduced 100% of the time.
Therefore, I believe the root cause is still related to WPR2 (Windows / Write Protected Region 2). During driver removal or when an abnormal condition occurs, WPR2 is not cleaned up properly, and the new driver installation does not have permission to clean it up afterward.