Nvidia-smi No devices were found Cannot initialize GSP firmware RM

Title: Blackwell (RTX Pro 5000) KVM Passthrough: GSP Timeout causes unexpected WPR2 already up and cannot be recovered via SBR/D3cold without Host Reboot

Environment:

  • GPU: NVIDIA RTX Pro 5000 (Blackwell Architecture, PCI ID: 10de:2bb3)

  • Host OS: Linux (KVM/QEMU Hypervisor)

  • Guest OS: Ubuntu 24.04 LTS

  • Driver Version: 580.105.08 (Open Kernel Module / MIT-GPL Flavor)

Description: When passing through the RTX Pro 5000 (Blackwell) to an Ubuntu VM via VFIO, the GSP firmware occasionally hits a heartbeat timeout during initialization or driver reload. Once this happens, the GPU enters an unrecoverable “bad state” where the driver fails to probe with the following errors:

Plaintext

dmesg |grep -iE "xid|gsp|nvrm"
[    8.726693] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  580.105.08  Release Build  (dvs-builder@U22-I3-B10-02-5)  Wed Oct 29 22:29:53 UTC 2025
[   69.779330] NVRM: Xid (PCI:0000:01:00): 62, 32311d90 0002a258 00000000 205f2a72 205f2e00 205f2d46 205f412e 205f45a6
[   73.781043] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from RPC_HDR->rpc_result @ kernel_gsp.c:4999
[   73.781059] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspWaitForRmInitDone(pGpu, pKernelGsp) @ kernel_gsp_gh100980
[   73.781106] NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
[   73.781108] NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
[   73.781177] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[   73.782282] NVRM: iovaspaceDestruct_IMPL: 1 left-over mappings in IOVAS 0x100
[   73.782300] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x40:2015)
[   73.783709] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 3726.717643] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  580.105.08  Wed Oct 29 23:15:11 UTC 2025
[ 3778.266333] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  580.105.08  Wed Oct 29 23:15:11 UTC 2025
[ 3879.682006] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2bb3)
               NVRM: installed in this system requires use of the NVIDIA open kernel modules.
[ 3879.682080] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:884)
[ 3879.684032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0


The Core Issue (Hardware/Reset Lock): Attempts to recover the GPU from the Host without rebooting the entire physical machine all failed:

  1. Triggering a Secondary Bus Reset (SBR) via setpci -s <bridge> BRIDGE_CONTROL.w=0040 completely breaks the PCIe link. The GPU drops off the bus entirely (probe failed with error -1 / fallen off the bus), and the PCIe Link status downgrades or stays frozen with MSI: Enable-.

  2. Forcing D3cold power state change on the root port also fails to flush the WPR2 region or unfreeze the MSI capability.

Currently, the ONLY way to clear the unexpected WPR2 already u

nvidia-bug-report-close-version.log.gz (103.3 KB)

nvidia-bug-report-open-version.log.gz (112.7 KB)

p condition and flip MSI: Enable- back to Enable+ is a full hard reboot of the physical Host machine, which is highly disruptive for virtualization environments.

One additional detail: when I first install a Windows system (Windows 10/11) on the VM with the passed-through GPU, and then reinstall that same VM with a Linux system (Ubuntu), this issue can be reproduced 100% of the time.

Therefore, I believe the root cause is still related to WPR2 (Windows / Write Protected Region 2). During driver removal or when an abnormal condition occurs, WPR2 is not cleaned up properly, and the new driver installation does not have permission to clean it up afterward.

Latest testing progress:

Scenario A (Forced Power Off)

If I terminate the Windows VM using virsh destroy, the GPU immediately enters a locked/stuck state. Any Linux VM started afterward will fail 100% of the time with errors such as:

  • unexpected WPR2 already up
  • GSP initialization timeout
  • MSI/PCIe bus disconnect/reset issues

The only way to recover is to reboot the host machine completely. My motherboard does not support powering off the GPU slot through PCI bridge slot power control.


Scenario B (Graceful Shutdown)

If I use virsh shutdown and allow Windows to shut down normally inside the guest OS, so that the Windows NVIDIA driver can unload cleanly, then a Linux VM started afterward works correctly.

In this case:

  • nvidia-smi works normally
  • No host reboot is required
  • GPU passthrough remains stable

This behavior is highly consistent with another failure pattern we observed under heavy GPU workloads:

  • the GPU crashes unexpectedly
  • the guest exits abnormally
  • GSP/WPR2 enters a locked state or leaves residual firmware state behind

As a result, our investigation is now focused on the high-load crash/reset path.

We also noticed reports on GitHub suggesting that disabling ASPM may reduce the probability of this issue.

For non-Blackwell architectures, another possible mitigation is to disable GSP firmware mode and fall back to traditional CPU-side RM management instead of GSP-managed mode.