GSP-RM firmware bug: heartbeat stops after GC6 exit, leads to fatal GPU loss (Xid 79) and broken display recovery

Driver version: 595.58.03 (open kernel modules)
GPU: NVIDIA RTX PRO 1000 Blackwell (GB207GLM) – also confirmed on RTX 3060/3080/4050 by other users
System: Dell Pro Max 16 Premium (MA16250), hybrid graphics (Intel iGPU + NVIDIA dGPU), S0ix enabled
Kernel: 6.19.10+deb13-amd64
GitHub issue: GSP heartbeat stuck at 0 since boot with S0ix power management on RTX PRO 1000 Blackwell laptop · Issue #1064 · NVIDIA/open-gpu-kernel-modules · GitHub

Summary

The GSP-RM firmware stops writing heartbeat to NV_PGSP_MAILBOX(0) after the first runtime PM GC6 exit. The heartbeat counter stays at 0 for the entire session, and every subsequent GPU wake triggers _kgspRpcRecvPoll: GSP RM heartbeat timed out.

I’ve traced this to the GC6 exit bootstrap path in the open kernel driver source (595.58.03). On GC6 exit, gpuExitSuspend() sets bootMode = KGSP_BOOT_MODE_GC6_EXIT. The bootstrap in kgspBootstrap_TU102() skips init RPCs and status queue init (those only run for NORMAL boot mode), then runs Booter Load and waits for GSP_INIT_DONE. GSP-RM is supposed to restore its full context including the heartbeat timer, but it doesn’t. The host-side _kgspHeartbeatInit() is purely local (sets timeout thresholds, no RPC to GSP), so there’s no way for the host to tell GSP-RM to restart heartbeat. The comment at kernel_gsp.c:5018 says “GSP starts sending heartbeat after rminit” but the GC6 exit path doesn’t go through rminit.

I verified the host-side code is correct: PTIMER doesn’t reset during GC6 (diff values grow monotonically at 1:1 with wall clock), sysTimerOffsetNs is accurate, and the register genuinely reads zero.

Impact

This is not just log noise. The stuck heartbeat has two real consequences:

  1. Fatal GPU loss (observed April 8): After 4 successful GC6 cycles with heartbeat=0, the 5th GC6 exit failed. The BIOS ACPI GC6 exit method (\_SB.PC00.RP12.PXSX.NC6O) timed out with AE_AML_LOOP_TIMEOUT. The GPU fell off the PCI bus entirely (Xid 79), FSP failed to boot (Xid 143), and GPU reset was required but impossible (Xid 154). All registers returned 0xffffffff. Required a full reboot.

  2. Display recovery blocked (observed April 9): When a dock DP-MST hub drops a video stream, the standard DPMS off/on workaround (which forces the driver to cycle the display pipeline) produces zero modeset activity when GSP heartbeat is stuck. The driver silently ignores the DPMS state change. Only suspend/resume (which resets GSP state) or physical dock reconnect can recover.

Confirmed on multiple GPUs

Four other users on the GitHub issue report the same heartbeat symptoms:

  • RTX 3080 Mobile (Razer Blade 14)
  • RTX 4050 Mobile (Acer Nitro V15)
  • GTX 4060 Mobile (ASUS TUF F17)
  • RTX 3060 Mobile (ASUS TUF A15)

All on driver 595.45.04 or 595.58.03, all in hybrid graphics mode.

Additional observation

During GC6 exit bootstrap, GSP-RM sends RPC event PFM_REQ_HNDLR_STATE_SYNC_CALLBACK (0x101a) while the host is in POLL_BOOTUP context. This event isn’t in the allowlist at kernel_gsp.c:1435, so it gets silently dropped with NV_ASSERT(0). Possibly a separate issue but it happens on the same code path.

Mitigation

NVreg_DynamicPowerManagement=0x00 disables all GPU runtime PM (no GC6 transitions), which avoids triggering the bug at the cost of higher idle power.

Attachments

Multiple nvidia-bug-report.log.gz files are attached to the GitHub issue, captured during live failures.