RTX 4060 Laptop GPU stuck unusable and powered on after a GSP error

I’m reporting from 575.51.02 BETA closed drivers, but I’ve had the same problem with stable/open drivers as well.

Issue manifests as the GPU becoming completely unresponsive to applications (e.g. vkcube hangs for a few seconds until finally selecting the iGPU) and being stuck in a powered on state until reboot. It appears to be linked specifically to power management and doesn’t happen if GPU is simply kept powered on for a long time, e.g. during gaming.

Bug report in a failure state: nvidia-bug-report.log.gz (1.5 MB)

For the record, here’s the primary error in dmesg:

[11283.704484] NVRM: GPU at PCI:0000:01:00: GPU-557b610e-2bc1-f6f2-15c8-b57ca6fbce38
[11283.704488] NVRM: Xid (PCI:0000:01:00): 119, pid=226, name=kworker/15:1, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
[11283.704508] NVRM: GPU0 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) and data 0x000000002080205b 0x0000000000000004.
[11283.704510] NVRM: GPU0 RPC history (CPU -> GSP):
[11283.704512] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
[11283.704513] NVRM:      0    76   GSP_RM_CONTROL        0x000000002080205b 0x0000000000000004 0x0006345db66ef4a4 0x0000000000000000          y
[11283.704516] NVRM:     -1    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x0006345d71097548 0x0006345d710c794d 197637us  
[11283.704519] NVRM:     -2    10   FREE                  0x00000000c1e00309 0x0000000000000000 0x0006345d71097320 0x0006345d7109750a    490us  
[11283.704521] NVRM:     -3    10   FREE                  0x000000000000000b 0x0000000000000000 0x0006345d710970d1 0x0006345d7109731e    589us  
[11283.704523] NVRM:     -4    10   FREE                  0x000000000000000c 0x0000000000000000 0x0006345d71096edc 0x0006345d7109703b    351us  
[11283.704525] NVRM:     -5    10   FREE                  0x0000000000000006 0x0000000000000000 0x0006345d71096d19 0x0006345d71096ed3    442us  
[11283.704527] NVRM:     -6    10   FREE                  0x000000000000000a 0x0000000000000000 0x0006345d71096815 0x0006345d71096d11   1276us  
[11283.704529] NVRM:     -7    10   FREE                  0x0000000000000002 0x0000000000000000 0x0006345d71095995 0x0006345d71096677   3298us  
[11283.704530] NVRM: GPU0 RPC event history (CPU <- GSP):
[11283.704532] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
[11283.704533] NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x0006345d7109f3e3 0x0006345d7109f3e4      1us  
[11283.704536] NVRM:     -1    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x0006345d7109bf34 0x0006345d7109bf36      2us  
[11283.704538] NVRM:     -2    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x0006345d7109bdd6 0x0006345d7109bdd6           
[11283.704540] NVRM:     -3    4099 POST_EVENT            0x0000000000000021 0x0000000000000100 0x0006345d707668f7 0x0006345d7076690b     20us  
[11283.704542] NVRM:     -4    4099 POST_EVENT            0x0000000000000021 0x0000000000000020 0x0006345d706e796b 0x0006345d706e7982     23us  
[11283.704544] NVRM:     -5    4099 POST_EVENT            0x0000000000000021 0x0000000000000001 0x0006345d70108aee 0x0006345d70108afc     14us  
[11283.704546] NVRM:     -6    4099 POST_EVENT            0x0000000000000021 0x0000000000000008 0x0006345d70023cd2 0x0006345d70023cf4     34us  
[11283.704548] NVRM:     -7    4099 POST_EVENT            0x0000000000000021 0x0000000000000001 0x0006345d6fc89aa1 0x0006345d6fc89aad     12us  
[11283.704551] CPU: 15 UID: 0 PID: 226 Comm: kworker/15:1 Tainted: P           OE      6.14.5-2-cachyos #1 5b3816ec247e07a05355dcf4d86a93cbe78a5deb
[11283.704555] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[11283.704556] Hardware name: ASUSTeK COMPUTER INC. ASUS TUF Gaming A15 FA507NV_FA507NV/FA507NV, BIOS FA507NV.316 11/04/2024
[11283.704557] Workqueue: kacpi_notify acpi_os_execute_deferred
[11283.704561] Sched_ext: lavd (enabled+all), task: runnable_at=-1ms
[11283.704562] Call Trace:
[11283.704564]  <TASK>
[11283.704567]  dump_stack_lvl+0x71/0x90
[11283.704570]  _nv013767rm+0x5dd/0x720 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.704794]  _nv013677rm+0xe2/0x880 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.704969]  _nv053503rm+0x594/0x770 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.705142]  _nv057098rm+0x9e/0x150 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.705338]  _nv052720rm+0x1a9/0x1b0 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.705512]  _nv054901rm+0x3f5/0x500 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.705693]  _nv015682rm+0x469/0x680 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.705878]  _nv052860rm+0x29/0x30 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.706069]  ? _nv054904rm+0x60/0x60 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.706251]  _nv000809rm+0x58/0x70 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.706434]  _nv000808rm+0x21b/0x220 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.706637]  _nv000760rm+0x1c0/0x320 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.706837]  rm_transition_dynamic_power+0xd7/0x13f [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707031]  nv_pmops_runtime_resume+0x76/0xf0 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707213]  ? __pfx_pci_pm_runtime_resume.llvm.5339016576846103269+0x10/0x10
[11283.707216]  __rpm_callback+0x93/0x350
[11283.707220]  ? __pfx_pci_pm_runtime_resume.llvm.5339016576846103269+0x10/0x10
[11283.707222]  rpm_resume+0x4e4/0x860
[11283.707225]  __pm_runtime_resume+0x5c/0x80
[11283.707227]  pci_device_shutdown.llvm.5339016576846103269+0x23/0x70
[11283.707230]  nv_indicate_not_idle+0x2f/0x40 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707412]  _nv048601rm+0xf4/0x240 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707593]  rm_power_source_change_event+0xc0/0x184 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707778]  nv_acpi_powersource_hotplug_event+0x63/0x90 [nvidia 10bb795edfb6216e2cc9508fb909d6a143ed626b]
[11283.707959]  acpi_ev_notify_dispatch+0x56/0x70
[11283.707962]  acpi_os_execute_deferred+0x1c/0x30
[11283.707965]  process_scheduled_works+0x250/0x590
[11283.707968]  worker_thread+0xf8/0x2c0
[11283.707970]  ? __pfx_worker_thread+0x10/0x10
[11283.707972]  kthread+0x26d/0x290
[11283.707975]  ? __pfx_kthread+0x10/0x10
[11283.707977]  ret_from_fork.cold+0xc/0x19
[11283.707979]  ? __pfx_kthread+0x10/0x10
[11283.707980]  ret_from_fork_asm+0x1a/0x30
[11283.707985]  </TASK>
[11289.708467] NVRM: Xid (PCI:0000:01:00): 119, pid=226, name=kworker/15:1, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a81 0x4).
[11295.709473] NVRM: Xid (PCI:0000:01:00): 119, pid=226, name=kworker/15:1, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
[11301.713432] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.
[11337.723272] NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
1 Like

Curiously, it seems that I can’t trigger this much quicker using a synthetic test for repeatedly powering GPU on/off like

while true; do vkcube --c 10; sleep 25; done

I guess it’s something more complex than a race condition during GPU shutdown.

1 Like

I’m having the same issue with my laptop (RTX 4090) and the 575.51.02 drivers. Originally, it seemed to just prevent my laptop from being able to sleep. Right now it’s hung at login with the Xorg process using up 100% CPU.

So far doesn’t seem to happen on my desktop, but I’ll verify that later.

Downgrading to 570.144 fixes the issue and is an acceptable workaround for this particular machine for now.

Confirmed: Looks like no issues on my desktop with an RTX 4090 with Nvidia drivers 575.51.02.

Maybe this is optimus related? My laptop has advanced optimus, but Linux isn’t able to make use that, but including in case that’s at all relevant to debug this issue.

I’m pretty confident this is related in some way to the process of powering the GPU on/off; however, doing just that is apparently insufficient to trigger the issue. Curious that you’ve managed to hit effectively the same bug but with GSP disabled as well.

1 Like