Periodic stutters and "NVRM: RmCheckForGcxSupportOnCurrentState" kernel warnings on Ubuntu 22.04 RTX 4070

I’ve been running into occasional visible “stutters” on my Ubuntu Linux 22.04 system. By stutter, I mean that for ~500ms there is no visible change to the screen. If there is a video playing, it freezes. If I am moving the mouse, the cursor will freeze.

At the same time, I get a ton of kernel messages such as:

Apr 24 14:59:16 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:21 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e51c037800 >= 3d63e51c037800
Apr 24 14:59:21 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:21 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:27 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e663d6cf00 >= 3d63e663d6cf00
Apr 24 14:59:27 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:27 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:32 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e7abaa2600 >= 3d63e7abaa2600
Apr 24 14:59:32 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:32 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:38 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e8f37d7d00 >= 3d63e8f37d7d00

Also found this while going through the logs

Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Apr 24 13:23:29 banana kernel: NVRM: nvAssertFailedNoLog: Assertion failed: expectedFunc == pHistoryEntry->function @ kernel_gsp.c:1744
Apr 24 13:23:29 banana kernel: NVRM: GPU at PCI:0000:01:00: GPU-02187fd8-22a1-3f71-cd52-22af54f42481
Apr 24 13:23:29 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1671935, name=kworker/1:3, Timeout after 1149s of waiting for RPC response from GPU0 GSP! Expected function 4097 (GSP_INIT_DONE) (0x0 0x0).
Apr 24 13:23:29 banana kernel: NVRM: GPU0 GSP RPC buffer contains function 4108 (UCODE_LIBOS_PRINT) and data 0x0000000000000000 0x0000000000000000.
Apr 24 13:23:29 banana kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Apr 24 13:23:29 banana kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Apr 24 13:23:29 banana kernel: NVRM:      0    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x000616daa95805e8 0x000616daa95d9f17    366 s y
Apr 24 13:23:29 banana kernel: NVRM:     -1    10   FREE                  0x00000000caf010bb 0x0000000000000000 0x000616daa958044a 0x000616daa95805e5    411us  
Apr 24 13:23:29 banana kernel: NVRM:     -2    76   GSP_RM_CONTROL        0x0000000020800ac3 0x0000000000000028 0x000616daa9580260 0x000616daa9580447    487us  
Apr 24 13:23:29 banana kernel: NVRM:     -3    4    ALLOC_MEMORY          0x0000000000000000 0x0000000000000000 0x000616daa957ff81 0x000616daa958025d    732us  
Apr 24 13:23:29 banana kernel: NVRM:     -4    10   FREE                  0x00000000caf010ba 0x0000000000000000 0x000616daa957fd61 0x000616daa957ff79    536us  
Apr 24 13:23:29 banana kernel: NVRM:     -5    76   GSP_RM_CONTROL        0x0000000020800ac3 0x0000000000000028 0x000616daa957fb78 0x000616daa957fd5f    487us  
Apr 24 13:23:29 banana kernel: NVRM:     -6    4    ALLOC_MEMORY          0x0000000000000000 0x0000000000000000 0x000616daa957f982 0x000616daa957fb75    499us  
Apr 24 13:23:29 banana kernel: NVRM:     -7    10   FREE                  0x00000000caf010b9 0x0000000000000000 0x000616daa957f7d9 0x000616daa957f97b    418us  
Apr 24 13:23:29 banana kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Apr 24 13:23:29 banana kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Apr 24 13:23:29 banana kernel: NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daed814621 0x000616daed814622      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -1    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daed8144ef 0x000616daed8144f0      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -2    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x000616daed812e49 0x000616daed812e4b      2us  
Apr 24 13:23:29 banana kernel: NVRM:     -3    4098 GSP_RUN_CPU_SEQUENCER 0x0000000000000628 0x0000000000003fe2 0x000616daed808c11 0x000616daed809d6e   4445us  
Apr 24 13:23:29 banana kernel: NVRM:     -4    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daa958d7c0 0x000616daa958d7c1      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -5    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x000616daa9585c1e 0x000616daa9585c20      2us  
Apr 24 13:23:29 banana kernel: NVRM:     -6    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x000616daa9585a33 0x000616daa9585a33           
Apr 24 13:23:29 banana kernel: NVRM:     -7    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000001 0x000616daa853253d 0x000616daa8532544      7us  
Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: ********************************************************************************
Apr 24 13:23:29 banana kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from rpcRecvPoll(pGpu, pRpc, NV_VGPU_MSG_EVENT_GSP_INIT_DONE) @ kernel_gsp.c:4074
Apr 24 13:23:29 banana kernel: NVRM: gpuPowerManagementResume: State load at resume for riscv/gsp failed: 0x65
Apr 24 13:23:35 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1671935, name=kworker/1:3, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
Apr 24 13:23:35 banana kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
Apr 24 13:23:35 banana kernel: NVRM: subdeviceCtrlCmdPerfSetPowerstate_KERNEL: NV2080_CTRL_CMD_PERF_SET_POWERSTATE RPC failed
Apr 24 13:23:46 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=854, name=nv_queue, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a7d7 0x2).
Apr 24 13:23:46 banana kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
Apr 24 13:23:46 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0x65
Apr 24 13:23:57 banana kernel: NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.

nvidia-bug-report.log.gz sent to linux-bugs@nvidia.com

Please switch to the closed driver and check if this also occurs with it.

Will do. By the way, the instruction in CUDA Installation Guide for Linux have the package names mixed up. apt-get remove --purge nvidia-kernel-open-550 should be apt-get remove --purge nvidia-kernel-550-open.

We have observed similar stack with close driver:
[ 8711.495322] ------------[ cut here ]------------
[ 8711.501692] WARNING: CPU: 68 PID: 591334 at /tmp/selfgz500774/NVIDIA-Linux-x86_64-550.54.15/kernel/nvidia/nv.c:4576 nv_suspend_devices+0x14f/0x180 [nvidia]
[ 8711.519032] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) xt_conntrack vxlan ip6_udp_tunnel udp_tunnel nfnetlink_cttimeout act_gact cls_flower sch_ingress loop nf_tables iptable_mangle iptable_raw nf_conntrack_netlink ip_set_bitmap_port ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipportnet dummy xt_comment iptable_nat sch_htb ip_set nfnetlink iptable_filter ip_tables openvswitch nf_conncount nf_nat tcm_loop target_core_pscsi target_core_file target_core_iblock overlay kvm_24451e9(OE) redpoll(OE) target_core_user target_core_mod uio ipmi_ssif bonding(OE) isofs cdrom ib_ipoib(OE) vfio_pci vfio_virqfd vfio_iommu_type1 vfio irqbypass kpatch_11752651(OK) avirt(OE) intel_rapl_msr intel_rapl_common intel_pmt_telemetry intel_pmt_crashlog iTCO_wdt intel_pmt_class iTCO_vendor_support i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_uncore pcspkr joydev isst_if_mmio isst_if_mbox_pci
[ 8711.519071] ses isst_if_common idxd idxd_bus intel_pmt enclosure mousedev mei_me i2c_i801 i2c_smbus mei i2c_ismt acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss ip_vs_sh nfs_acl lockd ip_vs_wrr ip_vs_rr grace ip_vs dm_mod nf_conntrack sunrpc nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) mlx5_ib(OE) ast sd_mod sg crc32c_intel drm_vram_helper nvme mpt3sas drm_ttm_helper i2c_algo_bit raid_class scsi_transport_sas nvme_core drm_kms_helper t10_pi syscopyarea sysfillrect sysimgblt ttm fb_sys_fops ahci libahci drm libata i2c_core wmi virtio_net net_failover failover mlx5_core(OE) mlxfw(OE) tls pci_hyperv_intf auxiliary(OE) psample mlxdevm(OE) ib_uverbs(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ecc]
[ 8711.709820] CPU: 68 PID: 591334 Comm: bash Kdump: loaded Tainted: P S OE K 5.10.134-13.al8.x86_64 #1
[ 8711.722025] Hardware name: Not Filled Not Filled/ArcherCityM, BIOS 05.11.07 03/28/2024
[ 8711.732114] RIP: 0010:nv_suspend_devices+0x14f/0x180 [nvidia]
[ 8711.739561] Code: 5d ff ff ff 48 8b 9b 80 06 00 00 48 85 db 74 9c 48 8b bb d0 02 00 00 ba 01 00 00 00 89 ee e8 f8 fc ff ff 41 89 c4 85 c0 74 da <0f> 0b 48 c7 c7 00 c5 dd c4 41 bd 01 00 00 00 e8 ed 95 45 ec e9 06
[ 8711.762526] RSP: 0018:ffffc0ef73563e50 EFLAGS: 00010206
[ 8711.769387] RAX: 0000000000000056 RBX: ffffa038e8d8f000 RCX: 0000000080020001
[ 8711.778397] RDX: ffffa038e8d8f660 RSI: 0000000000000286 RDI: ffffa038e8d8f658
[ 8711.787382] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffc1cf4001
[ 8711.796379] R10: ffff9f399b47b000 R11: 0000000000000001 R12: 0000000000000056
[ 8711.805363] R13: 0000000000000000 R14: ffffc0ef73563f18 R15: 0000000000000000
[ 8711.814337] FS: 00007f483ff22740(0000) GS:ffffa133feb00000(0000) knlGS:0000000000000000
[ 8711.824375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8711.831787] CR2: 000000c00017f000 CR3: 0000010772802006 CR4: 0000000002772ee0
[ 8711.840765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8711.849711] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 8711.858668] PKRU: 55555554
[ 8711.862643] Call Trace:
[ 8711.866469] nv_set_system_power_state+0x84/0x180 [nvidia]
[ 8711.873708] nv_procfs_write_suspend+0xd4/0x140 [nvidia]
[ 8711.880578] proc_reg_write+0x4e/0x90
[ 8711.885602] vfs_write+0xc2/0x260
[ 8711.890222] ksys_write+0x4f/0xd0
[ 8711.894784] do_syscall_64+0x30/0x40
[ 8711.899645] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ 8711.906207] RIP: 0033:0x7f4840058be7
[ 8711.911093] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 8711.933813] RSP: 002b:00007ffd6b195718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 8711.943182] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f4840058be7
[ 8711.952039] RDX: 000000000000000a RSI: 0000563b49068c20 RDI: 0000000000000001
[ 8711.960904] RBP: 0000563b49068c20 R08: 000000000000000a R09: 00007f48400be0c0
[ 8711.969764] R10: 00007f48400bdfc0 R11: 0000000000000246 R12: 000000000000000a
[ 8711.978591] R13: 00007f48400fb520 R14: 000000000000000a R15: 00007f48400fb720
[ 8711.987412] —[ end trace 2752b88ecdc60f81 ]—
[ 8717.993682] NVRM: GPU at PCI:0000:0f:00: GPU-28dae5f7-705c-216f-e4ca-b0c4e4012e16
[ 8718.002945] NVRM: Xid (PCI:0000:0f:00): 119, pid=591334, name=bash, Timeout after 6s of waiting for RPC response from GPU4 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801117 0x1).
[ 8718.022802] NVRM: GPU4 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) and data 0x0000000020801117 0x0000000000000001.
[ 8718.036098] NVRM: GPU4 RPC history (CPU → GSP):
[ 8718.042047] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
[ 8718.059203] NVRM: 0 76 GSP_RM_CONTROL 0x0000000020801117 0x0000000000000001 0x000618504a604c10 0x0000000000000000 y
[ 8718.074888] NVRM: -1 47 UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x000618504a5752d3 0x000618504a58b06d 89 s
[ 8718.090530] NVRM: -2 10 FREE 0x00000000caf00003 0x0000000000000000 0x000618504a5740a5 0x000618504a57414b 166us
[ 8718.106149] NVRM: -3 76 GSP_RM_CONTROL 0x0000000020800ac2 0x0000000000000020 0x000618504a573ff2 0x000618504a5740a3 177us
[ 8718.121801] NVRM: -4 4 ALLOC_MEMORY 0x0000000000000000 0x0000000000000000 0x000618504a573e80 0x000618504a573fef 367us
[ 8718.137448] NVRM: -5 76 GSP_RM_CONTROL 0x0000000020802a06 0x0000000000000004 0x000618504a570f4e 0x000618504a5711d2 644us
[ 8718.153053] NVRM: -6 76 GSP_RM_CONTROL 0x0000000020802a05 0x0000000000000098 0x000618504a570b7c 0x000618504a570f4c 976us
[ 8718.168907] NVRM: -7 76 GSP_RM_CONTROL 0x0000000020802a09 0x0000000000000084 0x000618504a570b0e 0x000618504a570b75 103us
[ 8718.184737] NVRM: GPU4 RPC event history (CPU ← GSP):
[ 8718.191347] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 8718.209106] NVRM: 0 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000000 0x0000000000000000 0x000618504a58499e 0x000618504a58499e
[ 8718.224896] NVRM: -1 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000001 0x0000000000000000 0x000618504a584724 0x000618504a584725 1us
[ 8718.240763] NVRM: -2 4108 UCODE_LIBOS_PRINT 0x0000000000000000 0x0000000000000000 0x000618504a5829b8 0x000618504a5829b9 1us
[ 8718.256681] NVRM: -3 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000000 0x0000000000000000 0x000618504a1b3619 0x000618504a1b3619
[ 8718.272656] NVRM: -4 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000001 0x0000000000000000 0x000618504a1afb1f 0x000618504a1afb1f
[ 8718.288714] NVRM: -5 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000000 0x0000000000000000 0x000618504a1afad5 0x000618504a1afad5
[ 8718.304857] NVRM: -6 4124 GSP_LOCKDOWN_NOTICE 0x0000000000000001 0x0000000000000000 0x000618504a1afa58 0x000618504a1afa58
[ 8718.321073] NVRM: -7 4108 UCODE_LIBOS_PRINT 0x0000000000000000 0x0000000000000000 0x000618504a19d9e4 0x000618504a19d9e4
[ 8718.337193] CPU: 82 PID: 591334 Comm: bash Kdump: loaded Tainted: P S W OE K 5.10.134-13.al8.x86_64 #1
[ 8718.349415] Hardware name: Not Filled Not Filled/ArcherCityM, BIOS 05.11.07 03/28/2024
[ 8718.359312] Call Trace:
[ 8718.363062] dump_stack+0x57/0x6e
[ 8718.368128] _nv012445rm+0x437/0x4b0 [nvidia]
[ 8718.374273] ? _nv012367rm+0x77/0x330 [nvidia]
[ 8718.380433] ? _nv045783rm+0x4b4/0x6e0 [nvidia]
[ 8718.386603] ? _nv045366rm+0xf5/0x210 [nvidia]
[ 8718.392721] ? _nv045077rm+0xd0/0x1b0 [nvidia]
[ 8718.398804] ? _nv047026rm+0x377/0x410 [nvidia]
[ 8718.404928] ? _nv014151rm+0x3f1/0x690 [nvidia]
[ 8718.411041] ? _nv045216rm+0x29/0x30 [nvidia]
[ 8718.416958] ? rm_restart_user_channels+0x73/0xf0 [nvidia]
[ 8718.424005] ? report_bug+0x9e/0xc0
[ 8718.428912] ? nv_restore_user_channels+0x180/0x1d0 [nvidia]
[ 8718.436251] ? nv_suspend_devices+0xbe/0x180 [nvidia]
[ 8718.442881] ? nv_set_system_power_state+0x84/0x180 [nvidia]
[ 8718.450200] ? nv_procfs_write_suspend+0xd4/0x140 [nvidia]
[ 8718.457163] ? proc_reg_write+0x4e/0x90
[ 8718.462300] ? vfs_write+0xc2/0x260
[ 8718.467021] ? ksys_write+0x4f/0xd0
[ 8718.471744] ? do_syscall_64+0x30/0x40
[ 8718.476731] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6

I haven’t experienced the problem since switching to the closed driver.

Then you should create a bug report on github for the open driver

Done