hello,When using the NVIDIA open-source driver version 560.35.03 on ubuntu 18.04, a graphics card disconnection issue occurred on the RTX 3060 GPU. The error system log is as follows:
May 29 16:38:33 root-PC kernel: [ 8516.915898] NVRM: GPU at PCI:0000:01:00: GPU-154c347a-1b02-a5c4-a983-321c822c643f
May 29 16:38:33 root-PC kernel: [ 8516.915901] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
May 29 16:38:33 root-PC kernel: [ 8516.915912] NVRM: GPU 0000:01:00.0: ...GPU has fallen off the bus. and now pmc_boot_0 = 0xffffffff
May 29 16:38:33 root-PC kernel: [ 8516.916036] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
May 29 16:38:33 root-PC kernel: [ 8516.916038] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
May 29 16:38:33 root-PC kernel: [ 8516.916080] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
May 29 16:38:33 root-PC kernel: [ 8516.916081] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
May 29 16:38:33 root-PC kernel: [ 8516.916085] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
May 29 16:38:33 root-PC kernel: [ 8516.916089] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
May 29 16:38:33 root-PC kernel: [ 8516.916124] NVRM: RmLogGpuCrash: RmLogGpuCrash: failed to save GPU crash data
May 29 16:38:33 root-PC kernel: [ 8516.916128] NVRM: _kgspLogRpcSanityCheckFailure: GPU0 sanity check failed 0xf waiting for RPC response from GSP. Expected function 76 (GSP_RM_CONTROL) (0x2080a0d1 0x658).
May 29 16:38:33 root-PC kernel: [ 8516.916130] NVRM: GPU0 GSP RPC buffer contains function 78 (DUMP_PROTOBUF_COMPONENT) and data 0x0000000000000000 0x0000000000000000.
May 29 16:38:33 root-PC kernel: [ 8516.916131] NVRM: GPU0 RPC history (CPU -> GSP):
May 29 16:38:33 root-PC kernel: [ 8516.916132] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
May 29 16:38:33 root-PC kernel: [ 8516.916134] NVRM: 0 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x00063642391890b3 0x0000000000000000 y
May 29 16:38:33 root-PC kernel: [ 8516.916136] NVRM: -1 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x00063642391275da 0x0006364239128615 4155us
May 29 16:38:33 root-PC kernel: [ 8516.916137] NVRM: -2 76 GSP_RM_CONTROL 0x000000002080a097 0x0000000000000490 0x00063642390ae706 0x00063642390af1ca 2756us
May 29 16:38:33 root-PC kernel: [ 8516.916138] NVRM: -3 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x00063642390578ae 0x0006364239057c0a 860us
May 29 16:38:33 root-PC kernel: [ 8516.916142] NVRM: -4 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x00063642390319f3 0x0006364239031dac 953us
May 29 16:38:33 root-PC kernel: [ 8516.916144] NVRM: -5 76 GSP_RM_CONTROL 0x000000002080a097 0x0000000000000490 0x0006364238fba237 0x0006364238fbb31c 4325us
May 29 16:38:33 root-PC kernel: [ 8516.916145] NVRM: -6 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x0006364238f3bb1a 0x0006364238f3c0e8 1486us
May 29 16:38:33 root-PC kernel: [ 8516.916147] NVRM: -7 76 GSP_RM_CONTROL 0x000000002080a0d1 0x0000000000000658 0x0006364238f25e54 0x0006364238f2641e 1482us
May 29 16:38:33 root-PC kernel: [ 8516.916147] NVRM: GPU0 RPC event history (CPU <- GSP):
May 29 16:38:33 root-PC kernel: [ 8516.916148] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
May 29 16:38:33 root-PC kernel: [ 8516.916150] NVRM: 0 4108 UCODE_LIBOS_PRINT 0x0000000000000000 0x0000000000000000 0x000636403decf468 0x000636403decf468
May 29 16:38:33 root-PC kernel: [ 8516.916152] NVRM: -1 4108 UCODE_LIBOS_PRINT 0x0000000000000000 0x0000000000000000 0x000636403decf2fa 0x000636403decf2fb 1us
May 29 16:38:33 root-PC kernel: [ 8516.916153] NVRM: -2 4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x000636403deccbbf 0x000636403deccbc0 1us
May 29 16:38:33 root-PC kernel: [ 8516.916154] NVRM: -3 4098 GSP_RUN_CPU_SEQUENCER 0x000000000000061c 0x0000000000003fe2 0x000636403dec34bc 0x000636403dec4645 4489us
May 29 16:38:33 root-PC kernel: [ 8516.916156] NVRM: -4 4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000018fb82e 0x000636403de9f17c 0x000636403de9f17f 3us
May 29 16:38:33 root-PC kernel: [ 8516.916160] NVRM: -5 4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000016ecc1c 0x000636403de617c2 0x000636403de617c3 1us
May 29 16:38:33 root-PC kernel: [ 8516.916162] NVRM: -6 4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000016ecbee 0x000636403de613ce 0x000636403de613d1 3us
May 29 16:38:33 root-PC kernel: [ 8516.916164] CPU: 4 PID: 4465 Comm: [vkps] Update Kdump: loaded Tainted: G OE 5.3.18+ #22
May 29 16:38:33 root-PC kernel: [ 8516.916165] Hardware name: Advantech EBC-GF68/EBC-GF68, BIOS GF68000Q060X019 10/11/2024
May 29 16:38:33 root-PC kernel: [ 8516.916166] Call Trace:
May 29 16:38:33 root-PC kernel: [ 8516.916172] dump_stack+0x6d/0x95
May 29 16:38:33 root-PC kernel: [ 8516.916301] os_dump_stack+0xe/0x10 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916434] _kgspRpcRecvPoll+0x32a/0x5f0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916532] _issueRpcAndWait+0x71/0x360 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916625] rpcRmApiControl_GSP+0x757/0x9e0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916721] RmGssLegacyRpcCmd+0x190/0x360 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916778] ? os_acquire_spinlock+0x12/0x30 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916874] ? RmDeprecatedVidHeapControl+0x80/0x80 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.916970] _nv04ControlWithSecInfo+0x47/0xa0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917077] ? rmapiControlWithSecInfoTls+0xf0/0xf0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917182] ? _rmAllocForDeprecatedApi+0x30/0x30 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917277] ? _rmControlForDeprecatedApi+0x30/0x30 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917372] ? _rmFreeForDeprecatedApi+0x20/0x20 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917467] ? RmCopyUserForDeprecatedApi+0xe0/0xe0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917561] ? _rmMapMemoryForDeprecatedApi+0x30/0x30 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917659] ? _rmAllocMemForDeprecatedApi+0x10/0x10 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917746] RmIoctl+0x64a/0xd60 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917801] ? os_get_current_tick+0x2c/0x50 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917857] ? os_acquire_spinlock+0x12/0x30 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917947] rm_ioctl+0x66/0x4f0 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.917950] ? get_futex_key+0x2ff/0x3c0
May 29 16:38:33 root-PC kernel: [ 8516.918004] nvidia_unlocked_ioctl+0x633/0x930 [nvidia]
May 29 16:38:33 root-PC kernel: [ 8516.918007] ? __switch_to_asm+0x40/0x70
May 29 16:38:33 root-PC kernel: [ 8516.918011] ? __switch_to_asm+0x34/0x70
May 29 16:38:33 root-PC kernel: [ 8516.918013] do_vfs_ioctl+0xa9/0x640
May 29 16:38:33 root-PC kernel: [ 8516.918015] ? _copy_from_user+0x3e/0x60
May 29 16:38:33 root-PC kernel: [ 8516.918016] ksys_ioctl+0x75/0x80
May 29 16:38:33 root-PC kernel: [ 8516.918017] __x64_sys_ioctl+0x1a/0x20
May 29 16:38:33 root-PC kernel: [ 8516.918019] do_syscall_64+0x5a/0x130
May 29 16:38:33 root-PC kernel: [ 8516.918021] entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 29 16:38:33 root-PC kernel: [ 8516.918022] RIP: 0033:0x7fffac6cf347
May 29 16:38:33 root-PC kernel: [ 8516.918023] Code: b3 66 90 48 8b 05 41 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 11 4b 2d 00 f7 d8 64 89 01 48
May 29 16:38:33 root-PC kernel: [ 8516.918024] RSP: 002b:00007fff2f414628 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 29 16:38:33 root-PC kernel: [ 8516.918025] RAX: ffffffffffffffda RBX: 00007fff2f4147e0 RCX: 00007fffac6cf347
May 29 16:38:33 root-PC kernel: [ 8516.918029] RDX: 00007fff2f4147e0 RSI: 00000000c020462a RDI: 0000000000000022
May 29 16:38:33 root-PC kernel: [ 8516.918030] RBP: 00000000c020462a R08: 00007fff2f4147e0 R09: 00007fff2f4147fc
May 29 16:38:33 root-PC kernel: [ 8516.918031] R10: 00007fff2f415830 R11: 0000000000000246 R12: 0000000000000022
May 29 16:38:33 root-PC kernel: [ 8516.918031] R13: 00007fff2f4147fc R14: 0000000068381d09 R15: 00007fff2f414630
May 29 16:38:34 root-PC kernel: [ 8517.047700] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:34 root-PC kernel: [ 8517.047707] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:34 root-PC kernel: [ 8517.047710] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:35 root-PC kernel: [ 8518.722522] irq 16: nobody cared (try booting with the "irqpoll" option)
May 29 16:38:35 root-PC kernel: [ 8518.729869] CPU: 10 PID: 0 Comm: swapper/10 Kdump: loaded Tainted: G OE 5.3.18+ #22
May 29 16:38:35 root-PC kernel: [ 8518.729870] Hardware name: Advantech EBC-GF68/EBC-GF68, BIOS GF68000Q060X019 10/11/2024
May 29 16:38:35 root-PC kernel: [ 8518.729870] Call Trace:
May 29 16:38:35 root-PC kernel: [ 8518.729871] <IRQ>
May 29 16:38:35 root-PC kernel: [ 8518.729876] dump_stack+0x6d/0x95
May 29 16:38:35 root-PC kernel: [ 8518.729878] __report_bad_irq+0x35/0xc0
May 29 16:38:35 root-PC kernel: [ 8518.729879] note_interrupt+0x24b/0x2a0
May 29 16:38:35 root-PC kernel: [ 8518.729880] handle_irq_event_percpu+0x54/0x80
May 29 16:38:35 root-PC kernel: [ 8518.729881] handle_irq_event+0x3b/0x60
May 29 16:38:35 root-PC kernel: [ 8518.729882] handle_fasteoi_irq+0x7c/0x130
May 29 16:38:35 root-PC kernel: [ 8518.729883] handle_irq+0x20/0x30
May 29 16:38:35 root-PC kernel: [ 8518.729885] do_IRQ+0x50/0xe0
May 29 16:38:35 root-PC kernel: [ 8518.729886] common_interrupt+0xf/0xf
May 29 16:38:35 root-PC kernel: [ 8518.729887] </IRQ>
May 29 16:38:35 root-PC kernel: [ 8518.729889] RIP: 0010:cpuidle_enter_state+0xa9/0x440
May 29 16:38:35 root-PC kernel: [ 8518.729890] Code: 3d 5c a4 3e 70 e8 47 c5 4a ff 49 89 c7 0f 1f 44 00 00 31 ff e8 78 d0 4a ff 80 7d d3 00 0f 85 e6 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ed 0f 89 ff 01 00 00 41 c7 44 24 10 00 00 00 00 48 83 c4 18
May 29 16:38:35 root-PC kernel: [ 8518.729891] RSP: 0018:ffffa810c0133e48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffde
May 29 16:38:35 root-PC kernel: [ 8518.729892] RAX: ffff8b080baaa6c0 RBX: ffffffff90dc11e0 RCX: 000000000000001f
May 29 16:38:35 root-PC kernel: [ 8518.729893] RDX: 000007bf6b6de6df RSI: 000000002819abac RDI: 0000000000000000
May 29 16:38:35 root-PC kernel: [ 8518.729893] RBP: ffffa810c0133e88 R08: 0000000000000002 R09: 0000000000029f40
May 29 16:38:35 root-PC kernel: [ 8518.729894] R10: ffffa810c0133e18 R11: 0000000000000006 R12: ffffc810bfc80300
May 29 16:38:35 root-PC kernel: [ 8518.729894] R13: 0000000000000001 R14: ffffffff90dc1258 R15: 000007bf6b6de6df
May 29 16:38:35 root-PC kernel: [ 8518.729896] ? cpuidle_enter_state+0x98/0x440
May 29 16:38:35 root-PC kernel: [ 8518.729897] ? menu_select+0x370/0x600
May 29 16:38:35 root-PC kernel: [ 8518.729898] cpuidle_enter+0x2e/0x40
May 29 16:38:35 root-PC kernel: [ 8518.729900] call_cpuidle+0x23/0x40
May 29 16:38:35 root-PC kernel: [ 8518.729901] do_idle+0x1f6/0x270
May 29 16:38:35 root-PC kernel: [ 8518.729903] cpu_startup_entry+0x1d/0x20
May 29 16:38:35 root-PC kernel: [ 8518.729905] start_secondary+0x167/0x1c0
May 29 16:38:35 root-PC kernel: [ 8518.729906] secondary_startup_64+0xa4/0xb0
May 29 16:38:35 root-PC kernel: [ 8518.729907] handlers:
May 29 16:38:35 root-PC kernel: [ 8518.732397] [<0000000026e0890e>] i801_isr
May 29 16:38:35 root-PC kernel: [ 8518.736809] Disabling IRQ #16
May 29 16:38:38 root-PC kernel: [ 8521.816202] device wlan0 entered promiscuous mode
May 29 16:38:38 root-PC kernel: [ 8521.847135] device wlan0 left promiscuous mode
May 29 16:38:43 root-PC kernel: [ 8526.888758] device wlan0 entered promiscuous mode
May 29 16:38:43 root-PC kernel: [ 8526.915206] device wlan0 left promiscuous mode
May 29 16:38:46 root-PC kernel: [ 8529.831182] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831196] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831199] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831216] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831222] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831225] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831240] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831245] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831248] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831262] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831268] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831270] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831278] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831282] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831285] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831293] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831302] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831304] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831312] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831316] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831319] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.831326] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
May 29 16:38:46 root-PC kernel: [ 8529.831330] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:257
May 29 16:38:46 root-PC kernel: [ 8529.831333] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1287
May 29 16:38:46 root-PC kernel: [ 8529.832762] NVRM: nvAssertOkFailedNoLog: Assertion failed: Current device is not valid [NV_ERR_INVALID_DEVICE] (0x00000026) returned from rmDeviceGpuLocksAcquire(pGpu, GPUS_LOCK_FLAGS_NONE, RM_LOCK_MODULES_MEM) @ video_mem.c:542
This problem has never occurred before when using the 1660 Super graphics card, but it has happened many times on the 3060 graphics card. How can this problem be solved?
nvidia-bug-report.log.gz (161.6 KB)