Custom board using nano poweroff err

Hi, we’re testing our custom board using Nano and find that there is a possibility of a power-off failure, a CPU shutdown failure, and then a system restart. The debug log is as follow.
We tested 400 times and comes 21 poweroff failure, and each time the output log is not same.

[2025-04-04 13:44:06]  [  138.536086] CPU0: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000000, esr=bf000002
[2025-04-04 13:44:28]  [  144.314293] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[2025-04-04 13:44:33]  [  144.321504] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.299-tegra #1
[2025-04-04 13:44:33]  [  144.328015] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[2025-04-04 13:44:33]  [  144.334006] Call trace:
[2025-04-04 13:44:33]  [  144.336453] [<ffffff800808ba30>] dump_backtrace+0x0/0x198
[2025-04-04 13:44:33]  [  144.341840] [<ffffff800808bff4>] show_stack+0x24/0x30
[2025-04-04 13:44:33]  [  144.346883] [<ffffff8008f86f74>] dump_stack+0xa0/0xc4
[2025-04-04 13:44:33]  [  144.351923] [<ffffff8008f83f94>] panic+0x128/0x2a4
[2025-04-04 13:44:33]  [  144.356706] [<ffffff80081824d4>] watchdog_check_hardlockup_other_cpu+0x11c/0x120
[2025-04-04 13:44:33]  [  144.364084] [<ffffff8008181648>] watchdog_timer_fn+0x98/0x2c0
[2025-04-04 13:44:33]  [  144.369818] [<ffffff8008139238>] __hrtimer_run_queues+0xd8/0x360
[2025-04-04 13:44:33]  [  144.375810] [<ffffff8008139b88>] hrtimer_interrupt+0xa8/0x1e0
[2025-04-04 13:44:33]  [  144.381544] [<ffffff8008c0d8e8>] tegra210_timer_isr+0x38/0x48
[2025-04-04 13:44:33]  [  144.387277] [<ffffff8008121bd8>] __handle_irq_event_percpu+0x68/0x288
[2025-04-04 13:44:33]  [  144.393702] [<ffffff8008121e20>] handle_irq_event_percpu+0x28/0x60
[2025-04-04 13:44:33]  [  144.399866] [<ffffff8008121ea8>] handle_irq_event+0x50/0x80
[2025-04-04 13:44:33]  [  144.405424] [<ffffff8008125d84>] handle_fasteoi_irq+0xd4/0x1c0
[2025-04-04 13:44:33]  [  144.411242] [<ffffff8008120b6c>] generic_handle_irq+0x34/0x50
[2025-04-04 13:44:33]  [  144.416972] [<ffffff8008121278>] __handle_domain_irq+0x68/0xc0
[2025-04-04 13:44:33]  [  144.422790] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[2025-04-04 13:44:33]  [  144.428174] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[2025-04-04 13:44:33]  [  144.433042] [<ffffff8008bb1ec0>] cpuidle_enter_state+0xb8/0x380
[2025-04-04 13:44:33]  [  144.438946] [<ffffff8008bb21fc>] cpuidle_enter+0x34/0x48
[2025-04-04 13:44:33]  [  144.444243] [<ffffff8008111534>] call_cpuidle+0x44/0x70
[2025-04-04 13:44:33]  [  144.449454] [<ffffff80081118b0>] cpu_startup_entry+0x1b0/0x200
[2025-04-04 13:44:33]  [  144.455275] [<ffffff8008091cc8>] secondary_start_kernel+0x190/0x1f8
[2025-04-04 13:44:33]  [  144.461525] [<0000000084f951a8>] 0x84f951a8
[2025-04-04 13:44:33]  [  144.465697] SMP: stopping secondary CPUs
[2025-04-04 13:44:33]  [  145.527320] SMP: failed to stop secondary CPUs 0-3
[2025-04-04 13:44:35]  [  145.532102] Kernel Offset: disabled
[2025-04-04 13:44:35]  [  145.535579] Memory Limit: none
[2025-04-04 13:44:35]  [  145.547734] Rebooting in 5 seconds..
[2025-04-04 13:44:35]  [  150.551585] SMP: stopping secondary CPUs
[2025-04-04 13:44:40]  [  151.613159] SMP: failed to stop secondary CPUs 0-3
[2025-04-04 15:38:54]  [   46.757080] CPU0: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000000, esr=bf000002
[2025-04-04 15:39:15]  [   46.765344] INFO: rcu_preempt self-detected stall on CPU
[2025-04-04 15:39:15]  [   46.770654] 0-...: (1 GPs behind) idle=06f/140000000000002/0 softirq=7588/7589 fqs=23 
[2025-04-04 15:39:15]  [   46.778637]  (t=5265 jiffies g=1524 c=1523 q=207)
[2025-04-04 15:39:15]  [   46.783424] rcu_preempt kthread starved for 5219 jiffies! g1524 c1523 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[2025-04-04 15:39:15]  [   67.810919] CPU1: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000001, esr=bf000002
[2025-04-04 15:39:36]  [   88.864611] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 39s! [pool:4875]
[2025-04-04 15:39:57]  [   88.864614] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [dbus-daemon:4726]
[2025-04-04 15:39:57]  [   88.864628] Modules linked in: fuse zram userspace_alert nvgpu ip_tables x_tables
[2025-04-04 15:39:57]  [   88.864629] 
[2025-04-04 15:39:57]  [   88.864633] CPU: 2 PID: 4726 Comm: dbus-daemon Not tainted 4.9.299-tegra #1
[2025-04-04 15:39:57]  [   88.864635] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[2025-04-04 15:39:57]  [   88.864637] task: ffffffc0f6687000 task.stack: ffffffc0f341c000
[2025-04-04 15:39:57]  [   88.864644] PC is at ptep_set_access_flags+0xb0/0x138
[2025-04-04 15:39:57]  [   88.864648] LR is at do_wp_page+0x180/0x8d8
[2025-04-04 15:39:57]  [   88.864651] pc : [<ffffff8008214bd0>] lr : [<ffffff8008200500>] pstate: 20400145
[2025-04-04 15:39:57]  [   88.864652] sp : ffffffc0f341fc80
[2025-04-04 15:39:57]  [   88.864656] x29: ffffffc0f341fc80 x28: ffffffc0f6687000 
[2025-04-04 15:39:57]  [   88.864660] x27: ffffffc0f3b8d568 x26: 0000000000000002 
[2025-04-04 15:39:57]  [   88.864663] x25: ffffffc0e114d4d0 x24: 0000000000000040 
[2025-04-04 15:39:57]  [   88.864666] x23: ffffffc0e114d4d0 x22: 00e8000120de9fd3 
[2025-04-04 15:39:57]  [   88.864669] x21: ffffffc0aa704110 x20: 0843000005581022 
[2025-04-04 15:39:57]  [   88.864672] x19: 0000000000000041 x18: 0000007fa0bc9a70 
[2025-04-04 15:39:57]  [   88.864675] x17: 0000000000000000 x16: 0000000000000000 
[2025-04-04 15:39:57]  [   88.864679] x15: 01c591431b18e008 
[2025-04-04 15:39:57]  [   88.864680] CPU3: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000003, esr=bf000002
[2025-04-04 15:39:57]  [   88.864681] x14: fffffffffffffec8 
[2025-04-04 15:39:57]  [   88.864684] x13: 726f273d30677261 x12: 2c27737542442f70 
[2025-04-04 15:39:57]  [   88.864687] x11: 0027726567616e61 x10: 4d6e6f6973736553 
[2025-04-04 15:39:57]  [   88.864690] x9 : 000000000000001c x8 : 00000000000000d4 
[2025-04-04 15:39:57]  [   88.864693] x7 : 0000000000000001 x6 : 0000000000100073 
[2025-04-04 15:39:57]  [   88.864696] x5 : 00e0000120de9fd3 x4 : 0000000000000001 
[2025-04-04 15:39:57]  [   88.864698] x3 : 00e8000120de9fd3 x2 : 00e8000120de9f53 
[2025-04-04 15:39:57]  [   88.864701] x1 : 00e8000120de9fd3 x0 : 0000000000000001 
[2025-04-04 15:39:57]  [   88.864702] 
[2025-04-04 15:39:57]  [   88.864705] Kernel panic - not syncing: softlockup: hung tasks
[2025-04-04 15:39:57]  [   88.864709] CPU: 2 PID: 4726 Comm: dbus-daemon Tainted: G             L  4.9.299-tegra #1
[2025-04-04 15:39:57]  [   88.864710] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[2025-04-04 15:39:57]  [   88.864711] Call trace:
[2025-04-04 15:39:57]  [   88.864716] [<ffffff800808ba30>] dump_backtrace+0x0/0x198
[2025-04-04 15:39:57]  [   88.864720] [<ffffff800808bff4>] show_stack+0x24/0x30
[2025-04-04 15:39:57]  [   88.864724] [<ffffff8008f86f74>] dump_stack+0xa0/0xc4
[2025-04-04 15:39:57]  [   88.864728] [<ffffff8008f83f94>] panic+0x128/0x2a4
[2025-04-04 15:39:57]  [   88.864733] [<ffffff8008181870>] watchdog_unpark_threads+0x0/0x98
[2025-04-04 15:39:57]  [   88.864736] [<ffffff8008139238>] __hrtimer_run_queues+0xd8/0x360
[2025-04-04 15:39:57]  [   88.864739] [<ffffff8008139b88>] hrtimer_interrupt+0xa8/0x1e0
[2025-04-04 15:39:57]  [   88.864742] [<ffffff8008c0d8e8>] tegra210_timer_isr+0x38/0x48
[2025-04-04 15:39:57]  [   88.864746] [<ffffff8008121bd8>] __handle_irq_event_percpu+0x68/0x288
[2025-04-04 15:39:57]  [   88.864748] [<ffffff8008121e20>] handle_irq_event_percpu+0x28/0x60
[2025-04-04 15:39:57]  [   88.864751] [<ffffff8008121ea8>] handle_irq_event+0x50/0x80
[2025-04-04 15:39:57]  [   88.864753] [<ffffff8008125d84>] handle_fasteoi_irq+0xd4/0x1c0
[2025-04-04 15:39:57]  [   88.864756] [<ffffff8008120b6c>] generic_handle_irq+0x34/0x50
[2025-04-04 15:39:57]  [   88.864758] [<ffffff8008121278>] __handle_domain_irq+0x68/0xc0
[2025-04-04 15:39:57]  [   88.864761] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[2025-04-04 15:39:57]  [   88.864763] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[2025-04-04 15:39:57]  [   88.864766] [<ffffff8008214bd0>] ptep_set_access_flags+0xb0/0x138
[2025-04-04 15:39:57]  [   88.864769] [<ffffff8008200500>] do_wp_page+0x180/0x8d8
[2025-04-04 15:39:57]  [   88.864772] [<ffffff8008204110>] handle_mm_fault+0x570/0x5f0
[2025-04-04 15:39:58]  [   88.864775] [<ffffff80080a247c>] do_page_fault+0x254/0x480
[2025-04-04 15:39:58]  [   88.864777] [<ffffff8008080954>] do_mem_abort+0x54/0xb0
[2025-04-04 15:39:58]  [   88.864779] [<ffffff8008083408>] el0_da+0x20/0x24
[2025-04-04 15:39:58]  [   88.864783] SMP: stopping secondary CPUs
[2025-04-04 15:39:58]  [   89.165755] Modules linked in: fuse zram userspace_alert nvgpu ip_tables x_tables
[2025-04-04 15:39:58]  [   89.173271] 
[2025-04-04 15:39:58]  [   89.174755] CPU: 0 PID: 4875 Comm: pool Tainted: G             L  4.9.299-tegra #1
[2025-04-04 15:39:58]  [   89.182304] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[2025-04-04 15:39:58]  [   89.188295] task: ffffffc0f9370e00 task.stack: ffffffc0df824000
[2025-04-04 15:39:58]  [   89.194202] PC is at ptep_set_access_flags+0xb0/0x138
[2025-04-04 15:39:58]  [   89.199239] LR is at do_wp_page+0x180/0x8d8
[2025-04-04 15:39:58]  [   89.203410] pc : [<ffffff8008214bd0>] lr : [<ffffff8008200500>] pstate: 20400145
[2025-04-04 15:39:58]  [   89.210786] sp : ffffffc0df827c80
[2025-04-04 15:39:58]  [   89.214088] x29: ffffffc0df827c80 x28: ffffffc0f9370e00 
[2025-04-04 15:39:58]  [   89.219400] x27: ffffffc0f3b8cae8 x26: 0000000000000002 
[2025-04-04 15:39:58]  [   89.224713] x25: ffffffc0e1380a50 x24: 0000000000000040 
[2025-04-04 15:39:58]  [   89.230025] x23: ffffffc0e1380a50 x22: 00e8000158968fd3 
[2025-04-04 15:39:58]  [   89.235336] x21: ffffffc0e22eeba8 x20: 09000000055aed75 
[2025-04-04 15:39:58]  [   89.240645] x19: 0000000000000041 x18: 0000007f7b67aa70 
[2025-04-04 15:39:58]  [   89.245956] x17: 0000000000000000 x16: 0000000000000000 
[2025-04-04 15:39:58]  [   89.251267] x15: 00003bb98294f37b x14: 0023d22696913400 
[2025-04-04 15:39:58]  [   89.256576] x13: 0000000067ef8c08 x12: 0000000000000018 
[2025-04-04 15:39:58]  [   89.261887] x11: 0000000024e22f42 x10: 0000000000000016 
[2025-04-04 15:39:58]  [   89.267196] x9 : 003b9aca00000000 x8 : 0000000000000062 
[2025-04-04 15:39:58]  [   89.272507] x7 : 0000000000002538 x6 : 0000000000100073 
[2025-04-04 15:39:58]  [   89.277817] x5 : 00e0000158968fd3 x4 : 0000000000000001 
[2025-04-04 15:39:58]  [   89.283127] x3 : 00e8000158968fd3 x2 : 00e8000158968f53 
[2025-04-04 15:39:58]  [   89.288437] x1 : 00e8000158968fd3 x0 : 0000000000000001 
[2025-04-04 15:39:58]  [   89.293748] 
[2025-04-04 15:39:58]  [   89.925268] SMP: failed to stop secondary CPUs 0-2
[2025-04-04 15:39:58]  [   89.930053] Kernel Offset: disabled
[2025-04-04 15:39:58]  [   89.933532] Memory Limit: none
[2025-04-04 15:39:58]  [   89.944246] Rebooting in 5 seconds..
[2025-04-04 15:39:58]  [   94.948097] SMP: stopping secondary CPUs
[2025-04-04 15:40:03]  [   96.011545] SMP: failed to stop secondary CPUs 0-2

Is it possible to reproduce this on NV devkit?

We did not test this on devkit.
But we can not repreduce it in single board testing, only in multiple board testing environment can it repreduce.

Hi, here is our full debug log and I found that there is a lot of nvgpu ERR print in both normal boot and err poweroff. Is this connected to our question

in normal boot and poweroff

[   25.231173] nvgpu: 57000000.gpu      gr_gk20a_handle_sm_exception:5710 [ERR]  sm hww global 0x00000004 warp 0x004b0001
[2025-04-03 17:27:51]  [   25.241899] nvgpu: 57000000.gpu                      gk20a_gr_isr:6231 [ERR]  set gr exception notifier
[2025-04-03 17:27:51]  [   25.251355] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 13 for ch 507
[2025-04-03 17:27:51]  [   25.262571] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1722 [ERR]  fake mmu fault on engine 0, engine subid 1 (hub), client 30 (a falcon), addr 0xbc7b97d000, type 14 (dual ptes), access_type 0x00000000,inst_ptr 0x5abe9f000
[2025-04-03 17:27:51]  [  149.522602] IRQ10 no longer affine to CPU1
[2025-04-03 17:29:55]  [  149.526958] CPU1: shutdown
[2025-04-03 17:29:55]  [  149.566565] IRQ11 no longer affine to CPU2
[2025-04-03 17:29:55]  [  149.570890] CPU2: shutdown
[2025-04-03 17:29:55]  [  149.611031] IRQ12 no longer affine to CPU3
[2025-04-03 17:29:55]  [  149.615752] CPU3: shutdown
[2025-04-03 17:29:55]  [  149.635521] reboot: Power down

in err poweroff

[2025-04-03 17:33:19]  [   25.677982] nvgpu: 57000000.gpu      gr_gk20a_handle_sm_exception:5710 [ERR]  sm hww global 0x00000004 warp 0x001b0001
[2025-04-03 17:33:24]  [   25.688693] nvgpu: 57000000.gpu                      gk20a_gr_isr:6231 [ERR]  set gr exception notifier
[2025-04-03 17:33:24]  [   25.698094] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 13 for ch 507
[2025-04-03 17:33:24]  [   25.709036] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1722 [ERR]  fake mmu fault on engine 0, engine subid 0 (gpc), client 23 (t1 5), addr 0xdfa884000, type 5 (priv viol), access_type 0x00000001,inst_ptr 0x8674f000
[2025-04-03 17:33:24]  [   30.966285] nvgpu: 57000000.gpu     gk20a_channel_timeout_handler:1573 [ERR]  Job on channel 507 timed out
[2025-04-03 17:33:29]  [   30.976370] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 507
[2025-04-03 17:33:29]  [   30.987766] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1722 [ERR]  fake mmu fault on engine 0, engine subid 0 (gpc), client 23 (t1 5), addr 0xdfa884000, type 5 (priv viol), access_type 0x00000001,inst_ptr 0x8674f000
[2025-04-03 17:33:29]  [   34.696456] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 507
[2025-04-03 17:33:33]  [   34.706631] nvgpu: 57000000.gpu     gk20a_fifo_handle_sched_error:2553 [ERR]  fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[2025-04-03 17:33:33]  [   34.719644] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1722 [ERR]  fake mmu fault on engine 0, engine subid 0 (gpc), client 23 (t1 5), addr 0xdfa884000, type 5 (priv viol), access_type 0x00000001,inst_ptr 0x8674f000
[2025-04-03 17:33:33]  [   34.739794] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:129  [ERR]  gr_fecs_os_r : 0
[2025-04-03 17:33:33]  [   34.748408] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:131  [ERR]  gr_fecs_cpuctl_r : 0x40
[2025-04-03 17:33:33]  [   34.757625] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:133  [ERR]  gr_fecs_idlestate_r : 0x1
[2025-04-03 17:33:33]  [   34.767012] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:135  [ERR]  gr_fecs_mailbox0_r : 0x3
[2025-04-03 17:33:33]  [   34.776315] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:137  [ERR]  gr_fecs_mailbox1_r : 0x0
[2025-04-03 17:33:33]  [   34.785612] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:139  [ERR]  gr_fecs_irqstat_r : 0x0
[2025-04-03 17:33:33]  [   34.794828] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:141  [ERR]  gr_fecs_irqmode_r : 0x4
[2025-04-03 17:33:33]  [   34.804039] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:143  [ERR]  gr_fecs_irqmask_r : 0x8704
[2025-04-03 17:33:33]  [   34.813515] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:145  [ERR]  gr_fecs_irqdest_r : 0x0
[2025-04-03 17:33:33]  [   34.822731] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:147  [ERR]  gr_fecs_debug1_r : 0x40
[2025-04-03 17:33:33]  [   34.831947] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:149  [ERR]  gr_fecs_debuginfo_r : 0x0
[2025-04-03 17:33:33]  [   34.841333] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:151  [ERR]  gr_fecs_ctxsw_status_1_r : 0x340
[2025-04-03 17:33:33]  [   34.851333] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x4
[2025-04-03 17:33:33]  [   34.861326] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
[2025-04-03 17:33:33]  [   34.871323] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x50009
[2025-04-03 17:33:33]  [   34.881664] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x20
[2025-04-03 17:33:33]  [   34.891744] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[2025-04-03 17:33:33]  [   34.902169] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
[2025-04-03 17:33:33]  [   34.912164] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x0
[2025-04-03 17:33:33]  [   34.922157] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
[2025-04-03 17:33:33]  [   34.932151] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
[2025-04-03 17:33:33]  [   34.942143] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
[2025-04-03 17:33:33]  [   34.952138] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
[2025-04-03 17:33:33]  [   34.962216] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x3
[2025-04-03 17:33:33]  [   34.972297] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
[2025-04-03 17:33:33]  [   34.982381] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x0
[2025-04-03 17:33:33]  [   34.992464] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
[2025-04-03 17:33:33]  [   35.002544] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
[2025-04-03 17:33:33]  [   35.012627] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:159  [ERR]  gr_fecs_engctl_r : 0x0
[2025-04-03 17:33:33]  [   35.021755] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:161  [ERR]  gr_fecs_curctx_r : 0x0
[2025-04-03 17:33:33]  [   35.030885] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:163  [ERR]  gr_fecs_nxtctx_r : 0x0
[2025-04-03 17:33:33]  [   35.040012] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:169  [ERR]  FECS_FALCON_REG_IMB : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.050007] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:175  [ERR]  FECS_FALCON_REG_DMB : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.060000] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:181  [ERR]  FECS_FALCON_REG_CSW : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.069997] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:187  [ERR]  FECS_FALCON_REG_CTX : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.079990] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:193  [ERR]  FECS_FALCON_REG_EXCI : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.090073] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.099984] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.109905] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.119813] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.129722] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.139629] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.149539] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.159450] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[2025-04-03 17:33:33]  [   35.169367] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1727 [ERR]  gr_status_r : 0x32008a1
[2025-04-03 17:33:33]  [   35.179887] nvgpu: 57000000.gpu                    fifo_error_isr:2627 [ERR]  channel reset initiated from fifo_error_isr; intr=0x00000100
[2025-04-03 17:33:33]  [   56.336179] CPU1: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000001, esr=bf000002
[2025-04-03 17:33:54]  [   77.389915] CPU2: SError detected, daif=1c0, spsr=0x40000045, mpidr=80000002, esr=bf000002
[2025-04-03 17:34:15]  [  119.497375] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 39s! [kworker/u8:0:6]
[2025-04-03 17:34:58]  [  119.505606] Kernel panic - not syncing: softlockup: hung tasks
[2025-04-03 17:34:58]  [  119.511427] CPU: 2 PID: 6 Comm: kworker/u8:0 Tainted: G             L  4.9.299-tegra #1
[2025-04-03 17:34:58]  [  119.519412] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[2025-04-03 17:34:58]  [  119.525408] Workqueue: devfreq_wq devfreq_monitor
[2025-04-03 17:34:58]  [  119.530105] Call trace:
[2025-04-03 17:34:58]  [  119.532545] [<ffffff800808ba30>] dump_backtrace+0x0/0x198
[2025-04-03 17:34:58]  [  119.537931] [<ffffff800808bff4>] show_stack+0x24/0x30
[2025-04-03 17:34:58]  [  119.542971] [<ffffff8008f86f74>] dump_stack+0xa0/0xc4
[2025-04-03 17:34:58]  [  119.548010] [<ffffff8008f83f94>] panic+0x128/0x2a4
[2025-04-03 17:34:58]  [  119.552790] [<ffffff8008181870>] watchdog_unpark_threads+0x0/0x98
[2025-04-03 17:34:58]  [  119.558869] [<ffffff8008139238>] __hrtimer_run_queues+0xd8/0x360
[2025-04-03 17:34:58]  [  119.564861] [<ffffff8008139b88>] hrtimer_interrupt+0xa8/0x1e0
[2025-04-03 17:34:58]  [  119.570594] [<ffffff8008c0d8e8>] tegra210_timer_isr+0x38/0x48
[2025-04-03 17:34:58]  [  119.576325] [<ffffff8008121bd8>] __handle_irq_event_percpu+0x68/0x288
[2025-04-03 17:34:58]  [  119.582748] [<ffffff8008121e20>] handle_irq_event_percpu+0x28/0x60
[2025-04-03 17:34:58]  [  119.588913] [<ffffff8008121ea8>] handle_irq_event+0x50/0x80
[2025-04-03 17:34:58]  [  119.594471] [<ffffff8008125d84>] handle_fasteoi_irq+0xd4/0x1c0
[2025-04-03 17:34:58]  [  119.600289] [<ffffff8008120b6c>] generic_handle_irq+0x34/0x50
[2025-04-03 17:34:58]  [  119.606019] [<ffffff8008121278>] __handle_domain_irq+0x68/0xc0
[2025-04-03 17:34:58]  [  119.611837] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[2025-04-03 17:34:58]  [  119.617220] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[2025-04-03 17:34:58]  [  119.622339] [<ffffff8000fda8b8>] __nvgpu_readl+0x38/0xc8 [nvgpu]
[2025-04-03 17:34:58]  [  119.628577] [<ffffff8000fc4704>] nvgpu_mc_boot_0+0x34/0x78 [nvgpu]
[2025-04-03 17:34:58]  [  119.634999] [<ffffff800101aca4>] __nvgpu_check_gpu_state+0x2c/0xa8 [nvgpu]
[2025-04-03 17:34:58]  [  119.642104] [<ffffff8000fda994>] nvgpu_readl+0x4c/0x60 [nvgpu]
[2025-04-03 17:34:58]  [  119.648181] [<ffffff800103a908>] gk20a_pmu_read_idle_counter+0x30/0x40 [nvgpu]
[2025-04-03 17:34:58]  [  119.655640] [<ffffff800100d034>] nvgpu_pmu_busy_cycles_norm+0x6c/0x160 [nvgpu]
[2025-04-03 17:34:58]  [  119.663096] [<ffffff8000fef878>] gk20a_scale_get_dev_status+0xa8/0xf8 [nvgpu]
[2025-04-03 17:34:58]  [  119.670215] [<ffffff8008cd4bac>] nvhost_pod_estimate_freq+0x94/0x830
[2025-04-03 17:34:58]  [  119.676552] [<ffffff8008cd1bac>] update_devfreq+0x44/0x218
[2025-04-03 17:34:58]  [  119.682022] [<ffffff8008cd1db4>] devfreq_monitor+0x34/0x90
[2025-04-03 17:34:58]  [  119.687495] [<ffffff80080d4154>] process_one_work+0x1e4/0x4b0
[2025-04-03 17:34:58]  [  119.693226] [<ffffff80080d4470>] worker_thread+0x50/0x4c8
[2025-04-03 17:34:58]  [  119.698609] [<ffffff80080db154>] kthread+0xec/0xf0
[2025-04-03 17:34:58]  [  119.703386] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[2025-04-03 17:34:58]  [  119.708685] SMP: stopping secondary CPUs
[2025-04-03 17:34:58]  [  120.770269] SMP: failed to stop secondary CPUs 0-3
[2025-04-03 17:34:59]  [  120.775050] Kernel Offset: disabled
[2025-04-03 17:34:59]  [  120.778528] Memory Limit: none
[2025-04-03 17:34:59]  [  120.790224] Rebooting in 5 seconds..
[2025-04-03 17:34:59]  [  125.794076] SMP: stopping secondary CPUs
[2025-04-03 17:35:04]  [  126.855222] SMP: failed to stop secondary CPUs 0-3

powererr.txt (32.6 MB)

Hi, I test multiple board and find that only one will get err in most times. And we changed the module on board, the err did not show as the original one, so we think the module is broken.

We did test on this single module on our custom board, the L4T version is 32.7.3. And we find that it 1/10 change to get err without any operation , the log is shown as this.
And we reflashed this module, it comes the same err messge in every boot.
It seems nvgpu err come up and finally kernel get panic and auto reboot
errlog.txt (277.8 KB)

Please RMA the module.

okay, I’ll do it
Thanks