Hi , In one of our projects with Jetson TX2 running with 32.5L4T, we see the below problem after few days of running the cuda based graphics application continuously.
[27005.019288] nvgpu: 17000000.gp10b gk20a_channel_timeout_handler:1573 [ERR] Job on channel 502 timed out
[27005.030424] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 502
[27005.147316] nvgpu: 17000000.gp10b gk20a_channel_timeout_handler:1573 [ERR] Job on channel 503 timed out
[27005.158616] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 503
[27011.235220] nvgpu: 17000000.gp10b gk20a_channel_timeout_handler:1573 [ERR] Job on channel 505 timed out
[27011.246574] nvgpu: 17000000.gp10b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 505
[27014.874996] INFO: rcu_preempt detected stalls on CPUs/tasks:
[27014.880693] 0-…: (4 GPs behind) idle=72b/140000000000002/0 softirq=685700/685711 fqs=2530 [27014.889212] (detected by 4, t=5255 jiffies, g=505571, c=505570, q=1271) [27024.350999] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 [27024.358234] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.9.201-tegra #100
[27024.364934] Hardware name: quill (DT)
[27024.368599] Call trace:
[27024.371062] dump_backtrace+0x0/0x198
[27024.376468] show_stack+0x24/0x30
[27024.381525] dump_stack+0xa0/0xc8
[27024.386580] panic+0x12c/0x2a8
[27024.391378] watchdog_check_hardlockup_other_cpu+0x11c/0x120 [27024.398773] watchdog_timer_fn+0x98/0x2c0
[27024.404521] __hrtimer_run_queues5y+0xd8/0x360
[27024.410527] hrtimer_interrupt+0xa8/0x1e0
[27024.416277] tegra186_timer_isr+0x34/0x48
[27024.422025] __handle_irq_event_percpu+0x68/0x288
[27024.428463] handle_irq_event_percpu+0x28/0x60 [27024.434640] handle_irq_event+0x50/0x80
[27024.440213] handle_fasteoi_irq+0xd4/0x1c0
[27024.446044] generic_handle_irq+0x34/0x50
[27024.451789] __handle_domain_irq+0x68/0xc0
[27024.457619] gic_handle_irq+0x5c/0xb0
[27024.463016] el1_irq+0xe8/0x194
[27024.467895] cpuidle_enter_state+0xb8/0x380
[27024.473813] cpuidle_enter+0x34/0x48
[27024.479125] call_cpuidle+0x44/0x70
[27024.484348] cpu_startup_entry+0x1b0/0x200
[27024.490182] secondary_start_kernel+0x190/0x1f8
[27024.496445] [<000000008122b1a4>] 0x8122b1a4
[27024.500632] SMP: stopping secondary CPUs
[27025.707803] SMP: failed to stop secondary CPUs 0,5
[27025.712594] Kernel Offset: disabled
[27025.716084] Memory Limit: none
[27025.719142] trusty-log panic notifier - trusty version Built: 14:49:57 Jan 15 2021 [27025.753433] Rebooting in 5 seconds…
[27030.758140] SMP: stopping secondary CPUs
[27031.965311] SMP: failed to stop secondary CPUs 0,5
[0000.175] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-03715cad) [0000.184] I> Boot-device: eMMC
[0000.191] I> sdmmc bdev is already initialized
[0000.196] I> pmic: reset reason (nverc) : 0x0
[0000.229] I> Found 19 partitions in SDMMC_BOOT (instance 3)
[0000.249] I> Found 34 partitions in SDMMC_USER (instance 3)
[0000.255] I> A/B: bin_type (16) slot 1
[0000.258] I> Loading partition bpmp-fw_b at 0xd7800000
[0000.263] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.269] I> Addr: 0xd7800000, start-block: 44098752, num_blocks: 1
[0000.294] I> Binary(16) of size 534416 is loaded @ 0xd7800000
[0000.299] I> A/B: bin_type (17) slot 1
[0000.303] I> Loading partition bpmp-fw-dtb_b at 0xd79f0000
[0000.308] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.314] I> Addr: 0xd79f0000, start-block: 44102008, num_blocks: 1
[0000.340] I> Binary(17) of size 604720 is loaded @ 0xd796c400
What could be the reason for this and how to handle this?