AGX Xavier kept self rebooting: BUG soft lockup

My AGX Xavier kept rebooting by itself sporadically. With network connected or disable network. With some apps running and without apps running. Attached below two serial console logs:

  1. self rebooting when network on: soft_lock_self_reboot.log (1.8 MB)
  2. self rebooting when network off (disable): self_reboot_network_off_running_nothing.log (98.9 KB)

I kept seeing the following kernel panic message when it self reboot:
[ 4425.175019] INFO: rcu_preempt self-detected stall on CPU
[ 4425.175220] 0-…: (2 GPs behind) idle=d9f/140000000000001/0 softirq=73877/73877 fqs=2187
[ 4425.175376] (t=5250 jiffies g=38235 c=38234 q=1964)
[ 4425.187000] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 4425.187171] 0-…: (2 GPs behind) idle=d9f/140000000000001/0 softirq=73877/73877 fqs=2237
[ 4425.187321] (detected by 5, t=5253 jiffies, g=3011, c=3010, q=0)
[ 4451.906464] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [ksoftirqd/0:3]
[ 4451.906821] Kernel panic - not syncing: softlockup: hung tasks
[ 4451.906927] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G L 4.9.140-tegra #1
[ 4451.907055] Hardware name: Jetson-AGX (DT)
[ 4451.907127] Call trace:
[ 4451.907177] [] dump_backtrace+0x0/0x198
[ 4451.907266] [] show_stack+0x24/0x30
[ 4451.907351] [] dump_stack+0x98/0xc0
[ 4451.907437] [] panic+0x11c/0x298
[ 4451.907520] [] watchdog_unpark_threads+0x0/0x98
[ 4451.907621] [] __hrtimer_run_queues+0xd8/0x360
[ 4451.907716] [] hrtimer_interrupt+0xa8/0x1e0
[ 4451.907813] [] arch_timer_handler_phys+0x38/0x58
[ 4451.908038] [] handle_percpu_devid_irq+0x90/0x2b0
[ 4451.908534] [] generic_handle_irq+0x34/0x50
[ 4451.908959] [] __handle_domain_irq+0x68/0xc0
[ 4451.909419] [] gic_handle_irq+0x5c/0xb0
[ 4451.911318] [] el1_irq+0xe8/0x194
[ 4451.915964] [] ksize+0x0/0xf0
[ 4451.920513] [] __tcp_send_ack.part.7+0x44/0x140
[ 4451.926895] [] tcp_send_ack+0x34/0x40
[ 4451.931711] [] __tcp_ack_snd_check+0x54/0xb0
[ 4451.937574] [] tcp_rcv_established+0x284/0x7b8
[ 4451.943612] [] tcp_v4_do_rcv+0x108/0x248
[ 4451.949124] [] tcp_v4_rcv+0xaac/0xc00
[ 4451.954547] [] ip_local_deliver_finish+0x80/0x278
[ 4451.960845] [] ip_local_deliver+0x54/0xf0
[ 4451.966620] [] ip_rcv_finish+0x1d8/0x3a0
[ 4451.971958] [] ip_rcv+0x270/0x3a8
[ 4451.976688] [] __netif_receive_skb_core+0x3b8/0xad8
[ 4451.983073] [] __netif_receive_skb+0x28/0x78
[ 4451.988935] [] netif_receive_skb_internal+0x2c/0xb0
[ 4451.995582] [] napi_gro_receive+0x15c/0x188
[ 4452.001362] [] eqos_napi_poll_rx+0x358/0x430
[ 4452.007222] [] net_rx_action+0xf4/0x358
[ 4452.012734] [] __do_softirq+0x13c/0x3b0
[ 4452.018247] [] run_ksoftirqd+0x48/0x58
[ 4452.023761] [] smpboot_thread_fn+0x160/0x248
[ 4452.029795] [] kthread+0xec/0xf0
[ 4452.034350] [] ret_from_fork+0x10/0x30
[ 4452.040124] SMP: stopping secondary CPUs
[ 4452.043980] Kernel Offset: disabled
[ 4452.047643] Memory Limit: none
[ 4452.050707] trusty-log panic notifier - trusty version Built: 21:16:17 Jun 25 2020 [ 4452.066548] Rebooting in 5 seconds…

What causing this self reboot? How can I fix my AGX Xavier? or shall I return the unit for replacement? Please advise. Thanks.

within few minutes, it self reboot again, this time it preceeds with

[ 507.090244] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
[ 507.090453] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 509

then soft lockup error:

[ 574.367546] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 44s! [ksoftirqd/0:3]
[ 574.367866] Kernel panic - not syncing: softlockup: hung tasks

attached please find serial console log: self_reboot_watchdog_thresh_20.log (101.7 KB)

Please stop filing new topic.

Your already filed 4 topics for same issue. We can use that ethernet one to track.