Xavier NX reboot loop

Hi,
We are having trouble with Xavier NX devices stuck in a reboot loop. So far all devices will recover after multiple reboots, but that can take up to 30 minutes. From UART we discovered the reboot is due to a kernel crash, logs of this crash are included below:

[    3.854495] systemd-journald[2162]: File /var/log/journal/5a8660aba01c4ab3ac72ad16008d18ed/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[    3.864407] using random self ethernet address
[    3.864549] using random host ethernet address
[    3.959066] wm8960: no symbol version for module_layout
[    3.959230] wm8960: loading out-of-tree module taints kernel.
[    4.254661] mwifiex_pcie: try set_consistent_dma_mask(32)
[    4.255022] mwifiex_pcie: PCI memory map Virt0: ffffff8012500000 PCI memory map Virt2: ffffff8013e00000
[    4.543625] random: crng init done
[    4.543759] random: 7 urandom warning(s) missed due to ratelimiting
[    8.507257] podgov: can't create debugfs directory
[    8.507425] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[    8.507565] CPU: 5 PID: 4298 Comm: gst-plugin-scan Tainted: G           O    4.9.140-tegra #1
[    8.507732] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[    8.507859] Call trace:
[    8.507948] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[    8.508062] [<ffffff800808c37c>] show_stack+0x24/0x30
[    8.508171] [<ffffff800845c7a0>] dump_stack+0x98/0xc0
[    8.508280] [<ffffff80081c1438>] panic+0x11c/0x298
[    8.508389] [<ffffff8008cbdc40>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[    8.508537] [<ffffff8008cbdfec>] nvhost_pod_event_handler+0x334/0x400
[    8.508663] [<ffffff8008cbaf14>] devfreq_add_device+0x284/0x408
[    8.508781] [<ffffff8008cbb0fc>] devm_devfreq_add_device+0x64/0xc0
[    8.509324] [<ffffff8000fdbec8>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[    8.509792] [<ffffff8000fd50e8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[    8.510296] [<ffffff8000fd5330>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[    8.510974] [<ffffff800878c91c>] pm_generic_runtime_resume+0x3c/0x58
[    8.517707] [<ffffff800878ec64>] __rpm_callback+0x74/0xa0
[    8.523046] [<ffffff800878ecc4>] rpm_callback+0x34/0x98
[    8.528136] [<ffffff8008790160>] rpm_resume+0x470/0x710
[    8.533204] [<ffffff800879044c>] __pm_runtime_resume+0x4c/0x70
[    8.539058] [<ffffff8000fd524c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[    8.545095] [<ffffff8000fb6f74>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[    8.551508] [<ffffff8008262314>] chrdev_open+0x94/0x198
[    8.557006] [<ffffff8008258de0>] do_dentry_open+0x1b8/0x318
[    8.562692] [<ffffff800825a388>] vfs_open+0x58/0x88
[    8.567376] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
[    8.567758] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
[    8.571989] IPv6: ADDRCONF(NETDEV_UP): mlan1: link is not ready
[    8.572183] IPv6: ADDRCONF(NETDEV_UP): mlan1: link is not ready
[    8.591219] [<ffffff800826d644>] do_last+0x454/0xe60
[    8.596205] [<ffffff800826e0e0>] path_openat+0x90/0x378
[    8.601454] [<ffffff800826f650>] do_filp_open+0x70/0xe8
[    8.606528] [<ffffff800825a84c>] do_sys_open+0x174/0x258
[    8.611866] [<ffffff800825a9b4>] SyS_openat+0x3c/0x50
[    8.617116] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[    8.622201] SMP: stopping secondary CPUs
[    8.626391] Kernel Offset: disabled
[    8.629621] Memory Limit: none
[    8.633030] trusty-log panic notifier - trusty version Built: 12:18:19 Oct 16 2020 [    8.648388] Rebooting in 5 seconds..

Our initial analysis points towards a kernel panic when the GPU code is trying to set the EMC clock speed. We are running L4T 32.4.4. We would like some help finding the root cause and a solution to this problem.

Hi pepijn.vanheiningen,

Are you using the devkit or custom board for Xavier NX?

Would you board hit the kernel panic before any modification?

Have you also tried with the latest R32.7.4 release?

We are using a custom board. So far we’ve been able to see that the issue stays with the Tegra module. If we move the Tegra module from an affected system to a healthy system the problem moves to the healthy system. In addition if we move the Tegra from a healthy system to the affected system the problem disappears.

Today we will test what happens if we a R35.3.1 based image. Using a R32.7.4 will be quite some work for us, so hopefully the R35.3.1 will give you enough information.

We currently have a fallout of around 40% with this problem so it’s of critical priority.

Hi CPeppenster,

Are you working with pepijn.vanheiningen?
Please also help to confirm if you can reproduce the issue on the devkit.
(i.e. move the Xavier NX module to the devkit board to check)

For R35, we would suggest verifying with the latest R35.5.0.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.