Hi,
We are having trouble with Xavier NX devices stuck in a reboot loop. So far all devices will recover after multiple reboots, but that can take up to 30 minutes. From UART we discovered the reboot is due to a kernel crash, logs of this crash are included below:
[ 3.854495] systemd-journald[2162]: File /var/log/journal/5a8660aba01c4ab3ac72ad16008d18ed/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[ 3.864407] using random self ethernet address
[ 3.864549] using random host ethernet address
[ 3.959066] wm8960: no symbol version for module_layout
[ 3.959230] wm8960: loading out-of-tree module taints kernel.
[ 4.254661] mwifiex_pcie: try set_consistent_dma_mask(32)
[ 4.255022] mwifiex_pcie: PCI memory map Virt0: ffffff8012500000 PCI memory map Virt2: ffffff8013e00000
[ 4.543625] random: crng init done
[ 4.543759] random: 7 urandom warning(s) missed due to ratelimiting
[ 8.507257] podgov: can't create debugfs directory
[ 8.507425] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[ 8.507565] CPU: 5 PID: 4298 Comm: gst-plugin-scan Tainted: G O 4.9.140-tegra #1
[ 8.507732] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 8.507859] Call trace:
[ 8.507948] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[ 8.508062] [<ffffff800808c37c>] show_stack+0x24/0x30
[ 8.508171] [<ffffff800845c7a0>] dump_stack+0x98/0xc0
[ 8.508280] [<ffffff80081c1438>] panic+0x11c/0x298
[ 8.508389] [<ffffff8008cbdc40>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[ 8.508537] [<ffffff8008cbdfec>] nvhost_pod_event_handler+0x334/0x400
[ 8.508663] [<ffffff8008cbaf14>] devfreq_add_device+0x284/0x408
[ 8.508781] [<ffffff8008cbb0fc>] devm_devfreq_add_device+0x64/0xc0
[ 8.509324] [<ffffff8000fdbec8>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[ 8.509792] [<ffffff8000fd50e8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[ 8.510296] [<ffffff8000fd5330>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[ 8.510974] [<ffffff800878c91c>] pm_generic_runtime_resume+0x3c/0x58
[ 8.517707] [<ffffff800878ec64>] __rpm_callback+0x74/0xa0
[ 8.523046] [<ffffff800878ecc4>] rpm_callback+0x34/0x98
[ 8.528136] [<ffffff8008790160>] rpm_resume+0x470/0x710
[ 8.533204] [<ffffff800879044c>] __pm_runtime_resume+0x4c/0x70
[ 8.539058] [<ffffff8000fd524c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[ 8.545095] [<ffffff8000fb6f74>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[ 8.551508] [<ffffff8008262314>] chrdev_open+0x94/0x198
[ 8.557006] [<ffffff8008258de0>] do_dentry_open+0x1b8/0x318
[ 8.562692] [<ffffff800825a388>] vfs_open+0x58/0x88
[ 8.567376] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
[ 8.567758] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
[ 8.571989] IPv6: ADDRCONF(NETDEV_UP): mlan1: link is not ready
[ 8.572183] IPv6: ADDRCONF(NETDEV_UP): mlan1: link is not ready
[ 8.591219] [<ffffff800826d644>] do_last+0x454/0xe60
[ 8.596205] [<ffffff800826e0e0>] path_openat+0x90/0x378
[ 8.601454] [<ffffff800826f650>] do_filp_open+0x70/0xe8
[ 8.606528] [<ffffff800825a84c>] do_sys_open+0x174/0x258
[ 8.611866] [<ffffff800825a9b4>] SyS_openat+0x3c/0x50
[ 8.617116] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[ 8.622201] SMP: stopping secondary CPUs
[ 8.626391] Kernel Offset: disabled
[ 8.629621] Memory Limit: none
[ 8.633030] trusty-log panic notifier - trusty version Built: 12:18:19 Oct 16 2020 [ 8.648388] Rebooting in 5 seconds..
Our initial analysis points towards a kernel panic when the GPU code is trying to set the EMC clock speed. We are running L4T 32.4.4. We would like some help finding the root cause and a solution to this problem.