TX2 kernel reboot abnormally

Hi
I meet a very strange issue, the tx2 will reboot abnormally sometimes.
when the kernel reboot anbormally, the kernel will printk below information:

nvidia-desktop login: [ 28.189851] kernel BUG at ./include/linux/pagemap.h:147!
[ 28.195160] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 28.200637] Modules linked in: bnep fuse option usb_wwan can_gw can_bcm can_raw can mttcan can_dev qmi_wwan_q cdc_wdm zram overlay cdc_acm bcmdhd ftdi_sio cfg80211 spidev binfmt_misc userspace_alert nvgpu bluedroid_pm ip_tables x_tables
[ 28.221820] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.9.201-tegra #6
[ 28.228508] Hardware name: quill (DT)
[ 28.232164] task: ffffffc1ecb31c00 task.stack: ffffffc1ecb54000
[ 28.238077] PC is at find_get_entries+0x21c/0x268
[ 28.242773] LR is at find_get_entries+0x1e8/0x268
[ 28.247466] pc : [] lr : [] pstate: 00400045
[ 28.254845] sp : ffffffc1ecb57150
[ 28.258153] x29: ffffffc1ecb57150 x28: 0000000000000400
[ 28.263473] x27: fffffffffffffe00 x26: ffffffc1e1580ef0
[ 28.268793] x25: 0000000000000001 x24: 0000000000000000
[ 28.274112] x23: ffffffc1e1580f00 x22: ffffffc1ecb57260
[ 28.279432] x21: ffffffc1ecb572e0 x20: ffffffc1e1580ef8
[ 28.284752] x19: 000000000000000e x18: 0000007f58b839f5
[ 28.290072] x17: 0000000000000001 x16: 0000000000000000
[ 28.295392] x15: 000000000000029d x14: 00106594508a21fe
[ 28.300712] x13: 0000000000000040 x12: 0000000000000228
[ 28.306029] x11: 0000000000000000 x10: 0000000000000000
[ 28.311349] x9 : 0000000000000000 x8 : 0000000000000000
[ 28.316667] x7 : 0000000000000025 x6 : ffffffbf077f4e00
[ 28.321988] x5 : ffffffbf077f4e00 x4 : ffffffc1ecb57260
[ 28.327306] x3 : ffffffbf077f4e00 x2 : 0000000000000001
[ 28.332624] x1 : 0000000000000100 x0 : ffffffbf077f0e60
[ 28.337943]
[ 28.339431] Process ksoftirqd/0 (pid: 3, stack limit = 0xffffffc1ecb54000)
[ 28.346290] Call trace:
[ 28.348735] [] find_get_entries+0x21c/0x268
[ 28.354472] [] pagevec_lookup_entries+0x48/0x68
[ 28.360553] [] invalidate_mapping_pages+0x80/0x208
[ 28.366895] [] inode_lru_isolate+0x208/0x260
[ 28.372716] [] __list_lru_walk_one.isra.2+0x94/0x190
[ 28.379230] [] list_lru_walk_one+0x58/0x70
[ 28.384876] [] prune_icache_sb+0x48/0x68
[ 28.390351] [] super_cache_scan+0x104/0x178
[ 28.396085] [] shrink_slab.part.17+0x21c/0x4a8
[ 28.402080] [] shrink_slab+0x78/0x90
[ 28.407208] [] shrink_node+0x11c/0x2f8
[ 28.412508] [] do_try_to_free_pages+0xc8/0x330
[ 28.418501] [] try_to_free_pages+0xdc/0x268
[ 28.424234] [] __alloc_pages_nodemask+0x524/0xcc0
[ 28.430488] [] allocate_slab+0xa8/0x4e8
[ 28.435873] [] new_slab+0x48/0x88
[ 28.440739] [] ___slab_alloc.constprop.34+0x2bc/0x4a0
[ 28.447338] [] __slab_alloc.isra.27.constprop.33+0x48/0x60
[ 28.454370] [] kmem_cache_alloc_trace+0x290/0x2c8
[ 28.460635] [] add_msg_controller_list+0x44/0x1a8 [mttcan]
[ 28.467676] [] ttcan_read_rx_fifo0+0x9c/0x1a8 [mttcan]
[ 28.474370] [] 0xffffff800112ab54
[ 28.479240] [] net_rx_action+0xf4/0x358
[ 28.484629] [] __do_softirq+0x13c/0x3b0
[ 28.490018] [w:20, r:19]
[ 28.490018] [] run_ksoftirqd+0x48/0x58
[ 28.497844] [] smpboot_thread_fn+0x160/0x248
[ 28.503663] [] kthread+0xec/0xf0
[ 28.508443] [] ret_from_fork+0x10/0x30
[ 28.513748] —[ end trace e5eadf1af9e15672 ]—
[ 28.524004] Kernel panic - not syncing: Fatal exception in interrupt
[ 28.530346] SMP: stopping secondary CPUs
[ 28.534267] Kernel Offset: disabled
[ 28.537745] Memory Limit: none
[ 28.540792] trusty-log panic notifier - trusty version Built: 08:40:58 Feb 19 2021 [ 28.553859] Rebooting in 5 seconds…

we use the 32.5.1 BSP and JetPack4.5.1 filesystem.
who know the issue and can help me?
thank you very much!!

Is this a custom board or devkit?

Any more uart log prior to the kernel panic?

hi waynetWWW

This is a custom board. Below attachment is the uart log
cu.SLAB_USBtoUART_202205101009.log (97.9 KB)

In the attachment, there are three time nomal boot info after power up and one reboot abnormally info after power up, please check!!

B.R.

Learn to dump the full log first… remove the quiet from /boot/extlinux/extlinux.conf so that full kernel log will be printed…

Also, why does that first few reboots are without any error? Is that triggered by you or it happened by itself?

hi WayneWWW

the first and second info is the normal booting info after power up triggered by me, so no any error
the third booting info with crash info is the abnormal rebooting info when system is running.

B.R.

Ok, then could you dump the full log?

hi WayneWWW

Do you mean dump the full log when TX2 abnormally reboot ? or dump the full log of normal booting is also ok?
Because the probability of TX2 abnormally rebooting is very low, so we need spend some time to reoccur the abnormal phenomenon.

B.R.

Haven’t you notice your kernel log is not very long ?

Yes, the log is captured from the uart, so it’s very slow

Also, we get the the full log of one normal booting from the kern.log
normal_kern.log (107.8 KB)

We will remove the quiet option to capture the full log of the abnormal rebooting form the uart if we reoccur the phenomenon

Also, try to remove all peripherals and see if you can still reproduce the issue.

If you cannot, add them back one by one and see which one will cause the issue.

Hi WayneWWW

We don’t reoccur the phenomenon, it’s a little difficult!
But We find the root cause, your mttcan driver has a bug!!
In the mttcan interrupt function:
mttcan_poll_ir
|–>ttcan_read_rx_fifo0
|–>|–>add_msg_controller_list
|–>|–>|–>kzalloc()

kzalloc use the GFP_KERNEL, actually GFP_KERNEL shall not be used in the interrupt!! In some case, it will try to let the interrupt sleep, it’s very dangerous!!
Please check!
B.R.

I will forward this issue to internal team to investigate, thanks for bring this issue.

hi yxz1295324,
may i know which Release branch are you using? It should use GFP_ATOMIC.

we have already updated to GFP_ATOMIC. You must be using old kernel. New Jetpack release is having GFP_ATOMIC. Please let us know if upgrading to new release is solving your issue.

Hi,
We use the BSP: 32.5.1, Jetpack: 4.5.1.
B.R.

Then please try to upgrade to new kernel.

Hi,
I want to know from which version the kernel use the GFP_ATOMIC??
B.R.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.