Kernel panic after stability test

Jetpack: 5.0.2GA
Hardware A: devkit(module + P3737 carrier board)
Hardware B: module from devkit + our carrier board

I made a system stability test in Hardware B for a period of time. On the 7th day of the test, i found the system did not respond, the fan stopped, and the temperature was very high. The orin got problem after power restart, the serial port printing was stuck.
To narrow down the scope of the problem, i made a test on Hardware A, and use sdkmanager to reflash orin to the official 5.0.2GA release version. The problem still exists, the startup log has the following three different phenomena.
Please help confirm how to further locate the cause of the problem, thank you!

我们在 Hardware B 做了一段时间的系统稳定性测试,测到第7天时发现系统没任何响应,风扇停转,orin温度很高,重启电源无法正常工作,串口打印卡住。
为了缩小问题范围,在 Hardware A 使用sdkmanager将软件更新为 5.0.2GA 官方版本,重启电源仍无法正常工作,反复重启大致有以下3种日志现象。
请帮忙确认如何进一步定位问题原因,谢谢!

boot1.log (60.2 KB)
boot2.log (62.4 KB)
boot3.log (28.9 KB)

  1. boot1.log: Kernel panic

[ 24.156230] Please wait for auto system configuration setup to complete…
[ 34.257024] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 34.264933] CPU: 0 PID: 1217 Comm: systemd Tainted: G O 5.10.104-tegra #1
[ 34.273263] Hardware name: /, BIOS 1.0-d7fb19b 08/10/2022
[ 34.278904] Call trace:
[ 34.281431] dump_backtrace+0x0/0x1d0
[ 34.285198] show_stack+0x30/0x40
[ 34.288620] dump_stack+0xd8/0x138
[ 34.292120] panic+0x17c/0x384
[ 34.295270] do_exit+0xaa8/0xab0
[ 34.298596] do_group_exit+0x4c/0xb0
[ 34.302277] get_signal+0x104/0x830
[ 34.305868] do_notify_resume+0x248/0xa00
[ 34.309992] work_pending+0xc/0x384
[ 34.313593] SMP: stopping secondary CPUs
[ 34.317632] Kernel Offset: 0x54d4f41f0000 from 0xffff800010000000
[ 34.323901] PHYS_OFFSET: 0xffffc8ba80000000
[ 34.328206] CPU features: 0x0040006,4a80aa38
[ 34.332604] Memory Limit: none
[ 34.335761] —[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]—

  1. boot2.log: Kernel panic

[ 22.278601] Unable to handle kernel paging request at virtual address 00ff0009008f0030
[ 22.286798] Mem abort info:
[ 22.289679] ESR = 0x86000004
[ 22.292840] EC = 0x21: IABT (current EL), IL = 32 bits
[ 22.298329] SET = 0, FnV = 0
[ 22.301490] EA = 0, S1PTW = 0
[ 22.304743] [00ff0009008f0030] address between user and kernel address ranges
[ 22.312114] Internal error: Oops: 86000004 [#1] PREEMPT SMP
[ 22.317856] Modules linked in: nvidia_modeset(O) bnep snd_soc_tegra186_asrc snd_soc_tegra210_iqc snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra186_arad snd_soc_tegra210_mvc snd_soc_tegra210_dmic snd_soc_tegra210_afc aes_ce_blk snd_soc_tegra210_adx crypto_simd snd_soc_tegra210_amx cryptd snd_soc_tegra210_admaif snd_soc_tegra210_i2s rtk_btusb snd_soc_tegra210_mixer snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_cipher btusb ghash_ce snd_soc_tegra210_adsp rtl8822ce btrtl sha2_ce sha256_arm64 btbcm loop snd_soc_tegra_machine_driver sha1_ce btintel snd_soc_tegra_utils ofpart snd_soc_spdif_tx snd_soc_simple_card_utils cmdlinepart snd_hda_codec_hdmi nvadsp qspi_mtd nct1008 mtd ucsi_ccg snd_soc_tegra210_ahub snd_hda_tegra typec_ucsi userspace_alert tegra_bpmp_thermal typec snd_hda_codec tegra210_adma cfg80211 snd_hda_core spi_tegra114 nvidia(O) ina3221 pwm_fan nvgpu nvmap ip_tables x_tables
[ 22.400275] CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: G O 5.10.104-tegra #1
[ 22.408786] Hardware name: /, BIOS 1.0-d7fb19b 08/10/2022
[ 22.414433] pstate: 60c00009 (nZCv daif +PAN +UAO -TCO BTYPE=–)
[ 22.420618] pc : 0xff0009008f0030
[ 22.424039] lr : rcu_core+0x274/0x980
[ 22.427812] sp : ffff8000100ebc90
[ 22.431228] x29: ffff8000100ebc90 x28: ffffc8093afe7000
[ 22.436708] x27: ffff54dfc0160e80 x26: ffff54dfc0160e80
[ 22.442186] x25: ffff8000100ebd20 x24: ffff54e6ec4bcab0
[ 22.447659] x23: ffffc8093b36ae40 x22: ffff54dfc0160e80
[ 22.453131] x21: ffffc8093b2d8998 x20: ffffc8093afe7000
[ 22.458601] x19: ffff54e6ec4bca40 x18: 0000000000000000
[ 22.464072] x17: 0000000000000000 x16: ffffc809396c1990
[ 22.469551] x15: 00000000000004a1 x14: 0000000000000499
[ 22.475022] x13: 0000000000000001 x12: 0000000000000500
[ 22.480497] x11: 0000000000000000 x10: 0000000000000a80
[ 22.485968] x9 : ffff8000100ebd40 x8 : ffff54dfc0160e80
[ 22.491449] x7 : ffff000000000000 x6 : ffffffffc3ffffff
[ 22.496921] x5 : 0000000000000001 x4 : ffffff537fde7a20
[ 22.502390] x3 : 0000000080490025 x2 : ffff54e000801608
[ 22.507868] x1 : 00ff0009008f0030 x0 : ffff54e000801608
[ 22.513350] Call trace:
[ 22.515870] 0xff0009008f0030
[ 22.518935] rcu_core_si+0x18/0x20
[ 22.522440] __do_softirq+0x140/0x3e8
[ 22.526216] run_ksoftirqd+0x50/0x60
[ 22.529905] smpboot_thread_fn+0x1c4/0x280
[ 22.534117] kthread+0x148/0x170
[ 22.537448] ret_from_fork+0x10/0x24
[ 22.541141] Code: bad PC value
[ 22.544300] —[ end trace b9933b027d5c66bd ]—
[ 22.549055] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 22.556142] SMP: stopping secondary CPUs
[ 22.560340] Kernel Offset: 0x4809295d0000 from 0xffff800010000000
[ 22.566623] PHYS_OFFSET: 0xffffab2140000000
[ 22.570926] CPU features: 0x0040006,4a80aa38
[ 22.575327] Memory Limit: none
[ 22.578466] —[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]—

  1. boot3.log: Stuck in UEFI

** WARNING: Test Key is used. **

L4TLauncher: Attempting GRUB Boot
L4TLauncher: Attempting Direct Boot
EFI stub: Booting Linux Kernel…
EFI stub: Using DTB from configuration table
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services and installing virtual address map…

Try to reflash the module first and see if it can get flashed.

Hi Wayne,
The module can get flashed, and i try twice. The logs are after reflashing the module.

The flash process is still okay or not?

The sdkmanager returned “SUMMARY: Flash Jetson AGX Orin - flash: Install completed successfully.”

sdkm_download.log (3.7 KB)
sdkm-2022-11-02-11-37-07.log (605.0 KB)
reflash_log_serial.log (26.9 KB)

Hi,

Is this still on NV devkit or you put on your custom board to boot?

on NV devkit

Is there any peripherals connected on the board?

Please test it with only power cable the micro usb cable for uart log.

J40 USB Type-C cable, DP cable, power cable, micro usb cable are connected.

The uart log as below is tested with only power cable the micro usb cable for uart log. The module automatically restarted for 3 times, after being powered on and started.
picocom_20221102-133702.log (329.4 KB)

Hi,

Could you clean up the rootfs installed by your jetpack and let it download again?

According to this error, not sure if this is coming from your rootfs or the emmc has something wrong.

857 [ 30.334764] EXT4-fs error (device mmcblk0p1): ext4_lookup:1706: inode #2760216: comm oem-config: iget: checksum invalid

I reflashed the module, after i deleted target_hw_image/JetPack_5.0.2_Linux_JETSON_AGX_ORIN_TARGETS/Linux_for_Tegra/rootfs.

picocom_1.log (69.0 KB)

Hi,

Do you have other Orin module to validate the software BSP? Your previous description which broke the board sounds like a hardware damage.

The orin module tested is a00 module. I have several a04 modules, could a04 module be used to validate the software BSP?

Hi,

Actually I would suggest you just test with a04 module… if you want to make it for production, then your product will also use a04 module …

The a04 module works fine, after I use sdkmanager to reflash the a04 module(+p3737 carrier board) with the same software BSP.
What should I do with the a00 module, thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.