Jetson TX2 Kernel crashed after running for a while

Hi,everyone.Recently,some of our TX2 board didn’t work after running for a while. At the first time we found this problem ,We could not get any information through debug uart,network or hdmi interface.That meant we could not get any useful log or information to debug this problem. After we check the log under the directory “/var/log”, we still did not find any error.

This problem has been annoying us for a while.I really hope that I can get some suggestions from all of you to solve the problem.

one of the syslog :syslog.1 (142.8 KB)
This file “syslog.1” can be opened like a txt file directly.

Yesterday ,we got a log after a TX2 board’s kernel crashed.I hope it contributes to your analysis.
The log is as follows:
[0000.050] C> I2C command failed
[0000.053] C> block index = (4) and rail_id = (1)
[0000.058] C> Addr: Reg = [0xe8:0x07]: 336166925
[0000.062] C> I2C command failed
[0000.065] C> block index = (5) and rail_id = (1)
[0000.070] C> Addr: Reg = [0xe8:0x07]: 336166925
[0000.172] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-1b8ce67c)
[0000.180] I> bit @ 0xd480000
[0000.183] I> Boot-device: eMMC
[0000.187] I> sdmmc bdev is already initialized
[0000.192] I> pmic: reset reason (nverc) : 0x0
[0000.198] I> Found 16 partitions in SDMMC_BOOT (instance 3)
[0000.205] I> Found 30 partitions in SDMMC_USER (instance 3)
[0000.211] I> A/B: bin_type (16) slot 0
[0000.214] I> Loading partition bpmp-fw at 0xd7800000
[0000.219] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.225] I> Addr: 0xd7800000, start-block: 29417480, num_blocks: 1
[0000.240] I> Binary(16) of size 533504 is loaded @ 0xd7800000
[0000.245] I> A/B: bin_type (17) slot 0
[0000.249] I> Loading partition bpmp-fw-dtb at 0xd79f0000
[0000.254] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.260] I> Addr: 0xd79f0000, start-block: 29419896, num_blocks: 1
[0000.273] I> Binary(17) of size 314896 is loaded @ 0xd79b3000
[0000.393] I> Loading SCE-FW …
[0000.396] I> A/B: bin_type (12) slot 0
[0000.400] I> Loading partition sce-fw at 0xd7300000
[0000.405] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.410] I> Addr: 0xd7300000, start-block: 29421896, num_blocks: 1
[0000.421] I> Binary(12) of size 125168 is loaded @ 0xd7300000
[0000.426] I> Init SCE
[0000.429] I> Loading APE-FW …
[0000.432] I> A/B: bin_type (11) slot 0
[0000.435] I> Loading partition adsp-fw at 0xd7400000
[0000.440] I> Reading two headers - addr:0xd7400000 blocks:1
[0000.446] I> Addr: 0xd7400000, start-block: 29401096, num_blocks: 1
[0000.456] I> Binary(11) of size 107808 is loaded @ 0xd7400000
[0000.462] I> Copy BTCM section
[0000.465] I> A/B: bin_type (13) slot 0
[0000.468] I> Loading partition cpu-bootloader at 0x96000000
[0000.474] I> Reading two headers - addr:0x96000000 blocks:1
[0000.479] I> Addr: 0x96000000, start-block: 29380616, num_blocks: 1
[0000.491] I> Binary(13) of size 277776 is loaded @ 0x96000000
[0000.497] I> A/B: bin_type (20) slot 0
[0000.501] I> Loading partition bootloader-dtb at 0x8520f400
[0000.506] I> Reading two headers - addr:0x8520f400 blocks:1
[0000.512] I> Addr: 0x8520f400, start-block: 29382664, num_blocks: 1
[0000.525] I> Binary(20) of size 240448 is loaded @ 0x8520f400
[0000.531] I> A/B: bin_type (14) slot 0
[0000.535] I> Loading partition secure-os at 0x8530f600
[0000.540] I> Reading two headers - addr:0x8530f600 blocks:1
[0000.545] I> Addr: 0x8530f600, start-block: 29384712, num_blocks: 1
[0000.558] I> Binary(14) of size 312752 is loaded @ 0x8530f600
[0000.565] I> TOS boot-params @ 0x85000000
[0000.569] I> TOS params prepared
[0000.572] I> Loading EKS …
[0000.575] I> A/B: bin_type (15) slot 0
[0000.579] I> Loading partition eks at 0x8590f800
[0000.583] I> Reading two headers - addr:0x8590f800 blocks:1
[0000.589] I> Addr: 0x8590f800, start-block: 29397000, num_blocks: 1
[0000.597] I> Binary(15) of size 1040 is loaded @ 0x8590f800
[0000.602] I> EKB detected (length: 0x400) @ 0x8590f800
[0000.607] I> Copied encrypted keys
[0000.611] I> boot profiler @ 0x175844000
[0000.615] I> boot profiler for TOS @ 0x175844000
[0000.620] I> Unhalting SCE
[0000.622] I> Primary Memory Start:80000000 Size:70000000
[0000.628] I> Extended Memory Start:f0110000 Size:856f0000
[0000.634] I> MB2(TBoot-BPMP) done

Unhandled Exception in EL3.
x30 = 0x0000000000000000
x0 = 0x0000000000000000
x1 = 0x0000000000000000
x2 = 0x0000000000000000
x3 = 0x0000000000000000
x4 = 0x0000000000000000
x5 = 0x0000000000000000
x6 = 0x0000000000000000
x7 = 0x0000000000000000
x8 = 0x0000000000000000
x9 = 0x0000000000000000
x10 = 0x0000000000000000
x11 = 0x0000000000000000
x12 = 0x0000000000000000
x13 = 0x0000000000000000
x14 = 0x0000000000000000
x15 = 0x0000000000000000
x16 = 0x0000000000000000
x17 = 0x0000000000000000
x18 = 0x0000000000000000
x19 = 0x0000000000000000
x20 = 0x0000000000000000
x21 = 0x0000000000000000
x22 = 0x0000000000000000
x23 = 0x0000000000000000
x24 = 0x0000000000000000
x25 = 0x0000000000000000
x26 = 0x0000000000000000
x27 = 0x0000000000000000
x28 = 0x0000000000000000
x29 = 0x0000000000000000
scr_el3 x0000000000000000
sctlr_el3 = 0x0000000000000000
cptr_el3 = 0x0000000000000000
tcr_el3 = 0x0000000000000000
daif = 0x0000000000000000
mair_el3 = 0x0000000000000000
spsr_el3 = 0x0000000000000000
elr_el3 = 0x0000000000000000
ttbr0_el3 = 0x0000000000000000
esr_el3 = 0x0000000000000000
far_el3 = 0x0000000000000000
spsr_el1 = 0x0000000000000000
elr_el1 = 0x0000000000000000
spsr_abt = 0x0000000000000000
spsr_und = 0x0000000000000000
spsr_irq = 0x0000000000000000
spsr_fiq = 0x0000000000000000
sctlr_el1 = 0x0000000000000000
actlr_el1 = 0x0000000000000000
cpacr_el1 = 0x0000000000000000
csselr_el1 = 0x0000000000000000
sp_el1 = 0x0000000000000000
esr_el1 = 0x0000000000000000
ttbr0_el1 = 0x0000000000000000
ttbr1_el1 = 0x0000000000000000
mair_el1 = 0x0000000000000000
amair_el1 = 0x0000000000000000
tcr_el1 = 0x0000000000000000
tpidr_el1 = 0x0000000000000000
tpidr_el0 = 0x0000000000000000
tpidrro_el0 = 0x0000000000000000
dacr32_el2 = 0x0000000000000000
ifsr32_el2 = 0x0000000000000000
par_el1 = 0x0000000000000000
mpidr_el1 = 0x0000000000000000
afsr0_el1 = 0x0000000000000000
afsr1_el1 = 0x0000000000000000
contextidr_el1 = 0x0000000000000000
vbar_el1 = 0x0000000000000000
cntp_ctl_el0 = 0x0000000000000000
cntp_cval_el0 = 0x0000000000000000
cntv_ctl_el0 = 0x0000000000000000
cntv_cval_el0 = 0x0000000000000000
cntkctl_el1 = 0x0000000000000000
fpexc32_el2 = 0x0000000000000000
sp_el0 = 0x0000000000000000
isr_el1 = 0x0000000000000000
cpuectlr_el1 = 0x0000000000000000
cpumerrsr_el1 = 0x0000000000000000
l2merrsr_el1 = 0x0000000000000000
gicc_hppir = 0x0000000000000000
gicc_ahppir = 0x0000000000000000
gicc_ctlr = 0x0000000000000000
gicd_ispendr regs (Offsets 0x200 - 0x278)
Offset: value
0000000000000200: 0x0000000000000000
0000000000000204: 0x0000000000000000
0000000000000208: 0x0000000000000000
000000000000020c: 0x0000000000000000
0000000000000210: 0x0000000000000000
0000000000000214: 0x0000000000000000
0000000000000218: 0x0000000000000000
000000000000021c: 0x0000000000000000
0000000000000220: 0x0000000000000000
0000000000000224: 0x0000000000000000
0000000000000228: 0x0000000000000000
000000000000022c: 0x0000000000000000
0000000000000230: 0x0000000000000000
0000000000000234: 0x0000000000000000
0000000000000238: 0x0000000000000000
000000000000023c: 0x0000000000000000
0000000000000240: 0x0000000000000000
0000000000000244: 0x0000000000000000
0000000000000248: 0x0000000000000000
000000000000024c: 0x0000000000000000
0000000000000250: 0x0000000000000000
0000000000000254: 0x0000000000000000
0000000000000258: 0x0000000000000000
000000000000025c: 0x0000000000000000
0000000000000260: 0x0000000000000000
0000000000000264: 0x0000000000000000
0000000000000268: 0x0000000000000000
000000000000026c: 0x0000000000000000
0000000000000270: 0x0000000000000000
0000000000000274: 0x0000000000000000
0000000000000278: 0x0000000000000000
000000000000027c: 0x0000000000000000

The log you shared even not able to enter kernel. Have you tried to reflash the board?

I don’t think we should care about the syslog now but only check why it cannot enter kerenl.

Firstly.Thanks for your reply.
I haven’t reflashed the kernel yet.After the board ran for a while, till the problem,kernel-crashed ,happened again.we checked all the voltage (3.3V ,5V 1.8V),which we thought should be checked,were ok. And it could go back to normal if we press the reset button.

we just found that log occasionally when the problem happebed.In order to find the reason,we connected the debug uart and recorded all the information from the debug uart. At the beginning ,everything seemed ok and it could enter kernel. But when problem happeded again,we found that the board rebooted automatically and it even did not enter the kernel.Just like the log shows.

From the uart log ,I could not find why it rebooted automatically.So I rebooted the board manually and checked syslog. But i could not find any useful information that indicted why it rebooted.

Is there any suggestions to find the reason?
Thanks again!

Could you try to monitor the uart log when error happens? Also, what is your board? A custom one?
Syslog is not able to get a abrupt reboot, you can only monitor it through uart.

Hello,the board is our self-made board.The log I shared before was from uart.When error happened ,we only got that error log and found that it rebooted automatically .After communicated with my colleagues,I think that the sysrem reboot only happens when our main application program don’t work.Because it will trigger the watchdog to reboot the system.

BTW,even if we monitor uart log when error happens,we only got the uart log once.In some self-made board,we could not get anything from the uart.

Could you dump the log that has watchdog log?

Do you mean in uart log or syslog?cause i even don’t where to find it. Could you please give me a favor to find the watchdog log?

The UART should dump it before it goes to reboot.

Obviously,we didn’t got any information about watchdog from previous uart log.And we will still montor the uart log.Hope we can get some new useful information.

There is a important thing that I need to emphasize is
we could not get any log when error happens sometimes.

Besides, I still hope to find another way to find the reason that causes the kernel crash.Maybe it is caused by watchdog.

Thanks.

If this issue is due to software, then it shall have some log from uart when reboot happens.

If there is no log, then it is probably hardware side issue.

Okey.Thanks,I will keep you informed if we get any new logs or new ideas.

BTW,do you have any suggestions to check the hardware?

Any log can be provided to do further analysis? Or it’s HW issue?

I can’t debug this, but wanted to point out something from the original error: The error is in exception level 3 (EL3), which is a privileged secure mode. Have you experimented with security fuses or anything related to signing? Perhaps the error is data driven by something related to security.

Most of the time,we could not get any log when error happens.

Thanks.Actually I don’t know what is security fuses.So I don’t how to do it.But I was wondering about why the watchdog didn’t work?

@kayccc @WayneWWW @linuxdev
I was wondering about why the watchdog didn’t work? We still did not solve the problem.
Could you please tell me about the difference between PMIC-WDT and Tegra-WDT?
We use the defaute value for WDT.

Could you monitor your uart log wait until the error happens, and share us the log at that moment?

The only uart log that we got has shared like my first post. That log was from uart. After that ,we did not get anything when the error happened.

That error log in first post even not boots into kernel. Does your device enter kernel after that? Or it just crash?

“Did not get anything” means totally no cboot/uboot/kernel log from the uart? or just you don’t see any error from uart?