Hi
I want to run some perception models on a platform with kernel version 5.10.104-rt63-tegra, but the kernel crashes and restarts, and I can’t see any error printouts from the debug serial port. Is there any other way to debug this?
Does it happen on the non-RT stock kernel?
Does syslog give anything?
No,it’s normal on the non-RT stock kernel,
syslog give nothing,it seems like CPU freezes and trigger WDT on PMIC
Do you mean some deep learning stuff when you said perception model
?
Maybe it’s caused by some scheduling priorities in the RT kernel settings.
It’s hard to debug without any log, and I’d say if there is no strong reason to use the RT kernel, then just use the non-RT one.
We have a strong need for the RT version. Are there any other debugging methods that can help us locate issues?
Most of the time you rely on serial console log/dmesg/syslog, and if none of them gives useful information, then there is really little we can do.
Because deep learning application requires a lot of I/O resources, I guess maybe the process is being context switched in the middle of data transfer, causing some memory issues so it cannot come back.
I have found that the file system is freezing when there is a problem, and the storage directory that is being accessed cannot be reached.
After a long period of stress testing, the system has become unresponsive and the following information has been output through the serial port:
[2023-12-19 16:26:03.186] [20347.474925] Unable to handle kernel paging request at virtual address 0003000900000ff8
[2023-12-19 16:26:03.186] [20347.474935] Mem abort info:
[2023-12-19 16:26:03.202] [20347.474936] ESR = 0x96000004
[2023-12-19 16:26:03.202] [20347.474938] EC = 0x25: DABT (current EL), IL = 32 bits
[2023-12-19 16:26:03.202] [20347.474940] SET = 0, FnV = 0
[2023-12-19 16:26:03.203] [20347.474942] EA = 0, S1PTW = 0
[2023-12-19 16:26:03.218] [20347.474943] Data abort info:
[2023-12-19 16:26:03.218] [20347.474944] ISV = 0, ISS = 0x00000004
[2023-12-19 16:26:03.218] [20347.474946] CM = 0, WnR = 0
[2023-12-19 16:26:03.218] [20347.474947] [0003000900000ff8] address between user and kernel address ranges
[2023-12-19 16:26:03.234] [20347.474951] Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
[2023-12-19 16:26:03.234] [20347.474955] Modules linked in: macvlan(E) hx_sgov08b03c(E) nv_isx031(E) hx_fsync(E) hx_ar0820(E) max96712(E) mttcan(E) can_dev(E) can_raw(E) can(E) zram(]
[2023-12-19 16:26:03.266] [20347.474986] CPU: 3 PID: 394 Comm: irq/308-nvme0q4 Tainted: G W E 5.10.104-rt63-tegra #1
[2023-12-19 16:26:03.282] [20347.474989] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 2.1-32413640 01/24/2023
[2023-12-19 16:26:03.282] [20347.474991] pstate: 20c00009 (nzCv daif +PAN +UAO -TCO BTYPE=–)
[2023-12-19 16:26:03.298] [20347.474993] pc : nvme_free_prps.isra.0+0x68/0xa0
[2023-12-19 16:26:03.298] [20347.475001] lr : nvme_free_prps.isra.0+0x2c/0xa0
[2023-12-19 16:26:03.299] [20347.475002] sp : ffff80001646bc80
[2023-12-19 16:26:03.299] [20347.475003] x29: ffff80001646bc80 x28: 0000000000000238
[2023-12-19 16:26:03.314] [20347.475005] x27: ffff8000112e25b0 x26: 000000000000a238
[2023-12-19 16:26:03.314] [20347.475007] x25: ffff8000112e1000 x24: 00000000000015b0
[2023-12-19 16:26:03.315] [20347.475008] x23: ffff7c86c7a87278 x22: ffff7c86c7ff5700
[2023-12-19 16:26:03.330] [20347.475009] x21: ffff7c86c7ff5810 x20: 0000000000000001
[2023-12-19 16:26:03.331] [20347.475011] x19: ffff7c86c7ff5700 x18: 0000000000000000
[2023-12-19 16:26:03.331] [20347.475012] x17: 0000000000000000 x16: 0000000000000000
[2023-12-19 16:26:03.346] [20347.475013] x15: 0000fffdb4bc0df8 x14: 0000000000000023
[2023-12-19 16:26:03.346] [20347.475015] x13: 0000000000000061 x12: 0000000000000024
[2023-12-19 16:26:03.347] [20347.475016] x11: 071c71c71c71c71c x10: 0000000000000a80
[2023-12-19 16:26:03.362] [20347.475018] x9 : 00000000000fe980 x8 : ffff7c86c7e6a7e0
[2023-12-19 16:26:03.362] [20347.475019] x7 : ffffdc5c8c1760f0 x6 : 000000001d7b3803
[2023-12-19 16:26:03.363] [20347.475021] x5 : ffff7c8bdb7e5bc0 x4 : ffff7c86c7e69d00
[2023-12-19 16:26:03.378] [20347.475022] x3 : 0000000000000a80 x2 : 00000000fd0fd000
[2023-12-19 16:26:03.378] [20347.475023] x1 : 0003000900000000 x0 : ffff7c86c7cc4500
[2023-12-19 16:26:03.379] [20347.475025] Call trace:
[2023-12-19 16:26:03.394] [20347.475026] nvme_free_prps.isra.0+0x68/0xa0
[2023-12-19 16:26:03.394] [20347.475028] nvme_unmap_data+0x10c/0x120
[2023-12-19 16:26:03.394] [20347.475029] nvme_pci_complete_rq+0x48/0xb0
[2023-12-19 16:26:03.395] [20347.475031] nvme_irq+0x16c/0x360
[2023-12-19 16:26:03.410] [20347.475032] irq_forced_thread_fn+0x44/0xc0
[2023-12-19 16:26:03.410] [20347.475036] irq_thread+0x158/0x260
[2023-12-19 16:26:03.410] [20347.475038] kthread+0x180/0x1b0
[2023-12-19 16:26:03.411] [20347.475042] ret_from_fork+0x10/0x24
[2023-12-19 16:26:03.411] [20347.475047] Code: 8b34cc63 11000694 f94002e0 f8636821 (f947fc33)
[2023-12-19 16:26:03.426] [20347.475051] —[ end trace 0000000000000006 ]—
[2023-12-19 16:26:03.426] [20348.480084] Kernel panic - not syncing:
[2023-12-19 16:26:03.427] [20348.480086] Oops: Fatal exception in interrupt
[2023-12-19 16:26:03.442] [20348.926169] tegra-xusb 3610000.xhci: 2-3 isn’t suspended: 0x0c001243
[2023-12-19 16:26:03.442] [20348.926177] tegra-xusb 3610000.xhci: not all ports suspended: -16
[2023-12-19 16:26:03.442] [20348.926200] tegra-xusb 3610000.xhci: entering ELPG failed
[2023-12-19 16:26:03.458] [20349.544672] hxar0820 7-0017: ar0820_stop_streaming: err=0, video_link=0x62
[2023-12-19 16:26:03.458] [20349.544862] hxar0820 1-0013: ar0820_stop_streaming: err=0, video_link=0x62
[2023-12-19 16:26:03.474] [20349.545073] hxar0820 1-0012: ar0820_stop_streaming: err=0, video_link=0x62
[2023-12-19 16:26:03.474] [20349.545121] max96712 7-0076: max96712_read_reg_Dser: addr = 0x13e, val = 0x0
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.