Xavier kernel crashes randomly on Jetpack 5.1.1

SanjayD · July 14, 2024, 10:33pm

A rare issue that seems to happen once in 2 or 3 days. The Xavier’s kernel panics supposedly due to CAN activity based on the crash log.

[ 4509.644650] 	**************************************
[ 4509.644800] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[ 4509.644932] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 4509.645043] Modules linked in: xt_conntrack(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) br_netfilter(E) overlay(E) xt_MASQUERADE(E) xt_mark(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) iptable_filter(E) aes_ce_ccm(E) mttcan(E) can_dev(E) can_raw(E) can(E) ramoops(E) reed_solomon(E) micrel(E) bnep(E) iwlmvm(E) mac80211(E) binfmt_misc(E) nvgpu(E) iwlwifi(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) pwm_fan(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) max77620_thermal(E) cfg80211(E) btusb(E) nct1008(E) uvcvideo(E) btrtl(E) ina3221(E) btbcm(E) videobuf2_vmalloc(E) btintel(E) tegra_bpmp_thermal(E) userspace_alert(E) spi_tegra114(E) cdc_acm(E) nvmap(E) 8021q(E) garp(E) mrp(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[ 4509.978976] CPU: 0 PID: 10564 Comm: ingenia_mcu_nod Tainted: G            E     5.10.104-pmx+ #1
[ 4509.979156] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[ 4509.979319] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 4509.979457] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[ 4509.979562] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[ 4509.979654] sp : ffff800010003920
[ 4509.979727] x29: ffff800010003920 x28: 0000000000000001 
[ 4509.979838] x27: 0000000000000080 x26: ffffa9abde710a80 
[ 4509.980102] x25: ffffa9abdf086e38 x24: 0000000000000001 
[ 4509.980492] x23: ffffa9abde9f7000 x22: ffffa9abdee8edb0 
[ 4510.309725] x21: 000000000000000f x20: 0000000000000005 
[ 4510.309855] x19: ffffa9abdee8eda0 x18: 0000000000000010 
[ 4510.309985] x17: 0000000000000000 x16: ffffa9abdcfb3210 
[ 4510.310120] x15: ffff602e5aadb0f0 x14: ffffffffffffffff 
[ 4510.310236] x13: ffff800090003517 x12: ffff80001000351f 
[ 4510.310347] x11: 0000000000000038 x10: 0101010101010101 
[ 4510.310458] x9 : ffff800010003830 x8 : 2a2a2a2a2a2a2a2a 
[ 4510.310569] x7 : ffffa9abdd6a5de0 x6 : c0000000ffffefff 
[ 4510.310678] x5 : ffff6035bfe24958 x4 : ffffa9abded17a28 
[ 4510.310796] x3 : 0000000000000001 x2 : ffffa9abdd14e170 
[ 4510.310912] x1 : ffff602e5aadab80 x0 : 0000000000010100 
[ 4510.311028] Call trace:
[ 4510.311090]  tegra194_cbb_err_isr+0x19c/0x1b0
[ 4510.311181]  __handle_irq_event_percpu+0x68/0x2a0
[ 4510.311275]  handle_irq_event_percpu+0x40/0xa0
[ 4510.311362]  handle_irq_event+0x50/0xf0
[ 4510.311521]  handle_fasteoi_irq+0xc0/0x170
[ 4510.640631]  generic_handle_irq+0x40/0x60
[ 4510.640758]  __handle_domain_irq+0x70/0xd0
[ 4510.640857]  efi_header_end+0xb0/0xf0
[ 4510.640942]  el1_irq+0xd0/0x180
[ 4510.641036]  ttcan_read_txevt_ram+0x54/0x90 [mttcan]
[ 4510.641144]  ttcan_read_txevt_fifo+0x90/0x150 [mttcan]
[ 4510.641265]  mttcan_poll_ir+0x624/0xcc0 [mttcan]
[ 4510.641361]  net_rx_action+0x124/0x440
[ 4510.641440]  __do_softirq+0x140/0x3e8
[ 4510.641516]  irq_exit+0xc0/0xe0
[ 4510.641581]  __handle_domain_irq+0x74/0xd0
[ 4510.641685]  efi_header_end+0xb0/0xf0
[ 4510.641759]  el0_irq_naked+0x4c/0x54
[ 4510.641839] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000) 
[ 4510.642004] ---[ end trace 9e3ff8224358e511 ]---

The crash is more easily reproducible when I restart the Ethernet interface using
sudo ifconfig eth0 down; sudo ifconfig eth0 up but it’s possible these are two different issues.
I will try Jetpack 5.1.2 since this post mentions that the crash caused by the Ethernet interface wasn’t reproducible on 5.1.2

Since my original issue isn’t easily reproducible, I can’t be certain that 5.1.2 will resolve it. Can you please help confirm if these are both the same issue?

KevinFFF · July 15, 2024, 2:23am

Hi SanjayD,

Are you using the devkit or custom board for AGX Xavier?
Could you also verify with the latest Jetpack 5.1.3(R35.5.0)?

SanjayD · July 15, 2024, 3:59am

@KevinFFF I am using a ConnectTech Rogue AGX101. The latest available BSP is 5.1.2, so I will try that instead.

I need a fix for this urgently, and updating the BSP usually involves a lot of risk and testing time for us, so this is not ideal.

KevinFFF · July 15, 2024, 9:40am

We would need a clear step to reproduce the issue on the devkit so that we can look into it and debug the issue.
If you have the devkit, please also try to reproduce on it to check if the issue is specific to the custom carrier board.

linuxdev · July 15, 2024, 4:49pm

Just curious, does this use a CAN device? I see this began in a user space program, and then a software IRQ was issued, which went into kernel space (but not a hardware device driver, it is a software IRQ). I see a an RX event for the CAN, but it is not a hardware IRQ, which seems slightly out of place for a software IRQ. If you do have any CAN device, then you might describe this…it is a network type device, and you mentioned sometimes you think eth0 up/down might trigger an issue. Maybe there is something odd going on between the CAN network and other hardware related to networking.

SanjayD · July 15, 2024, 5:09pm

Yes, I have a few CAN devices (motor controllers and other microcontrollers) on the network that are sending data periodically to the Xavier. The bandwidth used is ~8kbps.

I haven’t seen this issue on Jetpack 4 or 5.0.2, in the past 3-4 years of testing with the exact same network configuration, so seems like it might be due to something 5.1.1-specific.

linuxdev · July 15, 2024, 5:15pm

Do any of the CAN user space programs try to bind affinity to other CPUs? Or is the system just using whatever CPU core the default scheduling performs?

If you have multiple crash dumps you might want to check if this is in all of them, or if it instead mentions other programming from other software (if it sticks to CAN that is a big clue; if other network based programs also show up, then it is still a big clue, but not a smoking gun as to which):

[ 4510.640942]  el1_irq+0xd0/0x180
[ 4510.641036]  ttcan_read_txevt_ram+0x54/0x90 [mttcan]
[ 4510.641144]  ttcan_read_txevt_fifo+0x90/0x150 [mttcan]
[ 4510.641265]  mttcan_poll_ir+0x624/0xcc0 [mttcan]
[ 4510.641361]  net_rx_action+0x124/0x440
[ 4510.641440]  __do_softirq+0x140/0x3e8

I mention affinity because this would change debugging and widen what might be at issue. If you do not tweak scheduling or priorities, then it is whatever default scheduling does.

Being able to reproduce this would also help.

SanjayD · July 25, 2024, 1:28am

The userspace programs that use CAN do not use CPU affinity.

I haven’t been able to get further traces from the field for debugging, and since this needed to be fixed urgently I simply upgraded to Jetpack 5.1.2, and so far it doesn’t seem reproducible in the field or by restarting the Ethernet interface (this had a >50% chance of triggering the crash on JP 5.1.1). Thanks for looking through the crash logs!

SanjayD · August 11, 2024, 9:36pm

Unfortunately this issue might not have been entirely solved by the upgrade to 5.1.2 (latest supported by our Custom Carrier Board vendor, ConnectTech). Seeing kernel panics at a much lower rate now (once every ~100 hours)

Created a separate post for help with debugging this on a live device out in the field.

system · August 25, 2024, 9:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Xavier Kernel Panic on Jetpack 5.1.1 Jetson AGX Xavier kernel	10	897	October 12, 2023
Kernel Crashing Following Upgrade to Jetpack 5.1/5.1.1 Jetson AGX Xavier boot	35	1116	April 2, 2024
Jetpack 5.1 kernel panic on reboot Jetson Xavier NX kernel	8	1014	March 29, 2023
xavier-32G crash after the board start Jetson AGX Xavier boot	4	765	August 30, 2021
VXCAN (Virtual CAN Tunnel) kernel panic on Jetson Xavier NX Jetson Xavier NX can-bus	4	413	February 14, 2024
Xavier nx (Jetpack5.1.1) system crash after "reboot" command Jetson Xavier NX boot , kernel	8	801	November 4, 2023
AGX Xavier kept rebooting after crash Jetson AGX Xavier boot	11	2000	October 18, 2021
Xavier AGX no longer boots into GUI-Destop after crash/reboot Jetson AGX Xavier boot	4	580	October 18, 2021
Agx crash Jetson AGX Xavier kernel	4	781	October 18, 2021
Xavier 运行中出现panic系统重启 Jetson AGX Xavier boot , kernel , ubuntu	14	98	April 24, 2025

Xavier kernel crashes randomly on Jetpack 5.1.1

Related topics