Xavier kernel crashes randomly on Jetpack 5.1.1

A rare issue that seems to happen once in 2 or 3 days. The Xavier’s kernel panics supposedly due to CAN activity based on the crash log.

[ 4509.644650] 	**************************************
[ 4509.644800] kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!
[ 4509.644932] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 4509.645043] Modules linked in: xt_conntrack(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) br_netfilter(E) overlay(E) xt_MASQUERADE(E) xt_mark(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) iptable_filter(E) aes_ce_ccm(E) mttcan(E) can_dev(E) can_raw(E) can(E) ramoops(E) reed_solomon(E) micrel(E) bnep(E) iwlmvm(E) mac80211(E) binfmt_misc(E) nvgpu(E) iwlwifi(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) pwm_fan(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) max77620_thermal(E) cfg80211(E) btusb(E) nct1008(E) uvcvideo(E) btrtl(E) ina3221(E) btbcm(E) videobuf2_vmalloc(E) btintel(E) tegra_bpmp_thermal(E) userspace_alert(E) spi_tegra114(E) cdc_acm(E) nvmap(E) 8021q(E) garp(E) mrp(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[ 4509.978976] CPU: 0 PID: 10564 Comm: ingenia_mcu_nod Tainted: G            E     5.10.104-pmx+ #1
[ 4509.979156] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 3.1-32827747 03/19/2023
[ 4509.979319] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 4509.979457] pc : tegra194_cbb_err_isr+0x19c/0x1b0
[ 4509.979562] lr : tegra194_cbb_err_isr+0x11c/0x1b0
[ 4509.979654] sp : ffff800010003920
[ 4509.979727] x29: ffff800010003920 x28: 0000000000000001 
[ 4509.979838] x27: 0000000000000080 x26: ffffa9abde710a80 
[ 4509.980102] x25: ffffa9abdf086e38 x24: 0000000000000001 
[ 4509.980492] x23: ffffa9abde9f7000 x22: ffffa9abdee8edb0 
[ 4510.309725] x21: 000000000000000f x20: 0000000000000005 
[ 4510.309855] x19: ffffa9abdee8eda0 x18: 0000000000000010 
[ 4510.309985] x17: 0000000000000000 x16: ffffa9abdcfb3210 
[ 4510.310120] x15: ffff602e5aadb0f0 x14: ffffffffffffffff 
[ 4510.310236] x13: ffff800090003517 x12: ffff80001000351f 
[ 4510.310347] x11: 0000000000000038 x10: 0101010101010101 
[ 4510.310458] x9 : ffff800010003830 x8 : 2a2a2a2a2a2a2a2a 
[ 4510.310569] x7 : ffffa9abdd6a5de0 x6 : c0000000ffffefff 
[ 4510.310678] x5 : ffff6035bfe24958 x4 : ffffa9abded17a28 
[ 4510.310796] x3 : 0000000000000001 x2 : ffffa9abdd14e170 
[ 4510.310912] x1 : ffff602e5aadab80 x0 : 0000000000010100 
[ 4510.311028] Call trace:
[ 4510.311090]  tegra194_cbb_err_isr+0x19c/0x1b0
[ 4510.311181]  __handle_irq_event_percpu+0x68/0x2a0
[ 4510.311275]  handle_irq_event_percpu+0x40/0xa0
[ 4510.311362]  handle_irq_event+0x50/0xf0
[ 4510.311521]  handle_fasteoi_irq+0xc0/0x170
[ 4510.640631]  generic_handle_irq+0x40/0x60
[ 4510.640758]  __handle_domain_irq+0x70/0xd0
[ 4510.640857]  efi_header_end+0xb0/0xf0
[ 4510.640942]  el1_irq+0xd0/0x180
[ 4510.641036]  ttcan_read_txevt_ram+0x54/0x90 [mttcan]
[ 4510.641144]  ttcan_read_txevt_fifo+0x90/0x150 [mttcan]
[ 4510.641265]  mttcan_poll_ir+0x624/0xcc0 [mttcan]
[ 4510.641361]  net_rx_action+0x124/0x440
[ 4510.641440]  __do_softirq+0x140/0x3e8
[ 4510.641516]  irq_exit+0xc0/0xe0
[ 4510.641581]  __handle_domain_irq+0x74/0xd0
[ 4510.641685]  efi_header_end+0xb0/0xf0
[ 4510.641759]  el0_irq_naked+0x4c/0x54
[ 4510.641839] Code: a9446bf9 a94573fb a8c67bfd d65f03c0 (d4210000) 
[ 4510.642004] ---[ end trace 9e3ff8224358e511 ]---

The crash is more easily reproducible when I restart the Ethernet interface using
sudo ifconfig eth0 down; sudo ifconfig eth0 up but it’s possible these are two different issues.
I will try Jetpack 5.1.2 since this post mentions that the crash caused by the Ethernet interface wasn’t reproducible on 5.1.2

Since my original issue isn’t easily reproducible, I can’t be certain that 5.1.2 will resolve it. Can you please help confirm if these are both the same issue?

Hi SanjayD,

Are you using the devkit or custom board for AGX Xavier?
Could you also verify with the latest Jetpack 5.1.3(R35.5.0)?

@KevinFFF I am using a ConnectTech Rogue AGX101. The latest available BSP is 5.1.2, so I will try that instead.

I need a fix for this urgently, and updating the BSP usually involves a lot of risk and testing time for us, so this is not ideal.

We would need a clear step to reproduce the issue on the devkit so that we can look into it and debug the issue.
If you have the devkit, please also try to reproduce on it to check if the issue is specific to the custom carrier board.

Just curious, does this use a CAN device? I see this began in a user space program, and then a software IRQ was issued, which went into kernel space (but not a hardware device driver, it is a software IRQ). I see a an RX event for the CAN, but it is not a hardware IRQ, which seems slightly out of place for a software IRQ. If you do have any CAN device, then you might describe this…it is a network type device, and you mentioned sometimes you think eth0 up/down might trigger an issue. Maybe there is something odd going on between the CAN network and other hardware related to networking.

Yes, I have a few CAN devices (motor controllers and other microcontrollers) on the network that are sending data periodically to the Xavier. The bandwidth used is ~8kbps.

I haven’t seen this issue on Jetpack 4 or 5.0.2, in the past 3-4 years of testing with the exact same network configuration, so seems like it might be due to something 5.1.1-specific.

Do any of the CAN user space programs try to bind affinity to other CPUs? Or is the system just using whatever CPU core the default scheduling performs?

If you have multiple crash dumps you might want to check if this is in all of them, or if it instead mentions other programming from other software (if it sticks to CAN that is a big clue; if other network based programs also show up, then it is still a big clue, but not a smoking gun as to which):

[ 4510.640942]  el1_irq+0xd0/0x180
[ 4510.641036]  ttcan_read_txevt_ram+0x54/0x90 [mttcan]
[ 4510.641144]  ttcan_read_txevt_fifo+0x90/0x150 [mttcan]
[ 4510.641265]  mttcan_poll_ir+0x624/0xcc0 [mttcan]
[ 4510.641361]  net_rx_action+0x124/0x440
[ 4510.641440]  __do_softirq+0x140/0x3e8

I mention affinity because this would change debugging and widen what might be at issue. If you do not tweak scheduling or priorities, then it is whatever default scheduling does.

Being able to reproduce this would also help.

2 Likes

The userspace programs that use CAN do not use CPU affinity.

I haven’t been able to get further traces from the field for debugging, and since this needed to be fixed urgently I simply upgraded to Jetpack 5.1.2, and so far it doesn’t seem reproducible in the field or by restarting the Ethernet interface (this had a >50% chance of triggering the crash on JP 5.1.1). Thanks for looking through the crash logs!

Unfortunately this issue might not have been entirely solved by the upgrade to 5.1.2 (latest supported by our Custom Carrier Board vendor, ConnectTech). Seeing kernel panics at a much lower rate now (once every ~100 hours)

Created a separate post for help with debugging this on a live device out in the field.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.