Hi, we have a custom board based on Jetson AGX Orin with Jetpack and rootfs based on jetpack 35.3.1. From time to time Orin reboots most probably by watchdog (several times we can see message watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [migration/2:23]
from kernel in serial console right before reboot). Usually it is preceded by a bunch of errors from cbb-fabric
which tells that CPU cannot read from the memory regions mapped to the pcie controller registers:
exceptions from cbb
[ 162.966318] CPU:4, Error: cbb-fabric@0x13a00000, irq=22
[ 162.966470] **************************************
[ 162.966614] CPU:4, Error:cbb-fabric, Errmon:2
[ 162.966735] Error Code : TIMEOUT_ERR
[ 162.966843] Overflow : Multiple TIMEOUT_ERR
[ 162.966983]
[ 162.967025] Error Code : TIMEOUT_ERR
[ 162.967134] MASTER_ID : CCPLEX
[ 162.967225] Address : 0x3e000114
[ 162.967329] Cache : 0x0 -- Device Non-Bufferable
[ 162.967478] Protection : 0x2 -- Unprivileged, Non-Secure, Data Access
[ 162.967667] Access_Type : Read
[ 162.967765] Access_ID : 0x12
[ 162.967766] Fabric : cbb-fabric
[ 162.967959] Slave_Id : 0x13
[ 162.968047] Burst_length : 0x0
[ 162.968132] Burst_type : 0x1
[ 162.968223] Beat_size : 0x2
[ 162.968310] VQC : 0x0
[ 162.968629] GRPSEC : 0x7e
[ 162.969084] FALCONSEC : 0x0
[ 162.969542] **************************************
[ 162.970284] ------------[ cut here ]------------
[ 162.970296] WARNING: CPU: 4 PID: 758 at drivers/soc/tegra/cbb/tegra234-cbb.c:577 tegra234_cbb_isr+0x134/0x180
[ 162.970297] Modules linked in: nf_conntrack_netlink nfnetlink br_netfilter nvgpu overlay iwlmvm xt_nat xt_MASQUERADE mttcan xt_addrtype can_raw can iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nf_conntrack mac80211 nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_tcpudp iptable_filter isx031 iwlwifi cfg80211 ub953 snd_soc_tegra186_asrc snd_soc_tegra210_ope option snd_soc_tegra186_dspk snd_soc_tegra210_iqc qmi_wwan snd_soc_tegra186_arad snd_soc_tegra210_mvc ghash_ce usb_wwan snd_soc_tegra210_afc snd_soc_tegra210_dmic cdc_wdm sha2_ce snd_soc_tegra210_adsp usbserial snd_soc_tegra210_amx snd_soc_tegra210_adx snd_soc_tegra210_admaif sha256_arm64 snd_soc_tegra_machine_driver sha1_ce snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra_pcm snd_soc_tegra210_sfc snd_soc_tegra_utils snd_soc_simple_card_utils snd_soc_spdif_tx gpio_cam_syncer nvadsp userspace_alert ub960 can_dev max20087_regulator nct1008 snd_soc_tegra210_ahub snd_soc_rt5640 nvmap tegra210_adma tegra_bpmp_thermal
[ 162.970369] snd_soc_rl6231 ina3221 vcan fuse ip_tables x_tables mpt3sas raid_class scsi_transport_sas des_generic libdes ccm ahci libahci libata aes_ce_blk crypto_simd cryptd aes_ce_cipher [last unloaded: can]
[ 162.970393] CPU: 4 PID: 758 Comm: kworker/4:2 Not tainted 5.10.104-l4t-35.3.1-107-g2e51a0b #1
[ 162.970394] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 3.1-32827747 03/19/2023
[ 162.970403] Workqueue: usb_hub_wq hub_event
[ 162.970406] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 162.970408] pc : tegra234_cbb_isr+0x134/0x180
[ 162.970409] lr : tegra234_cbb_isr+0x10c/0x180
[ 162.970410] sp : ffff800010023bb0
[ 162.970411] x29: ffff800010023bb0 x28: ffff7c86c3fd8f00
[ 162.970413] x27: 0000000000000001 x26: 0000000000000080
[ 162.970415] x25: ffffdb5a8d1d9158 x24: ffffdb5a8de82c30
[ 162.970417] x23: ffffdb5a8d581000 x22: 0000000000000016
[ 162.970419] x21: ffffdb5a8dc4d810 x20: 0000000000000002
[ 162.970420] x19: ffffdb5a8dc4d800 x18: 0000000000000010
[ 162.970422] x17: 0000000000000000 x16: 0000000000000000
[ 162.970424] x15: ffff7c86c3fd9470 x14: ffffffffffffffff
[ 162.970425] x13: ffff8000900236d7 x12: ffff8000100236df
[ 162.970427] x11: 0000000000000040 x10: ffffdb5a8db18ae8
[ 162.970429] x9 : ffffdb5a8c23a8bc x8 : 0000000000000001
[ 162.970430] x7 : 0000000000017fe8 x6 : c0000000ffffefff
[ 162.970432] x5 : ffff7c8deebc1988 x4 : 0000000000000000
[ 162.970434] x3 : 0000000000000000 x2 : 0000000000000000
[ 162.970435] x1 : ffff7c86c3fd8f00 x0 : 0000000100010101
[ 162.970438] Call trace:
[ 162.970440] tegra234_cbb_isr+0x134/0x180
[ 162.970444] __handle_irq_event_percpu+0x68/0x2a0
[ 162.970445] handle_irq_event_percpu+0x3c/0xa0
[ 162.970447] handle_irq_event+0x50/0xf0
[ 162.970449] handle_fasteoi_irq+0xc0/0x180
[ 162.970451] generic_handle_irq+0x38/0x50
[ 162.970452] __handle_domain_irq+0x6c/0xd0
[ 162.970454] gic_handle_irq+0x60/0x12c
[ 162.970455] el1_irq+0xd0/0x180
[ 162.970456] __do_softirq+0xb4/0x3f0
[ 162.970460] irq_exit+0xc8/0xf0
[ 162.970461] __handle_domain_irq+0x70/0xd0
[ 162.970462] gic_handle_irq+0x60/0x12c
[ 162.970463] el1_irq+0xd0/0x180
[ 162.970467] _raw_spin_unlock_irqrestore+0x20/0x60
[ 162.970469] usb_disable_usb2_hardware_lpm+0x40/0xe0
[ 162.970470] usb_disable_device+0x11c/0x200
[ 162.970472] usb_disconnect+0xc4/0x2f0
[ 162.970474] hub_event+0x45c/0x1730
[ 162.970477] process_one_work+0x1c4/0x4a0
[ 162.970479] worker_thread+0x54/0x430
[ 162.970480] kthread+0x148/0x170
[ 162.970482] ret_from_fork+0x10/0x38
[ 162.970483] ---[ end trace d90d86e5d9064279 ]---
These pcie peripherals are connected with one of the Intel AX210 wifi modules installed in a custom board. Moreover, issue correlates somehow with these wifi modules (we think that either association/disassociation with ap or hight tx rates).
We still cannot reliably reproduce the issue so we cannot tell if our board also has the same issue on vanilla Jetpack. How can we narrow down suspect pool and at least decide which part of the hardware or software is the root of the problem?
full kernel messages before reboot (344.9 KB)