Debugging BCCPLEXWDT reset source

I am developing software on top of the AGX Orin dev kit. I encountered the system lock up and reset with the BCCPLEXWDT reset code as observed by looking at PMC log statements (sudo dmesg | grep PMC)

In case it’s relevant, I am using VPI/CUDA in my system and MIPI cameras via nvargus.

How can I better debug the BCCPLEXWDT issue? I am assuming it is some kind of CPU watchdog. What other information can be gathered to narrow down on what is causing this?

Hi,
Do you use latest Jetpack 5.1.1? If you use previous release, we would suggest upgrade to latest release and give it a try.

And for information, please share the UART log for reference. You can get the log by referring to:
Jetson/General debug - eLinux.org

Hi DaneLLL,

Thanks for the reply. I am not using Jetpack 5.1.1 yet. I can update when I find the opportunity. In the meantime, is there reason to believe this is something that will be resolved in Jetpack 5.1.1? Is there something in the changelog?

Thanks.

Hi,
It is uncertain if this is not seen on Jetpack 5.1.1, but it would be great if you can use latest release. Certain bugs reported in previous releases are fixed in later release.

If you have a test code which can be run to reproduce the issue, please share to us and we can give it a try.

I was able to reproduce the issue on Jetpack 5.1.1. I do not have a reproduction case yet as it happens “randomly” while the system is under load.

Is there any information you have can direct me to on the BCCPLEXWDT reset reason and any information to collect to diagnose the issue?

Sorry for the late response, have you managed to get issue resolved or still need the support? Thanks

For further check, we would need a method to reproduce it on developer kit. Would need your help to try it on developer kit and see if it is also present.

I still need support. As I mentioned in the first post, I am able to reproduce this on the Jetson AGX Orin Developer Kit. I do not know how to reproduce it because it seems to happen at random.

Does this reset reason essentially mean that the Linux kernel was completely hung? Or does it have some other kind of meaning?

I managed to reproduce one such BCCPLEXWDT reset while logging the serial console and captured the following kernel panic and stack trace.

[  608.008374] kernel BUG at mm/vmalloc.c:2065!
[  608.008512] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  608.008676] Modules linked in: nvidia_modeset(OE) xt_mark veth xt_tcpudp nf_conntrack_netlink nfnetlink br_netfilter binfmt_misc ip6table_nat overlay lzo_rle lzo_compress zram ip6table_filter ip6_tables xt_state xt_conntrack iptable_filter xt_MASQUERADE xt_nat xt_multiport xt_addrtype iptable_nat nf_nat ramoops reed_solomon nf_conntrack loop nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nvgpu iwlmvm mac80211 iwlwifi aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce cfg80211 sha256_arm64 sha1_ce ftdi_sio pwm_fan btusb usbserial userspace_alert btrtl nct1008 ina3221 tegra_bpmp_thermal btbcm gs_usb btintel spi_tegra114 nvidia(OE) nvmap mttcan can_dev can_raw can ip_tables x_tables [last unloaded: mtd]
[  608.010429] CPU: 0 PID: 456 Comm: irq/304-iwlwifi Tainted: G           OE     5.10.104-tegra #84
[  608.011739] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 2.1-32413640 01/24/2023
[  608.013049] pstate: 00c00009 (nzcv daif +PAN +UAO -TCO BTYPE=--)
[  608.013940] pc : __get_vm_area_node.isra.0+0x160/0x180
[  608.017777] lr : __get_vm_area_node.isra.0+0x40/0x180
[  608.023024] sp : ffff8000121e3340
[  608.026437] x29: ffff8000121e3340 x28: 0000000000000001 
[  608.031951] x27: 0000000000000000 x26: ffffb6f7274f7230 
[  608.037463] x25: fffffdffbfff0000 x24: ffff800010000000 
[  608.042975] x23: 0000000000000cc0 x22: 0000000000000001 
[  608.048488] x21: 0000000000000010 x20: 0000000000001000 
[  608.053999] x19: ffffb6f7274f7230 x18: 0000000000000000 
[  608.059512] x17: 0000000000000000 x16: ffffb6f72795c6e0 
[  608.064937] x15: ffff800032611000 x14: b3534e6082e4c0a7 
[  608.070449] x13: fc0ec6741b598c8a x12: 0000000000000001 
[  608.075877] x11: 0000000000000000 x10: 0000000000000001 
[  608.081387] x9 : 0000000007fff7f1 x8 : 00000000000001ff 
[  608.086813] x7 : 0000000000000a20 x6 : ffffb6f7274f7230 
[  608.092149] x5 : 0000000000000cc0 x4 : fffffdffbfff0000 
[  608.097574] x3 : ffff800010000000 x2 : ffffb6f72757e170 
[  608.102913] x1 : 0000000000000001 x0 : 0000000000000402 
[  608.108251] Call trace:
[  608.110702]  __get_vm_area_node.isra.0+0x160/0x180
[  608.115514]  vmap+0x98/0x110
[  608.118404]  dma_common_pages_remap+0x40/0x70
[  608.122690]  iommu_dma_alloc_remap+0x2d0/0x420
[  608.126975]  iommu_dma_alloc+0x274/0x310
[  608.130738]  dma_alloc_attrs+0xe8/0xf0
[  608.134678]  tegra_se_ccm_ctr+0x3dc/0x510
[  608.138701]  tegra_se_aes_ccm_decrypt+0x70/0xd0
[  608.143252]  crypto_aead_decrypt+0x48/0x70
[  608.147299]  aead_decrypt+0x138/0x190 [mac80211]
[  608.151840]  ieee80211_crypto_ccmp_decrypt+0x330/0x360 [mac80211]
[  608.157965]  ieee80211_rx_handlers+0xdac/0x2240 [mac80211]
[  608.163299]  ieee80211_prepare_and_rx_handle+0x580/0x1050 [mac80211]
[  608.169688]  ieee80211_rx_list+0x524/0x9a0 [mac80211]
[  608.174761]  ieee80211_rx_napi+0x60/0xe0 [mac80211]
[  608.179576]  iwl_mvm_rx_rx_mpdu+0x4c4/0xb40 [iwlmvm]
[  608.184382]  iwl_mvm_rx+0x70/0xb0 [iwlmvm]
[  608.188498]  iwl_pcie_rx_handle+0x664/0xa80 [iwlwifi]
[  608.193745]  iwl_pcie_irq_handler+0x614/0xc50 [iwlwifi]
[  608.198992]  irq_thread_fn+0x34/0xa0
[  608.202488]  irq_thread+0x158/0x250
[  608.205990]  kthread+0x148/0x170
[  608.209228]  ret_from_fork+0x10/0x24
[  608.212816] Code: a94363f7 a9446bf9 a8c57bfd d65f03c0 (d4210000) 
[  608.219033] ---[ end trace a9fdd3e4c4a00b2c ]---
[  608.228242] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[  608.231193] SMP: stopping secondary CPUs
[  608.234957] Kernel Offset: 0x36f7173c0000 from 0xffff800010000000
[  608.241075] PHYS_OFFSET: 0xffffb4a9c0000000
[  608.245193] CPU features: 0x0040006,4a80aa38
[  608.249653] Memory Limit: none
[  608.257422] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]--

Looking at the assertion here: kernel BUG at mm/vmalloc.c:2065!

I can see that a code path is hit from an interrupt context that should not have been hit from an interrupt context.

From the call stack, I can see that it is stemming from tegra platform-specific code tegra_se_aes_ccm_decrypt.

I do see a similar issue here: Possible bug in CryptoAPI due to tegra_xxx functions

Can someone from Nvidia confirm if there is a known workaround or fix for this bug? Thanks.

Looking more closely I see that iommu_dma_alloc_remap is not designed to be called from IRQ context, however we can see from the call stack that it clearly is.

I see it is called within this conditional

	if (IS_ENABLED(CONFIG_DMA_REMAP) && gfpflags_allow_blocking(gfp) &&
	    !(attrs & DMA_ATTR_FORCE_CONTIGUOUS)) {
		return iommu_dma_alloc_remap(dev, size, handle, gfp,
				dma_pgprot(dev, PAGE_KERNEL, attrs), attrs);
	}

In tegra-se-nvhost.c I see this call inside tegra_se_ccm_ctr

	dst_buf = dma_alloc_coherent(se_dev->dev, cryptlen+1, &dst_buf_dma_addr,
		GFP_KERNEL);

The last argument GFP_KERNEL is what makes gfpflags_allow_blocking(gfp) return true.

Looking at gfp.h

 * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
 * watermark is applied to allow access to "atomic reserves".
 * The current implementation doesn't support NMI and few other strict
 * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
 *
 * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
 * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.

To me it suggests that GFP_ATOMIC should be used in interrupt contexts.

Can someone at Nvidia look at the implementation of tegra_se_ccm_ctr in tegra-se-nvhost.c and confirm whether or not the call to dma_alloc_coherent would be problematic from an interrupt context and is the root of this kernel panic?

Hi,
For information, do you hit the issue while using WIFI function? The prints look related to network and would like to know which use-case you will hit the issue.

We observed this problem on Jetpack 5.0.2 and 5.1, on both Orin Dev-Kit and a custom carrier, and with different WiFi cards (stock RTL8822CE and Intel 8265). Unfortunately the problem is not deterministic and occurs in random times.
This is the first time we were able to catch the fault on the console, so we can’t tell if it is always related to network. In this case the fault occurred when the wifi driver calls crypto api to decrypt the wifi key. In this particular case the crash is caused by the failure of tegra_se_aes_ccm_decrypt() and tegra_se_ccm_ctr().
We’ve found the evidence that tegra-se crypto module had similar issues in the past.

Hi,
We have the WIFI cards and can set up for a try. Please share the steps so that we can set up developer kit and try to replicate the issue. And do further debugging on the issue.

Just a note on the Possible bug in CryptoAPI due to tegra_xxx functions

The first suggested patch does not work, the panic does not happen on the tegra_se_send_sha_data anymore, but moves to other functions that use the sg APIs (tested on BSP R32.7.1).

The only solution for me was to change the priority of the algorithm (second suggested patch).

It may not worth to check any patch that was from years ago. Orin AGX is already moving to rel-35 while those old posts are still rel-32 which is kernel4.9.

Please take it as new issue and share us the steps to reproduce.

We are using the following Intel card as a WiFi client via wpa_supplicant

0007:01:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)

We are running wpa_supplicant on our wlan interface

/sbin/wpa_supplicant -c/etc/wpa_supplicant/wpa_supplicant.conf -Dnl80211 -iwlan0

Our wpa_supplicant.conf just has a couple network blocks that look like this

network={
ssid="<omitted>"
priority=1
freq_list=2412 2417 2422 2427 2432 2437 2442 2447 2452 2457 2462
psk=<omitted>
}

We are not doing anything special to reproduce this issue. We’ve managed to reproduce it after leaving a system idle for 12-24 hours. Sometimes it happens after the system was been running for less than an hour. We don’t have any clear way to reproduce it.

Thanks for the reply.
We’ll setup devices and see if we can catch the same error.

We believe there’s some dependence on the access point configuration to trigger the buggy codepath. Specifically CCMP encryption type for key negotiation.

wpa_supplicant[509]: wlan0: WPA: Key negotiation completed with xx:xx:xx:xx:xx:xx [PTK=CCMP GTK=CCMP]

A week ago when I reported this issue, we have patched the kernel, replacing GFP_KERNEL with in_interrupt() ? GFP_ATOMIC : GFP_KERNEL in the various dma function calls inside tregra-se-nvhost.c.

We have not had a crash since, so it may have resolved the issue. We can submit a kernel patch if Nvidia can review it and confirm it is an appropriate fix. I’d like some confirmation on this.

Hi,

we’ve run the system for more than 60 hours, and we do not find any kernel panic.
It would be great if you can share the patch with us, and we’ll have our developer review it.
Thanks.

Attached is the patch that we applied. Note that we found every instance where GFP_KERNEL was used as an argument to dma_xxx APIs. Likely that was overkill, but we are not sure which codepaths are possible from interrupt contexts and which are not.

kernel-tegra-se.patch (16.8 KB)

1 Like