Orin tegra_uart_tx_dma_complete kernel panic on jetpack5.1.2

hi, nvidia team.

I encountered a system crash issue caused by uart dma.

We are using Jetpack 5.1.2 RT patch:

./kernel-5.10/scripts/rt-patch.sh apply-patches

We are usingUART2 (/dev/ttyTHS4):

#sudo stty -F /dev/ttyTHS4
speed 460800 baud; line = 0;
intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof = <undef>; start = <undef>; stop = <undef>; susp = <undef>; rprnt = <undef>; werase = <undef>; lnext = <undef>; discard = <undef>; min = 0; time = 0;
-icrnl -imaxbel
-opost -onlcr
-isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

When reading data, the system will hang up:

hexdump /dev/ttyTHS4

Kernel panic log:

orin-master login: [  179.857847] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  179.862160] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  179.867058] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  179.867061] serial-tegra 3140000.serial: Not able to get desc for Tx
[  179.872078] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  179.872080] serial-tegra 3140000.serial: Not able to get desc for Tx
[  179.877056] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  179.877058] serial-tegra 3140000.serial: Not able to get desc for Tx
[  180.257859] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.277153] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.457878] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.457996] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.458115] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.458234] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.462153] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.467089] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  180.467093] serial-tegra 3140000.serial: Not able to get desc for Tx
[  180.682151] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  180.697148] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  184.250476] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  184.250501] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.250504] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.250520] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.250521] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.251488] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[  184.251494] Mem abort info:
[  184.251495]   ESR = 0x96000004
[  184.251496]   EC = 0x25: DABT (current EL), IL = 32 bits
[  184.251498]   SET = 0, FnV = 0
[  184.251499]   EA = 0, S1PTW = 0
[  184.251500] Data abort info:
[  184.251500]   ISV = 0, ISS = 0x00000004
[  184.251501]   CM = 0, WnR = 0
[  184.251502] user pgtable: 4k pages, 48-bit VAs, pgdp=000000019e16a000
[  184.251504] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[  184.251508] Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
[  184.251511] Modules linked in: fuse xt_conntrack spidev xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter lzo_rle lzo_compress mttcan can_dev can_raw zram can overlay ramoops reed_solomon loop snd_soc_tegra186_dspk snd_soc_tegra210_ope snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_spdif_tx snd_soc_tegra_machine_driver binfmt_misc nct1008 i2c_nvvrs11 snd_soc_tegra210_adsp cam_cdi_tsc userspace_alert nv_hawk_owl snd_soc_tegra_utils tegra_bpmp_thermal max96712 snd_soc_simple_card_utils snd_soc_tegra210_ahub nvadsp tegra210_adma snd_hda_codec_hdmi snd_soc_rt5640
[  184.251565]  snd_soc_rl6231 snd_hda_tegra snd_hda_codec snd_hda_core spi_tegra114 ina3221 pwm_fan nvgpu nvmap ip_tables x_tables [last unloaded: mtd]
[  184.251576] CPU: 0 PID: 3815 Comm: irq/116-gpcdma. Not tainted 5.10.120-rt70-tegra #1
[  184.251579] Hardware name: Unknown Jetson AGX Orin Developer Kit/Jetson AGX Orin Developer Kit, BIOS 4.1-33958178 08/01/2023
[  184.251581] pstate: 20c00009 (nzCv daif +PAN +UAO -TCO BTYPE=--)
[  184.251583] pc : tegra_uart_tx_dma_complete+0x5c/0xe0
[  184.251593] lr : tegra_uart_tx_dma_complete+0x4c/0xe0
[  184.251595] sp : ffff800011acbb60
[  184.251596] x29: ffff800011acbb60 x28: ffffdf1e50ef8000
[  184.251598] x27: 0000000000000007 x26: ffff800011acbc18
[  184.251599] x25: ffff5f6993f2bb00 x24: dead000000000100
[  184.251601] x23: dead000000000122 x22: ffff5f6993f2bb00
[  184.251602] x21: ffff5f69843b69a0 x20: 00000000000000b0
[  184.251604] x19: ffff5f69873ed480 x18: ffff5f69e202e794
[  184.251605] x17: 0000000000000000 x16: 00000000000000f5
[  184.251606] x15: ffff5f69e202e884 x14: 0000000000000d5b
[  184.251608] x13: 0000000000004bcd x12: 0000000000000024
[  184.251609] x11: 071c71c71c71c71c x10: 0000000000000ab0
[  184.251610] x9 : ffff800011acbce0 x8 : ffff5f6993f2c610
[  184.251612] x7 : 000000000000e7c7 x6 : 00000000215f1c29
[  184.251613] x5 : 00ffffffffffffff x4 : 0000000000000000
[  184.251615] x3 : 0000000000000000 x2 : 0000000000000000
[  184.251616] x1 : 00000000000002cd x0 : ffff5f69873ed480
[  184.251618] Call trace:
[  184.251620]  tegra_uart_tx_dma_complete+0x5c/0xe0
[  184.251622]  vchan_complete+0x1fc/0x230
[  184.251626]  tasklet_action_common.isra.0+0x10c/0x150
[  184.251629]  tasklet_action+0x30/0x40
[  184.251631]  __do_softirq+0x120/0x3b4
[  184.251633]  __local_bh_enable_ip+0xdc/0x140
[  184.251635]  irq_forced_thread_fn+0x88/0xc0
[  184.251638]  irq_thread+0x188/0x280
[  184.251639]  kthread+0x180/0x1b0
[  184.251643]  ret_from_fork+0x10/0x24
[  184.251647] Code: b94043e3 f9414a62 aa1303e0 b942aa74 (b9400441)
[  184.251650] ---[ end trace 0000000000000002 ]---
[  184.257823] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  184.257941] tegra-gpcdma 2600000.gpcdma: DMA pause timed out
[  184.257953] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.257955] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.257970] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.257971] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.262064] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.262067] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.267006] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.267008] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.267014] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  184.267014] serial-tegra 3140000.serial: Not able to get desc for Tx
[  184.657263] Kernel panic - not syncing:

The corresponding log is as follows.
Please help to see how to solve this problem.
debug.log (95.7 KB)
Thank you.

Hi,

A few questions here:

  1. Would it also happen on the stock kernel?
  2. What kind of UART device is being used? Would it affect the behavior?
  3. Is it a DevKit or a custom carrier board?

hi

Never happened

A serial port interface for xlinux XILINX mpsoc.Output positioning information for calculation。
BTW, serial port output data volume is relatively large, perhaps due to DMA not being able to process it in time?

A custom carrier board.

Then can you try other UART devices to see if it’s caused by the amount of data?

Would it be re-producible on the DevKit?

I can test it and update the results later.

But according to my understanding, this error should be a scheduling issue. Can you analyze any content in the logs I provided. What else can I do to help solve this problem

We haven’t tested this yet. I don’t think it’s a board issue, it should be a system issue.

Hello @DaveYYY
any update about this issue?
Thanks.

Hi,

Sorry for the late reply.
I thought we would proceed after your reply on this:

Have you tested with other UART devices?
Is the issue reproducible regardless of the amount of data?

Hi,@DaveYYY

I am using a PC and sending 127 bytes of data through a USB serial port, but the issue has not recurred.

BTW,
When using minicom to read data,there is no “serial tegra 314000. serial: Not able to get desc for Tx” info.The issue did not recur after running for 20 hours.

At the project site, the occurrence of this problem is random.

This is the tty attribute when using minicom:

nvidia@orin-master:~$ sudo stty -F /dev/ttyTHS4
speed 460800 baud; line = 0;
min = 1; time = 5;
ignbrk -brkint -icrnl -imaxbel
-opost -onlcr
-isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

When using cat to read data,run for 2 minutes and the system will crash
This is the tty attribute when using cat:

nvidia@orin-master:~$ sudo stty -F /dev/ttyTHS4
speed 460800 baud; line = 0;
-brkint -imaxbel

Why is there a TX error message when using cat to read data?

Thank you for your help.

BTW,
After using minicom to open the serial port node and then using cat to read data, there will be no TX error

Well, I don’t think you should use cat to read UART…
Please just use other tools that are designed to be used with serial data.

ok.
I am currently using minicom for testing, and I will reply to you again when the problem recurs. At the project site, the occurrence of this problem is random.
Thank you!

hi,@DaveYYY
When I am not using CAT, there is no TX error information.
However, there was still an occasional reboot, and there were no abnormalities in the debug log during the reboot.

orin-master login: nvidia^M^M^M
Password: ^M^M
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-rt70-tegra aarch64)^M^M
^M^M
 * Documentation:  https://help.ubuntu.com^M^M
 * Management:     https://landscape.canonical.com^M^M
 * Support:        https://ubuntu.com/advantage^M^M
^M^M
This system has been minimized by removing packages and content that are^M^M
not required on a system that users do not log into.^M^M
^M^M
To restore this content, you can run the 'unminimize' command.^M^M
^M^M
Expanded Security Maintenance for Applications is not enabled.^M^M
^M^M
18 updates can be applied immediately.^M^M
To see these additional updates run: apt list --upgradable^M^M
^M^M
60 additional security updates can be applied with ESM Apps.^M^M
Learn more about enabling ESM Apps service at https://ubuntu.com/esm^M^M
^M^M
^M^M
The list of available updates is more than a week old.^M^M
To check for new updates run: sudo apt update^M^M
Last login: ä¸<89> 10æ<9c><88> 25 11:54:13 CST 2023 from 10.27.87.242 on pts/0^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ ^M^M
nvidia@orin-master:~$ start^M^M
-bash: start: command not found^M^M
nvidia@orin-master:~$ ^@ÿâ^M
[0000.062] I> MB1 (version: 1.2.0.0-t234-54845784-562369e5)^M
[0000.067] I> t234-A01-0-Silicon (0x12347) Prod^M
[0000.071] I> Boot-mode : Coldboot^M
[0000.075] I> Entry timestamp: 0x00000000^M
[0000.078] I> last_boot_error: 0x0^M
[0000.082] I> BR-BCT: preprod_dev_sign: 0^M
[0000.085] I> rst_source: 0x2, rst_level: 0x1^M
[0000.089] I> Task: SE error check^M
[0000.093] I> Task: Bootchain select WAR set^M
[0000.097] I> Task: Enable SLCG^M
[0000.099] I> Task: CRC check^M
[0000.102] I> Skip FUSE records CRC check as records_integrity fuse is not burned^M
[0000.109] I> Task: Initialize MB2 params^M
[0000.114] I> MB2-params @ 0x40060000^M
[0000.117] I> Task: Crypto init^M
[0000.120] I> Task: Perform MB1 KAT tests^M
[0000.124] I> Task: NVRNG health check^M
[0000.127] I> NVRNG: Health check success^M
[0000.131] I> Task: MSS Bandwidth limiter settings for iGPU clients^M
[0000.137] I> Task: Enabling and initialization of Bandwidth limiter^M
[0000.143] I> No request to configure MBWT settings for any PC!^M
[0000.149] I> Task: Secure debug controls^M
[0000.153] I> Task: strap war set^M
[0000.156] I> Task: Initialize SOC Therm^M
[0000.160] I> Task: Program NV master stream id^M
[0000.164] I> Task: Verify boot mode^M
[0000.170] I> Task: Alias fuses^M
[0000.173] W> FUSE_ALIAS: Fuse alias on production fused part is not supported.^M

And the ORIN power supply has not changed either.
Do you have any information that can help me?

Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.