Serial-Tegra DMA Driver Bug - memory leak

We are observing the same issue as Serial-Tegra DMA Driver Bug
where the kernel crashes around 113min (+/- 1 min), consistently, so we think it could be the same issue. Our kernel already has the change in the other report.
This is the kernel dump:

[ 6775.783301] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6776.288718] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6776.800880] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6777.312518] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6777.824497] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6778.336482] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6778.848275] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[ 6778.848524] tegra-gpcdma 2600000.gpcdma: slave id already in use
[ 6778.848724] serial-tegra 3100000.serial: Not able to get desc for Rx
[ 6778.848943] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[ 6778.848946] Mem abort info:
[ 6778.848948]   ESR = 0x96000004
[ 6778.848952]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 6778.848954]   SET = 0, FnV = 0
[ 6778.848956]   EA = 0, S1PTW = 0
[ 6778.848958] Data abort info:
[ 6778.848960]   ISV = 0, ISS = 0x00000004
[ 6778.848962]   CM = 0, WnR = 0
[ 6778.848967] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000141731000
[ 6778.848969] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[ 6778.848980] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 6778.848984] Modules linked in: overlay hid_logitech_hidpp input_leds 8021q garp mrp snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_amx snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_admaif snd_soc_tegra210_mixer snd_soc_tegra210_i2s snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd qmi_wwan cryptd hid_logitech_dj aes_ce_cipher ghash_ce cdc_wdm sha2_ce sha256_arm64 cdc_acm qcserial sha1_ce usb_wwan snd_soc_tegra_machine_driver snd_soc_tegra210_adsp snd_soc_spdif_tx ftdi_sio userspace_alert usbserial snd_soc_tegra_utils snd_soc_simple_card_utils snd_hda_codec_hdmi mttcan brcmfmac(O) inv_icm42600_i2c can_dev inv_icm42600 nvadsp kfifo_buf at24 opt3004 snd_soc_tegra210_ahub tegra_bpmp_thermal snd_hda_tegra tegra210_adma snd_hda_codec cfg80211(O) snd_hda_core r8168 compat(O) brcmutil(O) pwm_fan nvidia_drm(O) nvidia_modeset(O) nvidia(O) nvgpu nvmap
[ 6778.849118]  ina3221 fuse spi_tegra114 tpm_tis_spi tpm_tis_core tpm rng_core
[ 6778.849138] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      5.10.120-l4t-r35.4.ga+g76678311c10b #1
[ 6778.849141] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS v35.4.1 08/04/2023
[ 6778.849146] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 6778.849167] pc : tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[ 6778.849171] lr : tegra_uart_rx_buffer_push.constprop.0+0x30/0x190
[ 6778.849173] sp : ffff800010003610
[ 6778.849175] x29: ffff800010003610 x28: ffffd9194b102680
[ 6778.849180] x27: 0000000000000002 x26: 0000000000000005
[ 6778.849184] x25: ffff5c6a01dd4000 x24: ffffffffffffffca
[ 6778.849188] x23: 0000000000000001 x22: ffff5c6a01dd4000
[ 6778.849192] x21: ffff5c6a466af000 x20: 0000000000000fe0
[ 6778.849197] x19: ffff5c6a04dec880 x18: ffffffffffffffff
[ 6778.849201] x17: 0000000000000000 x16: 0000000000000000
[ 6778.849205] x15: ffff8000900038f7 x14: 0000000000000006
[ 6778.849210] x13: ffff8000100038ff x12: ffffd9194b0f6000
[ 6778.849214] x11: 0000000000000040 x10: ffffd9194b187b18
[ 6778.849219] x9 : ffffd9194b187b10 x8 : ffff5c6a00401940
[ 6778.849224] x7 : ffff000000000000 x6 : 0000000000000001
[ 6778.849228] x5 : 0000000000000001 x4 : ffff5c6d6e77c200
[ 6778.849232] x3 : 0000000000000000 x2 : ffffd91949672640
[ 6778.849236] x1 : 0000000000000000 x0 : ffff5c6a466af000
[ 6778.849241] Call trace:
[ 6778.849246]  tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[ 6778.849251]  tegra_uart_terminate_rx_dma.part.0+0x80/0xc4
[ 6778.849255]  tegra_uart_isr+0x41c/0x4bc
[ 6778.849264]  __handle_irq_event_percpu+0x68/0x2a0
[ 6778.849268]  handle_irq_event+0x70/0x150
[ 6778.849271]  handle_fasteoi_irq+0xac/0x1f4
[ 6778.849277]  __handle_domain_irq+0x88/0xf0
[ 6778.849283]  gic_handle_irq+0xd0/0x150
[ 6778.849286]  el1_irq+0xd0/0x180
[ 6778.849290]  console_unlock+0x3bc/0x584
[ 6778.849294]  vprintk_emit+0x124/0x290
[ 6778.849300]  dev_vprintk_emit+0x140/0x178
[ 6778.849303]  dev_printk_emit+0x84/0xb4
[ 6778.849306]  __dev_printk+0x60/0x88
[ 6778.849308]  _dev_err+0x74/0xa0
[ 6778.849312]  tegra_uart_start_rx_dma.part.0.isra.0+0x110/0x120
[ 6778.849316]  tegra_uart_rx_error_handle_timer+0xc8/0xd0
[ 6778.849323]  call_timer_fn+0x3c/0x200
[ 6778.849328]  __run_timers.part.0+0x21c/0x300
[ 6778.849332]  run_timer_softirq+0x44/0x80
[ 6778.849335]  __do_softirq+0x128/0x3f4
[ 6778.849341]  irq_exit+0xe0/0x100
[ 6778.849345]  __handle_domain_irq+0x8c/0xf0
[ 6778.849348]  gic_handle_irq+0xd0/0x150
[ 6778.849351]  el1_irq+0xd0/0x180
[ 6778.849358]  cpuidle_enter_state+0xbc/0x404
[ 6778.849362]  cpuidle_enter+0x40/0x54
[ 6778.849368]  do_idle+0x220/0x2b0
[ 6778.849373]  cpu_startup_entry+0x30/0x80
[ 6778.849378]  rest_init+0xdc/0xe8
[ 6778.849384]  arch_call_rest_init+0x18/0x20
[ 6778.849388]  start_kernel+0x504/0x53c
[ 6778.849395] Code: aa1603e0 97ff2919 f9413a61 aa0003f5 (b9400420)
[ 6778.849407] ---[ end trace bd7373de45b7021b ]---
[ 6778.849411] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 6778.849415] SMP: stopping secondary CPUs
[ 6778.849608] Kernel Offset: 0x5919394c0000 from 0xffff800010000000
[ 6778.849609] PHYS_OFFSET: 0xffffa39700000000
[ 6778.849613] CPU features: 0x08040006,4a80aa38
[ 6778.849615] Memory Limit: none
[ 6779.217899] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

This change is already in the kernel:

$ git show d5b90d6b9365
commit d5b90d6b9365250adb73b2fe5b52a5228df3b1d9
Author: Akhil R <akhilrajeev@nvidia.com>
Date:   Wed Dec 14 20:14:29 2022 +0530

    dmaengine: tegra: Fix memory leak in terminate_all()
    
    Terminate vdesc when terminating an ongoing transfer.
    This will ensure that the vdesc is present in the desc_terminated list
    The descriptor will will be freed later in desc_free_list().
    
    This fixes the memory leaks which happens whern terminating an ongoing
    transfer.
    
    Bug 3787456
    
    Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
    Change-Id: I44d5c7fedead91b5498ca422124d66da9e383a80
    Reviewed-on: https://git-master.nvidia.com/r/c/linux-5.10/+/2827943
    (cherry picked from commit 25783210ab30d29c567853c0e14bf998dbb8433f)
    Reviewed-on: https://git-master.nvidia.com/r/c/linux-5.10/+/2934196
    Reviewed-by: Bibek Basu <bbasu@nvidia.com>
    GVS: Gerrit_Virtual_Submit <buildbot_gerritrpt@nvidia.com>
    Tested-by: Bibek Basu <bbasu@nvidia.com>

diff --git a/drivers/dma/tegra-gpc-dma.c b/drivers/dma/tegra-gpc-dma.c
index 6f9ebd987801..b99f21054fa2 100644
--- a/drivers/dma/tegra-gpc-dma.c
+++ b/drivers/dma/tegra-gpc-dma.c
@@ -696,6 +696,7 @@ static int tegra_dma_terminate_all(struct dma_chan *dc)
                        return err;
                }
 
+               vchan_terminate_vdesc(&tdc->dma_desc->vd);
                tegra_dma_disable(tdc);
                tdc->dma_desc = NULL;
        }

We are using yocto as our build system, and here is the exact branch

FYI: our system has a lot of uart devices.

What’s the BSP version?

cat /etc/nv_tegra_release

Hi harvey_zhang,

Is it on the devkit or custom board?

Please also share the detailed steps how you reproduce this kernel panic.

cat /etc/nv_tegra_release
# R35 (release), REVISION: 4.1, GCID: 33958178, BOARD: t186ref, EABI: aarch64, DATE: Tue Aug  1 19:57:35 UTC 2023

We are using Orin NX 16Gb SOM on our custom board. I don’t do anything special, just boot up and let it sits there. We do have a lot of peripherals using UART, and some of them are spitting status/logs constantly to the NX (like GPS and MCU).

Since the MCU can talk to the Orin NX using UART, I put a while loop in the MCU to keep spamming the UART and the very same crash happens much faster.

[  608.354334] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  608.894157] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  609.404309] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  609.914130] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  610.424292] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  610.934285] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  611.444494] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[  611.956112] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[  611.956372] tegra-gpcdma 2600000.gpcdma: slave id already in use
[  611.956565] serial-tegra 3100000.serial: Not able to get desc for Rx
[  611.969262] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[  611.969518] Mem abort info:
[  611.969597]   ESR = 0x96000004
[  611.969690]   EC = 0x25: DABT (current EL), IL = 32 bits
[  611.969840]   SET = 0, FnV = 0
[  611.969927]   EA = 0, S1PTW = 0
[  611.970013] Data abort info:
[  611.970094]   ISV = 0, ISS = 0x00000004
[  611.970204]   CM = 0, WnR = 0
[  611.970295] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000138542000
[  611.970471] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[  611.970670] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[  611.970826] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter ip_tables x_tables br_netfilter overlay 8021q garp mrp aes_ce_blk qcserial ftdi_sio crypto_simd usb_wwan snd_soc_tegra186_asrc snd_soc_tegra186_dspk snd_soc_tegra210_ope snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_admaif snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_sfc snd_soc_tegra_pcm cryptd aes_ce_cipher ghash_ce sha2_ce qmi_wwan cdc_acm sha256_arm64 input_leds cdc_wdm sha1_ce usbserial brcmfmac(O) cfg80211(O) mttcan snd_soc_tegra_machine_driver can_dev snd_soc_spdif_tx userspace_alert inv_icm42600_i2c opt3004 inv_icm42600 compat(O) brcmutil(O) kfifo_buf snd_soc_tegra210_adsp at24 snd_soc_tegra210_ahub snd_soc_tegra_utils snd_hda_codec_hdmi snd_soc_simple_card_utils
[  611.970985]  tegra_bpmp_thermal nvadsp tegra210_adma snd_hda_tegra r8168 snd_hda_codec snd_hda_core pwm_fan nvidia_drm(O) nvidia_modeset(O) nvidia(O) nvgpu nvmap ina3221 fuse spi_tegra114 tpm_tis_spi tpm_tis_core tpm rng_core
[  612.046181] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      5.10.120-l4t-r35.4.ga+g76678311c10b #1
[  612.056152] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS v35.4.1 08/04/2023
[  612.066916] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  612.072971] pc : tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[  612.079253] lr : tegra_uart_rx_buffer_push.constprop.0+0x30/0x190
[  612.085552] sp : ffff800010003de0
[  612.088964] x29: ffff800010003de0 x28: ffffc7477a4f2680 
[  612.094475] x27: 0000000000000002 x26: 0000000000000005 
[  612.099987] x25: ffff730041ddf000 x24: ffffffffffffffca 
[  612.105500] x23: 0000000000000001 x22: ffff730041ddf000 
[  612.111012] x21: ffff730086acf000 x20: 0000000000000f50 
[  612.116438] x19: ffff730044dc3080 x18: 0000000000000000 
[  612.121951] x17: 0000000000000000 x16: 0000000000000000 
[  612.127376] x15: 0000000000000000 x14: ffffc7477a4f2680 
[  612.132888] x13: ffffabbc3456c000 x12: 000000003464d91d 
[  612.138312] x11: 0000000000000040 x10: ffffc7477a577b18 
[  612.143826] x9 : ffffc7477a577b10 x8 : ffff730040403010 
[  612.149164] x7 : ffff000000000000 x6 : 0000000000000001 
[  612.154500] x5 : 0000000000000001 x4 : ffff7303ae77a200 
[  612.159927] x3 : 0000000000000000 x2 : ffffc74778a62640 
[  612.165263] x1 : 0000000000000000 x0 : ffff730086acf000 
[  612.170601] Call trace:
[  612.173054]  tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[  612.178654]  tegra_uart_terminate_rx_dma.part.0+0x80/0xc4
[  612.183990]  tegra_uart_isr+0x41c/0x4bc
[  612.187758]  __handle_irq_event_percpu+0x68/0x2a0
[  612.192479]  handle_irq_event+0x70/0x150
[  612.196240]  handle_fasteoi_irq+0xac/0x1f4
[  612.200265]  __handle_domain_irq+0x88/0xf0
[  612.204292]  gic_handle_irq+0xd0/0x150
[  612.208226]  el1_irq+0xd0/0x180
[  612.211208]  cpuidle_enter_state+0xbc/0x404
[  612.215227]  cpuidle_enter+0x40/0x54
[  612.218906]  do_idle+0x220/0x2b0
[  612.222141]  cpu_startup_entry+0x2c/0x80
[  612.225906]  rest_init+0xdc/0xe8
[  612.229318]  arch_call_rest_init+0x18/0x20
[  612.233340]  start_kernel+0x504/0x53c
[  612.236843] Code: aa1603e0 97ff2919 f9413a61 aa0003f5 (b9400420) 
[  612.242974] ---[ end trace 20677f5e5e9a1a45 ]---
[  612.247689] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  612.254605] SMP: stopping secondary CPUs
[  612.258813] Kernel Offset: 0x4747688b0000 from 0xffff800010000000
[  612.264837] PHYS_OFFSET: 0xffff8d00c0000000
[  612.268951] CPU features: 0x08040006,4a80aa38
[  612.273325] Memory Limit: none
[  612.276305] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

@ShaneCCC @KevinFFF do you have any update on this?
It’s fairly easy to reproduce:

  • having a peripheral keeps spamming UART
  • there is no receiver on Linux (userspace)
    → it will crash eventually.
    Granted, if there is an application to handle the uart then it’s all good. But kernel shouldn’t crash regardless.

Could you also verify with the latest R35.5.0 on the devkit?

Do you enable HW flow control in your case?

I don’t think so. I will find some time to reproduce it on devkit.