We are observing the same issue as Serial-Tegra DMA Driver Bug
where the kernel crashes around 113min (+/- 1 min), consistently, so we think it could be the same issue. Our kernel already has the change in the other report.
This is the kernel dump:
[ 6775.783301] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6776.288718] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6776.800880] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6777.312518] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6777.824497] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6778.336482] serial-tegra 3100000.serial: RxData DMA copy to tty layer failed
[ 6778.848275] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[ 6778.848524] tegra-gpcdma 2600000.gpcdma: slave id already in use
[ 6778.848724] serial-tegra 3100000.serial: Not able to get desc for Rx
[ 6778.848943] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[ 6778.848946] Mem abort info:
[ 6778.848948] ESR = 0x96000004
[ 6778.848952] EC = 0x25: DABT (current EL), IL = 32 bits
[ 6778.848954] SET = 0, FnV = 0
[ 6778.848956] EA = 0, S1PTW = 0
[ 6778.848958] Data abort info:
[ 6778.848960] ISV = 0, ISS = 0x00000004
[ 6778.848962] CM = 0, WnR = 0
[ 6778.848967] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000141731000
[ 6778.848969] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[ 6778.848980] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 6778.848984] Modules linked in: overlay hid_logitech_hidpp input_leds 8021q garp mrp snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_amx snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_admaif snd_soc_tegra210_mixer snd_soc_tegra210_i2s snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd qmi_wwan cryptd hid_logitech_dj aes_ce_cipher ghash_ce cdc_wdm sha2_ce sha256_arm64 cdc_acm qcserial sha1_ce usb_wwan snd_soc_tegra_machine_driver snd_soc_tegra210_adsp snd_soc_spdif_tx ftdi_sio userspace_alert usbserial snd_soc_tegra_utils snd_soc_simple_card_utils snd_hda_codec_hdmi mttcan brcmfmac(O) inv_icm42600_i2c can_dev inv_icm42600 nvadsp kfifo_buf at24 opt3004 snd_soc_tegra210_ahub tegra_bpmp_thermal snd_hda_tegra tegra210_adma snd_hda_codec cfg80211(O) snd_hda_core r8168 compat(O) brcmutil(O) pwm_fan nvidia_drm(O) nvidia_modeset(O) nvidia(O) nvgpu nvmap
[ 6778.849118] ina3221 fuse spi_tegra114 tpm_tis_spi tpm_tis_core tpm rng_core
[ 6778.849138] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 5.10.120-l4t-r35.4.ga+g76678311c10b #1
[ 6778.849141] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS v35.4.1 08/04/2023
[ 6778.849146] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 6778.849167] pc : tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[ 6778.849171] lr : tegra_uart_rx_buffer_push.constprop.0+0x30/0x190
[ 6778.849173] sp : ffff800010003610
[ 6778.849175] x29: ffff800010003610 x28: ffffd9194b102680
[ 6778.849180] x27: 0000000000000002 x26: 0000000000000005
[ 6778.849184] x25: ffff5c6a01dd4000 x24: ffffffffffffffca
[ 6778.849188] x23: 0000000000000001 x22: ffff5c6a01dd4000
[ 6778.849192] x21: ffff5c6a466af000 x20: 0000000000000fe0
[ 6778.849197] x19: ffff5c6a04dec880 x18: ffffffffffffffff
[ 6778.849201] x17: 0000000000000000 x16: 0000000000000000
[ 6778.849205] x15: ffff8000900038f7 x14: 0000000000000006
[ 6778.849210] x13: ffff8000100038ff x12: ffffd9194b0f6000
[ 6778.849214] x11: 0000000000000040 x10: ffffd9194b187b18
[ 6778.849219] x9 : ffffd9194b187b10 x8 : ffff5c6a00401940
[ 6778.849224] x7 : ffff000000000000 x6 : 0000000000000001
[ 6778.849228] x5 : 0000000000000001 x4 : ffff5c6d6e77c200
[ 6778.849232] x3 : 0000000000000000 x2 : ffffd91949672640
[ 6778.849236] x1 : 0000000000000000 x0 : ffff5c6a466af000
[ 6778.849241] Call trace:
[ 6778.849246] tegra_uart_rx_buffer_push.constprop.0+0x38/0x190
[ 6778.849251] tegra_uart_terminate_rx_dma.part.0+0x80/0xc4
[ 6778.849255] tegra_uart_isr+0x41c/0x4bc
[ 6778.849264] __handle_irq_event_percpu+0x68/0x2a0
[ 6778.849268] handle_irq_event+0x70/0x150
[ 6778.849271] handle_fasteoi_irq+0xac/0x1f4
[ 6778.849277] __handle_domain_irq+0x88/0xf0
[ 6778.849283] gic_handle_irq+0xd0/0x150
[ 6778.849286] el1_irq+0xd0/0x180
[ 6778.849290] console_unlock+0x3bc/0x584
[ 6778.849294] vprintk_emit+0x124/0x290
[ 6778.849300] dev_vprintk_emit+0x140/0x178
[ 6778.849303] dev_printk_emit+0x84/0xb4
[ 6778.849306] __dev_printk+0x60/0x88
[ 6778.849308] _dev_err+0x74/0xa0
[ 6778.849312] tegra_uart_start_rx_dma.part.0.isra.0+0x110/0x120
[ 6778.849316] tegra_uart_rx_error_handle_timer+0xc8/0xd0
[ 6778.849323] call_timer_fn+0x3c/0x200
[ 6778.849328] __run_timers.part.0+0x21c/0x300
[ 6778.849332] run_timer_softirq+0x44/0x80
[ 6778.849335] __do_softirq+0x128/0x3f4
[ 6778.849341] irq_exit+0xe0/0x100
[ 6778.849345] __handle_domain_irq+0x8c/0xf0
[ 6778.849348] gic_handle_irq+0xd0/0x150
[ 6778.849351] el1_irq+0xd0/0x180
[ 6778.849358] cpuidle_enter_state+0xbc/0x404
[ 6778.849362] cpuidle_enter+0x40/0x54
[ 6778.849368] do_idle+0x220/0x2b0
[ 6778.849373] cpu_startup_entry+0x30/0x80
[ 6778.849378] rest_init+0xdc/0xe8
[ 6778.849384] arch_call_rest_init+0x18/0x20
[ 6778.849388] start_kernel+0x504/0x53c
[ 6778.849395] Code: aa1603e0 97ff2919 f9413a61 aa0003f5 (b9400420)
[ 6778.849407] ---[ end trace bd7373de45b7021b ]---
[ 6778.849411] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 6778.849415] SMP: stopping secondary CPUs
[ 6778.849608] Kernel Offset: 0x5919394c0000 from 0xffff800010000000
[ 6778.849609] PHYS_OFFSET: 0xffffa39700000000
[ 6778.849613] CPU features: 0x08040006,4a80aa38
[ 6778.849615] Memory Limit: none
[ 6779.217899] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
This change is already in the kernel:
$ git show d5b90d6b9365
commit d5b90d6b9365250adb73b2fe5b52a5228df3b1d9
Author: Akhil R <akhilrajeev@nvidia.com>
Date: Wed Dec 14 20:14:29 2022 +0530
dmaengine: tegra: Fix memory leak in terminate_all()
Terminate vdesc when terminating an ongoing transfer.
This will ensure that the vdesc is present in the desc_terminated list
The descriptor will will be freed later in desc_free_list().
This fixes the memory leaks which happens whern terminating an ongoing
transfer.
Bug 3787456
Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Change-Id: I44d5c7fedead91b5498ca422124d66da9e383a80
Reviewed-on: https://git-master.nvidia.com/r/c/linux-5.10/+/2827943
(cherry picked from commit 25783210ab30d29c567853c0e14bf998dbb8433f)
Reviewed-on: https://git-master.nvidia.com/r/c/linux-5.10/+/2934196
Reviewed-by: Bibek Basu <bbasu@nvidia.com>
GVS: Gerrit_Virtual_Submit <buildbot_gerritrpt@nvidia.com>
Tested-by: Bibek Basu <bbasu@nvidia.com>
diff --git a/drivers/dma/tegra-gpc-dma.c b/drivers/dma/tegra-gpc-dma.c
index 6f9ebd987801..b99f21054fa2 100644
--- a/drivers/dma/tegra-gpc-dma.c
+++ b/drivers/dma/tegra-gpc-dma.c
@@ -696,6 +696,7 @@ static int tegra_dma_terminate_all(struct dma_chan *dc)
return err;
}
+ vchan_terminate_vdesc(&tdc->dma_desc->vd);
tegra_dma_disable(tdc);
tdc->dma_desc = NULL;
}
We are using yocto as our build system, and here is the exact branch
FYI: our system has a lot of uart devices.