Serial-Tegra DMA Driver Bug


We’ve got a device hooked to a AGX Orin 32GB units over /dev/ttyTHS0. We’ve been experiencing the same issue as Agx xavier serial-tegra dma error - #10 by blueredfield , but I think I’ve found an easy way to reproduce it and a probable explanation of what’s happening inside the driver. The exact symptoms for us is that we can read from the serial port for a while, and then get this error in dmesg:

[67602.543323] image-server-cl: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[67602.543708] CPU: 0 PID: 35454 Comm: image-server-cl Tainted: G           OE     5.10.104-tegra #1
[67602.543710] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 3.1-32827747 03/19/2023
[67602.543712] Call trace:
[67602.543723]  dump_backtrace+0x0/0x1d0
[67602.543727]  show_stack+0x30/0x40
[67602.543734]  dump_stack+0xd8/0x138
[67602.543739]  warn_alloc+0x110/0x180
[67602.543741]  __alloc_pages_slowpath.constprop.0+0xb7c/0xba0
[67602.543743]  __alloc_pages_nodemask+0x2a0/0x320                                                                                                                                  [67602.543746]  allocate_slab+0x2b4/0x520
[67602.543749]  ___slab_alloc.constprop.0+0x1dc/0x760
[67602.543752]  __slab_alloc.isra.0.constprop.0+0x50/0x90
[67602.543754]  __kmalloc+0x434/0x460
[67602.543760]  tegra_dma_prep_slave_sg+0x130/0x340
[67602.543764]  tegra_uart_start_rx_dma+0xbc/0x140
[67602.543767]  tegra_uart_isr+0x300/0x490
[67602.543771]  __handle_irq_event_percpu+0x68/0x2a0
[67602.543773]  handle_irq_event_percpu+0x40/0xa0
[67602.543776]  handle_irq_event+0x50/0xf0
[67602.543778]  handle_fasteoi_irq+0xc0/0x170
[67602.543781]  generic_handle_irq+0x40/0x60
[67602.543783]  __handle_domain_irq+0x70/0xd0
[67602.543785]  gic_handle_irq+0x68/0x134
[67602.543787]  el0_irq_naked+0x4c/0x54
[67602.543788] Mem-Info:
[67602.543794] active_anon:6688 inactive_anon:799025 isolated_anon:0
                active_file:587536 inactive_file:5211383 isolated_file:0
                unevictable:11052 dirty:477130 writeback:0
                slab_reclaimable:192359 slab_unreclaimable:334860
                mapped:90468 shmem:18719 pagetables:3749 bounce:0
                free:74772 free_pcp:3551 free_cma:0                                                                                                                                 
[67602.543798] Node 0 active_anon:26752kB inactive_anon:3196100kB active_file:2350144kB inactive_file:20845532kB unevictable:44208kB isolated(anon):0kB isolated(file):0kB mapped:361872kB dirty:1908520kB writeback:0kB shmem:74876kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2385920kB writeback_tmp:0kB kernel_stack:9328kB all_unreclaimable? no
[67602.543803] DMA free:117624kB min:2640kB low:4472kB high:6304kB reserved_highatomic:0KB active_anon:0kB inactive_anon:26996kB active_file:4608kB inactive_file:1651076kB unevictable:0kB writepending:7036kB present:2097152kB managed:1834880kB mlocked:0kB pagetables:12kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB

Once we get this error, the serial port remains open but returns neither any more characters nor reports any errors to our application. We have to heuristically determine that the port is broken and re-open it, while losing any data that we would have received during that time.

Our application only uses about 3-4GB of RAM, but writes a ton of data to a USB-attached NVMe drive. The result is that Linux uses the majority of the Orin’s RAM as buffer/cache, as seen here in top:

Because of the huge amount of buffer/cache, there’s a ton of RAM available (24GB), but that RAM isn’t available for immediate allocation; if an application or driver wants to get some memory, those cache pages have to be flushed first. Inside the serial-tegra driver, there’s a request for a DMA descriptor for reading data into:

static int tegra_uart_start_rx_dma(struct tegra_uart_port *tup)
        unsigned int count = TEGRA_UART_RX_DMA_BUFFER_SIZE;

        if (tup->rx_dma_active)
                return 0;

        tup->rx_dma_desc = dmaengine_prep_slave_single(tup->rx_dma_chan,
                                tup->rx_dma_buf_phys, count, DMA_DEV_TO_MEM,
        if (!tup->rx_dma_desc) {
                dev_err(tup->, "Not able to get desc for Rx\n");
                return -EIO;

Inside the Tegra DMA driver (tegra-gpc-dma.c), there’s code that implements the internals of dmaengine_prep_slave_single(). Part of this code involves allocating memory to hold the DMA descriptor:

static struct dma_async_tx_descriptor *
tegra_dma_prep_slave_sg(struct dma_chan *dc, struct scatterlist *sgl,
                        unsigned int sg_len, enum dma_transfer_direction direction,
                        unsigned long flags, void *context)
... snip ...
        dma_desc = kzalloc(struct_size(dma_desc, sg_req, sg_len), GFP_NOWAIT);
        if (!dma_desc)
                return NULL;

From what I can tell, the issue arrises because of the GFP_NOWAIT flat; when all of the pages available to the kernel are being used as buffers, they’re available for allocation but not immediately. The result is that kzalloc(..., GFP_NOWAIT) returns null here. After that happens, the serial-tegra driver appears to get wedged and stops functioning.

This behaviour is reproducible on Jetpack 5.1.1 and L4T 35.3.1.

I am going to check it will update to you later.


@ShaneCCC any updates?

Please apply below change for it.

diff --git a/drivers/dma/tegra-gpc-dma.c b/drivers/dma/tegra-gpc-dma.c
index 9c086c54..abf34b0 100644
--- a/drivers/dma/tegra-gpc-dma.c
+++ b/drivers/dma/tegra-gpc-dma.c
@@ -715,6 +715,7 @@
 			return err;
+		vchan_terminate_vdesc(&tdc->dma_desc->vd);
 		tdc->dma_desc = NULL;

@ShaneCCC should that be line 699 in the publicly-released 35.2.1 and 35.3.1 sources?

Yes, should be line 699 in the tegra_dma_terminate_all()


@ShaneCCC you rock! I’m going to keep testing this here and put together a less hacky kernel build that we can deploy out to our units but the initial smoke testing is suggesting that that fixed it!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.