Re-opening Xavier NX R35.6.0 Kernel Oops

Hi NVIDA team,

I have updates on Xavier NX R35.6.0 Kernel Oops - #40, but this thread is now locked.

Apologies – we have been too busy to follow up on this thread until now.

I have replicated the same test, but booted and running from the eMMC instead of the NVMe. Unfortunately, the issue persists even when booted from eMMC.

For reference, I unpacked R35.6.0 JetPack/rootfs, did the usual applyBinaries.sh and then flashed with:

sudo ./flash.sh jetson-xavier-nx-devkit-emmc mmcblk0p1

I setup slub_debug as before, and then run the serial and disk stressors from before.

After running for 1/2 an hour, I get a slub_debug error, indicating that there is still an out-of-bounds write occurring.

tegra-ubuntu login: [   29.004673] nvidia: loading out-of-tree module taints kernel.
[ 6852.803761] =============================================================================
[ 6852.804068] BUG kmalloc-256 (Tainted: G           O     ): Poison overwritten
[ 6852.804240] -----------------------------------------------------------------------------
[ 6852.804240] 
[ 6852.804628] Disabling lock debugging due to kernel taint
[ 6852.804824] INFO: 0x000000000b94b4ca-0x00000000b733ee34 @offset=7192. First byte 0x70 instead of 0x6b
[ 6852.805039] INFO: Slab 0x0000000078d9659f objects=21 used=20 fp=0x00000000b561ef46 flags=0x8000000000010200
[ 6852.805306] INFO: Object 0x00000000bbcf4a83 @offset=7168 fp=0x00000000c8edbf1b
[ 6852.805306] 
[ 6852.805510] Redzone  00000000a3c4e2b2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.805727] Redzone  00000000289ce4e0: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.805938] Redzone  0000000058608c50: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.806164] Redzone  00000000b4848ccd: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.809459] Redzone  00000000268f6e60: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.818770] Redzone  00000000d3666ed2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.828536] Redzone  00000000754b38a2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.838073] Redzone  000000003aa3a467: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.847352] Redzone  000000002e962086: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.857147] Redzone  000000002c256668: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.866442] Redzone  00000000bdf983a7: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.876223] Redzone  00000000c2d83118: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.885786] Redzone  00000000ec33a6b4: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.895298] Redzone  00000000c2f9c951: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.904579] Redzone  00000000681a10f2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.914372] Redzone  00000000f5982f73: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[ 6852.923652] Object   00000000bbcf4a83: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.933189] Object   0000000000f25900: 6b 6b 6b 6b 6b 6b 6b 6b 70 b8 79 c8 ea 2d ff ff  kkkkkkkkp.y..-..
[ 6852.942984] Object   000000005ff73fc1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.952522] Object   00000000671e45da: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.962081] Object   0000000001272bf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.971598] Object   00000000d7cfa56c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.980877] Object   000000008b5f73c1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.990414] Object   00000000677f5130: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6852.999955] Object   00000000c008b8c6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.009489] Object   000000003daf912f: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.019027] Object   00000000e98482ab: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.028821] Object   00000000aeb24c22: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.038101] Object   0000000040642f12: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.047642] Object   000000009b9802b6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.057177] Object   00000000c3141fbc: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 6853.066972] Object   00000000970a474b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[ 6853.076531] Redzone  00000000b59eaaa6: bb bb bb bb bb bb bb bb                          ........
[ 6853.085002] Padding  000000008ef7e1da: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.094797] Padding  00000000a9eb646e: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.104333] Padding  000000007cef9957: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.113639] Padding  000000004a6db3c3: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.123410] Padding  00000000516d1778: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.132950] Padding  000000007d8162e8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.142485] Padding  00000000c4b59430: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.151852] Padding  00000000555e03fa: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.161647] Padding  00000000119413b0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.171183] Padding  00000000135465d9: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.180721] Padding  000000002fe798a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.190278] Padding  000000008935831c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.199797] Padding  0000000070010d18: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.209336] Padding  0000000099989353: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.218876] Padding  000000006484380d: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[ 6853.228763] FIX kmalloc-256: Restoring 0x000000000b94b4ca-0x00000000b733ee34=0x6b
[ 6853.228763] 
[ 6853.237421] FIX kmalloc-256: Marking all objects used

Interestingly, running the same test with R35.5.0 did not show the fault, indicating that it must have been introduced between R35.5.0 and R35.6.0.

Could you replace kernel image from rel-35.6 to rel-35.5 or vice versa and see if this leads to issue able/not able to reproduce? This could clarify if issue is located in kernel. As our test last time, we cannot reproduce this issue on our side even with your setup.

Hi WayneWWW,

So I have been working to bisect this issue.

For my stress tests, I need to run the OS for a reasonable amount of time. Unfortunately, the R35.5.0 kernel tends to hit a kernel oops in kernel/kernel-5.10/block/bio.c on line 960 before the serial port stressor can reach serial port bug. This was fixed in the move from R35.5.0 to R35.6.0:

diff -Nrau r3550/Linux_for_Tegra/source/public/kernel/kernel-5.10/block/bio.c r3560/Linux_for_Tegra/source/public/kernel/kernel-5.10/block/bio.c
--- r3550/Linux_for_Tegra/source/public/kernel/kernel-5.10/block/bio.c	2024-02-20 04:18:11.000000000 +0000
+++ r3560/Linux_for_Tegra/source/public/kernel/kernel-5.10/block/bio.c	2024-08-28 09:44:12.000000000 +0100
@@ -776,7 +776,7 @@
 
 	if ((addr1 | mask) != (addr2 | mask))
 		return false;
-	if (bv->bv_len + len > queue_max_segment_size(q))
+	if (len > queue_max_segment_size(q) - bv->bv_len)
 		return false;
 	return __bio_try_merge_page(bio, page, len, offset, same_page);
 }
@@ -960,7 +960,7 @@
 		return;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (mark_dirty && !PageCompound(bvec->bv_page))
+		if (mark_dirty)
 			set_page_dirty_lock(bvec->bv_page);
 		put_page(bvec->bv_page);
 	}
@@ -1332,8 +1332,7 @@
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
+		set_page_dirty_lock(bvec->bv_page);
 	}
 }
 
@@ -1381,7 +1380,7 @@
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+		if (!PageDirty(bvec->bv_page))
 			goto defer;
 	}
 

Thus I can’t really test an R35.5.0 kernel well because it is too unstable.

So, my second approach was to take a stock R35.6.0 kernel source, and then revert the changes to just the serial-tegra.c driver that were made between R3550 and R3560. The complete patch is included below. This should be reverse-applied to the R35.6.0 kernel.

--- r3550/Linux_for_Tegra/source/public/kernel/kernel-5.10/drivers/tty/serial/serial-tegra.c
+++ r3560/Linux_for_Tegra/source/public/kernel/kernel-5.10/drivers/tty/serial/serial-tegra.c
@@ -4,7 +4,7 @@
  *
  * High-speed serial driver for NVIDIA Tegra SoCs
  *
- * Copyright (c) 2012-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2012-2024, NVIDIA CORPORATION.  All rights reserved.
  *
  * Author: Laxman Dewangan <ldewangan@nvidia.com>
  */
@@ -662,7 +662,6 @@
 	count = tup->tx_bytes_requested - state.residue;
 	async_tx_ack(tup->tx_dma_desc);
 	uart_xmit_advance(&tup->uport, count);
-	dmaengine_terminate_all(tup->tx_dma_chan);
 	tup->tx_in_progress = 0;
 }
 
@@ -882,10 +881,21 @@
 	struct dma_tx_state state;
 	enum dma_status status;
 	struct dma_async_tx_descriptor *prev_rx_dma_desc;
+	unsigned long ier;
 	int rx_level = 0;
 	int ret = 0;
 
 	spin_lock_irqsave(&u->lock, flags);
+
+	/* Deactivate flow control to stop the sender. */
+	if (tup->rts_active && tup->is_hw_flow_enabled)
+		set_rts(tup, false);
+
+	/* Disable RX interrupts. */
+	ier = tup->ier_shadow;
+	ier &= ~(UART_IER_RLSI | UART_IER_RTOIE | TEGRA_UART_IER_EORD);
+	tup->ier_shadow = ier;
+	tegra_uart_write(tup, ier, UART_IER);
 
 	status = dmaengine_tx_status(tup->rx_dma_chan, tup->rx_cookie, &state);
 
@@ -893,11 +903,8 @@
 		dev_dbg(tup->uport.dev, "RX DMA is in progress\n");
 		goto done;
 	}
+
 	prev_rx_dma_desc = tup->rx_dma_desc;
-
-	/* Deactivate flow control to stop sender */
-	if (tup->rts_active && tup->is_hw_flow_enabled)
-		set_rts(tup, false);
 
 	tup->rx_dma_active = false;
 	ret = tegra_uart_rx_buffer_push(tup, 0);
@@ -920,6 +927,7 @@
 	tegra_uart_start_rx_dma(tup);
 	async_tx_ack(prev_rx_dma_desc);
 
+done:
 	/* Activate flow control to start transfer */
 	if (tup->enable_rx_buffer_throttle) {
 		if ((rx_level <= 70) && tup->rts_active)
@@ -927,7 +935,12 @@
 	} else if (tup->rts_active && tup->is_hw_flow_enabled)
 		set_rts(tup, true);
 
-done:
+	/* Enable RX interrupts. */
+	ier = tup->ier_shadow;
+	ier |= (UART_IER_RLSI | UART_IER_RTOIE | TEGRA_UART_IER_EORD);
+	tup->ier_shadow = ier;
+	tegra_uart_write(tup, ier, UART_IER);
+
 	spin_unlock_irqrestore(&u->lock, flags);
 }
 
@@ -945,26 +958,29 @@
 	dmaengine_tx_status(tup->rx_dma_chan, tup->rx_cookie, &state);
 	dmaengine_terminate_all(tup->rx_dma_chan);
 
+	ret = tegra_uart_rx_buffer_push(tup, state.residue);
 	tup->rx_dma_active = false;
-
-	/* Return error if tty buffer is full. */
-	ret = tegra_uart_rx_buffer_push(tup, state.residue);
-	if (ret) {
+	async_tx_ack(tup->rx_dma_desc);
+
+	if (ret)
 		tup->rx_in_progress = 0;
-		async_tx_ack(tup->rx_dma_desc);
-		return ret;
-	}
-
-	return 0;
-}
-
-static void tegra_uart_handle_rx_dma(struct tegra_uart_port *tup)
-{
+
+	return ret;
+}
+
+static int tegra_uart_handle_rx_dma(struct tegra_uart_port *tup)
+{
+	unsigned long ier;
 	int ret = 0;
 
 	/* Deactivate flow control to stop sender */
-	if (tup->rts_active  && tup->is_hw_flow_enabled)
+	if (tup->rts_active && tup->is_hw_flow_enabled)
 		set_rts(tup, false);
+
+	ier = tup->ier_shadow;
+	ier &= ~(UART_IER_RLSI | UART_IER_RTOIE | TEGRA_UART_IER_EORD);
+	tup->ier_shadow = ier;
+	tegra_uart_write(tup, ier, UART_IER);
 
 	/*
 	 * If tty buffer is full then keep RTS disabled, DMA and RTS
@@ -972,10 +988,19 @@
 	 */
 	ret = tegra_uart_terminate_rx_dma(tup);
 	if (ret)
-		return;
+		return ret;
+
+	tegra_uart_start_rx_dma(tup);
 
 	if (tup->rts_active  && tup->is_hw_flow_enabled)
 		set_rts(tup, true);
+
+	ier = tup->ier_shadow;
+	ier |= (UART_IER_RLSI | UART_IER_RTOIE | TEGRA_UART_IER_EORD);
+	tup->ier_shadow = ier;
+	tegra_uart_write(tup, ier, UART_IER);
+
+	return 0;
 }
 
 static int tegra_uart_start_rx_dma(struct tegra_uart_port *tup)
@@ -1029,41 +1054,12 @@
 	struct tegra_uart_port *tup = data;
 	struct uart_port *u = &tup->uport;
 	unsigned long iir;
-	unsigned long ier;
-	bool is_rx_start = false;
-	bool is_rx_int = false;
 	unsigned long flags;
-	struct tty_port *port = &tup->uport.state->port;
-	int rx_level = 0;
 
 	spin_lock_irqsave(&u->lock, flags);
 	while (1) {
 		iir = tegra_uart_read(tup, UART_IIR);
 		if (iir & UART_IIR_NO_INT) {
-			if (!tup->use_rx_pio && is_rx_int) {
-				tegra_uart_handle_rx_dma(tup);
-				if (tup->rx_in_progress) {
-					ier = tup->ier_shadow;
-					ier |= (UART_IER_RLSI | UART_IER_RTOIE |
-						TEGRA_UART_IER_EORD | UART_IER_RDI);
-					tup->ier_shadow = ier;
-					tegra_uart_write(tup, ier, UART_IER);
-				}
-			} else if (is_rx_start) {
-				if (tup->enable_rx_buffer_throttle) {
-					rx_level = tty_buffer_get_level(port);
-					if (rx_level > 70)
-						mod_timer(&tup->timer,
-						jiffies + tup->timer_timeout_jiffies);
-				}
-				tegra_uart_start_rx_dma(tup);
-
-				if (tup->enable_rx_buffer_throttle) {
-					if ((rx_level <= 70) && tup->rts_active)
-						set_rts(tup, true);
-				} else if (tup->rts_active && tup->is_hw_flow_enabled)
-						set_rts(tup, true);
-			}
 			spin_unlock_irqrestore(&u->lock, flags);
 			return IRQ_HANDLED;
 		}
@@ -1081,26 +1077,13 @@
 
 		case 4: /* End of data */
 		case 6: /* Rx timeout */
-			if (!tup->use_rx_pio) {
-				is_rx_int = tup->rx_in_progress;
-				/* Disable Rx interrupts */
-				ier = tup->ier_shadow;
-				ier &= ~(UART_IER_RDI | UART_IER_RLSI |
-					UART_IER_RTOIE | TEGRA_UART_IER_EORD);
-				tup->ier_shadow = ier;
-				tegra_uart_write(tup, ier, UART_IER);
+			if (!tup->use_rx_pio && tup->rx_dma_active) {
+				tegra_uart_handle_rx_dma(tup);
 				break;
 			}
 			fallthrough;
 		case 2: /* Receive */
-			if (!tup->use_rx_pio) {
-				is_rx_start = tup->rx_in_progress;
-				tup->ier_shadow  &= ~UART_IER_RDI;
-				tegra_uart_write(tup, tup->ier_shadow,
-						 UART_IER);
-			} else {
-				do_handle_rx_pio(tup);
-			}
+			do_handle_rx_pio(tup);
 			break;
 
 		case 3: /* Receive error */
@@ -1230,7 +1213,11 @@
 	tup->ier_shadow = 0;
 	tup->current_baud = 0;
 
-	clk_prepare_enable(tup->uart_clk);
+	ret = clk_prepare_enable(tup->uart_clk);
+	if (ret) {
+		dev_err(tup->uport.dev, "could not enable clk\n");
+		return ret;
+	}
 
 	/* Reset the UART controller to clear all previous status.*/
 	reset_control_assert(tup->rst);
@@ -1331,8 +1318,12 @@
 	 * If using DMA mode, enable EORD interrupt to notify about RX
 	 * completion.
 	 */
-	if (!tup->use_rx_pio)
+	if (!tup->use_rx_pio) {
+		tup->ier_shadow &= ~UART_IER_RDI;
 		tup->ier_shadow |= TEGRA_UART_IER_EORD;
+
+		tegra_uart_start_rx_dma(tup);
+	}
 
 	tegra_uart_write(tup, tup->ier_shadow, UART_IER);
 	return 0;

Applying this patch and then running the previously described stressors with slub_debug enabled resulted in no kernel oops and no slub_debug messages printed to the serial console after 72 hours.

This to me is a smoking gun that there is a bug in the R35.6.0 serial port driver implementation. For now we may be able to produce a workable OS image by just deploying with the revered serial port driver.

Hi bgillatt,

I’ve gone through your original Topic 309218 again for your status.

You hit 3 issues as following:

  1. slub_debug warning message
  2. kernel oops
  3. following errors from serial driver
[   50.583238] tegra-gpcdma 2600000.dma: DMA pause timed out
[   52.081031] tegra-gpcdma 2600000.dma: slave id already in use
[   52.081213] serial-tegra 3110000.serial: Not able to get desc for Tx

Please let us know which one would you like to fix currently.
And please share the detailed steps to reproduce it on the devkit after you run sudo ./flash.sh jetson-xavier-nx-devkit-emmc mmcblk0p1 to flash Xavier NX devkit since we can not reproduce it on the devkit with R35.6.0 before.

It seems legacy UART driver is working in your case. Do you still have the requirement to debug and use Tegra High Speed Uart driver?

Hi KevinFFF,

I believe the 3 items you have listed there are three symptoms of the same bug. I am not requesting that the symptoms are fixed, but that the root cause is found and fixed.

To me, the evidence points towards an out of bound write, either directly from the CPU, or a misconfigured DMA transfer. Because this write could be to any region in kernel address space, the severity of the fault will depend on which random bit of kernel memory is clobbered. If it is something fairly harmless, the slub_debug tool will catch it. If it is somewhere more sensitive, we see a kernel oops. But the point is that I still believe this is the same bug causing these problems.

We attempted to use the legacy serial port, but this will not work in our application. Our application needs to transfer data at 115200 kbps, and at this rate, and with the 36 byte hardware receive FIFO on the Jetson Xavier SoC, the kernel does not always service this FIFO before it overflows, resulting in lost data. We spent a long time trying to track down what is preventing the kernel from servicing interrupts in a timely manner, but there were quite a few sources, so this approach was abandoned as unworkable.

So, this is why I returned to debugging and triaging the DMA serial port driver in R35.6.0. As described above, I managed to get a working DMA serial port driver by reverting the serial-tegra.c driver to the version in the R35.5.0 kernel. Note that I did this patch because we have to have a DMA serial port, we cannot use a software serial port. While this patching will get our application working, I presume you do not want to have a broken serial port driver in your software, hence why I am reporting this here.

2 people have independently replicated the test internally in our company, as well as an external third party contractor using the steps in the previous post. For reference though, I will duplicate them here.

Login to instrument via SSH.

Install tools on running devkit:

sudo apt-get install stress-ng nano python3-serial

Setup slub_debug on running devkit:

sudo nano /boot/extlinux/extlinux.conf

… append to the command line:

slub_debug=FZP

Create a serial stressor on the devkit:

nano serial_stressor.py
#!/usr/bin/python3

import serial
import os
import sys

# Serial port parameters
serial_port = sys.argv[1]
baud_rate = 115200

while True:
    print("Loop")
    s = serial.Serial(serial_port, baud_rate, exclusive=True, timeout=0.1)
    print("Serial port connected to {}".format(serial_port), flush=True)

    read_data = s.read(128).decode('utf-8', errors='backslashreplace')
    print("Read done", flush=True)
    s.write("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".encode())
    print("Write done", flush=True)
    s.close()
    print("Close done", flush=True)

Set power mode to high power to boot all 6 CPUs:

sudo nvpmodel -m 8

On three parallel SSH sessions, start serial stressors and disk stressors:

sudo -E python3 serial_stressor.py /dev/ttyTHS0
sudo -E python3 serial_stressor.py /dev/ttyTHS1
stress-ng -d 6 --aggressive

Wait for between 15 mins and 1 hour for one of the symptoms to show up. This could be a slub_debug message or a kernel oops.

I’ve run the test for more than 20hrs w/o hitting the issue from yesterday.
I didn’t change the power mode on my setup and I will perform the test again.

Could you share the result of cat /proc/cmdline on your board?

Could you reproduce the issue if you don’t modify the power mode?
(i.e. would you hit the issue with default power mode 5 as following?)

< POWER_MODEL ID=5 NAME=MODE_10W_DESKTOP >

Hi KevinFFF,

Sure, the command line from

cat /proc/cmdline

is:

root=PARTUUID=84a337cd-d599-4a60-9c26-2cf352f36c2d rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 video=efifb:off nospectre_bhb slub_debug=FZP 

I will retry with the default power mode when I get some time to look at this.

Hi bgillatt,

I can reproduce this Kernel Oops issue locally with the steps you shared.
Please give us some time to clarify and debug the issue.

You can also try to remove slub_debug=FZP and check if it could help in your case.

1 Like

Hi KevinFFF,

Thank you for taking the time to replicate the fault. I look forward to hearing about anything you find.

We already do deploy without slub_debug enabled. But because of the random OOB write, our system randomly crashes with a kernel oops, which I can be attributed to the same root cause. The slub_debug just makes the issue more visible to debug.

Hi bgillatt,

Could you apply the following patch for serial driver to check if it could help?

diff --git a/drivers/tty/serial/serial-tegra.c b/drivers/tty/serial/serial-tegra.c
index b52a20019f0e..0deb6b8f72a1 100644
--- a/drivers/tty/serial/serial-tegra.c
+++ b/drivers/tty/serial/serial-tegra.c
@@ -565,15 +565,21 @@ static void tegra_uart_tx_dma_complete(void *args)
 	unsigned long flags;
 	unsigned int count;
 
+	spin_lock_irqsave(&tup->uport.lock, flags);
+	if (tup->tx_in_progress != TEGRA_UART_TX_DMA) {
+		dev_dbg(tup->uport.dev, "TX DMA is not in progress\n");
+		goto done;
+	}
+
 	dmaengine_tx_status(tup->tx_dma_chan, tup->tx_cookie, &state);
 	count = tup->tx_bytes_requested - state.residue;
-	async_tx_ack(tup->tx_dma_desc);
-	spin_lock_irqsave(&tup->uport.lock, flags);
 	uart_xmit_advance(&tup->uport, count);
 	tup->tx_in_progress = 0;
 	if (uart_circ_chars_pending(xmit) < WAKEUP_CHARS)
 		uart_write_wakeup(&tup->uport);
 	tegra_uart_start_next_tx(tup);
+
+done:
 	spin_unlock_irqrestore(&tup->uport.lock, flags);
 }
 
@@ -667,7 +673,6 @@ static void tegra_uart_stop_tx(struct uart_port *u)
 	dmaengine_tx_status(tup->tx_dma_chan, tup->tx_cookie, &state);
 	dmaengine_terminate_all(tup->tx_dma_chan);
 	count = tup->tx_bytes_requested - state.residue;
-	async_tx_ack(tup->tx_dma_desc);
 	uart_xmit_advance(&tup->uport, count);
 	tup->tx_in_progress = 0;
 }
@@ -861,7 +866,6 @@ static int tegra_uart_rx_buffer_push(struct tegra_uart_port *tup,
 	unsigned int count;
 	int ret;
 
-	async_tx_ack(tup->rx_dma_desc);
 	count = tup->rx_bytes_requested - residue;
 
 	/* If we are here, DMA is stopped */
@@ -887,7 +891,6 @@ static void tegra_uart_rx_dma_complete(void *args)
 	unsigned long flags;
 	struct dma_tx_state state;
 	enum dma_status status;
-	struct dma_async_tx_descriptor *prev_rx_dma_desc;
 	unsigned long ier;
 	int rx_level = 0;
 	int ret = 0;
@@ -903,6 +906,10 @@ static void tegra_uart_rx_dma_complete(void *args)
 	ier &= ~(UART_IER_RLSI | UART_IER_RTOIE | TEGRA_UART_IER_EORD);
 	tup->ier_shadow = ier;
 	tegra_uart_write(tup, ier, UART_IER);
+	if (tup->rx_dma_active == false) {
+		dev_dbg(tup->uport.dev, "RX DMA is not active\n");
+		goto done;
+	}
 
 	status = dmaengine_tx_status(tup->rx_dma_chan, tup->rx_cookie, &state);
 
@@ -911,8 +918,6 @@ static void tegra_uart_rx_dma_complete(void *args)
 		goto done;
 	}
 
-	prev_rx_dma_desc = tup->rx_dma_desc;
-
 	tup->rx_dma_active = false;
 	ret = tegra_uart_rx_buffer_push(tup, 0);
 	if (ret) {
@@ -920,7 +925,6 @@ static void tegra_uart_rx_dma_complete(void *args)
 		 * If we are here, then tty buffer is full. Keep RTS and DMA
 		 * disabled. They are enabled later by error handler.
 		 */
-		async_tx_ack(prev_rx_dma_desc);
 		goto done;
 	}
 
@@ -932,7 +936,6 @@ static void tegra_uart_rx_dma_complete(void *args)
 	}
 
 	tegra_uart_start_rx_dma(tup);
-	async_tx_ack(prev_rx_dma_desc);
 
 done:
 	/* Activate flow control to start transfer */
@@ -967,7 +970,6 @@ static int tegra_uart_terminate_rx_dma(struct tegra_uart_port *tup)
 
 	ret = tegra_uart_rx_buffer_push(tup, state.residue);
 	tup->rx_dma_active = false;
-	async_tx_ack(tup->rx_dma_desc);
 
 	if (ret)
 		tup->rx_in_progress = 0;
-- 
2.43.2

I’ve verified it working on Xavier NX-eMMC devkit w/o hitting kernel oop issue with the steps you shared.

1 Like

Thank you KevinFFF,

I will test this tomorrow and get back to you with the results.

Hi KevinFFF,

Preliminary testing is showing that this patch is working well. We have several weeks of stress testing and soak testing which might show up something, but so far I’ve not seen any kernel oops or data loss with the patch supplied above.

Thank you for your assistance on this tricky issue.

Hi KevinFFF,

The patch as provided has now passed our extended system testing and soak testing. Thank you for your assistance in resolving this issue.

1 Like