ttyTHS0 input overruns and DMA use-after-free, UEFI 4.1, kernel 6.6.6

@jonathanh Are you still working on UEFI and/or serial?
Remember this thread… Feedback on Experimental UEFI Firmware ?

I decided to upgrade my NX devkit to L4T 34.4.1 to get UEFI 4.1 but I still need to run a stock Fedora kernel which is now 6.6.6. Since I also still have a gps on uart pins on the 40-pin header (serial@3100000) I need the UEFI to not output to it so I’m doing exactly what I did in the eariler thread… disabling serial@3100000 in the dtb flashed to bootloader-dtb and re-enabling it in the kernel dtb with compatible = "nvidia,tegra194-hsuart";. It’s working fine except every few minutes I get…

[Fri Dec 15 16:59:53 2023] ttyTHS ttyTHS0: 1 input overrun(s)
[Fri Dec 15 17:02:45 2023] ttyTHS ttyTHS0: 2 input overrun(s)
[Fri Dec 15 17:04:11 2023] ttyTHS ttyTHS0: 2 input overrun(s)

This was not occurring with UEFI 1.1.2 and kernel 5.13.13.

So I decided to enable dma by adding …

dmas = <&gpcdma 8>, <&gpcdma 8>;
dma-names = "rx", "tx";

The gpsdma ‘8’ came from the TRM for uarta.
This however, causes the following whenever the port is opened…

[   70.769605] arm-smmu 12000000.iommu: Unhandled context fault: fsr=0x402, iova=0xaf031080, fsynr=0x1d0011, cbfrsynra=0x20, cb=1
[   70.781058] ==================================================================
[   70.787672] BUG: KFENCE: use-after-free read in tegra_uart_rx_buffer_push+0x3c/0x158
[   70.797291] Use-after-free read at 0x00000000debeda51 (in kfence-#133):
[   70.807179] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.6-100.fc38.aarch64 #1
[   70.814442] Hardware name: Unknown NVIDIA Jetson Xavier NX Developer Kit/NVIDIA Jetson Xavier NX Developer Kit, BIOS 4.1-33958178 08/01/2023
[   70.827217] ==================================================================
[   70.834382] tegra-mc 2c00000.memory-controller: axisw: secure write @0x00000003ffffff00: VPR violation ((null))
[   70.844752] tegra-mc 2c00000.memory-controller: axisr: secure read @0x000000ffffffff00: EMEM address decode error (EMEM decode error)
[   71.019843] irq 91: nobody cared (try booting with the "irqpoll" option)
[   71.020459] handlers:
[   71.020551] [<00000000ade5dac1>] nvidia_smmu_global_fault
[   71.020698] Disabling IRQ #91
[   71.020902] tegra-mc 2c00000.memory-controller: axisr: secure read @0x000000ffffffff00: EMEM address decode error (EMEM decode error)
[   71.758197] arm-smmu 12000000.iommu: Unhandled context fault: fsr=0x402, iova=0xaf031000, fsynr=0x1d0011, cbfrsynra=0x20, cb=1
[   71.760636] tegra-mc 2c00000.memory-controller: axisw: secure write @0x00000003ffffff00: VPR violation ((null))
[   71.762243] tegra-mc 2c00000.memory-controller: axisw: secure write @0x00000003ffffff00: VPR violation ((null))

Turning on dma was just a test so I’m not worried about it not working but the overruns are a bit concerning since the port speed is only 115200 and the GPS is only outputting 3 NMEA sentences at the top of every second.

Any ideas?

Hi gtj,

Is that a typo?
Do you mean L4T R35.4.1 (Jetpack 5.1.2)?

Please share the full dmesg and device tree for further check.

Yeah it was a typo. 35.4.1 is correct.

nx-ntp-dmesg.txt (69.2 KB)
nx-ntp-dtbo-dts.txt (677 Bytes)
nx-ntp-full-dts.txt (152.2 KB)

The base dtb over which I’m applying the dtbo is tegra194-p3509-0000+p3668-0000.dtb from the kernel 6.6.7.

I’m also seeing the following messages just after things settle down…

tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 3510000.hda
tegra-mc 2c00000.memory-controller: sync_state() pending due to 3510000.hda
tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 17000000.gpu
tegra-mc 2c00000.memory-controller: sync_state() pending due to 17000000.gpu
tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 15380000.nvjpg
tegra-mc 2c00000.memory-controller: sync_state() pending due to 15380000.nvjpg
tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 154c0000.nvenc
tegra-mc 2c00000.memory-controller: sync_state() pending due to 154c0000.nvenc
tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 15a80000.nvenc
tegra-mc 2c00000.memory-controller: sync_state() pending due to 15a80000.nvenc
tegra186-emc 2c60000.external-memory-controller: sync_state() pending due to 2900800.ahub
tegra-mc 2c00000.memory-controller: sync_state() pending due to 2900800.ahub

There doesn’t seem to be any ill effect but the console is in text mode so I’m not stressing the GPU at all. I can do that later today.

Are you using the 3rd-party kernel on the Xavier NX devkit rather than the kernel from L4T?

It seems the warning messages rather than the errors.
Could you get the expected data from your GPS through the serial interface?

Yes. My understanding from when the UEFI was originally released that the goal was to be able to run kernels directly from a distribution and not to have to run a specific L4T kernel. Hence the effort you guys put in to push many of the tegra related kernel patches to the mainstream kernel.

Every time one of those messages occurs, there’s a corruption in the serial data.
Here’s an example showing the first $PUBX message being truncated and the following $GPRMC starting without a newline in between.

$GPZDA,152734.00,03,01,2024,00,00*66
$PUBX,04,152734.00,030124,314853.99,2295,18,97113,544.534$GPRMC,152735.00,A,3947.22427,N,10505.99212,W,0.000,,030124,,,D,V*16
$GPZDA,152735.00,03,01,2024,00,00*67
$PUBX,04,152735.00,030124,314854.99,2295,18,97658,544.632,08*2F
$GPRMC,152736.00,A,3947.22427,N,10505.99212,W,0.000,,030124,,,D,V*15

May I know how do you connect the UART with GPS module?
Could you share the block diagram of your connection?

The warning means the buffer in UART is full so that the packet might be lost.
Have you tried to increase buffer size or enable HW flow control to check if they could help?

It’s a simple two wire connection.

UART0_TXD (pin 8 on 40 pin header)  -> U-BLOX RXD
UART0_RXD (pin 10 on 40 pin header) -> U-BLOX TXD

Yep, I know that. :)

Increase which buffer?

HW flow control??? For a 115200 baud rate that only receives about 200 bytes per second? That should not even be close to needing flow control and never has before.

I’ve had this running fine for a few years. The issue only started happening after upgrading. I guess I could revert serial-tegra.c back to when I know it was working ( I think it was kernel 5.13) and rebuild the kernel and see if that fixes it. Then I’d have to do a git-bisect and recompile the kernel every time. I suppose I could build the driver as a module though. That’d make it easier to test.

Anyway, I was hoping you guys could look at the the recent changes you’ve made and see if any could be causing the issue. Here’s a list, just FYI…

5abd01145d0c serial: tegra: handle clk prepare error in tegra_uart_hw_init()
f9061d3b7899 serial: tegra: Use devm_platform_get_and_ioremap_resource()
ad4484afe7de serial: tegra: Don't print error on probe deferral
29e5c442e553 tty: Explicitly include correct DT includes
fd2b55f86b8b serial: drivers: switch ch and flag to u8
38f28cfe9d08 serial: tegra: Add missing clk_disable_unprepare() in tegra_uart_hw_init()
109a951a9f1f serial: tegra: Read DMA status before terminating
b7e2647671a2 serial: tegra: Use uart_xmit_advance()
754f68044c7d serial: tegra: Use uart_xmit_advance(), fixes icount.tx accounting
cac8f7194111 serial: tegra: Remove custom frame size calculation
bec5b814d46c serial: Make ->set_termios() old ktermios const
eb01611056cf drivers: tty: serial: Add missing of_node_put() in serial-tegra.c
d93e612d13ba serial: tegra: fix typos in comments
988c5bbea59f tty: serial: make use of UART_LCR_WLEN() + tty_get_char_size()
b40de7469ef1 serial: tegra: Change lower tolerance baud rate limit for tegra20 and tegra30
a6a65f9ee093 serial: tegra: Use of_device_get_match_data
cc9ca4d95846 serial: tegra: Only print FIFO error message when an error occurs

Could you check if disabling getty service helps for your case?

$ sudo systemctl stop nvgetty.service
$ sudo systemctl disable nvgetty.service

It’s always been disabled otherwise ntpd wouldn’t be able to open /dev/ttyTHS0.

Just to eliminate application issues, I opened the port with cat as well as socat and just dumped the data to /dev/null but still got the overrun messages.

Please refer to the following topic to increase buffer size for UART.
Modify the uart buffer size - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums

Could you help to share the detailed reproduce step for us to verify it locally?

That post mentions TEGRA_UART_FIFO_SIZE and although it’s defined in serial-tegra.c, it’s not actually used anywhere. Down in tegra_uart_probe(), fifosize is forced to 32…

       u = &tup->uport;
       u->dev = &pdev->dev;
       u->ops = &tegra_uart_ops;
       u->type = PORT_TEGRA;
       u->fifosize = 32;
       tup->cdata = cdata;

But I did set …

#define TEGRA_UART_FIFO_SIZE                   256
...
       u->fifosize = TEGRA_UART_FIFO_SIZE;

and recompiled but it didn’t help at all.

Could you help to share the detailed reproduce step for us to verify it locally?
[/quote]

You need to start with a NX that’s been flashed with a bootloader_dtb that’s had
“/serial@3100000” disabled…

fdtput kernel/dtb/tegra194-p3668-0000-p3509-0000.dtb --type s "/serial@3100000" "status" "disabled"

Then flash the NX as usual. This prevents the UEFI from reading from or writing to ttyTHS0.

When the NX is available again, install the latest stock linux kernel. How you do it is up to you. Find the stock tegra194-p3509-0000+p3668-0000.dtb for that kernel and change the “compatible” parameter for “serial@3100000” from “nvidia,tegra194-uart” to “nvidia,tegra194-hsuart” and add an alias for “serial0” for it. Just enabling the of node will NOT make the serial port available again. You have to change to compatible from -uart to -hsuart. I don’t know why.

fdtput <path_to_dtb> --type s "/bus@0/serial@3100000" "compatible" "nvidia,tegra194-hsuart"
fdtput <path_to_dtb> --type s "/aliases" "serial0" "/bus@0/serial@3100000"

Note that in stock kernels, most devices reside under “/bus@0” whereas in your L4T modified kernel, they’re at the root level.

Copy the updated dtb to the boot directory on the NX.
Update your bootloader to pass the new dtb to the kernel. Fedora uses BLS as the default so for me it’s just a matter of adding
devicetree /new-dtb.dtb to the end of the /boot/loader/entries files.

Also make sure the kernel command line does NOT start a console on ttyTHS0 or ttyS0 and of course, make sure there’s no getty running on it either.

Connect a host machine serial port (hardware, USB, whatever) in a cross-over configuration to pins 6, 8 and 10 on the NX 40-pin header.

Host GND -> NX GND (pin 6)
Host RXD -> NX TXD (pin 8)
Host TXD -> NX RXD (pin 10)

Assuming you have a serial converter on ttyUSB1…

stty -F /dev/ttyUSB1 115200 raw
cat /dev/ttyUSB1

Reboot the NX. You may see shutdown messages from the serial port but once the NX resets, you should not see any UEFI or kernel messages coming across the serial port.
Kill the cat.

On the NX…
Open a terminal window or open an ssh session from the host and do a dmesg -weTPL to see the kernel messages.
Open another terminal window or another ssh session from the host and run…

stty -F /dev/ttyTHS0 115200 raw
cat /dev/ttyTHS0

On the Host…

stty -F /dev/ttyUSB1 115200 raw
echo "this is a test" > /dev/ttyUSB1

You should see “this is a test” on the NX. If not, something else is wrong and you’ll have you troubleshoot.

Back on the NX.
Kill the cat and run

cat /dev/ttyTHS0 /dev/null

Back on the Host, file a text file that has a few K of data in it and start sending some data…

while true ; do cat <filename> > /dev/ttyUSB1 ; done

You should start seeing the overrun messages in the dmesg window…

[13495.623506] ttyTHS ttyTHS0: 1 input overrun(s)
[13497.887454] ttyTHS ttyTHS0: 12 input overrun(s)
[13591.507142] ttyTHS ttyTHS0: 40 input overrun(s)
[13593.647107] ttyTHS ttyTHS0: 22 input overrun(s)
[13595.410008] ttyTHS ttyTHS0: 12 input overrun(s)
[13597.536825] ttyTHS ttyTH0: 10 input overrun(s)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.