Xavier NX R35.6.0 Kernel Oops

Hi NVIDIA team,

We have been trying to integrate R35.6.0 JetPack for our product, based on Jetson Xavier NX, but we are hitting occasional Kernel oops. For example (from the serial console):

[13275.636315] soctherm: OC ALARM 0x00000001  
[41816.879799] kernel BUG at mm/slub.c:4118!  
[41816.879959] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP  
[41816.880155] Modules linked in: fuse cfg80211 tcp_diag inet_diag veth nfnetlink_acct xt_mark xt_MASQUERADE nf_conntrack_netlink br_netfilter overlay ramoops reed_solomon realtek nfnetlink smsc ip6table_nat iptable_nat nf_nat smsc95xx loop snd_soc_tegra210_iqc snd_soc_tegra186_asrc snd_soc_tegra210_op  
e snd_soc_tegra186_arad snd_soc_tegra186_dspk snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_sfc aes_ce_blk crypto_simd cryptd aes_ce_cipher  
binfmt_misc ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_spdif_tx snd_soc_tegra_machine_driver snd_soc_tegra210_adsp snd_soc_tegra_utils snd_soc_simple_card_utils nvadsp userspace_alert tegra_bpmp_thermal snd_soc_tegra210_ahub max77620_thermal tegra210_adma snd_hda_codec_hdmi nv_imx219 snd_hda_tegra s  
nd_hda_codec snd_hda_core spi_tegra114 nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt  
[41816.880458]  nf_log_ipv4 ina3221 nf_log_common pwm_fan ipt_REJECT nf_reject_ipv4 xt_LOG nvgpu(E) xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables nvmap iptable_filter ip_tables x_tables [last unloaded: leds_gpio]  
[41816.912120] CPU: 1 PID: 2663 Comm: P[cgroups] Tainted: G            E     5.10.216-tegra #1  
[41816.920246] Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 202210.5-78a917ec-dirty 09/25/2024  
[41816.931538] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)  
[41816.937581] pc : kfree+0x41c/0x4a0  
[41816.940987] lr : cgroup_file_release+0x6c/0xc0  
[41816.945185] sp : ffff800029d73c90  
[41816.948856] x29: ffff800029d73c90 x28: ffff394da1005880    
[41816.954114] x27: 0000000000000000 x26: 0000000000000000    
[41816.959882] x25: 0000000000000000 x24: 0000000000000000    
[41816.965148] x23: ffffa3e81e374000 x22: ffff394da1005880    
[41816.970905] x21: ffff394fa8142600 x20: 0000000000000000    
[41816.976161] x19: fffffee53e805080 x18: 0000000000000000    
[41816.981757] x17: 0000000000000000 x16: 0000000000000000    
[41816.987010] x15: 0000000000000000 x14: 0000000000000000    
[41816.992781] x13: 0000000000000000 x12: 0000000000000000    
[41816.998034] x11: 0000000000000000 x10: 0000000000000000    
[41817.003629] x9 : 0000000000000000 x8 : 0000000000000002    
[41817.009142] x7 : 07ffffffffffffff x6 : ffff394da11d7d00    
[41817.014420] x5 : ffff394da846e9a0 x4 : ffffa3e81e380b28    
[41817.019825] x3 : 0000000000000000 x2 : ffffa3e81c492880    
[41817.025158] x1 : fffffee53e8050c8 x0 : fffffee53e8050c8    
[41817.030773] Call trace:  
[41817.033215]  kfree+0x41c/0x4a0  
[41817.036363]  cgroup_file_release+0x6c/0xc0  
[41817.040149]  kernfs_fop_release+0xa0/0xc0  
[41817.044407]  __fput+0x80/0x260  
[41817.047382]  ____fput+0x24/0x30  
[41817.050534]  task_work_run+0x88/0xe0  
[41817.054036]  do_notify_resume+0x24c/0x990  
[41817.057975]  work_pending+0xc/0x738  
[41817.061738] Code: f9400660 3707fae0 a9046bf9 f9002bfb (d4210000)    
[41817.067868] ---[ end trace 4690051af44342d3 ]---  
[41817.088105] Kernel panic - not syncing: Oops - BUG: Fatal exception  
[41817.088310] SMP: stopping secondary CPUs  
[41817.088444] Kernel Offset: 0x23e80c2c0000 from 0xffff800010000000  
[41817.088664] PHYS_OFFSET: 0xffffc6b380000000  
[41817.092182] CPU features: 0x48240002,03802a30  
[41817.096725] Memory Limit: none  
[41817.111103] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---  
^@<FF><E2>  
[0000.025] W> RATCHET: MB1 binary ratchet value 4 is larger than ratchet level 2 from HW fuses.  
[0000.033] I> MB1 (prd-version: 2.6.0.0-t194-41334769-cab45716)  
[0000.038] I> Boot-mode: Coldboot  
[0000.041] I> Platform: Silicon  
[0000.044] I> Chip revision : A02P  
[0000.047] I> Bootrom patch version : 15 (correctly patched)

We are running a stock Linux kernel and nearly stock bootloader (we change the boot logo when we build it):

<username>@<machine-name>:~$ uname -a
Linux ep-1026-xavier 5.10.216-tegra #1 SMP PREEMPT Fri Sep 13 08:55:39 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
<username>@<machine-name>:~$ sudo nvbootctrl dump-slots-info
Current version: 35.6.0
Capsule update status: 1
Current bootloader slot: B
Active bootloader slot: B
num_slots: 2
slot: 0,             status: normal
slot: 1,             status: normal

Our workload consists of Docker containers interacting with the serial ports, USB, HDMI & NVMe drive.
Is this something you have seen before with this release?

NO, didn’t see such issue before. Please try to find a method to reproduce on NV devkit.

Hi NVIDIA team,

I have spent some time trying to minimise a test-case using the developer kit and stock software. I believe I have re-created the fault that you may also duplicate.

I believe this is due to a programming fault in the tegra-serial drivers, that have seen a number of changes over the last few JetPack releases. I think what may be happening is an out-of-bounds write in this driver.

To start with, I setup the devkit with the stock bootloader/kernel/rootfs. To do so:
Unpack Jetson Linux release:

tar -xf Jetson_Linux_R35.6.0_aarch64.tbz2

Then unpack sample rootfs:

cd rootfs
tar -xf Tegra_Linux_Sample-Root-Filesystem_R35.6.0_aarch64.tbz2

Apply binaries:

cd ..
sudo ./apply_binaries.sh

This is then programmed to the dev kit via the USB connection:

sudo ROOTFS_AB=1 ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" BOARDID=3668 BOARDSKU=0003 \
    ./tools/kernel_flash/l4t_initrd_flash.sh --no-flash --external-device nvme0n1 \
    -c flash_l4t_nvme_rootfs_ab_local.xml -S 20GiB -p "-C nv-auto-config" jetson-xavier-nx-devkit nvme0n1

sudo ROOTFS_AB=1 ROOTFS_RETRY_COUNT_MAX=3 ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" ./tools/kernel_flash/l4t_initrd_flash.sh \
  --erase-all --external-device nvme0n1 -c flash_l4t_nvme_rootfs_ab_local.xml -S 20GiB \
  --showlogs --flash-only jetson-xavier-nx-devkit external

Note that we use a customised filesystem layout with a few minor tweaks. I have included it below for reference, but I doubt this is the source of the issue.

<?xml version="1.0"?>

<!-- Nvidia Tegra Partition Layout Version 1.0.0 -->

<!--
This file has been created as an amalgamation of the flash_l4t_nvme_rootfs_ab.xml
file of JetPack 5.1.1 and JetPack 5.1.3.

The deployed machines were built with the file from JetPack 5.1.1.

NVIDIA changed this in JetPack 5.1.3 in a few ways:
 - The APP/APP_b partitions were put at the end of this file, just before the secondary
   GPT entries. This broke the UDA partition: it was no longer possible to set
   the allocation_attribute to 0x808 which enables the "auto-resize" functionality.
   This has been reverted-these partitions are now again at the start of the
   file.
 - The sizes of the various partitions were changed. This has been reverted back
   to the original sizes.
 - The external device size has been redefined from 122159104 sectors (62 GB) to
   a variable NUM_SECTORS. This has been kept. Previously the file had to be
   patched to support our selection of 20 GiB rootfs partitions, but this no
   longer needs to be done.
 - A few other miscellaneous changes were introduced in 5.1.3. These have been
   kept.

 -->
<partition_layout version="01.00.0000">
    <device type="external" instance="0" sector_size="512" num_sectors="NUM_SECTORS">
        <partition name="master_boot_record" type="protective_master_boot_record">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 512 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** Contains protective MBR. </description>
        </partition>
        <partition name="primary_gpt" type="primary_gpt">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 19968 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** Contains primary GPT of the `external` device. All
              partitions defined after this entry are configured in the kernel, and are
              accessible by standard partition tools such as gdisk and parted. </description>
        </partition>
        <partition name="APP" id="1" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> APPSIZE </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <align_boundary> 4096 </align_boundary>
            <percent_reserved> 0 </percent_reserved>
            <filename> APPFILE </filename>
            <unique_guid> APPUUID </unique_guid>
            <description> **Required.** Contains the rootfs. This partition must be assigned
              the "1" for id as it is physically put to the end of the device, so that it
              can be accessed as the fixed known special device, for example, `/dev/mmcblk1p1`
              or `/dev/nvme0n1p1`. </description>
        </partition>
        <partition name="APP_b" id="2" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> APPSIZE </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <align_boundary> 4096 </align_boundary>
            <percent_reserved> 0 </percent_reserved>
            <filename> APPFILE_b </filename>
            <unique_guid> APPUUID_b </unique_guid>
            <description> **Required.** Contains the rootfs. This partition must be assigned
              the "2" for id as it is physically put to the end of the device, so that it
              can be accessed as the fixed known special device, for example, `/dev/mmcblk1p2`
              or `/dev/nvme0n1p2`. </description>
        </partition>
        <partition name="kernel" id="3" type="kernel">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 67108864 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> LNXFILE </filename>
            <description> **Required.** Chain A; contains boot.img (kernel, initrd, etc)
              which is loaded in when cpu-bootloader failes to launch the kernel
              from the rootfs at `/boot`. </description>
        </partition>
        <partition name="kernel-dtb" type="kernel_dtb">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 458752 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> DTB_FILE </filename>
            <description> **Required.** Chain A; contains kernel device tree blob. </description>
        </partition>
        <partition name="reserved_for_chain_A_user" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 33554432 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** Reserved space for chain A on user device. </description>
        </partition>
        <partition name="kernel_b" type="kernel">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 67108864 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> LNXFILE_b </filename>
            <description> **Required.** Chain B; contains boot.img (kernel, initrd, etc)
              which is loaded in when cpu-bootloader failes to launch the kernel
              from the rootfs at `/boot`. </description>
        </partition>
        <partition name="kernel-dtb_b" type="kernel_dtb">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 458752 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> DTB_FILE </filename>
            <description> **Required.** Chain B; contains kernel device tree blob. </description>
        </partition>
        <partition name="reserved_for_chain_B_user" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 33554432 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** Reserved space for chain B on user device. </description>
        </partition>
        <partition name="RECNAME" type="kernel">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> RECSIZE </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> RECFILE </filename>
            <description> **Required.** Contains recovery image. </description>
        </partition>
        <partition name="RECDTB-NAME" type="kernel_dtb">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 524288 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> RECDTB-FILE </filename>
            <description> **Required.** Contains recovery DTB image. </description>
        </partition>
        <partition name="RECROOTFS" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> RECROOTFSSIZE </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Optional.** Reserved for future use by the recovery filesystem;
              removable. </description>
        </partition>
        <partition name="esp" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 67108864 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <filename> ESP_FILE </filename>
            <partition_type_guid> C12A7328-F81F-11D2-BA4B-00A0C93EC93B </partition_type_guid>
            <description> **Required.** EFI system partition with L4T Launcher. </description>
        </partition>
        <partition name="RECNAME_alt" type="kernel">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> RECSIZE </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** For fail-safe recovery update. </description>
        </partition>
        <partition name="RECDTB-NAME_alt" type="kernel_dtb">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 524288 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** For fail-safe recovery DTB update. </description>
        </partition>
        <partition name="esp_alt" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 67108864 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** EFI system partition for fail-safe ESP update. </description>
        </partition>
        <partition name="UDA" type="data">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 18432 </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 0x808 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <align_boundary> 4096 </align_boundary>
            <description> **Required.** This partition may be mounted and used to store user
              data. </description>
        </partition>
        <partition name="secondary_gpt" type="secondary_gpt">
            <allocation_policy> sequential </allocation_policy>
            <filesystem_type> basic </filesystem_type>
            <size> 0xFFFFFFFFFFFFFFFF </size>
            <file_system_attribute> 0 </file_system_attribute>
            <allocation_attribute> 8 </allocation_attribute>
            <percent_reserved> 0 </percent_reserved>
            <description> **Required.** Contains secondary GPT of the `external`
              device. </description>
        </partition>
    </device>
</partition_layout>

The OS is booted on the devkit, and the configuration wizard is followed over the USB serial port console as normal.

To demonstrate the fault, it is useful to enable slub_debug in the Kernel. Ocassionally, the fault as described here does not show up as a full blown kernel oops, but the slub_debug tool does detect an out of bounds write anyway. To do so, add “slub_debug=FZP” to the end of the kernel command line in /boot/extlinux/extlinux.conf and then reboot.

Ideally, I would also run KASAN, but this does not appear to work with the NVIDIA kernel – the system hits an oops before the kernel has finished initialising.

Additionally, I also set the power model of the platform to high-power, with all 6 cores enabled. This seems to make the fault occur quicker too:

sudo nvpmodel -m 8

Now, to trigger the fault, I start two different kinds of processes:

  • One is a fairly standard disk I/O stressor. By itself this does not cause the fault, but causes increased activity in the kernel that makes the fault quicker to trigger a kernel oops.
  • The second is a Python script that continuously opens, writes, reads and then closes the serial port. This causes the fault to happen.

Start the disk stressor:

sudo apt-get install stress-ng
stress-ng -d 6 --aggressive

Python script to stress serial port:

#!/usr/bin/python3

import serial
import os
import sys


# Serial port parameters
serial_port = sys.argv[1]
baud_rate = 115200

while True:
    print("Loop")
    s = serial.Serial(serial_port, baud_rate, exclusive=True, timeout=0.1)
    print("Serial port connected to {}".format(serial_port), flush=True)

    read_data = s.read(128).decode('utf-8', errors='backslashreplace')
    print("Read done", flush=True)
    s.write("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA".encode())
    print("Write done", flush=True)
    s.close()
    print("Close done", flush=True)

Install the pyserial package:

sudo apt-get install python3-serial

Run the serial port stressors:

sudo python3 serial_stressor.py /dev/ttyTHS0
sudo python3 serial_stressor.py /dev/ttyTHS1

After some time (between 1 - 20 minutes) the fault will show up as either a slub_debug warning message or a kernel oops.

An example of the slub_debug output:

[ 2657.761796] BUG kmalloc-256 (Tainted: G            E    ): Poison overwritten  
[ 2657.762033] -----------------------------------------------------------------------------  
[ 2657.762033]    
[ 2657.762316] Disabling lock debugging due to kernel taint  
[ 2657.762526] INFO: 0x00000000f112dca3-0x00000000f5fa89fa @offset=13340. First byte 0xf instead of 0x6b  
[ 2657.762726] INFO: Slab 0x00000000a175c326 objects=21 used=21 fp=0x0000000000000000 flags=0x8000000000010200  
[ 2657.762926] INFO: Object 0x00000000bcedf435 @offset=13312 fp=0x00000000262e20cd  
[ 2657.762926]    
[ 2657.763131] Redzone  0000000013271211: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.763333] Redzone  00000000d5c6c138: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.763527] Redzone  0000000049e76646: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.763825] Redzone  0000000089d9f38c: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.767246] Redzone  0000000061e465ae: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.776528] Redzone  00000000c3b534ec: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.786322] Redzone  000000008afae9a6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.795964] Redzone  00000000f4bcf38e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.805229] Redzone  000000008a9de8c6: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.815021] Redzone  00000000f56f6abf: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.824302] Redzone  00000000f2f56c13: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.833840] Redzone  000000006e7d99de: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.843377] Redzone  00000000dae0326d: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.852914] Redzone  00000000573c3421: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.862452] Redzone  00000000990f3660: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.872010] Redzone  0000000078b556f2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................  
[ 2657.881784] Object   00000000bcedf435: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.891320] Object   0000000006b426e1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 0f 7c ff ff  kkkkkkkkkkkk.|..  
[ 2657.900607] Object   00000000eff94206: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.910396] Object   00000000e79a47d6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.919676] Object   00000000df28da82: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.929220] Object   00000000d945906a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.939013] Object   00000000c6c7b1da: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.948550] Object   000000001e160536: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.958089] Object   00000000dc95f001: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.967363] Object   00000000932553b9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.976901] Object   00000000a3f6087c: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.986696] Object   0000000015a03f25: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2657.996234] Object   00000000dcbecf39: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2658.005771] Object   00000000479df2ec: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2658.015050] Object   000000002a8fc7fa: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[ 2658.024588] Object   00000000004774a6: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.  
[ 2658.034383] Redzone  00000000dd95fef4: bb bb bb bb bb bb bb bb                          ........  
[ 2658.042876] Padding  00000000b76b04dd: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.052414] Padding  000000002d9fb3bf: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.061950] Padding  00000000383dc28d: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.071744] Padding  000000002771bc7b: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.081305] Padding  00000000326559a5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.090820] Padding  00000000a65dabde: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.100358] Padding  000000009595bd2c: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.109749] Padding  0000000071bd4663: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.119521] Padding  00000000fa314f48: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.129058] Padding  00000000b9f40ee3: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.138596] Padding  000000000ba9b3df: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.148133] Padding  000000001c868305: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.157420] Padding  00000000cdef50e7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.167208] Padding  00000000cd738ec0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.176768] Padding  000000006894afd5: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ  
[ 2658.186412] FIX kmalloc-256: Restoring 0x00000000f112dca3-0x00000000f5fa89fa=0x6b  
[ 2658.186412]    
[ 2658.195296] FIX kmalloc-256: Marking all objects used

An example of a kernel oops:

[ 3060.264768] Unable to handle kernel paging request at virtual address 006b6b6b40c1335c  
[ 3060.265079] Mem abort info:                                                
[ 3060.265173]   ESR = 0x96000004  
[ 3060.265285]   EC = 0x25: DABT (current EL), IL = 32 bits                                                                                              
[ 3060.265498]   SET = 0, FnV = 0
[ 3060.265285]   EC = 0x25: DABT (current EL), IL = 32 bits  
[ 3060.265498]   SET = 0, FnV = 0  
[ 3060.265640]   EA = 0, S1PTW = 0  
[ 3060.265756] Data abort info:  
[ 3060.265861]   ISV = 0, ISS = 0x00000004  
[ 3060.265960]   CM = 0, WnR = 0  
[ 3060.266033] [006b6b6b40c1335c] address between user and kernel address ranges  
[ 3060.266195] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP  
[ 3060.266329] Modules linked in: overlay ramoops reed_solomon realtek bnep snd_soc_tegra210_iqc snd_soc_tegra186_asrc snd_soc_tegra210_ope snd_soc_te  
gra186_dspk loop snd_soc_tegra186_arad snd_soc_tegra210_mvc snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_afc snd_soc_tegra210_amx snd_s  
oc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_sfc snd_soc_tegra_pcm aes_ce_blk crypto_simd cryptd aes_ce_cipher ghas  
h_ce binfmt_misc sha2_ce sha256_arm64 sha1_ce snd_soc_tegra_machine_driver leds_gpio snd_soc_spdif_tx max77620_thermal tegra_bpmp_thermal input_leds r  
tk_btusb btusb btrtl btbcm btintel snd_soc_tegra210_adsp snd_soc_tegra_utils rtl8822ce snd_soc_simple_card_utils nvadsp tegra210_adma snd_soc_tegra210  
_ahub userspace_alert snd_hda_codec_hdmi cfg80211 snd_hda_tegra snd_hda_codec snd_hda_core nv_imx219 spi_tegra114 ina3221 pwm_fan nvgpu(E) nvmap ip_ta  
bles x_tables [last unloaded: mtd]  
[ 3060.303308] CPU: 4 PID: 8440 Comm: kworker/u12:2 Tainted: G    B       E     5.10.216-tegra #1  
[ 3060.311700] Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024  
[ 3060.321955] Workqueue: writeback wb_workfn (flush-259:0)  
[ 3060.327192] pstate: 40c00009 (nZcv daif +PAN +UAO -TCO BTYPE=--)  
[ 3060.333489] pc : __read_extent_tree_block+0x50/0x2b0  
[ 3060.338475] lr : __read_extent_tree_block+0x44/0x2b0  
[ 3060.343198] sp : ffff8000222836a0  
[ 3060.346894] x29: ffff8000222836a0 x28: 0000000002b4b5a2    
[ 3060.352126] x27: ffffa3679af74df4 x26: 0000000000000649    
[ 3060.357893] x25: ffffa3679be247b8 x24: 0000000000000000    
[ 3060.363153] x23: ffff7c0e8c325900 x22: ffff7c0f40c13480    
[ 3060.368919] x21: 6b6b6b6b40c13358 x20: 0000000000000000    
[ 3060.374174] x19: ffff800022283834 x18: 0000000000001000    
[ 3060.379944] x17: 0000000000000000 x16: 0000000000000068    
[ 3060.385457] x15: 00000000000001d4 x14: 000000000000028e    
[ 3060.390624] x13: ffff7c0e8cbed7a0 x12: 0000000000000015    
[ 3060.396219] x11: fffffff03a161b08 x10: ffff7c0e8c325900    
[ 3060.401476] x9 : 0000000000000030 x8 : ffff7c0e8c325930    
[ 3060.406987] x7 : 000000000001f5a2 x6 : 000000000001f5a5    
[ 3060.412501] x5 : 0000000000000000 x4 : 0000000000000000    
[ 3060.418184] x3 : 6b6b6b6b40c13358 x2 : 0000000000008c48    
[ 3060.423522] x1 : 0000000000000649 x0 : ffff7c0e97dd6800    
[ 3060.428856] Call trace:  
[ 3060.431343]  __read_extent_tree_block+0x50/0x2b0  
[ 3060.435606]  ext4_ext_search_right+0x364/0x410  
[ 3060.440176]  ext4_ext_map_blocks+0x1b0/0xef0  
[ 3060.444355]  ext4_map_blocks+0x18c/0x5b0  
[ 3060.448397]  ext4_writepages+0x814/0xdd0  
[ 3060.451879]  do_writepages+0x5c/0x110  
[ 3060.455811]  __writeback_single_inode+0x4c/0x510  
[ 3060.460387]  writeback_sb_inodes+0x20c/0x4b0  
[ 3060.464390]  wb_writeback+0xf8/0x410  
[ 3060.468146]  wb_workfn+0x104/0x630  
[ 3060.471593]  process_one_work+0x1c4/0x4c0  
[ 3060.475586]  worker_thread+0x54/0x450  
[ 3060.479086]  kthread+0x148/0x170  
[ 3060.482260]  ret_from_fork+0x10/0x20  
[ 3060.486002] Code: 97f139f3 f94016c0 f264029f 52918902 (b94006a1)    
[ 3060.491872] ---[ end trace 344ffafb1ddb57d6 ]---  
[ 3060.513101] Kernel panic - not syncing: Oops: Fatal exception  
[ 3060.513312] SMP: stopping secondary CPUs  
[ 3060.513466] Kernel Offset: 0x23678ab90000 from 0xffff800010000000  
[ 3060.513643] PHYS_OFFSET: 0xffff83f280000000  
[ 3060.516456] CPU features: 0x48240002,03802a30  
[ 3060.520651] Memory Limit: none  
[ 3060.539175] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

In addition to this fault, there may be a second fault in the drivers. This one is much easier to trigger but may be unrelated to the above listed fault. To reproduce, switch to a root terminal, pipe random data to /dev/ttyTHS1, kill this process with Ctrl-C and then restart the piping:

sudo su
cat /dev/urandom > /dev/ttyTHS1
Ctrl-C
cat /dev/urandom > /dev/ttyTHS1

This will result in a number of warning messages being printed to the kernel console:

[   50.582961] tegra-gpcdma 2600000.dma: DMA pause timed out
[   50.583238] tegra-gpcdma 2600000.dma: DMA pause timed out
[   52.081031] tegra-gpcdma 2600000.dma: slave id already in use
[   52.081213] serial-tegra 3110000.serial: Not able to get desc for Tx
[   52.081377] tegra-gpcdma 2600000.dma: slave id already in use
[   52.081521] serial-tegra 3110000.serial: Not able to get desc for Tx
[   52.081679] tegra-gpcdma 2600000.dma: slave id already in use
[   52.081816] serial-tegra 3110000.serial: Not able to get desc for Tx
[   52.081972] tegra-gpcdma 2600000.dma: slave id already in use
[   52.082147] serial-tegra 3110000.serial: Not able to get desc for Tx
[   52.082285] tegra-gpcdma 2600000.dma: slave id already in use
[   52.082402] serial-tegra 3110000.serial: Not able to get desc for Tx

At this point the /dev/ttyTHS1 serial port becomes inoperable. Interestingly, /dev/ttyTHS0 is not affected by this fault.

1 Like

This is a critical fault in the NVIDIA software stack for us and is currently preventing us from releasing our software to customers. Our goal with this release was to move on from R35.3.1 where we encountered a kernel memory leak caused by the serial port drivers–thus requiring our customers to reboot their machines periodically to prevent the system from becoming unstable.

Obviously, neither R35.3.1 nor R35.6.0 are really suitable for our product if the serial ports do not work reliably.

For our application, we do not need full VT100 emulation, flow control, or anything else than “bytes-in-to-bytes-out” functionality.

To that end, do you have any serial port drivers that are perhaps more simple and reliable that we can use in the kernel instead of the complex serial-tegra drivers?

One avenue we will be exploring will be to try the DMA-less serial port drivers. However, I am concerned that the amount of traffic on the serial port will cause an overwhelming amount of interrupt load on the system, where we are also trying to do realtime control.

1 Like

Is this issue reproducible with sdkmanager image?

I am just curious why need to do those configuration there. Is it able to just reproduce with stress and the python script?

I use the official NVIDIA downloadable rootfs/kernel packages because I do not have a computer that sdkmanager works on.

I assume that the downloads of these files from the NVIDIA website match the files that sdkmanager uses to build an image for flashing?

Which configuration are you referring to?

One suggestion as a test case: The UARTs are enabled in the device tree. The “compatible” line lists any drivers in a comma-delimited list. I would suggest testing the UART with the following combinations:

  • Both drivers available, only use the Tegra High Speed (/dev/ttyTHS#) interface.
  • Both drivers available, only use the compatibility interface (/dev/ttyS#).
  • Remove the compatibility driver from the device tree, and run only the THS driver.
  • Remove the THS driver from the device tree, and run only the compatibility driver.

Reboot between tests, but since it is basically the one with the Ctrl-c, it might not take long. You can duplicate your boot entries in extlinux.conf and simply add copies which name a different device tree.

The purpose is not to work around this, but to narrow down on what can or cannot trigger this. I do suspect that not using DMA would be a high IRQ load, but it wouldn’t hurt to test if this is an interaction rather than a specific driver. I am also wondering if using loopback mode on the serial UART might help since I don’t know if you have anything consuming the data during your test (it could be a send issue, or it could be a receive issue, or it could be neither; you would want to see if consuming data changes results rather than just random send; or vice-versa, stop consuming data and see what happens with send if no flow control is used).

Okay, I’ve been digging into the difference between DMA-controlled and compatibility drivers today.

Vanilla OS
First, I flashed a fresh OS, according to the instructions in my post above.

This results in the following serial ports being presented by the kernel. (Just including here for reference):

crw-rw---- 1 root dialout   4, 64 Nov 21  2023 /dev/ttyS0  
crw-rw---- 1 root dialout   4, 65 Nov 21  2023 /dev/ttyS1  
crw-rw---- 1 root dialout   4, 66 Nov 21  2023 /dev/ttyS2  
crw-rw---- 1 root dialout   4, 67 Nov 21  2023 /dev/ttyS3  
crw--w---- 1 root tty     236,  0 Oct 23 09:37 /dev/ttyTCU0  
crw--w---- 1 root tty     237,  0 Oct 23 09:37 /dev/ttyTHS0  
crw-rw---- 1 root dialout 237,  1 Nov 21  2023 /dev/ttyTHS1  
crw-rw---- 1 root dialout 237,  4 Nov 21  2023 /dev/ttyTHS4

and the following is printed in the kernel log during boot:

[    5.197993] serial-tegra 3100000.serial: Adding to iommu group 2
[    5.204155] 3100000.serial: ttyTHS0 at MMIO 0x3100000 (irq = 31, base_baud = 0) is a TEGRA_UART
[    5.213508] serial-tegra 3110000.serial: Adding to iommu group 2
[    5.219178] 3110000.serial: ttyTHS1 at MMIO 0x3110000 (irq = 32, base_baud = 0) is a TEGRA_UART
[    5.228115] serial-tegra 3140000.serial: Adding to iommu group 2
[    5.233666] 3140000.serial: ttyTHS4 at MMIO 0x3140000 (irq = 33, base_baud = 0) is a TEGRA_UART

I believe my previous tests cover the “Both drivers available, only use the Tegra High Speed (/dev/ttyTHS#) interface.” test scenario.

At this point I tried to open the the /dev/ttyS* devices with the serial port stressor script. This would cover your suggestion “Both drivers available, only use the compatibility interface (/dev/ttyS#).”. However, these ports always report “Input/output error” when they are opened. Thus, I don’t think it is possible to run this test as described. Perhaps the compatibility interface cannot be accessed while the Tegra High Speed driver has control over the hardware?

With this vanilla OS, with no DTB changes, the results of trying to open and test the various ports is summarised below:

Port Status
/dev/ttyS0 Input/Output Error
/dev/ttyS1 Input/Output Error
/dev/ttyS2 Input/Output Error
/dev/ttyS3 Input/Output Error
/dev/ttyS4 No such file or directory
/dev/ttyTHS0 OK
/dev/ttyTHS1 OK
/dev/ttyTHS2 No such file or directory
/dev/ttyTHS3 No such file or directory
/dev/ttyTHS4 OK

Modified DTB to Remove High Speed Driver
Next I decompiled the existing DTB file that is pointed to by extlinux.conf into DTS:

cat /boot/extlinux/extlinux.conf
...
        FDT /boot/dtb/kernel_tegra194-p3668-0000-p3509-0000.dtb
...
dtc -I dtb -O dts -o extracted.dts /boot/dtb/kernel_tegra194-p3668-0000-p3509-0000.dtb

Then I modified the extracted device tree to replace the tegra186-hsuart driver with tegra20-uart and remove the DMA parameters:

--- extracted.dts	2024-10-23 10:03:40.185551581 +0100
+++ modified.dts	2024-10-23 10:30:05.940228747 +0100
@@ -1034,7 +1034,7 @@
 	};
 
 	serial@3100000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		interconnects = <0x03 0x16>;
 		interconnect-names = "dma-mem";
@@ -1043,8 +1043,6 @@
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x70 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x08 0x1b 0x08>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0x9b 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0x9b>;
@@ -1057,15 +1055,13 @@
 	};
 
 	serial@3110000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0x3110000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x71 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x09 0x1b 0x09>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0x9c 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0x9c>;
@@ -1078,15 +1074,13 @@
 	};
 
 	serial@c280000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0xc280000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x72 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x03 0x1b 0x03>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0x9d 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0x9d>;
@@ -1099,15 +1093,13 @@
 	};
 
 	serial@3130000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0x3130000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x73 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x13 0x1b 0x13>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0x9e 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0x9e>;
@@ -1120,15 +1112,13 @@
 	};
 
 	serial@3140000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0x3140000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x74 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x14 0x1b 0x14>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0x9f 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0x9f>;
@@ -1141,15 +1131,13 @@
 	};
 
 	serial@3150000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0x3150000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x75 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x0c 0x1b 0x0c>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0xa0 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0xa0>;
@@ -1162,15 +1150,13 @@
 	};
 
 	serial@c290000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0xc290000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0x76 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x02 0x1b 0x02>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0xa1 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0xa1>;
@@ -1183,15 +1169,13 @@
 	};
 
 	serial@3170000 {
-		compatible = "nvidia,tegra186-hsuart";
+		compatible = "nvidia,tegra20-uart";
 		iommus = <0x02 0x20>;
 		dma-coherent;
 		reg = <0x00 0x3170000 0x00 0x10000>;
 		reg-shift = <0x02>;
 		interrupts = <0x00 0xcf 0x04>;
 		nvidia,memory-clients = <0x0e>;
-		dmas = <0x1b 0x0d 0x1b 0x0d>;
-		dma-names = "rx\0tx";
 		clocks = <0x04 0xbe 0x04 0x66>;
 		clock-names = "serial\0parent";
 		assigned-clocks = <0x04 0xbe>;

I then recompiled this modified DTC:

sudo dtc -I dts -O dtb -o /boot/dtb/modified.dtb modified.dts

And pointed extlinux.conf to this new DTB:

cat /boot/extlinux/extlinux.conf
...
        FDT /boot/dtb/modified.dtb
...

The system was then rebooted.

This results in just the compatibility interfaces being available:

crw-rw---- 1 root dialout   4, 64 Oct 23 10:35 /dev/ttyS0  
crw-rw---- 1 root dialout   4, 65 Oct 23 10:35 /dev/ttyS1  
crw-rw---- 1 root dialout   4, 66 Oct 23 10:35 /dev/ttyS2  
crw-rw---- 1 root dialout   4, 67 Oct 23 10:35 /dev/ttyS3  
crw--w---- 1 root tty     236,  0 Oct 23 10:36 /dev/ttyTCU0

and the following kernel log:

[    5.120071] tegra-uart 3100000.serial: Adding to iommu group 2
[    5.126026] tegradc 15210000.display: parse_dp_settings: No dp-lt-settings node
[    5.133264] 3100000.serial: ttyS0 at MMIO 0x3100000 (irq = 31, base_baud = 115124) is a Tegra
[    5.137432] tegradc 15210000.display: DT parsed successfully
[    5.144744] tegra-uart 3110000.serial: Adding to iommu group 2
[    5.150697] tegradc 15210000.display: dc.1 probe not in device tree order, deferring
[    5.205231] 3110000.serial: ttyS1 at MMIO 0x3110000 (irq = 32, base_baud = 115124) is a Tegra
[    5.213823] tegra-uart 3140000.serial: Adding to iommu group 2
[    5.219778] 3140000.serial: ttyS2 at MMIO 0x3140000 (irq = 33, base_baud = 115124) is a Tegra

Again, checking which ports can be opened showed that now the /dev/ttyS* devices are now working:

Port Status
/dev/ttyS0 OK
/dev/ttyS1 OK
/dev/ttyS2 OK
/dev/ttyS3 Input/Output Error
/dev/ttyS4 No such file or directory
/dev/ttyTHS0 No such file or directory
/dev/ttyTHS1 No such file or directory
/dev/ttyTHS2 No such file or directory
/dev/ttyTHS3 No such file or directory
/dev/ttyTHS4 No such file or directory

This shows the DTB applied correctly and we are getting the compatibility driver controlling all of the serial ports.

I then ran the serial port stressor on /dev/ttyS0 and /dev/ttyS1, while also running the disk stress tool. This replicates the test where I saw the kernel oops and slub_debug faults. This I think exercises “Remove the THS driver from the device tree, and run only the compatibility driver.”.

These stressors ran for 2.5 hours without fault. In addition, the test where I cat /dev/urandom to the ports also ran repeatedly without faults.

The lack of faults with the compatibility driver adds more evidence that there is a fault in the high speed, DMA driver.

Regarding your test scenario, " Remove the compatibility driver from the device tree, and run only the THS driver.", I don’t think the original DTB has both drivers enabled at the same time in the device tree, so I think this scenario is the same as the vanilla shipped DTB.

In terms of hardware setup for these tests–this is on the 8GiB Jetson Developer Kit, with an SD card. A USB-serial converter is attached to the UART TXD and UART RXD pins of J14 to allow me to capture the bootloader/kernel logs. There are no other electrical connections to the other serial ports.

Just making notes as I read that, please forgive the randomness…


Can you verify again which pins and/or connector you are using for your serial port, even if it is on a custom carrier board? If it is on the 40-pin header, then even custom carrier boards tend to use the same header, although options might change.


Much of what follows in several paragraphs is somewhat repetitive. The reason for that is the issue you mentioned where the ttyS# ports had an error message instead of functioning.

The label of “ttyS#” versus “ttyTHS#” does not necessarily use the same “#” for one port. When these ports are enumerated the numbering is in the order that the drivers bound to that UART. If the THS driver binds UARTs in the same order that the legacy driver does, then and only then would all numbers match. It is possible that a number such as ttyS1 (a contrived example, not a specific case) is the same as ttyTHS2. I just want to make sure that numbering is not an indication of the port.

Trivia: You might find clues to the specific physical device somewhere under “/sys”; example:

find /sys -name 'ttyTHS*'
find /sys -name 'ttyS*'

The same hardware can matched between one ttyS# and one ttyTHS# (a specialized UART might not have either a ttyTHS and might have only some other unexpected tty name; as mentioned though only one will be found if the “compatible” of the device tree does not provide two UART drivers). The tty name cannot be trusted to match hardware UART port-to-name. Every time you boot with a changed device tree you might try the loopback test method with something like gtkterm (“sudo apt-get install gtkterm”; or your favorite serial console program) just to verify the serial port numbering did not change from the device tree change. So after wiring TX to RX (and optionally CTS to RTS when visible and when flow control is enabled), one could do something like this and watch for echo (without the program itself doing local echo):

# Does it echo with both of these?
gtkterm -b 8 -t 1 -s 115200 -p /dev/ttyTHS2
gtkterm -b 8 -t 1 -s 115200 -p /dev/ttyS2

It is hard to overemphasize how important a loopback test success is prior to running any of your other tests. If basic function or naming is invalid, then all following results are “contaminated” by this failure. A successful loopback echo guarantees your next steps are valid tests.

I’m just adding some typical device names in the above example, but the idea is to loopback confirm echo on both a ttyS and a ttyTHS prior to using them in any other test. Without this you don’t know if the correct device is used since changing the tree can change the device number. If all of the ttyS fail despite naming both a ttyTHS and legacy driver in the device tree, then we have another problem which probably needs to be solved before continuing…we’d need to know why one of the drivers fails (it is ok to reboot between ttyS and ttyTHS tests if all of one fails, but reboot should not be needed; that’s a quirk we’d need to know about).

Also, we don’t want to remove the THS or legacy driver from all UARTs, we only want to test on the one port. I don’t know the consequences of changing this for other ports (maybe none, but this for example could break some logging when transitioning from boot phase to the Linux kernel taking over).

We can ignore all ports which are in “ls -l /dev/ttyTHS* /dev/ttyS*” group “tty”. The only ports which can possibly matter to us are the “dialout” group.

Note: In the log output you will see a pairing of the serial port hardware, e.g., the “3100000.serial” and the device (the next line) “3100000.serial: ttyTHS0”. *If you find out just once which address is attached to the ttyTHS UART of interest, then you can write it down, and from then on you have any case where you can get a port to work with a ttyTHS port (especially in loopback mode), can you check the physical address (the legacy driver probably won’t give a hexadecimal hardware address)? Then perform this regular expression search of dmesg and verify (each time you boot with a change to device tree) to verify the THS physical address:
dmesg | egrep -i '[0-9]+[.]serial'


I think your device tree edits are correct, but someone might want to verify this. A second way to verify this is if removing the hsuart results in fewer “/dev/ttyTHS#” files. *If the driver loads, it will create a “/dev” file; if not, then the “/dev” file won’t exist.

Incidentally, if you wish to verify the device tree, then the “/proc/device-tree/” is a reflection of the running system’s device tree. For example, you might find something like this (but this is from a different Jetson model):

cd /proc/device-tree
egrep -H -a '*' `find . -name compatible | egrep serial`
./serial@3130000/compatible:nvidia,tegra194-hsuart
./serial@3170000/compatible:nvidia,tegra194-hsuart
./serial@c270000/compatible:arm,sbsa-uart
./serial@3100000/compatible:nvidia,tegra194-hsuart
./serial@3140000/compatible:nvidia,tegra194-hsuart
./serial@c280000/compatible:nvidia,tegra194-hsuart
./serial@3110000/compatible:nvidia,tegra194-hsuart
./serial@3150000/compatible:nvidia,tegra194-hsuart
./serial@31d0000/compatible:arm,sbsa-uart

(note: Using “find” from that location can give different results from using find somewhere else and naming “/proc/device-tree”; you are crossing filesystem types when doing this, and find can exclude some results when crossing filesystems)

Very Important: We do not want to run the test on any tty which does not have group “dialout”. We also want to run tests prior to the stress test using loopback. It sounds like you did this correctly, but can you verify loopback prior to any stress test in any given configuration? If so, then there does seem to be an interaction between the THS and legacy drivers. However, I also want to verify that if only the THS or if only the legacy driver is used during a given boot, does the error still occur?

The reason for the reboot test is that both drivers inherit the state of the UART hardware. We want to know if the failure is related to a UART state, in which case a fresh reboot won’t show the error; if this is not related to a UART state, then we know that it is a somewhat more direct driver interaction. This again would narrow where to look for the issue.

TIP: One reason for so much talk of loopback is that a UART will seldom disagree with its own settings (it can happen though). Your external UART is useful, but adds something to the test we can filter out with loopback.

Hi,

Yes, well aware that devices may come up in a different order. That is why I posted snippets of the dmesg log showing that they did not come up in a different order.

Regarding this sentence:

The same hardware can matched between one ttyS# and one ttyTHS# (a specialized UART might not have either a ttyTHS and might have only some other unexpected tty name; as mentioned though only one will be found if the “compatible” of the device tree does not provide two UART drivers).

Sure, but I think I demonstrated with the dmesg logs that the the serial ports had been switched from only working with the /dev/ttyTHS* device to the /dev/ttyS* device.

As mentioned in my second message, this test has been setup with the official NVIDIA Jetson Xavier NX 8 GiB dev kit. This only has the kernel console available on Pins 3/4 of J14, and UART1 available on pins 8/10 of J12.

WayneWWW specifically requested that a minimised test case was created on the Jetson Xavier NX development kit, and that is what was done.

The kernel console works fine. UART1 was verified with loopback and a logic analyser.

No, I definitely want to remove the DMA driver from all of the UART ports to successfully check the fault has been removed: if I left even one DMA driver active and then I saw a kernel fault, how would I know if it is because of the one active DMA driver or something else?

Even if I was not actively stressing the serial ports with a DMA driver, there is no confirmation yet that the out of bounds write was not caused by e.g. the probing code of the DMA driver. Hence, for an accurate test I want no DMA drivers active.

I’m not sure what you’re getting at with the permissions of the /dev/tty* devices? The serial port stressors I explained in my second post are run as root, so the permissions on those devices should not matter.

Yes – dmesg logs for both DMA and compatible driver show the mapping between device and hardware address. see my second message.

Yes – device tree is correctly loading the compatible driver, as confirmed by looking at the output of dmesg I listed earlier. The /proc/device-tree VFS has nodes that match my DTB, so that shows that the correct DTB is loaded. But I think that was obvious anyway because the compatibility serial port drivers were loaded.

During a single boot I only accessed /dev/ttyTHS* OR /dev/ttyS*, but never both.

Can I suggest that NVIDA actually tries to duplicate this issue on hardware at your end? I get the feeling that this is dragging. There is an easily reproducible test case, on your hardware, with your software stack.

To me it is clear as day that the DMA drivers have a fault, and that once the system has been switched to using the compatibility drivers that fault disappears.

Hi,

Sorry to interrupt the discussion. The configuration I mentioned previously is that partition layout change. I guess this shall not be related to the issue if I want to reproduce this with NV devkit, right?

Could you also share the hardware connection to reproduce this if I want to use your script to reproduce locally?

As mentioned in my second message, this test has been setup with the official NVIDIA Jetson Xavier NX 8 GiB dev kit. This only has the kernel console available on Pins 3/4 of J14, and UART1 available on pins 8/10 of J12.

This one?

So the current situation is if you don’t use hsuart but switch to nvidia,tegra20-uart, then this issue might not happen?

It is possible that the “/dev” has files there from “mknod”, in which case they’d appear to persist (which is kind of “very old school”; someone might do this if udev is not expected to have run in time).

Yes, you are correct. However, I think you would also want to test with the driver still running against other hardware. We’re testing all combinations, but a combination with “no driver at all” and “no driver on that UART” both seem to be useful. Mostly I’m worried about a driver having initialized something on a particular UART, or else a driver expecting something on a given UART. I had not taken into account the possibility of the driver being in the kernel, but not bound to that UART being an issue, but it certainly could be an issue.

If permissions show group “tty” instead of “dialout” it implies you have multiple programs accessing the same UART simultaneously. Most of what I suggested above is about multiple kernel space drivers making a UART available simultaneously, but there is also the user space part of this. What happens if both serial console and your testing must access the UART simultaneously? I’m just making sure that no port being tested is from group tty because that guarantees serial console is bound to that port (collision of data input/output rather than collision of drivers). There is a large incidence of UART questions and reports on the forums when trying to use the console UART for other purposes, so I guess you could say it is more or less a reflex to point out serial console issues.

I agree you are at the point where this can be tested. I don’t personally have the particular Xavier NX to test with, but everything else has been ruled out. If your install is 100% from the default flash, then this can be reproduced. Check your L4T release via “cat /etc/nv_tegra_release”, and if all software came from that release via here:
https://developer.nvidia.com/linux-tegra
…then I think @WayneWWW can have someone test this out. It is probably a pain to do so since so much is in this thread, but if you can summarize the exact details of what L4T release is used and the minimal method for triggering this could be taken to whoever debugs the drivers. The only thing I’ve done is to rule out more common causes. We know the test is for pins 8/10 of J12, and console is running on pins 3/4 of J14 so the issue has no relation to a console trying to run on your pins at the same time.

Hi both,

Re: WayneEEE

I doubt the partition layout should change the kernel behaviour, but I included it there so that the test can be reproduced exactly the same on your end.

The hardware connection that allowed me to create the minimised test case is:

  • Jetson Xavier NX Dev Kit 8 GiB with SD card inserted and NVMe drive mounted.
  • Micro USB cable connected to the Dev Kit USB port.
  • 12V power supply connected to the Dev Kit barrel jack.

Just to see the kernel console log, I also attached a serial ↔ USB adaptor on UART TX/UART RX pins of J14. The bug still shows up if I remove this, but it is useful to leave this connected in order to see any kernel oops / slub_debug messages.

There is nothing else connected to this dev kit.

So the current situation is if you don’t use hsuart but switch to nvidia,tegra20-uart, then this issue might not happen?

Correct – I confirmed that switching from the hsuart driver to the uart driver prevented the bug from occurring.

Re: linuxdev

I think you would also want to test with the driver still running against other hardware.

What other hardware are you referring to here? The bug has been demonstrated on the NVIDIA Jetson Xavier NX Dev Kit 8 GiB. That is the only hardware required to replicate this bug.

I had not taken into account the possibility of the driver being in the kernel, but not bound to that UART being an issue, but it certainly could be an issue.

I don’t believe that is happening here – my tests posted a couple of days ago ran the exact same rootfs / kernel, but with JUST the DTB bindings modified between the two test scenarios. If the UARTs are bound to the DMA driver the fault occurs, if the UARTs are bound to the compatibility driver the fault does not occur. So – the DMA driver is always built into the kernel, but only causes a fault when it is activated via the DTB binding.

If permissions show group “tty” instead of “dialout” it implies you have multiple programs accessing the same UART simultaneously.

I’m sorry, I’m a bit lost here. There is not lots of other software running here accessing serial ports. I flashed a vanilla NVIDIA rootfs/kernel/DTB and only ran the commands that were shown in my second post. As far as I can tell, there should be nothing else accessing the serial ports with that configuration, unless the NVIDIA vanilla rootfs runs any additional programs without user interaction. I’m not sure how this relates to the permissions on the serial port devices though. Reference for tty and dialout groups can be found here: SystemGroups - Debian Wiki

All software is from here: Jetson Linux | NVIDIA Developer, except the Python serial port stressor script and partition layout XML file that I included in my second message.

if you can summarize the exact details of what L4T release is used and the minimal method for triggering this could be taken to whoever debugs the drivers.

I believe that is what I have covered in my second message. If there is any more information that is required, please detail what you need and I will provide it.

I’m just suggesting that these cases exist and could be tested:

  • Both legacy and THS drivers run on a specific port.
  • Both legacy and THS drivers run on some ports, but only THS on a specific port.
  • Both legacy and THS drivers run on some ports, but only legacy on a specific port.
  • Only legacy is loaded anywhere.
  • Only THS is loaded anywhere.

I’m differentiating between kernel drivers interacting with each other due to existing in the kernel, versus interacting due to trying to access the same specific UART. There are other UARTs beyond the one you are working with. It is probably easier to debug if it is known that both drivers can exist without the problem showing up, and thus the issue would be related to how both access a given UART, versus knowing the interaction does not require touching the same UART (which is a more general failure). This does seem to be the THS driver, but debugging is aided by knowing all circumstances.

Hi,

As the UART on J14 is the combined UART which by default will have lots of info from different processor, we don’t suggest to use UART on this header.

Is this issue possible to reproduce by using the 40 pin header?

WayneWWW

The combined UART output on J14 is what your rootfs comes configured with in your vanilla rootfs. That is why I left it in that state.

The serial stressors that trigger the fault are point to the /dev/ttyTHS0 and /dev/ttyTHS1 serial ports (these are different to the combined UART).

Once you have replicated the fault on your end with my notes from my second message, perhaps you could try such modifications as disabling the combined UART and seeing if it still occurs?

Actually combined UART is not able to got disabled. That is why I suggested if we can use other serial to reproduce.

Sorry in advance I am not quite sure about all the discussion here. What is the pin you connect when you do stress on /dev/ttyTHS0 and ttyTHS1?

I do not connect anything to any of the UARTs that are targetted by the serial port stressor (/dev/ttyTHS0/ and /dev/ttyTHS1).

The combined UART on J14, marked UART TX/UART RX on the NVIDIA Xavier NX dev kit, is connected to a serial to USB converter just so that I can see the kernel logs. The kernel fault reproduces fine if this is not electrically connected to anything, but it is useful to see the kernel oops / slub_debug messages that are printed when the fault occurs.

OK. I see. So no hardware connection for these interface. You just open it and keep doing stress test?

Yes that is correct.