Serial port failure: Not able to get desc for Rx

Hello,

I am having a fatal error issue with software accessing serial port devices attach to an AGX. I am currently using serial devices connected to an AGX Xavier on a ConnectTech Rogue carrier. Both of these serial devices appear to work correctly and I am able to stream data into different applications.

I have recently started experiencing these serial device software applications going into an error state due to no received data arriving. Both devices seem to go into an error state close in time. The applications will start up correctly, and have run for long periods with no issue, sometimes I never see an issue during the hours long tests I do. When they do have an issue they can go into the error state at any time; I have roughly correlated it to when the memory being used by the other processing I launch start taking a lot of the RAM. But I don’t have anything to say for sure, but it feels that way from watching things start up.
I have been able to observe in the kernel logs that right when the application software begins reporting data arrival errors, the kernel outputs a message about SLUB issues and an Rx desc cannot be provided. Once the application software gets into this state it can only recover by killing the application and restarting it.

Memory Usage: kernel

An additional item I observed is that when running either of the serial port applications the kmalloc-256 allocation just seems to grow unbounded. For example I measured a kmallow-256 allocation of around 120kbytes/s sustained over many hours. I only see that when running software that access the serial ports. This memory appears to leak and is not freed until a full AGX reboot occurs. As an example, running less than one hour

sudo slabtop -s a

Active / Total Objects (% used)    : 3073102 / 3112035 (98.7%)
 Active / Total Slabs (% used)      : 94706 / 94706 (100.0%)
 Active / Total Caches (% used)     : 94 / 130 (72.3%)
 Active / Total Size (% used)       : 764493.08K / 774870.92K (98.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.25K / 8.83K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
2531488 2531488 100%    0.25K  79109       32    632872K kmalloc-256
 64512  64226  99%    0.01K    126      512       504K zs_handle
 61440  61440 100%    0.12K   1920       32      7680K kmalloc-128
 52696  52103  98%    0.57K   1882       28     30112K radix_tree_node
 46720  46690  99%    0.12K   1460       32      5840K kernfs_node_cache
 56575  41779  73%    0.05K    775       73      3100K lsm_inode_cache
 50967  39394  77%    0.19K   2427       21      9708K dentry
 33885  33174  97%    0.58K   1255       27     20080K inode_cache
 25630  25630 100%    0.18K   1165       22      4660K vm_area_struct
 18432  17478  94%    0.06K    288       64      1152K vmap_area
 16256  15869  97%    0.06K    254       64      1016K anon_vma_chain
 15470  14832  95%    0.05K    182       85       728K ftrace_event_field
 11220  11220 100%    0.04K    110      102       440K ext4_extent_status

Memory Usage: user space

All software applications, in addition to the serial port application, take up about 80% to 85% of the total AGX 32GB of RAM. Not ideal but still has some headroom.

free -h
              total        used        free      shared  buff/cache   available
Mem:           30Gi        23Gi       416Mi       3.0Mi       6.0Gi       5.9Gi
Swap:          15Gi       270Mi        14Gi

It feels like this issue occurs as a result of some memory not being allocated, but need some assistance on this.

I found another post here where the author experienced something similar, but that topic was closed due to inactivity.
https://forums.developer.nvidia.com/t/serial-port-not-worked-kernel-report-serial-tegra-3140000-serial-not-able-to-get-desc-for-rx/242638/16

I am not too familiar with the lower device tree etc, and could use a set of eyes to help understand this log file. Can someone please review the below kernel messages and provide feedback? This is a problem that is stopping upcoming testing and needs to be resolved.

Kernel Messages

sudo journalctl -f

Aug 02 18:35:01 CRON[12222]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 02 18:35:01 CRON[12223]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 02 18:35:01 CRON[12222]: pam_unix(cron:session): session closed for user root
Aug 02 18:35:58 sudo[12568]:     user : TTY=pts/6 ; PWD=/home/nvidia ; USER=root ; COMMAND=/usr/bin/jetson_clocks
Aug 02 18:35:58 sudo[12568]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 02 18:35:58 kernel: nvgpu: 17000000.gv11b             railgate_enable_store:323  [INFO]  railgate is disabled.
Aug 02 18:35:58 sudo[12568]: pam_unix(sudo:session): session closed for user root
Aug 02 18:36:04 sudo[12662]:     user : TTY=pts/6 ; PWD=/home/nvidia ; USER=root ; COMMAND=/usr/bin/journalctl -f
Aug 02 18:36:04 sudo[12662]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 02 18:37:34 kernel: warn_alloc: 2 callbacks suppressed
Aug 02 18:37:34 kernel: swapper/0: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Aug 02 18:37:34 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.10.104-tegra #1
Aug 02 18:37:34 kernel: Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 1.0-d7fb19b 08/10/2022
Aug 02 18:37:34 kernel: Call trace:
Aug 02 18:37:34 kernel:  dump_backtrace+0x0/0x1d0
Aug 02 18:37:34 kernel:  show_stack+0x2c/0x40
Aug 02 18:37:34 kernel:  dump_stack+0xd8/0x138
Aug 02 18:37:34 kernel:  warn_alloc+0x108/0x180
Aug 02 18:37:34 kernel:  __alloc_pages_slowpath.constprop.0+0xb74/0xba0
Aug 02 18:37:34 kernel:  __alloc_pages_nodemask+0x29c/0x320
Aug 02 18:37:34 kernel:  allocate_slab+0x2ac/0x510
Aug 02 18:37:34 kernel:  ___slab_alloc.constprop.0+0x1d4/0x750
Aug 02 18:37:34 kernel:  __slab_alloc.isra.0.constprop.0+0x4c/0x90
Aug 02 18:37:34 kernel:  __kmalloc+0x420/0x450
Aug 02 18:37:34 kernel:  tegra_dma_prep_slave_sg+0x12c/0x340
Aug 02 18:37:34 kernel:  tegra_uart_start_rx_dma+0xb0/0x130
Aug 02 18:37:34 kernel:  tegra_uart_isr+0x300/0x490
Aug 02 18:37:34 kernel:  __handle_irq_event_percpu+0x60/0x2a0
Aug 02 18:37:34 kernel:  handle_irq_event_percpu+0x3c/0xa0
Aug 02 18:37:34 kernel:  handle_irq_event+0x4c/0xf0
Aug 02 18:37:34 kernel:  handle_fasteoi_irq+0xbc/0x170
Aug 02 18:37:34 kernel:  generic_handle_irq+0x3c/0x60
Aug 02 18:37:34 kernel:  __handle_domain_irq+0x6c/0xc0
Aug 02 18:37:34 kernel:  efi_header_end+0xa8/0xf0
Aug 02 18:37:34 kernel:  el1_irq+0xd0/0x180
Aug 02 18:37:34 kernel:  cpuidle_enter_state+0xb4/0x400
Aug 02 18:37:34 kernel:  cpuidle_enter+0x3c/0x50
Aug 02 18:37:34 kernel:  call_cpuidle+0x40/0x70
Aug 02 18:37:34 kernel:  do_idle+0x1fc/0x260
Aug 02 18:37:34 kernel:  cpu_startup_entry+0x28/0x70
Aug 02 18:37:34 kernel:  rest_init+0xd8/0xe4
Aug 02 18:37:34 kernel:  arch_call_rest_init+0x14/0x1c
Aug 02 18:37:34 kernel:  start_kernel+0x50c/0x540
Aug 02 18:37:34 kernel: Mem-Info:
Aug 02 18:37:34 kernel: active_anon:10891 inactive_anon:5897358 isolated_anon:0
                         active_file:1202415 inactive_file:450591 isolated_file:0
                         unevictable:4595 dirty:5 writeback:0
                         slab_reclaimable:20690 slab_unreclaimable:73489
                         mapped:46435 shmem:1424 pagetables:14969 bounce:0
                         free:45355 free_pcp:2181 free_cma:4043
Aug 02 18:37:34 kernel: Node 0 active_anon:43564kB inactive_anon:23589432kB active_file:4809660kB inactive_file:1802364kB unevictable:18380kB isolated(anon):0kB isolated(file):0kB mapped:185740kB dirty:20kB writeback:0kB shmem:5696kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1396736kB writeback_tmp:0kB kernel_stack:10432kB all_unreclaimable? no
Aug 02 18:37:34 kernel: DMA free:122352kB min:1304kB low:2224kB high:3144kB reserved_highatomic:0KB active_anon:20kB inactive_anon:598828kB active_file:146188kB inactive_file:8kB unevictable:0kB writepending:0kB present:1182400kB managed:920128kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:16172kB
Aug 02 18:37:34 kernel: lowmem_reserve[]: 0 0 30102 30102
Aug 02 18:37:34 kernel: Normal free:59068kB min:58084kB low:88908kB high:119732kB reserved_highatomic:2048KB active_anon:43544kB inactive_anon:22990604kB active_file:4662968kB inactive_file:1801480kB unevictable:18380kB writepending:20kB present:31457280kB managed:30837268kB mlocked:0kB pagetables:59876kB bounce:0kB free_pcp:8724kB local_pcp:652kB free_cma:0kB
Aug 02 18:37:34 kernel: lowmem_reserve[]: 0 0 0 0
Aug 02 18:37:34 kernel: DMA: 2*4kB (MC) 23*8kB (UMC) 39*16kB (UM) 38*32kB (UMEC) 20*64kB (UMC) 14*128kB (UMEC) 6*256kB (UMC) 4*512kB (MEC) 5*1024kB (UMEC) 1*2048kB (E) 26*4096kB (MC) = 122352kB
Aug 02 18:37:34 kernel: Normal: 1164*4kB (UMEH) 1333*8kB (UMEH) 660*16kB (UMEH) 899*32kB (UMEH) 22*64kB (UEH) 8*128kB (UMH) 6*256kB (UH) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 59128kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
Aug 02 18:37:34 kernel: 1659307 total pagecache pages
Aug 02 18:37:34 kernel: 214 pages in swap cache
Aug 02 18:37:34 kernel: Swap cache stats: add 75039, delete 74827, find 2525/6315
Aug 02 18:37:34 kernel: Free swap  = 15592992kB
Aug 02 18:37:34 kernel: Total swap = 15878688kB
Aug 02 18:37:34 kernel: 8159920 pages RAM
Aug 02 18:37:34 kernel: 0 pages HighMem/MovableOnly
Aug 02 18:37:34 kernel: 220571 pages reserved
Aug 02 18:37:34 kernel: 16384 pages cma reserved
Aug 02 18:37:34 kernel: 0 pages hwpoisoned
Aug 02 18:37:34 kernel: SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO)
Aug 02 18:37:34 kernel:   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
Aug 02 18:37:34 kernel:   node 0: slabs: 27144, objs: 868608, free: 0
Aug 02 18:37:34 kernel: serial-tegra 3100000.serial: Not able to get desc for Rx

Configuration

AGX Xavier AGX 32G
Jetpack 5.0.2
Connect Tech Rogue carrier card
Using serial port /dev/ttyTHS0 and /dev/ttyTHS1

Update 20230803

After more reading I found a post that seems to be a close match to this issue. The proposed fix is in the kernel and attemps to close the DMA descriptor when terminating all, which seems to add up to my issue.
Here is the link to that post

I have also observed some additional debug information in dmsg that further point toward a dma issue

sudo dmesg
[18650.269109] serial-tegra 3110000.serial: Not able to get desc for Rx
[18650.269402] SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO)
[18650.269643]   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[18650.269849]   node 0: slabs: 180569, objs: 5778208, free: 0
[18650.269986] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.304791] tegra-gpcdma 2600000.dma: slave id already in use
[18650.305034] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.405176] tegra-gpcdma 2600000.dma: slave id already in use
[18650.405378] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.505123] tegra-gpcdma 2600000.dma: slave id already in use
[18650.505352] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.605411] tegra-gpcdma 2600000.dma: slave id already in use
[18650.605657] serial-tegra 3110000.serial: Not able to get desc for Tx



thank you

Are you using the device tree from ConnectTech? If you’ve flashed it with the default NVIDIA software, and if that carrier board is not an exact match in layout to the dev kit, then it is expected that part of the hardware is not going to function. If their carrier board is an exact match in layout to the dev kit, then they will state this on their web site, or else they will provide some sort of modified board support package (mostly a different device tree).

Hello,

Thank you for the reply.

Yes I am using the proper Connect Tech BSP, and the serial ports do indeed function. If it was not clear from the previous post, the system works OK until it doesn’t. The failures occur seemly at random time, though I’m sure the answers are in the logs I supplied but I’m not savvy enough at that low level to pick it out.

I’m currently rebuilding the kernel with the patch described in the link above.

What permissions do you see on those ports? What do you see from:
ls -l /dev/ttyTHS* /dev/ttyS*

Note that if the port is group dialout, then it is free for use. If that port is group tty, then it is part of the serial console and should not be used for other purposes without disabling the serial console (which is usually a bad idea if you don’t have a good reason).

Each ttyTHS port uses the same hardware as some of the ttyS ports, but the device special file name changes depending on which driver supports the hardware. The ttyTHS uses DMA, the ttyS is a legacy driver. You would not want to use the same port with two drivers at the same time, so another question is if you are using any ttyS port?

Hello,

Thank you for the reply, those are all good areas to look at. I agree and configure the serial ports to be in the dialout group, and also remove the mode manager, when I initially configure the AGX.

Configure serial ports

Remove the package modem manager

sudo apt remove modemmanager

Configure serial port ttyTHS0 to not be set up for a login console (getty)

The default group for /dev/ttyTHS0 is tty and not dialout. This serial port is not going to be used as a console login, it is going to be used to interface to electronics so the group needs to be changed to dialout.

sudo systemctl disable nvgetty.service

The group assignment of the serial ports before disabling the nvgetty service

crw-rw---- 1 root tty     238, 0 Apr 15 18:25 /dev/ttyTHS0
crw-rw---- 1 root dialout 238, 1 Apr 15 18:25 /dev/ttyTHS1

The group assignment of the serial ports after disabling the nvgetty service

crw-rw---- 1 root dialout 238, 0 Apr 15 18:25 /dev/ttyTHS0
crw-rw---- 1 root dialout 238, 1 Apr 15 18:25 /dev/ttyTHS1
crw-rw---- 1 root dialout   4, 64 Mar 27 13:54 /dev/ttyS0
crw-rw---- 1 root dialout   4, 65 Mar 27 13:54 /dev/ttyS1

I believe this is a more subtle issue that is closer to the kernel and memory usage. The logs pasted above indicate dma and/or some memory issues. The application does work and the ports are accessible under nominal operation.

@bruce4243 that looks exactly the same as the issue I was having in this thread that you linked: Serial-Tegra DMA Driver Bug - #4 by tony-persea

I can confirm that the patch they provided has completely solved the issue for us. Rebuilding the kernel was relatively straightforward, although packaging it into debs so that a DKMS driver we use would work was a pain in the butt. Hopefully this patch makes it into the next mainline JetPack release.

@ShaneCCC any additional thoughts?

1 Like

@bruce4243 one more thought. If you’re not at the point where you’re ready to build and deploy your own kernel build, we found that detecting that you’ve lost your serial data feed and re-opening the port would usually get the data flowing again for us. That wasn’t an acceptable solution for our use case, but it might be for yours?

Hello,

@tony-persea I applied the kernel patch on Friday and tested over the weekend. I can report that the kernel memory allocation, observed via slabtop, became much more controlled. The kmalloc-256 went to almost zero and the rest of the slabs were stable as well.

For the testing a was able to perform so far I did not observe any software process crashing. It looks like this patch correctly releases the DMA descriptors and addresses the issue. This seems like a substantial problem given that serial port based devices are frequently used on devices like the AGX. It is puzzling why NVIDIA has not applied this patch in Jetpack 5.1.1.

I will ask the same question to @ShaneCCC, any idea why this patch is not being applied? Is there a downside to it?

I’ll continue testing and report back if I see any bad behavior, hopefully this resolves this issue.

thanks

Hello,

@tony-persea
@ShaneCCC

I wanted to follow up on this post. This kernel patch appears to still be holding, so for now I will mark it complete.

The NVIDIA family of embedded computes are used widely in autonomous and perception systems. These types of systems often rely on serial devices, so a kernel bug that leaks memory due to serial port usage is serious. It would be good to get a reply from folks at NVIDIA on this.

Sorry to lose this.
The new release r35.4.1 have include this fixed.

Thanks

@ShaneCCC
Thank you for the follow up. Do you know if we can access a more complete change log from Nvidia for this new release? The usual release notes are high level, which is typically enough. For some of us lower level integration folks it would help to get more details on the changes, for example this patch.
Is this possible to get eyes on?

@tony-persea
Thank you for your original post.

thanks

Looks like check the release note is the only way for it. Otherwise need to download the source to confirm it.

Thanks

Thanks @ShaneCCC , I’ll consider this issue closed

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.