Hello,
I am having a fatal error issue with software accessing serial port devices attach to an AGX. I am currently using serial devices connected to an AGX Xavier on a ConnectTech Rogue carrier. Both of these serial devices appear to work correctly and I am able to stream data into different applications.
I have recently started experiencing these serial device software applications going into an error state due to no received data arriving. Both devices seem to go into an error state close in time. The applications will start up correctly, and have run for long periods with no issue, sometimes I never see an issue during the hours long tests I do. When they do have an issue they can go into the error state at any time; I have roughly correlated it to when the memory being used by the other processing I launch start taking a lot of the RAM. But I don’t have anything to say for sure, but it feels that way from watching things start up.
I have been able to observe in the kernel logs that right when the application software begins reporting data arrival errors, the kernel outputs a message about SLUB issues and an Rx desc cannot be provided. Once the application software gets into this state it can only recover by killing the application and restarting it.
Memory Usage: kernel
An additional item I observed is that when running either of the serial port applications the kmalloc-256 allocation just seems to grow unbounded. For example I measured a kmallow-256 allocation of around 120kbytes/s sustained over many hours. I only see that when running software that access the serial ports. This memory appears to leak and is not freed until a full AGX reboot occurs. As an example, running less than one hour
sudo slabtop -s a
Active / Total Objects (% used) : 3073102 / 3112035 (98.7%)
Active / Total Slabs (% used) : 94706 / 94706 (100.0%)
Active / Total Caches (% used) : 94 / 130 (72.3%)
Active / Total Size (% used) : 764493.08K / 774870.92K (98.7%)
Minimum / Average / Maximum Object : 0.01K / 0.25K / 8.83K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
2531488 2531488 100% 0.25K 79109 32 632872K kmalloc-256
64512 64226 99% 0.01K 126 512 504K zs_handle
61440 61440 100% 0.12K 1920 32 7680K kmalloc-128
52696 52103 98% 0.57K 1882 28 30112K radix_tree_node
46720 46690 99% 0.12K 1460 32 5840K kernfs_node_cache
56575 41779 73% 0.05K 775 73 3100K lsm_inode_cache
50967 39394 77% 0.19K 2427 21 9708K dentry
33885 33174 97% 0.58K 1255 27 20080K inode_cache
25630 25630 100% 0.18K 1165 22 4660K vm_area_struct
18432 17478 94% 0.06K 288 64 1152K vmap_area
16256 15869 97% 0.06K 254 64 1016K anon_vma_chain
15470 14832 95% 0.05K 182 85 728K ftrace_event_field
11220 11220 100% 0.04K 110 102 440K ext4_extent_status
Memory Usage: user space
All software applications, in addition to the serial port application, take up about 80% to 85% of the total AGX 32GB of RAM. Not ideal but still has some headroom.
free -h
total used free shared buff/cache available
Mem: 30Gi 23Gi 416Mi 3.0Mi 6.0Gi 5.9Gi
Swap: 15Gi 270Mi 14Gi
It feels like this issue occurs as a result of some memory not being allocated, but need some assistance on this.
I found another post here where the author experienced something similar, but that topic was closed due to inactivity.
https://forums.developer.nvidia.com/t/serial-port-not-worked-kernel-report-serial-tegra-3140000-serial-not-able-to-get-desc-for-rx/242638/16
I am not too familiar with the lower device tree etc, and could use a set of eyes to help understand this log file. Can someone please review the below kernel messages and provide feedback? This is a problem that is stopping upcoming testing and needs to be resolved.
Kernel Messages
sudo journalctl -f
Aug 02 18:35:01 CRON[12222]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 02 18:35:01 CRON[12223]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 02 18:35:01 CRON[12222]: pam_unix(cron:session): session closed for user root
Aug 02 18:35:58 sudo[12568]: user : TTY=pts/6 ; PWD=/home/nvidia ; USER=root ; COMMAND=/usr/bin/jetson_clocks
Aug 02 18:35:58 sudo[12568]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 02 18:35:58 kernel: nvgpu: 17000000.gv11b railgate_enable_store:323 [INFO] railgate is disabled.
Aug 02 18:35:58 sudo[12568]: pam_unix(sudo:session): session closed for user root
Aug 02 18:36:04 sudo[12662]: user : TTY=pts/6 ; PWD=/home/nvidia ; USER=root ; COMMAND=/usr/bin/journalctl -f
Aug 02 18:36:04 sudo[12662]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 02 18:37:34 kernel: warn_alloc: 2 callbacks suppressed
Aug 02 18:37:34 kernel: swapper/0: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Aug 02 18:37:34 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.10.104-tegra #1
Aug 02 18:37:34 kernel: Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS 1.0-d7fb19b 08/10/2022
Aug 02 18:37:34 kernel: Call trace:
Aug 02 18:37:34 kernel: dump_backtrace+0x0/0x1d0
Aug 02 18:37:34 kernel: show_stack+0x2c/0x40
Aug 02 18:37:34 kernel: dump_stack+0xd8/0x138
Aug 02 18:37:34 kernel: warn_alloc+0x108/0x180
Aug 02 18:37:34 kernel: __alloc_pages_slowpath.constprop.0+0xb74/0xba0
Aug 02 18:37:34 kernel: __alloc_pages_nodemask+0x29c/0x320
Aug 02 18:37:34 kernel: allocate_slab+0x2ac/0x510
Aug 02 18:37:34 kernel: ___slab_alloc.constprop.0+0x1d4/0x750
Aug 02 18:37:34 kernel: __slab_alloc.isra.0.constprop.0+0x4c/0x90
Aug 02 18:37:34 kernel: __kmalloc+0x420/0x450
Aug 02 18:37:34 kernel: tegra_dma_prep_slave_sg+0x12c/0x340
Aug 02 18:37:34 kernel: tegra_uart_start_rx_dma+0xb0/0x130
Aug 02 18:37:34 kernel: tegra_uart_isr+0x300/0x490
Aug 02 18:37:34 kernel: __handle_irq_event_percpu+0x60/0x2a0
Aug 02 18:37:34 kernel: handle_irq_event_percpu+0x3c/0xa0
Aug 02 18:37:34 kernel: handle_irq_event+0x4c/0xf0
Aug 02 18:37:34 kernel: handle_fasteoi_irq+0xbc/0x170
Aug 02 18:37:34 kernel: generic_handle_irq+0x3c/0x60
Aug 02 18:37:34 kernel: __handle_domain_irq+0x6c/0xc0
Aug 02 18:37:34 kernel: efi_header_end+0xa8/0xf0
Aug 02 18:37:34 kernel: el1_irq+0xd0/0x180
Aug 02 18:37:34 kernel: cpuidle_enter_state+0xb4/0x400
Aug 02 18:37:34 kernel: cpuidle_enter+0x3c/0x50
Aug 02 18:37:34 kernel: call_cpuidle+0x40/0x70
Aug 02 18:37:34 kernel: do_idle+0x1fc/0x260
Aug 02 18:37:34 kernel: cpu_startup_entry+0x28/0x70
Aug 02 18:37:34 kernel: rest_init+0xd8/0xe4
Aug 02 18:37:34 kernel: arch_call_rest_init+0x14/0x1c
Aug 02 18:37:34 kernel: start_kernel+0x50c/0x540
Aug 02 18:37:34 kernel: Mem-Info:
Aug 02 18:37:34 kernel: active_anon:10891 inactive_anon:5897358 isolated_anon:0
active_file:1202415 inactive_file:450591 isolated_file:0
unevictable:4595 dirty:5 writeback:0
slab_reclaimable:20690 slab_unreclaimable:73489
mapped:46435 shmem:1424 pagetables:14969 bounce:0
free:45355 free_pcp:2181 free_cma:4043
Aug 02 18:37:34 kernel: Node 0 active_anon:43564kB inactive_anon:23589432kB active_file:4809660kB inactive_file:1802364kB unevictable:18380kB isolated(anon):0kB isolated(file):0kB mapped:185740kB dirty:20kB writeback:0kB shmem:5696kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1396736kB writeback_tmp:0kB kernel_stack:10432kB all_unreclaimable? no
Aug 02 18:37:34 kernel: DMA free:122352kB min:1304kB low:2224kB high:3144kB reserved_highatomic:0KB active_anon:20kB inactive_anon:598828kB active_file:146188kB inactive_file:8kB unevictable:0kB writepending:0kB present:1182400kB managed:920128kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:16172kB
Aug 02 18:37:34 kernel: lowmem_reserve[]: 0 0 30102 30102
Aug 02 18:37:34 kernel: Normal free:59068kB min:58084kB low:88908kB high:119732kB reserved_highatomic:2048KB active_anon:43544kB inactive_anon:22990604kB active_file:4662968kB inactive_file:1801480kB unevictable:18380kB writepending:20kB present:31457280kB managed:30837268kB mlocked:0kB pagetables:59876kB bounce:0kB free_pcp:8724kB local_pcp:652kB free_cma:0kB
Aug 02 18:37:34 kernel: lowmem_reserve[]: 0 0 0 0
Aug 02 18:37:34 kernel: DMA: 2*4kB (MC) 23*8kB (UMC) 39*16kB (UM) 38*32kB (UMEC) 20*64kB (UMC) 14*128kB (UMEC) 6*256kB (UMC) 4*512kB (MEC) 5*1024kB (UMEC) 1*2048kB (E) 26*4096kB (MC) = 122352kB
Aug 02 18:37:34 kernel: Normal: 1164*4kB (UMEH) 1333*8kB (UMEH) 660*16kB (UMEH) 899*32kB (UMEH) 22*64kB (UEH) 8*128kB (UMH) 6*256kB (UH) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 59128kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 02 18:37:34 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
Aug 02 18:37:34 kernel: 1659307 total pagecache pages
Aug 02 18:37:34 kernel: 214 pages in swap cache
Aug 02 18:37:34 kernel: Swap cache stats: add 75039, delete 74827, find 2525/6315
Aug 02 18:37:34 kernel: Free swap = 15592992kB
Aug 02 18:37:34 kernel: Total swap = 15878688kB
Aug 02 18:37:34 kernel: 8159920 pages RAM
Aug 02 18:37:34 kernel: 0 pages HighMem/MovableOnly
Aug 02 18:37:34 kernel: 220571 pages reserved
Aug 02 18:37:34 kernel: 16384 pages cma reserved
Aug 02 18:37:34 kernel: 0 pages hwpoisoned
Aug 02 18:37:34 kernel: SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO)
Aug 02 18:37:34 kernel: cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
Aug 02 18:37:34 kernel: node 0: slabs: 27144, objs: 868608, free: 0
Aug 02 18:37:34 kernel: serial-tegra 3100000.serial: Not able to get desc for Rx
Configuration
AGX Xavier AGX 32G
Jetpack 5.0.2
Connect Tech Rogue carrier card
Using serial port /dev/ttyTHS0 and /dev/ttyTHS1
Update 20230803
After more reading I found a post that seems to be a close match to this issue. The proposed fix is in the kernel and attemps to close the DMA descriptor when terminating all, which seems to add up to my issue.
Here is the link to that post
I have also observed some additional debug information in dmsg that further point toward a dma issue
sudo dmesg
[18650.269109] serial-tegra 3110000.serial: Not able to get desc for Rx
[18650.269402] SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO)
[18650.269643] cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[18650.269849] node 0: slabs: 180569, objs: 5778208, free: 0
[18650.269986] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.304791] tegra-gpcdma 2600000.dma: slave id already in use
[18650.305034] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.405176] tegra-gpcdma 2600000.dma: slave id already in use
[18650.405378] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.505123] tegra-gpcdma 2600000.dma: slave id already in use
[18650.505352] serial-tegra 3110000.serial: Not able to get desc for Tx
[18650.605411] tegra-gpcdma 2600000.dma: slave id already in use
[18650.605657] serial-tegra 3110000.serial: Not able to get desc for Tx
thank you