We use SEEED reComputer J4011 (Orin NX 16GB modules) in production and have noticed a concerning leak in slab memory on every module in our fleet. It leaks at around 20-25MB a day. The only way to free this memory is to restart the device.
See the below chart from Grafana tracking SUnreclaim by scraping /proc/meminfo over a 15 day period.
This issue reproduces on JP 6.1 & JP 6.2.1.
slabtop outputs that these objects are active (this is a different Jetson unit from above).
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
55296 1334600 98% 0.12K 42353 32 169412K kmalloc-128
1045472 1045079 99% 0.12K 32671 32 130684K kernfs_node_cache
I’ve been able to reproduce by running the following python script repeatedly in our docker image. After cancelling the script and waiting 10 minutes, SUnreclaim in /proc/meminfo and kernfs_node_cache/kmalloc-128 in slabtop are inflated from a baseline measurement and never reduce back down.
import pynvml as nvml
while True:
try:
nvml.nvmlInit()
h = nvml.nvmlDeviceGetHandleByIndex(0)
print(nvml.nvmlDeviceGetName(h), flush=True)
except KeyboardInterrupt:
break
finally:
try:
nvml.nvmlShutdown()
except Exception:
pass
Our services run very similar logic as part of a regular heartbeat. With ebpftrace, I can see the following stack trace for an allocation that never seems to have a corresponding free called by our application. I’m not an ebpf expert so take the following with a grain of salt. I also see the same trace called by nvidia-containe, vulkan-info and a few others.
__traceiter_kmem_cache_alloc+104
__traceiter_kmem_cache_alloc+104
kmem_cache_alloc+780
__kernfs_new_node+116
kernfs_new_node+100
kernfs_create_link+80
sysfs_do_create_link_sd+116
sysfs_create_link+84
sysfs_slab_alias+164
__kmem_cache_alias+144
kmem_cache_create_usercopy+200
kmem_cache_create+84
nvgpu_kmem_cache_create+240
nvgpu_buddy_allocator_init+392
nvgpu_allocator_init+300
nvgpu_vm_do_init+1652
nvgpu_vm_init+152
gk20a_as_alloc_share+548
gk20a_ctrl_dev_ioctl+3164
__arm64_sys_ioctl+172
invoke_syscall+88
el0_svc_common.constprop.0+9
do_el0_svc+112
el0_svc+36
el0t_64_sync_handler+156
el0t_64_sync+400
Is this is a known issue? Is there any workaround?
We have a few ideas on how to mitigate on the application side. Is the only way to free this memory restarting?


