Kmalloc-128 & kernfs_node_cache leak on Orin NX 16GB (JP 6.1)

We use SEEED reComputer J4011 (Orin NX 16GB modules) in production and have noticed a concerning leak in slab memory on every module in our fleet. It leaks at around 20-25MB a day. The only way to free this memory is to restart the device.
See the below chart from Grafana tracking SUnreclaim by scraping /proc/meminfo over a 15 day period.


This issue reproduces on JP 6.1 & JP 6.2.1.
slabtop outputs that these objects are active (this is a different Jetson unit from above).

OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
55296 1334600  98%    0.12K  42353       32    169412K kmalloc-128
1045472 1045079  99%    0.12K  32671       32    130684K kernfs_node_cache

I’ve been able to reproduce by running the following python script repeatedly in our docker image. After cancelling the script and waiting 10 minutes, SUnreclaim in /proc/meminfo and kernfs_node_cache/kmalloc-128 in slabtop are inflated from a baseline measurement and never reduce back down.

import pynvml as nvml

while True:
    try:
        nvml.nvmlInit()
        h = nvml.nvmlDeviceGetHandleByIndex(0)
        print(nvml.nvmlDeviceGetName(h), flush=True)
    except KeyboardInterrupt:
        break
    finally:
        try:
            nvml.nvmlShutdown()
        except Exception:
            pass

Our services run very similar logic as part of a regular heartbeat. With ebpftrace, I can see the following stack trace for an allocation that never seems to have a corresponding free called by our application. I’m not an ebpf expert so take the following with a grain of salt. I also see the same trace called by nvidia-containe, vulkan-info and a few others.

    __traceiter_kmem_cache_alloc+104
    __traceiter_kmem_cache_alloc+104 
    kmem_cache_alloc+780 
    __kernfs_new_node+116
    kernfs_new_node+100 
    kernfs_create_link+80 
    sysfs_do_create_link_sd+116 
    sysfs_create_link+84 
    sysfs_slab_alias+164 
    __kmem_cache_alias+144
    kmem_cache_create_usercopy+200
    kmem_cache_create+84
    nvgpu_kmem_cache_create+240 
    nvgpu_buddy_allocator_init+392 
    nvgpu_allocator_init+300 
    nvgpu_vm_do_init+1652 
    nvgpu_vm_init+152 
    gk20a_as_alloc_share+548 
    gk20a_ctrl_dev_ioctl+3164 
    __arm64_sys_ioctl+172 
    invoke_syscall+88 
    el0_svc_common.constprop.0+9 
    do_el0_svc+112 
    el0_svc+36 
    el0t_64_sync_handler+156 
    el0t_64_sync+400

Is this is a known issue? Is there any workaround?
We have a few ideas on how to mitigate on the application side. Is the only way to free this memory restarting?

*** Please note that this reply is generated by LLM automatically ***

The issue you’re experiencing with the Jetson Orin NX 16GB modules is related to memory leaks, specifically slab memory leaks. This type of leak occurs when memory is allocated but not properly freed, leading to a gradual increase in memory usage over time.

The symptoms you’ve described, such as the memory leak rate of around 2025MB per day, and the need to restart the device to free the memory, are consistent with a slab memory leak.

The fact that you’ve been able to reproduce the issue by running a Python script repeatedly in your Docker image suggests that the leak is related to the application’s memory allocation and deallocation patterns.

The stack trace you provided from ebpftrace shows that the allocation is being made by the nvidia-container process, which is a part of the NVIDIA container runtime. This suggests that the leak may be related to the container runtime’s memory management.

There are a few possible explanations for this issue:

  1. NVIDIA container runtime bug: It’s possible that there’s a bug in the NVIDIA container runtime that’s causing the memory leak. You may want to check the NVIDIA forums or GitHub issues to see if anyone else has reported similar issues.
  2. Application memory allocation patterns: The fact that your application’s services run similar logic as part of a regular heartbeat suggests that the memory allocation patterns may be contributing to the leak. You may want to review your application’s memory allocation and deallocation patterns to ensure that they’re properly freeing memory.
  3. System configuration: It’s possible that the system configuration is contributing to the memory leak. You may want to check the system’s memory settings, such as the slab memory size, to ensure that they’re properly configured.

To mitigate the issue, you may want to consider the following workarounds:

  1. Restart the device periodically: As you’ve already discovered, restarting the device can free up the slab memory. You may want to consider implementing a periodic restart schedule to mitigate the issue.
  2. Implement memory monitoring and alerting: You can implement memory monitoring and alerting tools to detect when the slab memory usage is approaching a critical threshold. This can help you identify the issue before it becomes severe.
  3. Optimize application memory allocation patterns: Reviewing and optimizing your application’s memory allocation patterns may help reduce the memory leak.
  4. Update the NVIDIA container runtime: If you’re running an older version of the NVIDIA container runtime, you may want to update to the latest version to see if the issue is resolved.

In comparison to other Jetson modules, the Jetson Orin NX 16GB module is a relatively new and powerful module. However, the memory leak issue you’re experiencing is not unique to this module, and similar issues have been reported on other Jetson modules.

For example, the Jetson Orin Nano 8GB module has been reported to have similar memory leak issues, although the root cause may be different.

In summary, the memory leak issue you’re experiencing on the Jetson Orin NX 16GB module is likely related to the NVIDIA container runtime or application memory allocation patterns. Implementing workarounds such as periodic restarts, memory monitoring, and optimizing application memory allocation patterns may help mitigate the issue.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Not sure which API is used by the pynvml.
But there is a known memory leakage about CUDA IPC.
The fix can be found on the L4T r36.4.4 page:

Additional Files
Fix memory leak observed when importing an external memory handle through IPC.

Please help replace the CUDA driver to see if this can fix your issue.

Thanks.

Hi,
This fix did not resolve the issue.
We have also observed the same leak from the same trace caused by nvidia-container and vulkaninfo.

Thanks,
Connor

Hi,

We will need to reproduce this issue locally to gather more information.
Could you share how to reproduce this?

We try the pynvml python code but fail to capture the leakage before/after running the script"

before

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            29Gi       1.6Gi        27Gi        29Mi       1.3Gi        28Gi
Swap:           14Gi          0B        14Gi

after

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            29Gi       1.6Gi        27Gi        29Mi       1.3Gi        28Gi
Swap:           14Gi          0B        14Gi

Thanks.

Hi,
The leak is slow (~/30 kb/s). The above ‘free -h’ is not granular enough to capture it unless you run the script for a considerable length of time.

Please check the following before and after running the script for at least 10 minutes to validate.

cat /proc/meminfo
Note the value of SUnreclaim:

cat /proc/slabinfo
Note the number of active objects for kernfs_node_cache & kmalloc-128.

Please also note that this does not reduce over time after ending the script.

Thanks.

Hi,
Can we get an update on this? This is extremely concerning for us as we sell security focused high-uptime products and we can’t just restart devices ad-hoc without disrupting our customers.

Thanks.

Hi,

We can reproduce this issue locally and are now checking with our internal team.
Will keep you updated.

Thanks.

Has the problem been resolved? I used the JP6.2.1 image and applied the patch package. After re-flashing the device, the problem still persists. How is the progress of your solution? This bug has seriously affected the normal operation when running large models.

Hi,

Could you share more information about your issue?

The driver fixes a known CUDA IPC memory leakage.
This topic reports a memory leakage from the kernel, which our internal team is checking.

If you are facing a leakage in another scenario, please file a new topic for your issue separately.

Thanks.

Hi, has there been any progress on resolving this issue?

Thanks

Hi,

Our internal team needs more time for this issue.
Will get back to you for any progress.

Thanks.

Hi,

In the meantime, could you try the change below to see if it can help?

Thanks.

Hi,
We tried this patch previously and it did not remedy the leak.

Hi,

Thanks for the testing.
We will keep you updated on any feedback from our internal team.

Thanks.

Hi,

Thanks for your patience.

We found the memory leakage disappear after moving the initialization outside the loop as below:

import pynvml as nvml

nvml.nvmlInit()
try:
    while True:
        try:
            h = nvml.nvmlDeviceGetHandleByIndex(0)
            print(nvml.nvmlDeviceGetName(h), flush=True)
        except KeyboardInterrupt:
            break
finally:
    nvml.nvmlShutdown()

The initialization and shutdown should only requires once while the query can be done for multiple times in loop.
Thanks.

Hi,

This does not remedy the leak in any meaningful way and is not a sufficient reply after 2 months of no progress.

This problem is not isolated to nvml, the above script was attached as a minimal reproducer to allow a swifter resolution to this issue. In the original post, I state I also see the same trace called by nvidia-container, vulkan-info and a few others.

In practice, we see a very slow leak that accumulates over the course of months. Our services no longer make use of nvml and we still see this leak caused by the previously mentioned services.
Additionally, calling init and shutdown within a process’ lifetime is valid behaviour. I suspect that running the same script with no internal loop and instead an external bash loop, better simulating real world uses, would trigger the same leak.

This is clearly a leak in the GPU driver as it occurs for many different services in the same way. Over the course of months, processes will start and stop and this leak will build until we are forced to restart the board.

Kind regards,
Connor