Memory Leak on Jetson Orin when calculating ComputeCache with videotestsrc

• Hardware Platform (Jetson / GPU): Jetson AGX/NX Orin
• DeepStream Version: 6.3
• JetPack Version (valid for Jetson only) : 5.1.3
• TensorRT Version: 8.5.2-1+cuda11.4
• Issue Type( questions, new requirements, bugs): questions/bugs

Hello, I have recently encountered high memory usage for fairly trivial Deepstream programs on Jetson Orin, compared to Jeston Xavier platforms.

I have noticed that the memory usage is about 10x higher and the pipeline takes takes up to 2 minutes to start.
After some work, I correlated this to the .nv/ComputeCache/ directory and the JIT compilation of kernels.

This is the testing pipeline

gst-launch-1.0 videotestsrc is-live=1 pattern=black num-buffers=128 ! videoconvert !  'video/x-raw,format=NV12,width=608,height=608,framerate=25/1' ! queue !  nvvideoconvert ! 'video/x-raw(memory:NVMM),width=608,height=608,framerate=25/1,format=NV12' ! fakesink sync=0

We assume that the kernels were already compiled and are prepared in the cache:

MODE Memory Usage Time
CUDA_CACHE_DISABLE=0 160MB (0.5%) 1s
CUDA_CACHE_DISABLE=1 1440MB (4.5%) 1m10s

This seems that if :

  • the binaries are cached, they are loaded on start with no memory overhead.
  • the binaries are created on start, the compile data remain in memory until the program is terminated

This is how the cache ~/.nv/ComputeCache/ contents look

   68,0 MiB [##########] /f
  840,0 KiB [          ] /9
  360,0 KiB [          ] /5
    4,0 KiB [          ]  index  

Here we see that the binary file causing this is huge (68MiB), and after looking into it, we see the thousands kernels for color conversions:

strings 7c9d8fbee1f027 | grep 'nv\.constant2\._' | count
>> 4224

A short example:

.nv.constant2._Z27YUV444_to_NV12_709_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_NV12_709_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z23YUV444_to_NV12_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_NV12_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z29YUV444_to_YUV420_709_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z26YUV444_to_YUV420_709_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z25YUV444_to_YUV420_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z22YUV444_to_YUV420_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z30YUV444_to_B32F_G32F_R32F_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z30YUV444_to_R32F_G32F_B32F_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_B8_G8_R8_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_R8_G8_B8_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z19YUV444_to_BGR_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z19YUV444_to_RGB_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_ABGR_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_ARGB_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_BGRA_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_RGBA_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii

This does not happen when reading the video from file. I have only encountered this when using videotestsrc.
Using caps explicitely specifying NV12 format had no effect.

My questions are:

  • Is this expected? Shouldn’t the memory be freed after JIT compilation finishes? In other words, shouldn’t the allocated memory be in the end the same as if loaded from cache
  • Can this be limited, in the sense that it does not compile all the kernels which will never be needed?
  • Why does this not happen when using filesrc?

Thank you for your assistance,
Simon

Can you explain further how to reproduce it?

I used the following command to test and found no difference.

I also observed the memory(18M) and video memory(200M) at the same time and there was no difference.

time CUDA_CACHE_DISABLE=1 gst-launch-1.0 videotestsrc is-live=1 pattern=black num-buffers=128 ! videoconvert !  'video/x-raw,format=NV12,width=608,height=608,framerate=25/1' ! queue !  nvvideoconvert ! 'video/x-raw(memory:NVMM),width=608,height=608,framerate=25/1,format=NV12' ! nv3dsink

real	0m6.456s
user 0m0.766s
sys	0m0.514s
time CUDA_CACHE_DISABLE=0 gst-launch-1.0 videotestsrc is-live=1 pattern=black num-buffers=128 ! videoconvert !  'video/x-raw,format=NV12,width=608,height=608,framerate=25/1' ! queue !  nvvideoconvert ! 'video/x-raw(memory:NVMM),width=608,height=608,framerate=25/1,format=NV12' ! nv3dsink

real	0m6.450s
user 0m0.646s
sys	0m0.498s

I’m using AGX Orin 64G, and the Jetpack version is 6.1, can you try upgrading? I’m not sure if this is a bug with legacy jetpack

Hello, thank you for checking this on the new Jetpack 6.1. I did suspect this could be connected to using the legacy version 5.1.3.

There are no more steps, I installed Deepstream 6.3 + nvbufsurtransform in version according to Jetpack 5.1.3 on my Jetson Orin.

However, your input has been very helpful, now I know this behavior was not observed on the new Jetpack 6.1. I will write an update as soon as I try it out on new Jetpack.