• Hardware Platform (Jetson / GPU): Jetson AGX/NX Orin
• DeepStream Version: 6.3
• JetPack Version (valid for Jetson only) : 5.1.3
• TensorRT Version: 8.5.2-1+cuda11.4
• Issue Type( questions, new requirements, bugs): questions/bugs
Hello, I have recently encountered high memory usage for fairly trivial Deepstream programs on Jetson Orin, compared to Jeston Xavier platforms.
I have noticed that the memory usage is about 10x higher and the pipeline takes takes up to 2 minutes to start.
After some work, I correlated this to the .nv/ComputeCache/
directory and the JIT compilation of kernels.
This is the testing pipeline
gst-launch-1.0 videotestsrc is-live=1 pattern=black num-buffers=128 ! videoconvert ! 'video/x-raw,format=NV12,width=608,height=608,framerate=25/1' ! queue ! nvvideoconvert ! 'video/x-raw(memory:NVMM),width=608,height=608,framerate=25/1,format=NV12' ! fakesink sync=0
We assume that the kernels were already compiled and are prepared in the cache:
MODE | Memory Usage | Time |
---|---|---|
CUDA_CACHE_DISABLE=0 | 160MB (0.5%) | 1s |
CUDA_CACHE_DISABLE=1 | 1440MB (4.5%) | 1m10s |
This seems that if :
- the binaries are cached, they are loaded on start with no memory overhead.
- the binaries are created on start, the compile data remain in memory until the program is terminated
This is how the cache ~/.nv/ComputeCache/
contents look
68,0 MiB [##########] /f
840,0 KiB [ ] /9
360,0 KiB [ ] /5
4,0 KiB [ ] index
Here we see that the binary file causing this is huge (68MiB), and after looking into it, we see the thousands kernels for color conversions:
strings 7c9d8fbee1f027 | grep 'nv\.constant2\._' | count
>> 4224
A short example:
.nv.constant2._Z27YUV444_to_NV12_709_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_NV12_709_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z23YUV444_to_NV12_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_NV12_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z29YUV444_to_YUV420_709_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z26YUV444_to_YUV420_709_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z25YUV444_to_YUV420_ER_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z22YUV444_to_YUV420_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z30YUV444_to_B32F_G32F_R32F_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z30YUV444_to_R32F_G32F_B32F_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_B8_G8_R8_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z24YUV444_to_R8_G8_B8_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z19YUV444_to_BGR_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z19YUV444_to_RGB_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_ABGR_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_ARGB_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_BGRA_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
.nv.constant2._Z20YUV444_to_RGBA_cutexyyyPvS_S_S_S_S_iiiiiiiiiiiii
This does not happen when reading the video from file. I have only encountered this when using videotestsrc
.
Using caps explicitely specifying NV12
format had no effect.
My questions are:
- Is this expected? Shouldn’t the memory be freed after JIT compilation finishes? In other words, shouldn’t the allocated memory be in the end the same as if loaded from cache
- Can this be limited, in the sense that it does not compile all the kernels which will never be needed?
- Why does this not happen when using
filesrc
?
Thank you for your assistance,
Simon