Constant Deepstream / TensorRT memory usage independent of engine. How to improve?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson TX2 / T4
• DeepStream Version 5.0.1 / 6.0
• JetPack Version (valid for Jetson only) 4.5.1
• TensorRT Version 7 / 8
• NVIDIA GPU Driver Version (valid for GPU only) jetson / 510.54
• Issue Type( questions, new requirements, bugs) Memory Management.
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

  • Start with a robust engine file (e.g. PeopleNet), take note of the memory usage (~1300MB on my tests)
  • Try a smaller engine (e.g Detectnetv2Resnet18/smaller input - from TAO), take note (~1150MB )
  • Try a smaller engine (e.g Detectnetv2Resnet10/smaller input - from TAO), take note (~1100 MB )

Keep trying smaller resolution, different/shallow network(mobilenet, yolo,…), it gets stuck close to 1100MB, the lowest we registered was 1080MB. With one exception, the resnet10.caffemodel engine, that comes with the container. It consumes only ~960MB.

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description) deepstream-app, any source file, just update the model engine and disable most things that are not needed to run nvinfer.

===

I found here [ How to reduce the memory-usage during the infer process? ] that TensorRT uses 1100MB out of the box. The behavior was similar on the containers for Jetson and for dGPU, for both Deepstream 5 and 6.

How can I optimize the memory usage even further ? At least to close the gap a little from resnet10.caffemodel engine .

Thanks

===
The forum gives me an error uploading the files, I will paste them bellow

source_modified.txt
################################################################################

# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved.

#

# Permission is hereby granted, free of charge, to any person obtaining a

# copy of this software and associated documentation files (the "Software"),

# to deal in the Software without restriction, including without limitation

# the rights to use, copy, modify, merge, publish, distribute, sublicense,

# and/or sell copies of the Software, and to permit persons to whom the

# Software is furnished to do so, subject to the following conditions:

#

# The above copyright notice and this permission notice shall be included in

# all copies or substantial portions of the Software.

#

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING

# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER

# DEALINGS IN THE SOFTWARE.

################################################################################

[application]

enable-perf-measurement=1

perf-measurement-interval-sec=5

#gie-kitti-output-dir=streamscl

[tiled-display]

enable=0

rows=2

columns=2

width=1280

height=720

gpu-id=0

#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform

#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla

#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla

#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla

#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson

nvbuf-memory-type=0

[source0]

enable=1

#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP

type=3

uri=file://../../streams/sample_1080p_h264.mp4

num-sources=2

#drop-frame-interval=2

gpu-id=0

# (0): memtype_device - Memory type Device

# (1): memtype_pinned - Memory type Host Pinned

# (2): memtype_unified - Memory type Unified

cudadec-memtype=0

[sink0]

enable=0

#Type - 1=FakeSink 2=EglSink 3=File

type=2

sync=1

source-id=0

gpu-id=0

nvbuf-memory-type=0

[sink1]

enable=1

#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming

type=3

#1=mp4 2=mkv

container=1

#1=h264 2=h265

codec=1

#encoder type 0=Hardware 1=Software

enc-type=0

sync=0

#iframeinterval=10

bitrate=2000000

#H264 Profile - 0=Baseline 2=Main 4=High

#H265 Profile - 0=Main 1=Main10

profile=0

output-file=out.mp4

source-id=0

[sink2]

enable=0

#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming

type=4

#1=h264 2=h265

codec=1

#encoder type 0=Hardware 1=Software

enc-type=0

sync=0

#iframeinterval=10

bitrate=400000

#H264 Profile - 0=Baseline 2=Main 4=High

#H265 Profile - 0=Main 1=Main10

profile=0

# set below properties in case of RTSPStreaming

rtsp-port=8554

udp-port=5400

[osd]

enable=1

gpu-id=0

border-width=1

text-size=15

text-color=1;1;1;1;

text-bg-color=0.3;0.3;0.3;1

font=Serif

show-clock=0

clock-x-offset=800

clock-y-offset=820

clock-text-size=12

clock-color=1;0;0;0

nvbuf-memory-type=0

[streammux]

gpu-id=0

##Boolean property to inform muxer that sources are live

live-source=0

buffer-pool-size=2

batch-size=2

##time out in usec, to wait after the first buffer is available

##to push the batch even if the complete batch is not formed

batched-push-timeout=40000

## Set muxer output width and height

width=1920

height=1080

##Enable to maintain aspect ratio wrt source, and allow black borders, works

##along with width, height properties

enable-padding=0

nvbuf-memory-type=0

## If set to TRUE, system timestamp will be attached as ntp timestamp

## If set to FALSE, ntp timestamp from rtspsrc, if available, will be attached

# attach-sys-ts-as-ntp=1

# config-file property is mandatory for any gie section.

# Other properties are optional and if set will override the properties set in

# the infer config file.

[primary-gie]

enable=1

gpu-id=0

model-engine-file=../../models/Primary_Detector/resnet10.caffemodel_b4_gpu0_int8.engine

batch-size=2

#Required by the app for OSD, not a plugin property

bbox-border-color0=1;0;0;1

bbox-border-color1=0;1;1;1

bbox-border-color2=0;0;1;1

bbox-border-color3=0;1;0;1

interval=0

gie-unique-id=1

nvbuf-memory-type=0

config-file=config_infer_primary.txt

[tracker]

enable=0

# For NvDCF and DeepSORT tracker, tracker-width and tracker-height must be a multiple of 32, respectively

tracker-width=640

tracker-height=384

ll-lib-file=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_nvmultiobjecttracker.so

# ll-config-file required to set different tracker types

# ll-config-file=config_tracker_IOU.yml

ll-config-file=config_tracker_NvDCF_perf.yml

# ll-config-file=config_tracker_NvDCF_accuracy.yml

# ll-config-file=config_tracker_DeepSORT.yml

gpu-id=0

enable-batch-process=1

enable-past-frame=1

display-tracking-id=1

[secondary-gie0]

enable=0

model-engine-file=../../models/Secondary_VehicleTypes/resnet18.caffemodel_b16_gpu0_int8.engine

gpu-id=0

batch-size=16

gie-unique-id=4

operate-on-gie-id=1

operate-on-class-ids=0;

config-file=config_infer_secondary_vehicletypes.txt

[secondary-gie1]

enable=0

model-engine-file=../../models/Secondary_CarColor/resnet18.caffemodel_b16_gpu0_int8.engine

batch-size=16

gpu-id=0

gie-unique-id=5

operate-on-gie-id=1

operate-on-class-ids=0;

config-file=config_infer_secondary_carcolor.txt

[secondary-gie2]

enable=0

model-engine-file=../../models/Secondary_CarMake/resnet18.caffemodel_b16_gpu0_int8.engine

batch-size=16

gpu-id=0

gie-unique-id=6

operate-on-gie-id=1

operate-on-class-ids=0;

config-file=config_infer_secondary_carmake.txt

[tests]

file-loop=0

Hi @NilsAI ,
Yes. as the link you pointed out, it’s expected that TensorRT (with cuDNN & clublas) comsume so many memory.

You can modify nvinfer source code to disable cuDNN or cuBLAS tatic for TensorRT to save some memory - Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

To check if your model can run with cuDNN or cuBLAS disabled, you can use trtexec and “–tacticSources” option to disable cuDNN or cublas to try your model

# /usr/src/tensorrt/bin/trtexec  --help
...
  --tacticSources=tactics     Specify the tactics to be used by adding (+) or removing (-) tactics from the default
                              tactic sources (default = all available tactics).
                              Note: Currently only cuDNN, cuBLAS and cuBLAS-LT are listed as optional tactics.
                              Tactic Sources: tactics ::= [","tactic]
                                              tactic  ::= (+|-)lib
                                              lib     ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"
                              For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS

Hi @mchi ,

Thanks for the docs.

I managed to run the engines using trtexec --loadEngine, and there are not much difference between them in memory.

caffemodel.resnet10 uses about 877MB
peoplenet pruned v2.3 uses about 969MB

And pretty much everything else is between those two.

But I could not figure out how to use the tacticSource with tlt from the documentation, or if TAO supports it.

I saw that trt supports onnix when building the engine with tacticsSource, and that it does not support etlt, but I could not find tlt anywhere.

I am training the models on top of the pretrained models from NGC, so their weights are in tlt.

I will try to do the necessary changes on the builder from nvinfer, but I appreciate if you can point me to the right direction to test the tlt with trtexec.

Thank you very much.

I saw that I was moved to the TAO forum, here are the data:

• Hardware (T4/V100/Xavier/Nano/etc) Jetson TX2 / T4
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) TAO Toolkit 3.0-21.11
• Training spec file(If have, please share here) Same from NGC - detectnet_v2 updated to work with resnet10, 18, 34 and different resolutions
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

The issue is that all models I load in nvinfer uses about 1100MB never less, and they suggested me to investigate the usage of CUBLAS or cuDNN, and if they can be disabled.

Yes, I did some experiments to check the cudnn/cublas consumed GPU memory with different tactics sources combination in Xaiver.

For peoplenet v1.0 or v2.3, the cudnn consumes about 88MB. The cublas consumes about 68MB.

More, please double check your step when check the memory usage.
Below is mine.

Step:

  1. Download v2.3 version and v1.0 peoplenet
    $ wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplenet/versions/pruned_v2.3/files/resnet34_peoplenet_pruned.etlt -O /opt/nvidia/deepstream/deepstream-5.1/samples/models/tlt_pretrained_models/peoplenet/resnet34_peoplenet_pruned_v2.3.etlt

$ wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplenet/versions/pruned_v1.0/files/resnet34_peoplenet_pruned.etlt -O /opt/nvidia/deepstream/deepstream-5.1/samples/models/tlt_pretrained_models/peoplenet/resnet34_peoplenet_pruned_v1.0.etlt

  1. Run deepstream-app for the first time to generate the tensorrt engine
    $ cd /opt/nvidia/deepstream/deepstream-5.1/samples/configs/tlt_pretrained_models
    $ deepstream-app -c deepstream_app_source1_peoplenet.txt

  2. Config the engine in deepstream_app_source1_peoplenet.txt and config_infer_primary_peoplenet.txt in order to run the tensorrt engine directly.

  3. Run inference again
    $ deepstream-app -c deepstream_app_source1_peoplenet.txt

  4. During running, check the “NvMapMemUsed” via running following command
    $ cat /proc/meminfo

Please check
For v1.0

NvMapMemUsed: ?? KB —> (Before running)
NvMapMemUsed: ?? KB —> (Running)

Run v2.3
NvMapMemUsed: ?? KB —> (Before running)
NvMapMemUsed: ?? KB —> (Running)

We cleaned the caches with sync && sudo sysctl wm.drop_caches=3 before every test. For the previous tests we were using tegrastats/nvtop (jetson/T4)

I followed the steps you shared:

Config file

Disabled everything but: Source0 (file), streammux, primary gie, osd, and Sink0 (fake). batch=1

Tests on Jeston TX2

For v1.0

NvMapMemUsed: 26188 KB —> (Before running)
NvMapMemUsed: 304872 KB —> (Running)

Run v2.3
NvMapMemUsed: 26188 KB —> (Before running)
NvMapMemUsed: 297128 KB —> (Running)

===

A model based on Resnet18 304;544 pruned is using about 254888 KB.

Are there any way to point cublas or cudnn from TAO ? I tried playing around with workspace size but there was no difference to that.

The result looks well. It consumes 249MB.

For tao models(.etlt format .tlt format), it is not possible for users to generate tensorrt engine with tacticSource(cublas or cudnn ).
So, if you want to save more memory, one feasible way is to generate int8 engine.

Yeah, int8 would be great, but it says on the docs that TX2 doesn’t support int8, are there any workaround?

The other problem, that is why I reached the Deepstream forum first, is that the pipeline with Source0 (file), streammux, primary gie, osd, tracker=IOU and Sink0 (fake). batch=1 consumes close to 1000MB(tegrastats) with that 249MB model, and the model uses >700MB when loaded using trtexec(tegrastats) - maybe I am doing something very wrong here.

We identified a few things that improve memory usage, like the difference in the input video and the steammux resolution, but it is quite small. The same goes to using newer versions of TensorRT.

Do you have any suggesting on how to improve the memory with this pipeline or loading the model with nvinfer ?

TX2 does not support int8. If possible, you can use other devices, for example, NX or Xaiver.

For gpu memory usage optimizer in the pipeline you mentioned, please search/ask for help in DS forum. It is related to Deepstream.

More, you can also try to run inference without deepstream.

  • use a standalone inference way
  • use triton server

Yeah, I asked on the deepstream forum, they moved the post here when they could not help me.

Thanks for helping me with TAO, Morganh.

Can you move my post back there, please?

Suggest you to create a new topic. Please focus on the pipeline with Source0 (file), streammux, primary gie, osd, tracker=IOU and Sink0 (fake). batch=1 and try to run other model instead of tao model.

If you confirm that the pipeline consumes high memory in TX2, you can ask deepstream team to check if it is possible to optimize.
In terms of tao model, current optimization is to use int8 engine.

More, above result shows that you should run with deepstream, right? And it consumes about 249MB.
So, the gpu memory looks reasonable in total pipeline.

try to run other model instead of tao model

That is when I created the post on deepstream, we tried mobilenet and yolo, then we switched to TAO to see if we could get improvements, there were some improvements in FPS and accuracy, but we got stuck on this barrier of 1000MB/700MB(trtexec) and cannot go much lower on the TX2 (there is a huge difference on the accuracy of the model running on a dGPU vs TX2).

===

I will try the tests you suggested and open a new thread if needed.

I will also suggest at work that we upgrade to the newest version of Jetpack as it fixes a lot of memory issues. And if possible to use Jetson that supports int8 if they want to meet the memory requirements.

Thank you very much Morganh!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.