Memory usage when loading unet for inference on jetson nano

ksokolov · July 2, 2021, 7:58am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) Nano
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Unet based on resnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) docker_tag: v3.0-py3

• DeepStream Version 5.1
• JetPack Version (valid for Jetson only) 4.5.1
• TensorRT Version 7.1.3.0-1+cuda10.2

Hi, I am trying to run inference with a custom unet model on my Jetson Nano. I trained it with tlt, then exported and created an engine file on the device. The engine file is around 37 MB. I use the deepstream-segmentation example from deepstream-python-apps to run a deepstream pipeline with this model. When loading the engine, the memory usage goes from 1.5 GB idle to 3.8 GB, so the system is almost freezing. This is happening before the actual inference takes place, during the model loading stage. Now when I try the dstest_segmentation_config_industrial.txt, the memory consumption only goes up to 2.7 GB from the 1.5 GB idle. I checked the .engine file for this config and it is 25 MB. so there are 2 questions from me

Why would a model which weights 38 MB almost cause OOM while a 25 MB one does not?
Why is the memory consumed in the range of GBs while the model size is around 20-30 MB?

Morganh · July 2, 2021, 9:59am

How did you run inference with the 37MB trt engine (from tlt) ? And what is the spec, is it dstest_segmentation_config_industrial.txt? Can you share the exact spec?

When you mentioned 25MB trt engine, could you share the spec file too?

ksokolov · July 2, 2021, 11:15am

How did you run inference with the 37MB trt engine (from tlt) ?

python3 deepstream_segmentation.py conf.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output

what is the spec

conf_.txt (3.5 KB)

When you mentioned 25MB trt engine, could you share the spec file too?

This is the engine which gets generated when I run python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output

Morganh · July 2, 2021, 12:15pm

For deploying unet tlt model in deepstream, please follow

or
GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream → https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps/blob/master/configs/unet_tlt/pgie_unet_tlt_config.txt

And use GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream to run inference.

ksokolov · July 2, 2021, 2:49pm

Excuse me but I fail to see how your answer relates to my question. concerning

For deploying unet tlt model in deepstream, please follow
https://docs.nvidia.com/tlt/tlt-user-guide/text/semantic_segmentation/unet.html#deploying-to-deepstream
or
GitHub - NVIDIA-AI-IOT/deepstream_tlt_apps: Sample apps to demonstrate how to deploy models trained with TLT on DeepStream → deepstream_tlt_apps/pgie_unet_tlt_config.txt at master · NVIDIA-AI-IOT/deepstream_tlt_apps · GitHub

, I followed the procedure to prepare an engine for my custom model from the start.

concerning

And use GitHub - NVIDIA-AI-IOT/deepstream_tlt_apps: Sample apps to demonstrate how to deploy models trained with TLT on DeepStream to run inference.

I have just tried that just in case and it is an identical situation to running via deepsream-python-apps. I.e. when I run inference with pre-trained model from nvidi (25MB one, dstest_segmentation_config_semantic.txt from prev. message), only 2.7 GB RAM is used. When running with a 37 MB one (conf_.txt), memory usage goes to 3.7 GB during model loading stage. The engine is 12 MB larger, model runs on the same resolution and even not RGB (so 512x512x1 instead of 512x512x3), but almost causes OOM during loading. I am very frustrated because I do not get where the memory goes to, and why I cannot reach the performance level that Nvidia demonstrates in its benchmarks concerning Unet, and why a 12 MB increase in engine size would cause a 1.2 GB increase in RAM usage. I can provide the engines, the config files, the commands, anything you would need to have a closer look at the problem.

Morganh · July 2, 2021, 3:43pm

Thanks for the detailed info. Several questions here.
1 .

Please tell me which benchmarks you cannot reach. Please share more detail about it and I need to check and if I can reproduce.

I share the inference method and config file above, because there are several parameters mismatching between https://github.com/NVIDIA-AI-IOT/deepstream_python_apps/blob/master/apps/deepstream-segmentation/dstest_segmentation_config_semantic.txt and https://docs.nvidia.com/tlt/tlt-user-guide/text/semantic_segmentation/unet.html#deepstream-configuration-file. So, please follow official tlt user guide to deploy tlt unet model.
For the sharp increase in memory, I need to check if I can reproduce.

ksokolov · July 2, 2021, 3:57pm

By the benchmark I cannot reach I mean that Nvidia provides a unet model (deepstream_python_apps/dstest_segmentation_config_semantic.txt at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub) which runs just fine on 512x512x3 images and is apparently based on resnet18. I use my own model based on resnet10, it runs on 512x512x1 images and model file is only 12 MB larger than the former one. But this custom model fails to run faster than 1 fps and takes up almost all available memory when loading. I came here in search for help in understanding why it happens this way and how could I optimise my custom model to get similar performance to the nvidia’s one.
I have checked the configs and I only see differences in lines that concern model-specific parameters like input color and resolution, weight paths, etc. Do I miss something?
Should I upload the configs and engines?

Morganh · July 2, 2021, 4:10pm

For 1, the unet model in deepstream_python_apps should not be a tlt model. It is not related to tlt. So, the comparison seems to be not compatible. TLT provides official unet model for purpose-built in ngc, see https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplesemsegnet/files?version=deployable_v1.0. If you train a tlt model, please follow tlt user guide to deploy the model you have trained.
For 2, please pay attention to output-blob-names, offset , etc as well.
For 3, if possible, you can share etlt model, engine files, configs, and how did you generate trt engine.

ksokolov · July 2, 2021, 4:57pm

Ok, just thought since it is same architecture (unet based on a small resnet), it should be comparable to some extent.
thats what I meant by model-specific params. However, offset value and layer names shouldnt influence the model size or speed, should they? Anyways, I made sure that I run each model with its proper config and do it from the same app, and got the results I describe.
to the details of experiment
Nvidia network

got the code from this app
created an engine by running the app for the first time

python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output

took the engine (size 25MB) and dstest_segmentation_config_semantic.txt files
used them in ds-tlt (after building the app) like this

./apps/tlt_segmentation/ds-tlt-segmentation -c dstest_segmentation_config_semantic.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264

Result: 2.7 GB RAM total used

Custom from TLT

exported my model to .etlt file is here
on nano, converted to an .engine (size 38MB) like this

./tlt-converter -k nvidia_tlt -p input_1,1x1x512x512,1x1x512x512,1x1x512x512  -t fp16 rn10-unet-512-gray.etlt

prepared the config
dstest_segmentation_config_semantic_our.txt (3.5 KB)
ran the app

./apps/tlt_segmentation/ds-tlt-segmentation -c dstest_segmentation_config_semantic_our.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264

Result: 3.9 GB Ram total usage, freezing.

Morganh · July 15, 2021, 10:46am

Sorry for late reply. Actually I cannot reproduce high memory issue. I ran in Jetson Nano board which is flashed/installed via Jetpack4.5.1.
Below is my step when I run inference with the official release unet model.
It can run inference well against the 720p.jpg file.

Step:

$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps.git
$ cd deepstream_tlt_apps
$ wget https://nvidia.box.com/shared/static/i1cer4s3ox4v8svbfkuj5js8yqm3yazo.zip -O models.zip
$ unzip models.zip
$ wget https://developer.nvidia.com/cuda102-trt71-jp45 && unzip cuda102-trt71-jp45 && chmod +x cuda10.2_trt7.1_jp4.5/tlt-converter
$ ./cuda10.2_trt7.1_jp4.5/tlt-converter -k tlt_encode -p input_1,1x3x608x960,1x3x608x960,1x3x608x960 -t fp16 models/unet/unet_resnet18.etlt -e models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine
$ ll -sh models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine
73M -rw-rw-r-- 1 nvidia nvidia 73M Jul 15 18:25 models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine

$ export CUDA_VER=10.2
$ make
$ ./apps/tlt_segmentation/ds-tlt-segmentation -c configs/unet_tlt/pgie_unet_tlt_config.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.jpg

For 720p.h264, it stops at “NVMEDIA_ENC: bBlitMode is set to TRUE”. But the memory usage is not high.

$ ./apps/tlt_segmentation/ds-tlt-segmentation -c configs/unet_tlt/pgie_unet_tlt_config.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4
H264: Profile = 66, Level = 0
NVMEDIA_ENC: bBlitMode is set to TRUE

I also try deepstream_python_apps.

$ cd /opt/nvidia/deepstream/deepstream/sources
$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps.git
$ cd deepstream_python_apps/apps/deepstream-segmentation/

$ python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.jpg output

The output folder contains the inference result.

system · September 18, 2021, 3:44am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Memory usage when loading unet for inference on jetson nano DeepStream SDK	3	451	September 18, 2021
Custom U-net model: convert, load and inference Jetson Nano	5	995	October 15, 2021
Constant Deepstream / TensorRT memory usage independent of engine. How to improve? TAO Toolkit tensorrt , tao	13	1455	March 10, 2022
Error while executing the fastest RCNN example on the tlt officialy provided docker in my intel computer TAO Toolkit tensorflow , docker	9	1021	October 12, 2021
Run a UNet segmentation model on Jetson Nano / Convert pb to TensorRT Jetson Nano tensorrt	3	1811	October 18, 2021
Jetson Nano takes 30-40 secs for loading a Tensorflow YOLOv3 model Jetson Nano	6	1217	October 15, 2021
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	18	1015	April 27, 2020
Tlt-convert for custom trained YoloV4 model failed on Jetson Nano 4G TAO Toolkit	42	2388	August 27, 2021
Problem with limited memory after loading model - Jetson Nano Jetson Nano jetson-inference , pytorch	2	1536	October 3, 2021
Python App Cutom Model on the Jetson Nano TAO Toolkit	10	1194	October 12, 2021

Memory usage when loading unet for inference on jetson nano

Related topics