Memory usage when loading unet for inference on jetson nano

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) Nano
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Unet based on resnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) docker_tag: v3.0-py3

• DeepStream Version 5.1
• JetPack Version (valid for Jetson only) 4.5.1
• TensorRT Version 7.1.3.0-1+cuda10.2

Hi, I am trying to run inference with a custom unet model on my Jetson Nano. I trained it with tlt, then exported and created an engine file on the device. The engine file is around 37 MB. I use the deepstream-segmentation example from deepstream-python-apps to run a deepstream pipeline with this model. When loading the engine, the memory usage goes from 1.5 GB idle to 3.8 GB, so the system is almost freezing. This is happening before the actual inference takes place, during the model loading stage. Now when I try the dstest_segmentation_config_industrial.txt, the memory consumption only goes up to 2.7 GB from the 1.5 GB idle. I checked the .engine file for this config and it is 25 MB. so there are 2 questions from me

  1. Why would a model which weights 38 MB almost cause OOM while a 25 MB one does not?
  2. Why is the memory consumed in the range of GBs while the model size is around 20-30 MB?

How did you run inference with the 37MB trt engine (from tlt) ? And what is the spec, is it dstest_segmentation_config_industrial.txt? Can you share the exact spec?

When you mentioned 25MB trt engine, could you share the spec file too?

How did you run inference with the 37MB trt engine (from tlt) ?

python3 deepstream_segmentation.py conf.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output

what is the spec

conf_.txt (3.5 KB)

When you mentioned 25MB trt engine, could you share the spec file too?

This is the engine which gets generated when I run python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output

For deploying unet tlt model in deepstream, please follow

or
GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStreamhttps://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps/blob/master/configs/unet_tlt/pgie_unet_tlt_config.txt

And use GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream to run inference.

Excuse me but I fail to see how your answer relates to my question. concerning

For deploying unet tlt model in deepstream, please follow
https://docs.nvidia.com/tlt/tlt-user-guide/text/semantic_segmentation/unet.html#deploying-to-deepstream
or
GitHub - NVIDIA-AI-IOT/deepstream_tlt_apps: Sample apps to demonstrate how to deploy models trained with TLT on DeepStreamdeepstream_tlt_apps/pgie_unet_tlt_config.txt at master · NVIDIA-AI-IOT/deepstream_tlt_apps · GitHub

, I followed the procedure to prepare an engine for my custom model from the start.

concerning

And use GitHub - NVIDIA-AI-IOT/deepstream_tlt_apps: Sample apps to demonstrate how to deploy models trained with TLT on DeepStream to run inference.

I have just tried that just in case and it is an identical situation to running via deepsream-python-apps. I.e. when I run inference with pre-trained model from nvidi (25MB one, dstest_segmentation_config_semantic.txt from prev. message), only 2.7 GB RAM is used. When running with a 37 MB one (conf_.txt), memory usage goes to 3.7 GB during model loading stage. The engine is 12 MB larger, model runs on the same resolution and even not RGB (so 512x512x1 instead of 512x512x3), but almost causes OOM during loading. I am very frustrated because I do not get where the memory goes to, and why I cannot reach the performance level that Nvidia demonstrates in its benchmarks concerning Unet, and why a 12 MB increase in engine size would cause a 1.2 GB increase in RAM usage. I can provide the engines, the config files, the commands, anything you would need to have a closer look at the problem.

Thanks for the detailed info. Several questions here.
1 .

Please tell me which benchmarks you cannot reach. Please share more detail about it and I need to check and if I can reproduce.

  1. I share the inference method and config file above, because there are several parameters mismatching between https://github.com/NVIDIA-AI-IOT/deepstream_python_apps/blob/master/apps/deepstream-segmentation/dstest_segmentation_config_semantic.txt and https://docs.nvidia.com/tlt/tlt-user-guide/text/semantic_segmentation/unet.html#deepstream-configuration-file. So, please follow official tlt user guide to deploy tlt unet model.

  2. For the sharp increase in memory, I need to check if I can reproduce.

  1. By the benchmark I cannot reach I mean that Nvidia provides a unet model (deepstream_python_apps/dstest_segmentation_config_semantic.txt at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub) which runs just fine on 512x512x3 images and is apparently based on resnet18. I use my own model based on resnet10, it runs on 512x512x1 images and model file is only 12 MB larger than the former one. But this custom model fails to run faster than 1 fps and takes up almost all available memory when loading. I came here in search for help in understanding why it happens this way and how could I optimise my custom model to get similar performance to the nvidia’s one.

  2. I have checked the configs and I only see differences in lines that concern model-specific parameters like input color and resolution, weight paths, etc. Do I miss something?

  3. Should I upload the configs and engines?

For 1, the unet model in deepstream_python_apps should not be a tlt model. It is not related to tlt. So, the comparison seems to be not compatible. TLT provides official unet model for purpose-built in ngc, see https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplesemsegnet/files?version=deployable_v1.0. If you train a tlt model, please follow tlt user guide to deploy the model you have trained.
For 2, please pay attention to output-blob-names, offset , etc as well.
For 3, if possible, you can share etlt model, engine files, configs, and how did you generate trt engine.

  1. Ok, just thought since it is same architecture (unet based on a small resnet), it should be comparable to some extent.

  2. thats what I meant by model-specific params. However, offset value and layer names shouldnt influence the model size or speed, should they? Anyways, I made sure that I run each model with its proper config and do it from the same app, and got the results I describe.

  3. to the details of experiment
    Nvidia network

  • got the code from this app
  • created an engine by running the app for the first time
python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output
  • took the engine (size 25MB) and dstest_segmentation_config_semantic.txt files
  • used them in ds-tlt (after building the app) like this
./apps/tlt_segmentation/ds-tlt-segmentation -c dstest_segmentation_config_semantic.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264

Result: 2.7 GB RAM total used

Custom from TLT

  • exported my model to .etlt file is here
  • on nano, converted to an .engine (size 38MB) like this
./tlt-converter -k nvidia_tlt -p input_1,1x1x512x512,1x1x512x512,1x1x512x512  -t fp16 rn10-unet-512-gray.etlt
./apps/tlt_segmentation/ds-tlt-segmentation -c dstest_segmentation_config_semantic_our.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264

Result: 3.9 GB Ram total usage, freezing.

Sorry for late reply. Actually I cannot reproduce high memory issue. I ran in Jetson Nano board which is flashed/installed via Jetpack4.5.1.
Below is my step when I run inference with the official release unet model.
It can run inference well against the 720p.jpg file.

Step:

$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps.git
$ cd deepstream_tlt_apps
$ wget https://nvidia.box.com/shared/static/i1cer4s3ox4v8svbfkuj5js8yqm3yazo.zip -O models.zip
$ unzip models.zip
$ wget https://developer.nvidia.com/cuda102-trt71-jp45 && unzip cuda102-trt71-jp45 && chmod +x cuda10.2_trt7.1_jp4.5/tlt-converter
$ ./cuda10.2_trt7.1_jp4.5/tlt-converter -k tlt_encode -p input_1,1x3x608x960,1x3x608x960,1x3x608x960 -t fp16 models/unet/unet_resnet18.etlt -e models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine
$ ll -sh models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine
73M -rw-rw-r-- 1 nvidia nvidia 73M Jul 15 18:25 models/unet/unet_resnet18.etlt_b1_gpu0_fp16.engine

$ export CUDA_VER=10.2
$ make
$ ./apps/tlt_segmentation/ds-tlt-segmentation -c configs/unet_tlt/pgie_unet_tlt_config.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.jpg

For 720p.h264, it stops at “NVMEDIA_ENC: bBlitMode is set to TRUE”. But the memory usage is not high.

$ ./apps/tlt_segmentation/ds-tlt-segmentation -c configs/unet_tlt/pgie_unet_tlt_config.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4
H264: Profile = 66, Level = 0
NVMEDIA_ENC: bBlitMode is set to TRUE

I also try deepstream_python_apps.

$ cd /opt/nvidia/deepstream/deepstream/sources
$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps.git
$ cd deepstream_python_apps/apps/deepstream-segmentation/

$ python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.jpg output

The output folder contains the inference result.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.