Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) Nano
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Unet based on resnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) docker_tag: v3.0-py3
• DeepStream Version 5.1 • JetPack Version (valid for Jetson only) 4.5.1 • TensorRT Version 7.1.3.0-1+cuda10.2
Hi, I am trying to run inference with a custom unet model on my Jetson Nano. I trained it with tlt, then exported and created an engine file on the device. The engine file is around 37 MB. I use the deepstream-segmentation example from deepstream-python-apps to run a deepstream pipeline with this model. When loading the engine, the memory usage goes from 1.5 GB idle to 3.8 GB, so the system is almost freezing. This is happening before the actual inference takes place, during the model loading stage. Now when I try the dstest_segmentation_config_industrial.txt, the memory consumption only goes up to 2.7 GB from the 1.5 GB idle. I checked the .engine file for this config and it is 25 MB. so there are 2 questions from me
Why would a model which weights 38 MB almost cause OOM while a 25 MB one does not?
Why is the memory consumed in the range of GBs while the model size is around 20-30 MB?
How did you run inference with the 37MB trt engine (from tlt) ? And what is the spec, is it dstest_segmentation_config_industrial.txt? Can you share the exact spec?
When you mentioned 25MB trt engine, could you share the spec file too?
When you mentioned 25MB trt engine, could you share the spec file too?
This is the engine which gets generated when I run python3 deepstream_segmentation.py dstest_segmentation_config_semantic.txt /opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.h264 output
I have just tried that just in case and it is an identical situation to running via deepsream-python-apps. I.e. when I run inference with pre-trained model from nvidi (25MB one, dstest_segmentation_config_semantic.txt from prev. message), only 2.7 GB RAM is used. When running with a 37 MB one (conf_.txt), memory usage goes to 3.7 GB during model loading stage. The engine is 12 MB larger, model runs on the same resolution and even not RGB (so 512x512x1 instead of 512x512x3), but almost causes OOM during loading. I am very frustrated because I do not get where the memory goes to, and why I cannot reach the performance level that Nvidia demonstrates in its benchmarks concerning Unet, and why a 12 MB increase in engine size would cause a 1.2 GB increase in RAM usage. I can provide the engines, the config files, the commands, anything you would need to have a closer look at the problem.
By the benchmark I cannot reach I mean that Nvidia provides a unet model (deepstream_python_apps/dstest_segmentation_config_semantic.txt at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub) which runs just fine on 512x512x3 images and is apparently based on resnet18. I use my own model based on resnet10, it runs on 512x512x1 images and model file is only 12 MB larger than the former one. But this custom model fails to run faster than 1 fps and takes up almost all available memory when loading. I came here in search for help in understanding why it happens this way and how could I optimise my custom model to get similar performance to the nvidia’s one.
I have checked the configs and I only see differences in lines that concern model-specific parameters like input color and resolution, weight paths, etc. Do I miss something?
For 1, the unet model in deepstream_python_apps should not be a tlt model. It is not related to tlt. So, the comparison seems to be not compatible. TLT provides official unet model for purpose-built in ngc, see https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplesemsegnet/files?version=deployable_v1.0. If you train a tlt model, please follow tlt user guide to deploy the model you have trained.
For 2, please pay attention to output-blob-names, offset , etc as well.
For 3, if possible, you can share etlt model, engine files, configs, and how did you generate trt engine.
Ok, just thought since it is same architecture (unet based on a small resnet), it should be comparable to some extent.
thats what I meant by model-specific params. However, offset value and layer names shouldnt influence the model size or speed, should they? Anyways, I made sure that I run each model with its proper config and do it from the same app, and got the results I describe.
Sorry for late reply. Actually I cannot reproduce high memory issue. I ran in Jetson Nano board which is flashed/installed via Jetpack4.5.1.
Below is my step when I run inference with the official release unet model.
It can run inference well against the 720p.jpg file.
For 720p.h264, it stops at “NVMEDIA_ENC: bBlitMode is set to TRUE”. But the memory usage is not high.
$ ./apps/tlt_segmentation/ds-tlt-segmentation -c configs/unet_tlt/pgie_unet_tlt_config.txt -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4
H264: Profile = 66, Level = 0
NVMEDIA_ENC: bBlitMode is set to TRUE
I also try deepstream_python_apps.
$ cd /opt/nvidia/deepstream/deepstream/sources
$ git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps.git
$ cd deepstream_python_apps/apps/deepstream-segmentation/