Python App Cutom Model on the Jetson Nano

Hi,

I’m using TLT in order to train a custom detection model (based on detectNet) and deploy it on the Jetson Nano. So far, I managed to train the model using the notebook (created the etlt file) and converted it to an engine file on the Jetson using tlt-converter.

Looking at the python examples (feeding the engine files directly into DeepStream), I see that I need to provide nvinfer several files such as a .caffe and .prototext in addition to the engine file. How do I generate these files ?

For some cases, I would like to use TRT directly. So I converted the model (.etlt file) to a .trt file. How can I use this file outside of DeepStream in python? (the TRT python API doesn’t specify what to do with a .trt file).

While using tlt-converter I’m getting the “some tactics do not have …” warning, I know it can be solved by using the -w flag but not sure what is a good value for that. Any advise will help!

In addition, I would like to know whether a TRT generated model depends on the GPU model type or just the architecture. For example a TRT model generated using a 1070 GPU will work on a 1080 GPU? (It will obviously not work on a 2080 GPU)

Thanks
Yuval

  1. See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#intg_detectnetv2_model, only etlt model , ngc_key, label file are needed.

tlt-encoded-model=xxx.etlt
tlt-model-key= yourkey

Note, if you already generated trt engine, above two lines is not needed. Just set a new line as below.

model-engine-file = xxx.engine

  1. For how to use trt engine file outside of DeepStream in python, please refer to How to use tlt trained model on Jetson Nano - #3 by Morganh

  2. For -w flag, in Nano board, refer to Accelerating Peoplnet with tlt for jetson nano - #13 by Morganh

  3. It depends on TRT version and architecture. If TRT version is the same, TRT model generated using a 1070 GPU is expected to work on a 1080 GPU.
    More info in https://developer.nvidia.com/cuda-gpus#compute and Support Matrix :: NVIDIA Deep Learning TensorRT Documentation

Thanks for the detailed response!

I’m experimenting with my trained model and trying to improve the runtime performance which is roughly 0.5 fps and is very jittery.

My setup and configuration is as follows:

  1. DetectNet with ResNet18 backbone.
  2. I pruned the model with pth=0.01, the ratio between the pruned and unpruned model is 0.05 without compromising the accuracy.
  3. I exported the model using the following command (not using int8 as is not supported on the Jetson Nano)

!tlt-export detectnet_v2
-m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt
-o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt
-k $KEY
–max_workspace_size 3073741824
–verbose

On the Jetson, I couldn’t resolve the “some tactics do not have …” warning even after increasing the memory size using -w and decreasing -m flag. What else can I try? How significant is this? As far as I can tell adding “-t fp16” is the cause (although I would like to keep the network at fp16).

I modified the python-app example #1 to match my current network but couldn’t test the performance of different batch sizes. So far I managed to run only a batch of 1. Any advice how to make it work? (I changed the batch variable in the config file on the Jetson and in the python script as well).

in addition, I’m testing the performence of Detectnet with a resnet10 backbone and will update on the performence Im getting.

Thanks
Yuval

Please do not care about “some tactics do not have …”, it is not a harmful log.
As long as you can get an etlt model file after you run “tlt-export”, it is OK.

What is “python-app example #1”?

The python example is found in the following link:

In addition, I’m having trouble deploying detectnet based on a resnet10 backbone to the Jetson. I’m following the same procedure as with resnet18 but changed to 10 in all the necessary places (in the code and in the spec files). I managed to get the .etlt file but getting an error while converting it on the Nano. The error I’m getting is with UFFParser: Unsupported number of graph 0 . I’ve read that it is related to the key but honestly, I can’t find anything wrong with it (equivalent on both platforms). How can I debug this?

Thanks
Yuval

Please refer to TLT Converter UffParser: Unsupported number of graph 0 - #4 by Morganh

Ok, I did have a small typo in the file naming (now the conversion works).

I’m left with the following issues:

  1. Should I expect performance improvement by using fp16 on the Nano? This option should be configured only while using the tlt-converter or in earlier stages as well (export stage)?
  2. How to correctly apply batch mode to the python example in the following link? so far I managed to use it with batch=1 but it crashes with larger batch sizes.
    deepstream_python_apps/apps/deepstream-test1 at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub
  3. Are there any more steps I can take in order to maximize the performence?

Thanks
Yuval

  1. I do not understand your comment “performance improvement”. The tool tlt-converter is just in order to generate trt engine.
  2. From tlt user guide, it just verify GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream. Not sure the status of your mentioned link https://github.com/NVIDIA-AI-IOT/deepstream_python_apps/tree/master/apps/deepstream-test1
  3. For performance in Nano, please make sure
$ nvpmodel -m 0

$  jetson_clocks
  1. I meant run-time execution; I would expect that a model converted to fp16 will be faster than a fp32. As far as I understand, the -t flag in the converter should affect the run-time execution of the model. Is that right?

  2. I’ll try and run the model with the original DS App. The python examples are not valid example for using DS?

  3. I believe I already tried setting jetson_clocks but I’ll double check.

As I’m following the KITTI tutorial at this stage, is there any benchmark for run-time performance on the NANO? How much FPS should I achieve?

Thanks
Yuval

  1. The “-t” just means “engine datatype”. If you set “-t fp16”, then a fp16 trt engine is generated. For inference time, please use trtexec to test. Reference: Measurement model speed
  2. The python examples should be valid example for using DS. But for TLT model(etlt model or its output trt engine), not sure its status inside the DS python examples.
  3. For KITTI, it does not have. But you can find FPS in https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet, https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetectir, etc. See Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation