Setup information
• Hardware Platform (Jetson / GPU) : Jetson Orin Nano
• DeepStream Version : DeepStream 6.3
• JetPack Version (valid for Jetson only) : Jetpack 5.1.3
• TensorRT Version : TensorRT 8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only) : N/A
• Issue Type( questions, new requirements, bugs) : Bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) : See below for configurations.
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description) : N/A
Issue:
Fine-tuned TAO ClassificationTF2 TLT model (EfficientNet-B0 backbone) gives high inference accuracy, but accuracy dropped after converting the model to TensorRT engine and running inference in DeepStream as SGIE.
Evaluation Method:
Results were compared on the same video.
This is how we compare the TLT and TensorRT models:
- We used the same PGIE (PeopleNet) and tracker to perform detection and tracking.
- We cropped the objects from the video frames based on the bounding boxes in the KITTI tracker output file.
- We run
tao model classification_tf2 inference
and evaluate the results of the TLT model.
- We run inference on the same video in DeepStream and TRTEXEC, and manually compare the results.
Accuracy Comparison:
Object |
Ground Truth Class |
TLT Accuracy |
TensorRT Accuracy |
Object_1 |
Class_1 |
96.01% |
41.46% |
Object_2 |
Class_1 |
97.60% |
9.36% |
Object_3 |
Class_1 |
100% |
18.00% |
Object_4 |
Class_2 |
100% |
100% |
Object_5 |
Class_2 |
100% |
100% |
Notes:
- The TensorRT accuracy was obtained by running TRT inference using TRTExec. We manually inspect the DeepStream output video, the TRTEXEC results seem to align with the DeepStream inference overlay output video.
- It seems like the accuracy drop only affected
Class_1
, but not Class_2
.
TAO ClassificationTF2 Configuration
dataset:
train_dataset_path: "/workspace/tao-experiments/data/train"
val_dataset_path: "/workspace/tao-experiments/data/val"
preprocess_mode: 'torch'
num_classes: 2
augmentation:
enable_color_augmentation: True
train:
checkpoint: '/workspace/tao-experiments/pretrained_classification_tf2_vefficientnet_b0'
batch_size_per_gpu: 32
num_epochs: 100
optim_config:
optimizer: 'sgd'
lr_config:
scheduler: 'cosine'
learning_rate: 0.0005
soft_start: 0.05
reg_config:
type: 'L2'
scope: ['conv2d', 'dense']
weight_decay: 0.00005
results_dir: '/workspace/tao-experiments/results/train'
model:
backbone: 'efficientnet-b0'
input_width: 128
input_height: 128
input_channels: 3
dropout: 0.12
evaluate:
dataset_path: "/workspace/tao-experiments/data/test"
checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
top_k: 1
batch_size: 16
n_workers: 8
results_dir: '/workspace/tao-experiments/results/val'
inference:
image_dir: "/workspace/tao-experiments/data/test_images"
checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
results_dir: '/workspace/tao-experiments/results/inference'
classmap: "/workspace/tao-experiments/results/train/classmap.json"
export:
checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
onnx_file: '/workspace/tao-experiments/results/export/efficientnet-b0.onnx'
results_dir: '/workspace/tao-experiments/results/export'
How I converted the model from TLT to TensorRT engine:
- Convert from TLT to ONNX using
tao model classification_tf2 export
. This step was not performed on the Jetson device. We were using NVIDIA GeForce RTX 4090 GPU for model training and exporting.
- Convert from ONNX to TensorRT. This was conducted on the Jetson Orin Nano device. We tried two methods: (i) deploy the TLT model to Deepstream directly and let DeepStream handle the TRT conversion implicitly; and (ii) use TRTEXEC to compile TensorRT engine. However, both methods give the same (bad) inference results.
DeepStream App Configuration Files:
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#kitti-track-output-dir=tracker_output_folder
[tiled-display]
enable=1
rows=1
columns=1
width=1920
height=1080
gpu-id=0
[source0]
enable=1
type=2
num-sources=1
uri=file:///path/to/test/video/file.mp4
gpu-id=0
[streammux]
gpu-id=0
batch-size=1
batched-push-timeout=33333
width=1920
height=1080
[sink0]
enable=1
type=3
container=1
codec=1
enc-type=1
sync=0
bitrate=3000000
profile=0
output-file=/path/to/inference/overlay/video.mp4
source-id=0
[osd]
enable=1
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial
[primary-gie]
enable=1
plugin-type=0
gpu-id=0
batch-size=1
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
gie-unique-id=1
config-file=/opt/nvidia/deepstream/deepstream-6.3/samples/configs/tao_pretrained_models/nvinfer/config_infer_primary_peoplenet.txt
[secondary-gie0]
enable=1
plugin-type=0
gpu-id=0
batch-size=1
gie-unique-id=3
operate-on-gie-id=1
operate-on-class-ids=0
config-file=/path/to/config_infer_secondary_classificationtf2.txt
[tracker]
enable=1
tracker-width=480
tracker-height=288
ll-lib-file=/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
ll-config-file=/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/config_tracker_NvDCF_perf.yml
gpu-id=0
display-tracking-id=1
[tests]
file-loop=0
config_infer_secondary_classificationtf2.txt
(we followed this guide)
[property]
gpu-id=0
# preprocessing_mode == 'torch'
net-scale-factor=0.017507
offsets=123.675;116.280;103.53
model-color-format=0
# model config
onnx-file=/path/to/efficientnet-b0.onnx
model-engine-file=/path/to/efficientnet-b0.onnx_b1_gpu0_fp32.engine
labelfile-path=/path/to/labels.txt
classifier-threshold=0.5
operate-on-class-ids=0
batch-size=1
network-mode=0
network-type=1
process-mode=2
secondary-reinfer-interval=0
gie-unique-id=3
We would like to know: (i) what causes the degradation in model accuracy?; and (ii) how could we minimize this performance gap between the TLT model and the TensorRT engine.
In TAO, there is a tao-deploy docker for user to generate tensorrt engine and also run evaluation or inference. To narrow down, please use tao deploy
docker to check the result.
You can use tao launcher(i.e., tao deploy xxx) or use docker run
against (nvcr.io/nvidia/tao/tao-toolkit:5.3.0-deploy) directly.
See the end of tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/classification.ipynb at main · NVIDIA/tao_tutorials · GitHub or Classification (TF2) with TAO Deploy - NVIDIA Docs.
Thank you for the fast response.
I have tried TAO deploy as suggested. I used classification_tf2 gen_trt_engine
to generate the TensorRT engine file, and used classification_tf2 inference
to run inference on the test images.
However, the results are quite weird. ALL images are classified as the same class (“non-staff”) with same confidence (0.938832).
Here are the first few rows of result.csv
:
Accuracy:
Class |
TLT Accuracy |
TensorRT Accuracy |
non-staff |
97.79% |
100% |
staff |
97.57% |
0% |
Configuration Files
gen_trt_engine config file:
gen_trt_engine:
onnx_file: '/path/to/efficientnet-b0.onnx'
trt_engine: '/path/to/efficientnet-b0_batch64.engine'
results_dir: '/path/to/gen_trt_results'
tensorrt:
max_workspace_size: 4
max_batch_size: 64
data_type: "fp32"
Inference config file:
dataset:
augmentation:
enable_color_augmentation: true
num_classes: 2
preprocess_mode: torch
inference:
trt_engine: /path/to/efficientnet-b0_batch64.engine
classmap: /path/to/classmap.json
image_dir: /path/to/test/images
results_dir: /path/to/inference-results
model:
backbone: efficientnet-b0
input_channels: 3
input_height: 128
input_width: 128
I have tried the default notebook and default specs (e.g., enable_center_crop: True
) as suggested. Here are the results:
TAO TLT Model Evaluation Results
tao model classification_tf2 evaluate
Evaluation Loss: 0.1649947166442871
Evaluation Top 1 accuracy: 0.9825870394706726
Confusion Matrix:
[[589 17]
[ 4 596]]
Classification Report:
precision recall f1-score support
non-staff 0.99 0.97 0.98 606
staff 0.97 0.99 0.98 600
accuracy 0.98 1206
macro avg 0.98 0.98 0.98 1206
weighted avg 0.98 0.98 0.98 1206
TAO Deploy TensorRT Engine Evaluation Results
tao deploy classification_tf2 evaluate
Top 1 scores: 0.495
Confusion Matrix:
[[ 0 606]
[ 0 594]]
Classification Report:
precision recall f1-score support
non-staff 0.00 0.00 0.00 606
staff 0.49 1.00 0.66 594
accuracy 0.49 1200
macro avg 0.25 0.50 0.33 1200
weighted avg 0.25 0.49 0.33 1200
From your result, there is still accuracy drop when run “tao deploy” evaluation against the tensorrt engine. Will check further. BTW, did you ever run with default dataset?
From your result, there is still accuracy drop when run “tao deploy” evaluation against the tensorrt engine. Will check further.
Thanks! I am happy to share the TLT model and TensorRT engine with you if that helps.
BTW, did you ever run with default dataset?
No, I did not.
I have specified the same interpolation method in the spec YAML file for both tao model classification_tf2 evaluate
and tao deploy classification_tf2 evaluate
:
model:
...
resize_interpolation_method: bilinear
It does not help.
OK, thanks for the info. I will check further and update to you if I have. Thanks.
I can reproduce the accuracy drop. Will check further.
Thank you.
May I know when we will get an update? Our production relies on this.
If it cannot be resolved soon, what alternatives could we use? Will reverting to an older version of the TAO toolkit resolve the accuracy drop issue? Is there another solution provided by NVIDIA that we can use instead?
Got it. We are still working on that. Once there is a fix or workaround, I will update to you.
Sorry for inconvenience.
Hi, I find the root cause. Please use below way to change code.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash
Then inside the docker,
#
mv /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py.bak
#
vim /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py
(Copy the content from tao_deploy/nvidia_tao_deploy/inferencer/preprocess_input.py at main · NVIDIA/tao_deploy · GitHub)
Modify tao_deploy/nvidia_tao_deploy/inferencer/preprocess_input.py at main · NVIDIA/tao_deploy · GitHub
to
override_mean = True
Then run evaluation to confirm there is no gap now.
#
classification_tf2 evaluate -e xxx.yaml
You can also run docker commit
to generate a new tao-deploy docker.
Thank you for the fast action.
I was able to close the gap by following the method you suggested. However, this approach only addresses the gap for TAO deploy. I am interested in utilizing this model in DeepStream.
Could you please advise on how I can apply this fix to ensure there is no performance gap when using the TRT engine in DeepStream?
Please set to
net-scale-factor=0.0175070028011204
offsets=2.165178571428571;2.035714285714286;1.8125
According to tao_tensorflow2_backend/nvidia_tao_tf2/cv/classification/utils/preprocess_input.py at main · NVIDIA/tao_tensorflow2_backend · GitHub, (1/255 - mean)/std = 1/255/std - mean/std
I specified the values that you suggested, but it didn’t make a difference. The resulting inference overlay video looks similar to the previous one.
Please delete engine and generate again. Just to make sure you are using a new engine instead of running with existing engine.
I deleted the old engine, regenerated a new engine, and specified the net-scale-factor
and offsets
you suggested. It didn’t make a difference.
When you run tao model classification_tf2 inference
, you are using the cropped images, right?
Currently, your pipeline is peoplenet → classification.
To narrow down, could you use the cropped images and run classification directly? In other words, just want to work as primary trt engine.
Similar topic is shared in
Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh.
More, two modifications can be considered.
1)Please generate .avi file based on the cropped image.
$ gst-launch-1.0 multifilesrc location=“/tmp/%d.jpg” caps=“image/jpeg,framerate=30/1” ! jpegdec ! x264enc ! avimux ! filesink location=“out.avi”
Refer to Issue with image classification tutorial and testing with deepstream-app - #24 by Morganh
2) Set scaling-filter=5
. Refer to Issue with image classification tutorial and testing with deepstream-app - #32 by Morganh