Fine-tuned TAO ClassificationTF2 Accuracy Drop after Compiling to TensorRT

Setup information
• Hardware Platform (Jetson / GPU) : Jetson Orin Nano
• DeepStream Version : DeepStream 6.3
• JetPack Version (valid for Jetson only) : Jetpack 5.1.3
• TensorRT Version : TensorRT 8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only) : N/A
• Issue Type( questions, new requirements, bugs) : Bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) : See below for configurations.
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description) : N/A


Issue:
Fine-tuned TAO ClassificationTF2 TLT model (EfficientNet-B0 backbone) gives high inference accuracy, but accuracy dropped after converting the model to TensorRT engine and running inference in DeepStream as SGIE.

Evaluation Method:
Results were compared on the same video.
This is how we compare the TLT and TensorRT models:

  1. We used the same PGIE (PeopleNet) and tracker to perform detection and tracking.
  2. We cropped the objects from the video frames based on the bounding boxes in the KITTI tracker output file.
  3. We run tao model classification_tf2 inference and evaluate the results of the TLT model.
  4. We run inference on the same video in DeepStream and TRTEXEC, and manually compare the results.

Accuracy Comparison:

Object Ground Truth Class TLT Accuracy TensorRT Accuracy
Object_1 Class_1 96.01% 41.46%
Object_2 Class_1 97.60% 9.36%
Object_3 Class_1 100% 18.00%
Object_4 Class_2 100% 100%
Object_5 Class_2 100% 100%

Notes:

  • The TensorRT accuracy was obtained by running TRT inference using TRTExec. We manually inspect the DeepStream output video, the TRTEXEC results seem to align with the DeepStream inference overlay output video.
  • It seems like the accuracy drop only affected Class_1, but not Class_2.

TAO ClassificationTF2 Configuration

dataset:
  train_dataset_path: "/workspace/tao-experiments/data/train"
  val_dataset_path: "/workspace/tao-experiments/data/val"
  preprocess_mode: 'torch'
  num_classes: 2
  augmentation:
    enable_color_augmentation: True
train:
  checkpoint: '/workspace/tao-experiments/pretrained_classification_tf2_vefficientnet_b0'
  batch_size_per_gpu: 32
  num_epochs: 100
  optim_config:
    optimizer: 'sgd'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.0005
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
  results_dir: '/workspace/tao-experiments/results/train'
model:
  backbone: 'efficientnet-b0'
  input_width: 128
  input_height: 128
  input_channels: 3
  dropout: 0.12
evaluate:
  dataset_path: "/workspace/tao-experiments/data/test"
  checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
  top_k: 1
  batch_size: 16
  n_workers: 8
  results_dir: '/workspace/tao-experiments/results/val'
inference:
  image_dir: "/workspace/tao-experiments/data/test_images"
  checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
  results_dir: '/workspace/tao-experiments/results/inference'
  classmap: "/workspace/tao-experiments/results/train/classmap.json"
export:
  checkpoint: "/workspace/tao-experiments/results/train/efficientnet-b0_100.tlt"
  onnx_file: '/workspace/tao-experiments/results/export/efficientnet-b0.onnx'
  results_dir: '/workspace/tao-experiments/results/export'

How I converted the model from TLT to TensorRT engine:

  1. Convert from TLT to ONNX using tao model classification_tf2 export. This step was not performed on the Jetson device. We were using NVIDIA GeForce RTX 4090 GPU for model training and exporting.
  2. Convert from ONNX to TensorRT. This was conducted on the Jetson Orin Nano device. We tried two methods: (i) deploy the TLT model to Deepstream directly and let DeepStream handle the TRT conversion implicitly; and (ii) use TRTEXEC to compile TensorRT engine. However, both methods give the same (bad) inference results.

DeepStream App Configuration Files:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#kitti-track-output-dir=tracker_output_folder

[tiled-display]
enable=1
rows=1
columns=1
width=1920
height=1080
gpu-id=0

[source0]
enable=1
type=2
num-sources=1
uri=file:///path/to/test/video/file.mp4
gpu-id=0

[streammux]
gpu-id=0
batch-size=1
batched-push-timeout=33333
width=1920
height=1080

[sink0]
enable=1
type=3
container=1
codec=1
enc-type=1
sync=0
bitrate=3000000
profile=0
output-file=/path/to/inference/overlay/video.mp4
source-id=0

[osd]
enable=1
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial

[primary-gie]
enable=1
plugin-type=0
gpu-id=0
batch-size=1
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
gie-unique-id=1
config-file=/opt/nvidia/deepstream/deepstream-6.3/samples/configs/tao_pretrained_models/nvinfer/config_infer_primary_peoplenet.txt

[secondary-gie0]
enable=1
plugin-type=0
gpu-id=0
batch-size=1
gie-unique-id=3
operate-on-gie-id=1
operate-on-class-ids=0
config-file=/path/to/config_infer_secondary_classificationtf2.txt

[tracker]
enable=1
tracker-width=480
tracker-height=288
ll-lib-file=/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
ll-config-file=/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/config_tracker_NvDCF_perf.yml
gpu-id=0
display-tracking-id=1

[tests]
file-loop=0

config_infer_secondary_classificationtf2.txt
(we followed this guide)

[property]
gpu-id=0
# preprocessing_mode == 'torch'
net-scale-factor=0.017507
offsets=123.675;116.280;103.53
model-color-format=0

# model config
onnx-file=/path/to/efficientnet-b0.onnx
model-engine-file=/path/to/efficientnet-b0.onnx_b1_gpu0_fp32.engine
labelfile-path=/path/to/labels.txt
classifier-threshold=0.5
operate-on-class-ids=0
batch-size=1

network-mode=0
network-type=1
process-mode=2

secondary-reinfer-interval=0
gie-unique-id=3

We would like to know: (i) what causes the degradation in model accuracy?; and (ii) how could we minimize this performance gap between the TLT model and the TensorRT engine.

In TAO, there is a tao-deploy docker for user to generate tensorrt engine and also run evaluation or inference. To narrow down, please use tao deploy docker to check the result.
You can use tao launcher(i.e., tao deploy xxx) or use docker run against (nvcr.io/nvidia/tao/tao-toolkit:5.3.0-deploy) directly.
See the end of tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/classification.ipynb at main · NVIDIA/tao_tutorials · GitHub or Classification (TF2) with TAO Deploy - NVIDIA Docs.

Thank you for the fast response.
I have tried TAO deploy as suggested. I used classification_tf2 gen_trt_engine to generate the TensorRT engine file, and used classification_tf2 inference to run inference on the test images.

However, the results are quite weird. ALL images are classified as the same class (“non-staff”) with same confidence (0.938832).
Here are the first few rows of result.csv:
image

Accuracy:

Class TLT Accuracy TensorRT Accuracy
non-staff 97.79% 100%
staff 97.57% 0%

Configuration Files

gen_trt_engine config file:

gen_trt_engine:
  onnx_file: '/path/to/efficientnet-b0.onnx'
  trt_engine: '/path/to/efficientnet-b0_batch64.engine'
  results_dir: '/path/to/gen_trt_results'
  tensorrt:
    max_workspace_size: 4
    max_batch_size: 64
    data_type: "fp32"

Inference config file:

dataset:
  augmentation:
    enable_color_augmentation: true
  num_classes: 2
  preprocess_mode: torch
inference:
  trt_engine: /path/to/efficientnet-b0_batch64.engine
  classmap: /path/to/classmap.json
  image_dir: /path/to/test/images
  results_dir: /path/to/inference-results
model:
  backbone: efficientnet-b0
  input_channels: 3
  input_height: 128
  input_width: 128

So, your result shows that the tao deploy inference or tao deploy evaluation also get incorrect result.
To narrow down, could you run default notebook to see if it works?
tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/classification.ipynb at main · NVIDIA/tao_tutorials · GitHub.

Please refer to the default spec file. tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs/spec.yaml at main · NVIDIA/tao_tutorials · GitHub. For example, there is enable_center_crop: True.

I have tried the default notebook and default specs (e.g., enable_center_crop: True) as suggested. Here are the results:

TAO TLT Model Evaluation Results
tao model classification_tf2 evaluate

Evaluation Loss: 0.1649947166442871
Evaluation Top 1 accuracy: 0.9825870394706726

Confusion Matrix:

[[589  17]
 [  4 596]]

Classification Report:

              precision    recall  f1-score   support

   non-staff       0.99      0.97      0.98       606
       staff       0.97      0.99      0.98       600

    accuracy                           0.98      1206
   macro avg       0.98      0.98      0.98      1206
weighted avg       0.98      0.98      0.98      1206

TAO Deploy TensorRT Engine Evaluation Results
tao deploy classification_tf2 evaluate

Top 1 scores: 0.495

Confusion Matrix:

[[  0 606]
 [  0 594]]

Classification Report:

              precision    recall  f1-score   support

   non-staff       0.00      0.00      0.00       606
       staff       0.49      1.00      0.66       594

    accuracy                           0.49      1200
   macro avg       0.25      0.50      0.33      1200
weighted avg       0.25      0.49      0.33      1200

From your result, there is still accuracy drop when run “tao deploy” evaluation against the tensorrt engine. Will check further. BTW, did you ever run with default dataset?

From tao-tf2 branch tao_tensorflow2_backend/nvidia_tao_tf2/cv/classification/inferencer/keras_inferencer.py at main · NVIDIA/tao_tensorflow2_backend · GitHub and tao-deploy branch tao_deploy/nvidia_tao_deploy/cv/classification_tf1/dataloader.py at main · NVIDIA/tao_deploy · GitHub, could you please set the same interpolation_method explicitly in spec yaml file and retry?

From your result, there is still accuracy drop when run “tao deploy” evaluation against the tensorrt engine. Will check further.

Thanks! I am happy to share the TLT model and TensorRT engine with you if that helps.

BTW, did you ever run with default dataset?

No, I did not.

I have specified the same interpolation method in the spec YAML file for both tao model classification_tf2 evaluate and tao deploy classification_tf2 evaluate:

model:
  ...
  resize_interpolation_method: bilinear

It does not help.

OK, thanks for the info. I will check further and update to you if I have. Thanks.

I can reproduce the accuracy drop. Will check further.

Thank you.
May I know when we will get an update? Our production relies on this.
If it cannot be resolved soon, what alternatives could we use? Will reverting to an older version of the TAO toolkit resolve the accuracy drop issue? Is there another solution provided by NVIDIA that we can use instead?

Got it. We are still working on that. Once there is a fix or workaround, I will update to you.
Sorry for inconvenience.

Hi, I find the root cause. Please use below way to change code.

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

Then inside the docker,
# mv /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py.bak
# vim /usr/local/lib/python3.10/dist-packages/nvidia_tao_deploy/inferencer/preprocess_input.py
(Copy the content from tao_deploy/nvidia_tao_deploy/inferencer/preprocess_input.py at main · NVIDIA/tao_deploy · GitHub)

Modify tao_deploy/nvidia_tao_deploy/inferencer/preprocess_input.py at main · NVIDIA/tao_deploy · GitHub
to

override_mean = True

Then run evaluation to confirm there is no gap now.

# classification_tf2 evaluate -e xxx.yaml

You can also run docker commit to generate a new tao-deploy docker.

Thank you for the fast action.
I was able to close the gap by following the method you suggested. However, this approach only addresses the gap for TAO deploy. I am interested in utilizing this model in DeepStream.
Could you please advise on how I can apply this fix to ensure there is no performance gap when using the TRT engine in DeepStream?

Please set to

net-scale-factor=0.0175070028011204
offsets=2.165178571428571;2.035714285714286;1.8125

According to tao_tensorflow2_backend/nvidia_tao_tf2/cv/classification/utils/preprocess_input.py at main · NVIDIA/tao_tensorflow2_backend · GitHub, (1/255 - mean)/std = 1/255/std - mean/std

I specified the values that you suggested, but it didn’t make a difference. The resulting inference overlay video looks similar to the previous one.

Please delete engine and generate again. Just to make sure you are using a new engine instead of running with existing engine.

I deleted the old engine, regenerated a new engine, and specified the net-scale-factor and offsets you suggested. It didn’t make a difference.

When you run tao model classification_tf2 inference, you are using the cropped images, right?

Currently, your pipeline is peoplenet → classification.
To narrow down, could you use the cropped images and run classification directly? In other words, just want to work as primary trt engine.
Similar topic is shared in
Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh.

More, two modifications can be considered.
1)Please generate .avi file based on the cropped image.
$ gst-launch-1.0 multifilesrc location=“/tmp/%d.jpg” caps=“image/jpeg,framerate=30/1” ! jpegdec ! x264enc ! avimux ! filesink location=“out.avi”
Refer to Issue with image classification tutorial and testing with deepstream-app - #24 by Morganh
2) Set scaling-filter=5. Refer to Issue with image classification tutorial and testing with deepstream-app - #32 by Morganh