Inference with tensorrt engine file has different results compared with trained hdf5 model

ioka2 · July 2, 2024, 10:54am

• Hardware - P4
• Network Type - Classification
• TLT Version - nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
• Default training configuration for TF1 classification with resnet18/resnet50

I performed inference on test images using both the trained model (in HDF5 format) and the ONNX exported engine file. The predictions differed significantly, even to the extent of assigning different classes to the same images.

Retraining with nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5 and testing with tlt model and etlt exported engine file provides much better results that can represent the models mAP.

I wasn’t able to save the model or export it in tlt/etlt for TAO 5.0.0, could that solve this issue? Is it supported?

Is there a way to fix this in that version of TAO?

Morganh · July 2, 2024, 3:50pm

You are following tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf1/tao_voc/classification.ipynb at main · NVIDIA/tao_tutorials · GitHub to run inference, right? How about the result when run inference with hdf5 file with tao model classification_tf1 inference xxx?
And what is the result when run tao deploy classification_tf1 inference xxx?

Since TAO5.0, the code is open-sourced. And actually previous etlt file is actually encrypted onnx file. Previous tlt file is actually encrypted hdf5 file.
In TAO 5.0 or later, it will save to hdf5 file and export to onnx file.

ioka2 · July 3, 2024, 8:44am

I am following similar steps to the tutorial, yes.

Running tao model classification_tf1 inference xxx outputs good results, but with tao deploy classification_tf1 inference xxx I am seeing the drop in accuracy in the predictions after creating the engine file. Same with tao deploy classification_tf1 evaluate xxx for unseen images, the precision recall score is much worse compared with the hdf5 model.

On the contrary, when training with the same dataset, same configuration with TAO 4.0.0 the output of the inference and evaluation of the tlt and engine file significantly better.

Morganh · July 3, 2024, 9:02am

OK, did you ever keep the exporting log? I am afraid it is due to onnx_route.
See tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub
and tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub.

Could you try to modify tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub to tf2onnx and retry?

ioka2 · July 3, 2024, 9:46am

I am building the engine model directly on the deepstream (v. 6.3) pipeline with the gst-nvinfer element with all the configurations provided after exporting the model to onnx, as mentioned in the tutorial. Can I change the onnx_route on this use case? How could I do it?

Morganh · July 4, 2024, 9:02am

Yes, you can modify the code inside the docker.
Step:

Login the docker
docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash.
Then modify the code

vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py

Run exporting again to generate a new onnx file. Monitor the log if it is using tf2onnx.
Please note that run exporting inside the docker.
$ classification_tf1 export xxx

ioka2 · July 4, 2024, 1:13pm

I tried to do the export with ‘tf2onnx’ as the onnx_route. I had to also update the /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/export/classification_exporter.py file. Here is the error from the classification_tf1 export xx command in the docker container.

2024-07-04 13:08:34,944 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 119: Setting the onnx export route to tf2onnx
2024-07-04 13:08:34,944 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.makenet.export.classification_exporter 90: Setting the onnx export rote to tf2onnx
2024-07-04 13:08:35,040 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 429: Using input nodes: ['input_1']
2024-07-04 13:08:35,040 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 430: Using output nodes: ['predictions/Softmax']
Loaded model
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 70, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 66, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 50, in main
    run_export(Exporter, args=args, backend=backend)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/app.py", line 289, in run_export
    exporter.export(output_file,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py", line 440, in export
    self.save_exported_file(model, output_file_name)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/export/classification_exporter.py", line 228, in save_exported_file
    input_tensor_names, out_tensor_names, _ = keras_to_pb(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/_uff.py", line 241, in keras_to_pb
    freeze_graph.freeze_graph(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/tools/freeze_graph.py", line 346, in freeze_graph
    return freeze_graph_with_def_protos(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/tools/freeze_graph.py", line 228, in freeze_graph_with_def_protos
    output_graph_def = graph_util.convert_variables_to_constants(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 277, in convert_variables_to_constants
    inference_graph = extract_sub_graph(input_graph_def, output_node_names)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 197, in extract_sub_graph
    _assert_nodes_are_present(name_to_node, dest_nodes)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 152, in _assert_nodes_are_present
    assert d in name_to_node, "%s is not in graph" % d
AssertionError: predictions/Softmax is not in graph
Execution status: FAIL

Morganh · July 5, 2024, 3:55pm

Will check further…

More, Could you please run tao model classification_tf1 evaluate xxx against the hdf5 model and run tao deploy classification_tf1 evaluate xxx against the tensorrt engine? Please share the evaluation result as well. Thanks.

Also, please share the training spec file.

ioka2 · July 8, 2024, 8:19am

The following is the result of tao model classification_tf1 evaluate xxx with the hdf5 model

                 precision    recall  f1-score   support

      class-1       0.97      0.99      0.98       285
      class-2       1.00      1.00      1.00        76
      class-3       1.00      1.00      1.00         9
      class-4       0.89      0.68      0.77        25

    micro avg       0.97      0.97      0.97       395
    macro avg       0.77      0.73      0.75       395
 weighted avg       0.97      0.97      0.97       395

Here is the result of tao deploy classification_tf1 evaluate xxx for the engine file:


                 precision    recall  f1-score   support

      class-1       0.88      0.87      0.87       285
      class-2       0.65      0.79      0.71        76
      class-3       1.00      0.67      0.80         9
      class-4       0.53      0.36      0.43        25

     accuracy                           0.81       395
    macro avg       0.61      0.54      0.56       395
 weighted avg       0.81      0.81      0.81       395

The following is the training config file:

model_config {
  # Model Architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet']
  arch: "resnet"
  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]
  n_layers: 50
  use_batch_norm: True
  use_bias: False
  all_projections: False
  use_pooling: True
  retain_head: True
  resize_interpolation_method: BICUBIC
  # if you want to use the pretrained model,
  # image size should be "3,224,224"
  # otherwise, it can be "3, X, Y", where X,Y >= 16
  input_image_size: "3,224,224"
}
train_config {
  train_dataset_path: "folder/train"
  val_dataset_path: "folder/val"
  pretrained_model_path: "pretrained_object_detection_vresnet50/resnet_50.hdf5"
  # Only ['sgd', 'adam'] are supported for optimizer
  optimizer {
      sgd {
      lr: 0.01
      decay: 0.0
      momentum: 0.9
      nesterov: False
      }
  }
  batch_size_per_gpu: 25
  n_epochs: 40
  # Number of CPU cores for loading data
  n_workers: 4
  # regularizer
  reg_config {
      # regularizer type can be "L1", "L2" or "None".
      type: "L2"
      # if the type is not "None",
      # scope can be either "Conv2D" or "Dense" or both.
      scope: "Conv2D,Dense"
      # 0 < weight decay < 1
      weight_decay: 0.000015
  }
  # learning_rate
  lr_config {
      cosine {
      learning_rate: 0.04
      soft_start: 0.0
      }
  }
  enable_random_crop: False
  enable_center_crop: False
  enable_color_augmentation: True
  mixup_alpha: 0.2
  label_smoothing: 0.1
  preprocess_mode: "caffe"
  image_mean {
    key: 'b'
    value: 103.9
  }
  image_mean {
    key: 'g'
    value: 116.8
  }
  image_mean {
    key: 'r'
    value: 123.7
  }
}
eval_config {
  eval_dataset_path: "folder/test"
  model_path: "trained_weights_hdf5/resnet_040.hdf5"
  top_k: 5
  batch_size: 25
  n_workers: 4
  enable_center_crop: False
}

Morganh · July 8, 2024, 8:23am

OK. Is it possible to share the hdf5 model? I am going to reproduce and then check what is the culprit. It is better you can share some test images as well. You can send the model and images via a private message. Thanks a lot.

Topic		Replies	Views
Inference with TensorRT is different that inference with HDF5 TAO Toolkit	16	551	March 25, 2024
Cannot use TensorRT model exported by NVIDIA TAO TAO Toolkit	8	1191	May 17, 2022
Accuracy drop when converting tlt model to engine model TAO Toolkit	3	536	April 8, 2022
Tao training - Visualise inference after training provides 98% accuracy, however, after model export to TensorRT, the inference result is 0% TAO Toolkit	5	617	March 12, 2022
Classification inference huge performance degradation TAO Toolkit	11	1582	February 18, 2022
Errors while reading TensorRT engine file produced by TAO 5 TAO Toolkit	6	742	September 7, 2023
The effect is very poor when converted to trt TAO Toolkit tensorrt , ubuntu	61	1524	September 11, 2023
TensorRT Inference form a .etlt model on Python TAO Toolkit tensorrt	7	1268	November 16, 2021
Converting etlt file to .engine for jetson TAO Toolkit	17	3057	October 25, 2022
Sampes to run a tlt based resnet18 classification model tensorrt engine file TAO Toolkit tensorrt	2	560	October 12, 2021

Inference with tensorrt engine file has different results compared with trained hdf5 model

Related topics