Inference with tensorrt engine file has different results compared with trained hdf5 model

• Hardware - P4
• Network Type - Classification
• TLT Version - nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
• Default training configuration for TF1 classification with resnet18/resnet50

I performed inference on test images using both the trained model (in HDF5 format) and the ONNX exported engine file. The predictions differed significantly, even to the extent of assigning different classes to the same images.

Retraining with nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5 and testing with tlt model and etlt exported engine file provides much better results that can represent the models mAP.

I wasn’t able to save the model or export it in tlt/etlt for TAO 5.0.0, could that solve this issue? Is it supported?

Is there a way to fix this in that version of TAO?

1 Like

You are following tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf1/tao_voc/classification.ipynb at main · NVIDIA/tao_tutorials · GitHub to run inference, right? How about the result when run inference with hdf5 file with tao model classification_tf1 inference xxx?
And what is the result when run tao deploy classification_tf1 inference xxx?

Since TAO5.0, the code is open-sourced. And actually previous etlt file is actually encrypted onnx file. Previous tlt file is actually encrypted hdf5 file.
In TAO 5.0 or later, it will save to hdf5 file and export to onnx file.

I am following similar steps to the tutorial, yes.

Running tao model classification_tf1 inference xxx outputs good results, but with tao deploy classification_tf1 inference xxx I am seeing the drop in accuracy in the predictions after creating the engine file. Same with tao deploy classification_tf1 evaluate xxx for unseen images, the precision recall score is much worse compared with the hdf5 model.

On the contrary, when training with the same dataset, same configuration with TAO 4.0.0 the output of the inference and evaluation of the tlt and engine file significantly better.

OK, did you ever keep the exporting log? I am afraid it is due to onnx_route.
See tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub
and tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub.

Could you try to modify tao_tensorflow1_backend/nvidia_tao_tf1/cv/common/export/keras_exporter.py at main · NVIDIA/tao_tensorflow1_backend · GitHub to tf2onnx and retry?

I am building the engine model directly on the deepstream (v. 6.3) pipeline with the gst-nvinfer element with all the configurations provided after exporting the model to onnx, as mentioned in the tutorial. Can I change the onnx_route on this use case? How could I do it?

Yes, you can modify the code inside the docker.
Step:

  1. Login the docker
    docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash.

  2. Then modify the code

vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py
  1. Run exporting again to generate a new onnx file. Monitor the log if it is using tf2onnx.
    Please note that run exporting inside the docker.
    $ classification_tf1 export xxx

I tried to do the export with ‘tf2onnx’ as the onnx_route. I had to also update the /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/export/classification_exporter.py file. Here is the error from the classification_tf1 export xx command in the docker container.

2024-07-04 13:08:34,944 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 119: Setting the onnx export route to tf2onnx
2024-07-04 13:08:34,944 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.makenet.export.classification_exporter 90: Setting the onnx export rote to tf2onnx
2024-07-04 13:08:35,040 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 429: Using input nodes: ['input_1']
2024-07-04 13:08:35,040 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.keras_exporter 430: Using output nodes: ['predictions/Softmax']
Loaded model
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 70, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 66, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/scripts/export.py", line 50, in main
    run_export(Exporter, args=args, backend=backend)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/app.py", line 289, in run_export
    exporter.export(output_file,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py", line 440, in export
    self.save_exported_file(model, output_file_name)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/makenet/export/classification_exporter.py", line 228, in save_exported_file
    input_tensor_names, out_tensor_names, _ = keras_to_pb(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/_uff.py", line 241, in keras_to_pb
    freeze_graph.freeze_graph(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/tools/freeze_graph.py", line 346, in freeze_graph
    return freeze_graph_with_def_protos(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/tools/freeze_graph.py", line 228, in freeze_graph_with_def_protos
    output_graph_def = graph_util.convert_variables_to_constants(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 277, in convert_variables_to_constants
    inference_graph = extract_sub_graph(input_graph_def, output_node_names)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 197, in extract_sub_graph
    _assert_nodes_are_present(name_to_node, dest_nodes)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/graph_util_impl.py", line 152, in _assert_nodes_are_present
    assert d in name_to_node, "%s is not in graph" % d
AssertionError: predictions/Softmax is not in graph
Execution status: FAIL

Will check further…

More, Could you please run tao model classification_tf1 evaluate xxx against the hdf5 model and run tao deploy classification_tf1 evaluate xxx against the tensorrt engine? Please share the evaluation result as well. Thanks.

Also, please share the training spec file.

The following is the result of tao model classification_tf1 evaluate xxx with the hdf5 model

                 precision    recall  f1-score   support

      class-1       0.97      0.99      0.98       285
      class-2       1.00      1.00      1.00        76
      class-3       1.00      1.00      1.00         9
      class-4       0.89      0.68      0.77        25

    micro avg       0.97      0.97      0.97       395
    macro avg       0.77      0.73      0.75       395
 weighted avg       0.97      0.97      0.97       395

Here is the result of tao deploy classification_tf1 evaluate xxx for the engine file:


                 precision    recall  f1-score   support

      class-1       0.88      0.87      0.87       285
      class-2       0.65      0.79      0.71        76
      class-3       1.00      0.67      0.80         9
      class-4       0.53      0.36      0.43        25

     accuracy                           0.81       395
    macro avg       0.61      0.54      0.56       395
 weighted avg       0.81      0.81      0.81       395

The following is the training config file:

model_config {
  # Model Architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet']
  arch: "resnet"
  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]
  n_layers: 50
  use_batch_norm: True
  use_bias: False
  all_projections: False
  use_pooling: True
  retain_head: True
  resize_interpolation_method: BICUBIC
  # if you want to use the pretrained model,
  # image size should be "3,224,224"
  # otherwise, it can be "3, X, Y", where X,Y >= 16
  input_image_size: "3,224,224"
}
train_config {
  train_dataset_path: "folder/train"
  val_dataset_path: "folder/val"
  pretrained_model_path: "pretrained_object_detection_vresnet50/resnet_50.hdf5"
  # Only ['sgd', 'adam'] are supported for optimizer
  optimizer {
      sgd {
      lr: 0.01
      decay: 0.0
      momentum: 0.9
      nesterov: False
      }
  }
  batch_size_per_gpu: 25
  n_epochs: 40
  # Number of CPU cores for loading data
  n_workers: 4
  # regularizer
  reg_config {
      # regularizer type can be "L1", "L2" or "None".
      type: "L2"
      # if the type is not "None",
      # scope can be either "Conv2D" or "Dense" or both.
      scope: "Conv2D,Dense"
      # 0 < weight decay < 1
      weight_decay: 0.000015
  }
  # learning_rate
  lr_config {
      cosine {
      learning_rate: 0.04
      soft_start: 0.0
      }
  }
  enable_random_crop: False
  enable_center_crop: False
  enable_color_augmentation: True
  mixup_alpha: 0.2
  label_smoothing: 0.1
  preprocess_mode: "caffe"
  image_mean {
    key: 'b'
    value: 103.9
  }
  image_mean {
    key: 'g'
    value: 116.8
  }
  image_mean {
    key: 'r'
    value: 123.7
  }
}
eval_config {
  eval_dataset_path: "folder/test"
  model_path: "trained_weights_hdf5/resnet_040.hdf5"
  top_k: 5
  batch_size: 25
  n_workers: 4
  enable_center_crop: False
}

OK. Is it possible to share the hdf5 model? I am going to reproduce and then check what is the culprit. It is better you can share some test images as well. You can send the model and images via a private message. Thanks a lot.