Inference with TensorRT is different that inference with HDF5

I’ve trained a model using TAO and got the HDF5 model then exported it to ONNX.
From ONNX, I deployed two tensorRT engine (FP32 and FP16).

I’ve run inference using the HDF5 model and the two tensor Engines, but I am not getting the same results (for instance, both tensorRT engines predict 2569 labels for the same sample of 1000 images while the HDF5 inference give me 605 labels).

This is how I exported to ONNX:

!tao model faster_rcnn export --gpu_index $GPU_INDEX \
                        -m $USER_EXPERIMENT_DIR/frcnn_resnet_50.epoch_108.hdf5 \
                        -o $USER_EXPERIMENT_DIR/frcnn_resnet_50_epoch_108.onnx \
                        -e $SPECS_DIR/specs_birds_inference.txt

This is how I deployed to TRT FP32:

!tao deploy faster_rcnn gen_trt_engine --gpu_index $GPU_INDEX \
                        -m $USER_EXPERIMENT_DIR/frcnn_resnet_50_epoch_108.onnx \
                        -e $SPECS_DIR/specs_birds_inference.txt \
                        --data_type fp32 \
                        --batch_size 1 \
                        --max_batch_size 1 \
                        --engine_file $USER_EXPERIMENT_DIR/birds_trt.epoch_108_fp32_bs1.engine \
                        --results_dir $USER_EXPERIMENT_DIR

This is how I infered with TRT FP32:

!tao deploy faster_rcnn inference  --gpu_index $GPU_INDEX \
                                   -e $SPECS_DIR/specs_birds_inference.txt \
                                   -m $USER_EXPERIMENT_DIR/birds_trt.epoch_108_fp32_bs1.engine \
                                   -i /workspace/tao-experiments-birds/data/sample_for_deployment/image_2 \
                                   -r /workspace/tao-experiments-birds/data/sample_for_deployment/inf_fp32
  • Same logic for FP16.

Attached the specs file.
specs_birds_inference.txt (3.7 KB)

Is there something wrong in the command for which I am getting this different results between the three inferences?

Could you check the output result folder to see what is the difference?
Could you use tao deploy faster_rcnn evaluate to check the result as well?

For each experiment, there are two folders generated: the images with bboxes plotted and the kitti labels.

Here is an example of the difference:
With HDF5:

bird 0 0 0 7.85 232.71 23.94 248.48 0 0 0 0 0 0 0 0.6687426

With TRT FP32:

bird 0.00 0 0.00 13.397 525.418 52.533 566.986 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.761
bird 0.00 0 0.00 402.527 63.345 417.133 81.669 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.730
bird 0.00 0 0.00 394.411 47.057 418.000 96.407 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.711
bird 0.00 0 0.00 351.630 511.766 396.456 554.119 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.694
bird 0.00 0 0.00 383.578 63.915 413.887 89.282 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.575
bird 0.00 0 0.00 6.950 507.647 28.483 554.079 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.557
bird 0.00 0 0.00 4.657 525.446 15.417 543.087 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.380
bird 0.00 0 0.00 0.000 476.593 58.159 561.449 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.342
bird 0.00 0 0.00 366.280 516.060 382.072 546.478 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.338
bird 0.00 0 0.00 10.513 525.226 38.347 542.867 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.300

The bounding boxes are not even close!!

Here is the output of the evaluate with HDF5 directly:

100%|█████████████████████████████████████████| 228/228 [01:14<00:00,  3.06it/s]
==========================================================================================
Class               AP                  precision           recall              RPN_recall          
------------------------------------------------------------------------------------------
bird                0.5940              0.1795              0.6782              0.7025              
------------------------------------------------------------------------------------------
mAP@0.5 = 0.5940              

while for the evaluation with tao deploy faster_rcnn evaluate, I get a json file with the results inside. It shows:

{"AP_bird": 0.25194354937444696}

The difference in AP is reflected in how the bounding boxes on the images look like. However, the main question is why converting the model to a TensorRT engine FP32 would result in such a difference in Average Precision?

Best

Did you ever run tao_tutorials/notebooks/tao_launcher_starter_kit/faster_rcnn/faster_rcnn.ipynb at main · NVIDIA/tao_tutorials · GitHub and meet similar issue?
More, could you check the onnx file if the opset version is 12?

Never reached this step with the fastercnn.

I deployed to tensorRT FP32 from onnx with opset 12 by setting the target opset option to 12. Then, I rerun inference again. I will post the inference results soon.

In the meanwhile, I am posting the output log of the conversion from ONNX to tensorRT FP32.

2024-03-20 15:32:05,979 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-03-20 15:32:06,061 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-deploy
2024-03-20 15:32:06,113 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-03-20 15:32:32,870 [TAO Toolkit] [INFO] matplotlib.font_manager: generated new fontManager
Loading uff directly from the package source code
Loading uff directly from the package source code
2024-03-20 15:32:34,091 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.common.logging.status_logging 198: Log file already exists at /workspace/tao-experiments-birds/faster_rcnn_20231218_BestExperiment1_ac42/status.json
2024-03-20 15:32:34,091 [TAO Toolkit] [INFO] root 174: Starting faster_rcnn gen_trt_engine.
[03/20/2024-15:32:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 32, GPU 204 (MiB)
[03/20/2024-15:33:19] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +889, GPU +174, now: CPU 997, GPU 378 (MiB)
2024-03-20 15:33:19,264 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 137: Parsing ONNX model
2024-03-20 15:33:19,341 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 96: ONNX model inputs: 
2024-03-20 15:33:19,342 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 97: Input 0: input_image.
2024-03-20 15:33:19,342 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 98: [0, 3, 613, 418].
[03/20/2024-15:33:19] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/20/2024-15:33:19] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[03/20/2024-15:33:19] [TRT] [I] No importer registered for op: ProposalDynamic. Attempting to import as plugin.
[03/20/2024-15:33:19] [TRT] [I] Searching for plugin: ProposalDynamic, plugin_version: 1, plugin_namespace: 
[03/20/2024-15:33:19] [TRT] [F] Validation failed: libNamespace == nullptr
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528

[03/20/2024-15:33:19] [TRT] [E] std::exception
[03/20/2024-15:33:19] [TRT] [I] Successfully created plugin: ProposalDynamic
[03/20/2024-15:33:19] [TRT] [F] Validation failed: libNamespace == nullptr
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528

[03/20/2024-15:33:19] [TRT] [E] std::exception
[03/20/2024-15:33:19] [TRT] [I] No importer registered for op: CropAndResizeDynamic. Attempting to import as plugin.
[03/20/2024-15:33:19] [TRT] [I] Searching for plugin: CropAndResizeDynamic, plugin_version: 1, plugin_namespace: 
[03/20/2024-15:33:19] [TRT] [I] Successfully created plugin: CropAndResizeDynamic
[03/20/2024-15:33:19] [TRT] [I] No importer registered for op: NMSDynamic_TRT. Attempting to import as plugin.
[03/20/2024-15:33:19] [TRT] [I] Searching for plugin: NMSDynamic_TRT, plugin_version: 1, plugin_namespace: 
[03/20/2024-15:33:19] [TRT] [W] parsers/onnx/builtin_op_importers.cpp:5219: Attribute isBatchAgnostic not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
[03/20/2024-15:33:19] [TRT] [I] Successfully created plugin: NMSDynamic_TRT
2024-03-20 15:33:19,754 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 154: Network Description
2024-03-20 15:33:19,754 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 156: Input 'input_image' with shape (-1, 3, 613, 418) and dtype DataType.FLOAT
2024-03-20 15:33:19,754 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 158: Output 'nms_out' with shape (-1, 1, 100, 7) and dtype DataType.FLOAT
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 158: Output 'nms_out_1' with shape (-1, 1, 1, 1) and dtype DataType.FLOAT
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.faster_rcnn.engine_builder 160: dynamic batch size handling
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 150: TensorRT engine build configurations:
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 163:  
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 179:   BuilderFlag.TF32
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 195:  
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 197:   Note: max representabile value is 2,147,483,648 bytes or 2GB.
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 199:   MemoryPoolType.WORKSPACE = 2147483648 bytes
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 201:   MemoryPoolType.DLA_MANAGED_SRAM = 0 bytes
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 203:   MemoryPoolType.DLA_LOCAL_DRAM = 1073741824 bytes
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 205:   MemoryPoolType.DLA_GLOBAL_DRAM = 536870912 bytes
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 207:  
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 209:   PreviewFeature.FASTER_DYNAMIC_SHAPES_0805
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 211:   PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805
2024-03-20 15:33:19,755 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 215:   Tactic Sources = 31
[03/20/2024-15:33:19] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[03/20/2024-15:33:19] [TRT] [F] Validation failed: libNamespace == nullptr
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528

[03/20/2024-15:33:19] [TRT] [E] std::exception
[03/20/2024-15:33:22] [TRT] [I] Graph optimization time: 2.31754 seconds.
[03/20/2024-15:33:22] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 1258, GPU 388 (MiB)
[03/20/2024-15:33:22] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 1260, GPU 398 (MiB)
[03/20/2024-15:33:22] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[03/20/2024-15:33:22] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[03/20/2024-15:33:22] [TRT] [F] Validation failed: libNamespace == nullptr
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528

[03/20/2024-15:33:22] [TRT] [E] std::exception
[03/20/2024-15:36:12] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[03/20/2024-15:36:12] [TRT] [F] Validation failed: libNamespace == nullptr
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528

[03/20/2024-15:36:12] [TRT] [E] std::exception
[03/20/2024-15:36:12] [TRT] [I] Total Host Persistent Memory: 282912
[03/20/2024-15:36:12] [TRT] [I] Total Device Persistent Memory: 1062912
[03/20/2024-15:36:12] [TRT] [I] Total Scratch Memory: 5332224
[03/20/2024-15:36:12] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 139 MiB, GPU 354 MiB
[03/20/2024-15:36:12] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 169 steps to complete.
[03/20/2024-15:36:12] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 7.63265ms to assign 8 blocks to 169 nodes requiring 391385088 bytes.
[03/20/2024-15:36:12] [TRT] [I] Total Activation Memory: 391384064
[03/20/2024-15:36:12] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1492, GPU 604 (MiB)
[03/20/2024-15:36:12] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1492, GPU 614 (MiB)
[03/20/2024-15:36:12] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +89, GPU +191, now: CPU 89, GPU 191 (MiB)
Export finished successfully.

What I am particularly worried about is this warning:

[03/20/2024-15:33:22] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.

Does this mean that the model is not able to get it to FP32?

Nope.

Suggest you to change inference thereshold in the spec file to check further for fp32.

If possible, you can share .hdf5 file, onnx file and several test images for reproducing.

UPDATE: checking now after the inference results after forcing a target opset = 12, I am still having the same descrepancy despite forcing the target opset to 12.

==> I let you know what happens when I lower the inference threshold.
==> I will post the two models and few images soon.

The same descrepancy: 250 labels predicted with HDF5 and 1000 labels predicted with FP32 and FP16.

Which version of both tao tf1 and tao deploy docker?
You can run
! tao info --verbose

Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.2.0.1
published_date: 01/16/2024

Please find the HDF5, the onnx and some imager [here](https://drive.google.com/file/d/1AVh1MRTElCDZIVg20HVXhJx7LjjwPHl1/view?usp=sharing).

Just to confirm, the model will detect birds class, right? From the test images, it can not be easily detected by human eyes. What do you think?

Yes only one class, birds.

Although it’s difficult, but still detectable. I’ve got a good mAP on images inferred with HDF5. I also tested on unseen images (outside the val and test set) and it works quite well.

If the tensorRT is only a mirror of the HDF5 with optimization, why would it give different results? Are HDF5 and TensorRT they conceptually different in terms of structure?

To debug the difference, we may need to assume the onnx inference as the baseline. Then to check if there is something wrong in the Tensorrt.
Refer to TensorRT/tools/Polygraphy/examples/cli/debug/02_reducing_failing_onnx_models at main · NVIDIA/TensorRT · GitHub.

Could you use old version of tao deploy model to check?
For example, nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy

Yes, but still have a descrepancy (although, less worse than with 5.2.0):

With TensorRT:

bird 0.00 0 0.00 0.000 522.048 54.933 550.091 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.761
bird 0.00 0 0.00 395.695 63.285 418.000 90.448 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.730
bird 0.00 0 0.00 340.674 521.610 393.587 546.734 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.694

With HDF5:

bird 0 0 0 7.85 232.71 23.94 248.48 0 0 0 0 0 0 0 0.6687425

I will update you for the debugging soon.