Two TRT compiled engines that were generated from the same Onnx model show different inference average times

Description

I have an Onnx model - see attached below model.zip.

I generated two TRT engines for this model using two different methods, one with trtexec and second using my adapted Python script which based on the TRT SDK sample “onnx_resnet50.py”.

Let’s call the engines:

  1. TRT_256.trt.engine

  2. TRT_256_Own.trt.engine

Using trtexec on both generated engines for inference I am getting different time measurements results.

Environment

TensorRT Version: 8.6.1.6
GPU Type: RTX 4090 mobile
Nvidia Driver Version: 546.24
CUDA Version: 12.3, V12.3.107
CUDNN Version: 8.9.7
Operating System + Version: Ubuntu 22.04.3 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64)
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): 2.2.1+cu121
Baremetal or Container (if container which image + tag): Container - nvcr.io/nvidia/tensorrt:24.01-py3

Relevant Files

model.zip (60.3 KB)

Engines.zip (188.2 KB)

Exportprofiles.zip (1.1 KB)

Engines_Layers_Info.zip (2.8 KB)

Steps To Reproduce

Please include:
TRT engine creation using trtexec:

  • Create TRT_256:
    trtexec --onnx=./localRegistrationTm_256.onnx --fp16 --saveEngine=TRT_256.trt.engine --verbose

TRT engines execution:

  • TRT_256:
    trtexec --loadEngine=./TRT_256.trt.engine --warmUp=3000 --iterations=3000 --verbose --exportProfile=TRT_256_profile.txt

  • TRT_256_Own:
    trtexec --loadEngine=./TRT_256_Own.trt.engine --warmUp=3000 --iterations=3000 --verbose --exportProfile=TRT_256_Own_profile.txt

The attached export profiles above show that only one Conv layer average time was significantly increased while using the TRT_256_Own.trt.engine, all other layers are more of the same.

Please help my analyze what is the root problem which cause this difference.

Regards,

Hi @orong13 ,
Are the differences drastic?
If you are using same engine with same input, TensorRT should be deterministic.
However I don’t think engine building is supposed to be deterministic as tactics are chosen based on observed runtime. If you’re outputting your log with info level, you should be able to compare tactic selection between the two engines. Since different tactics/kernels could change order of operations, you would expect floating point differences.

Thank you very much.
The issue is no more relevant.
I found the root cause.