Converting an ONNX model to TensortRT Engine Takes Days

Description

Hi, I have an ONNX model and currently I get 30FPS of inferenece on RTX4060 Mobile. I am trying to gain some performance throgh using TensorRT. This is the summary of my ONNX model:

nodes = 2049 initializers = 444 inputs = 2 outputs = 1 top ops: [(‘Constant’, 592), (‘Unsqueeze’, 227), (‘Add’, 163), (‘Transpose’, 133), (‘MatMul’, 132), (‘Concat’, 113), (‘Shape’, 107), (‘Gather’, 103), (‘Reshape’, 95), (‘Mul’, 94), (‘Div’, 53), (‘LayerNormalization’, 48), (‘Conv’, 45), (‘Erf’, 26), (‘Cast’, 26), (‘Slice’, 20), (‘Softmax’, 14), (‘BatchNormalization’, 9), (‘ReduceMean’, 8), (‘Relu’, 7)]

also this is link to onnx file. Now this is where problems start, using this command:

trtexec --onnx=asymformer_160.onnx ^
–saveEngine=test.engine ^
–fp16 --noTF32 ^
–minShapes=img:1x3x160x160,dep:1x1x160x160 ^
–optShapes=img:1x3x160x160,dep:1x1x160x160 ^
–maxShapes=img:1x3x160x160,dep:1x1x160x160 ^
–precisionConstraints=prefer ^
–memPoolSize=workspace:2048 ^
–tacticSources=+CUBLAS,+CUBLAS_LT

I cannot get it complete it after 10 hours of waiting and gave up.

TensorRT Version: TensorRT-10.13.2.6.Windows.win10.cuda-12.9
GPU Type: RTX4060
Nvidia Driver Version: 576.88
CUDA Version: 12.8
CUDNN Version: cudnn-windows-x86_64-8.9.7.29_cuda12-archive
Operating System + Version: Windows 10
Python Version (if applicable): 3.11.9
TensorFlow Version (if applicable): -
PyTorch Version (if applicable): pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128

Steps To Reproduce

  1. Install the ONNX file I shared.
  2. Run the trtexec command.

I actviated verbose flag to see if I can catch any errors, but seemingly everything works but they are just too slow! I don’t know if this is normal or not since I am not familiar with TensorRT.

As a follow up, I can generate an engine with the same setup using polygraph, but I cannot get it done in TensorRT.

hey,
i think your 10+ hour conversion issue is caused by the 132 MatMul operations with asymmetric shapes that TensorRT struggles to optimize.
have you tried:

  • Pre-optimize transformer attention matrices
  • Handling the multi-scale stereo matching operation
  • Reducing TensorRT’s optimization search space

best wishes