Why TensorRT model is slower?

❓ Question

Why TensorRT model is slower? I have tried TensorRT in an MHA (multihead attention) model, but found it is even slower than the jit scripted model.

What you have already tried

I tested the original model, the jit scripted model, the jit model after optimization, and the TensorRT model. Then, I found the tensorrt model is not as fast as I expected. The model here is a simple MHA module modified from fairseq so it could pass the compilation.

import time
import tmp_attn
import torch
import tensorrt
import torch_tensorrt as torch_trt

def timer(m, i):
    st = time.time()
    for _ in range(10000):
        m(i, i, i)
    ed = time.time()
    return ed - st

t1 = torch.randn(64, 1, 1280, device="cuda:0")
model = tmp_attn.MultiheadAttention(1280, 8).to("cuda:0")
model2 = torch.jit.script(model)
model3 = torch.jit.optimize_for_inference(model2)
model4 = torch_trt.compile(model, inputs=[t1, t1, t1]).to("cuda:0")

print("Original Model", timer(model, t1))
print("Jit Script Model", timer(model2, t1))
print("Jit Script Model after optimization", timer(model3, t1))
print("TensorRT Model", timer(model4, t1))

I ran these models 10000 times and record the spent time.
The output is:
Original Model 5.6981117725372314
Jit Script Model 4.5694739818573
Jit Script Model after optimization 3.3332810401916504
TensorRT Model 4.772718667984009


Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 1.11.0
  • CPU Architecture: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
  • OS (e.g., Linux): Linux, CentOS7
  • How you installed PyTorch (conda, pip, libtorch, source): conda
  • Build command you used (if compiling from source): /
  • Are you using local sources or building from archives: No
  • Python version: 3.7
  • CUDA version: 11.7
  • GPU models and configuration:
  • TensorRT version:
  • Torch_tensorrt version: 1.1.0

Additional context

The code of MHA is here.



Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:



I have already provided the model and script above. Please check it out and thank you.


We recommend you to please try on the latest TensorRT 8.4 GA version and let us know if you still face this issue.

Thank you.