Why TensorRT model is slower?

Why TensorRT model is slower? I have tried TensorRT in an MHA (multihead attention) model, but found it is even slower than the jit scripted model.

I tested the original model, the jit scripted model, the jit model after optimization, and the TensorRT model. Then, I found the tensorrt model is not as fast as I expected. The model here is a simple MHA module modified from fairseq so it could pass the compilation.

import time
import tmp_attn
import torch
import tensorrt
import torch_tensorrt as torch_trt

def timer(m, i):
    st = time.time()
    for _ in range(10000):
        m(i, i, i)
    ed = time.time()
    return ed - st

t1 = torch.randn(64, 1, 1280, device="cuda:0")
model = tmp_attn.MultiheadAttention(1280, 8).to("cuda:0")
model2 = torch.jit.script(model)
model3 = torch.jit.optimize_for_inference(model2)
model4 = torch_trt.compile(model, inputs=[t1, t1, t1]).to("cuda:0")

print("Original Model", timer(model, t1))
print("Jit Script Model", timer(model2, t1))
print("Jit Script Model after optimization", timer(model3, t1))
print("TensorRT Model", timer(model4, t1))

I ran these models 10000 times and record the spent time.
The output is:
Original Model 5.6981117725372314
Jit Script Model 4.5694739818573
Jit Script Model after optimization 3.3332810401916504
TensorRT Model 4.772718667984009


  • PyTorch Version (e.g., 1.0): 1.11.0
  • CPU Architecture: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
  • OS (e.g., Linux): Linux, CentOS7
  • How you installed PyTorch (conda, pip, libtorch, source): conda
  • Build command you used (if compiling from source): /
  • Are you using local sources or building from archives: No
  • Python version: 3.7
  • CUDA version: 11.7
  • GPU models and configuration:
  • TensorRT version:
  • Torch_tensorrt version: 1.1.0

The code of MHA is here.



I have already provided the model and script above. Please check it out and thank you.


