Why TensorRT model is slower? I have tried TensorRT in an MHA (multihead attention) model, but found it is even slower than the jit scripted model.
I tested the original model, the jit scripted model, the jit model after optimization, and the TensorRT model. Then, I found the tensorrt model is not as fast as I expected. The model here is a simple MHA module modified from
fairseq so it could pass the compilation.
import time import tmp_attn import torch import tensorrt import torch_tensorrt as torch_trt def timer(m, i): st = time.time() for _ in range(10000): m(i, i, i) ed = time.time() return ed - st t1 = torch.randn(64, 1, 1280, device="cuda:0") model = tmp_attn.MultiheadAttention(1280, 8).to("cuda:0") model2 = torch.jit.script(model) model3 = torch.jit.optimize_for_inference(model2) model4 = torch_trt.compile(model, inputs=[t1, t1, t1]).to("cuda:0") print("Original Model", timer(model, t1)) print("Jit Script Model", timer(model2, t1)) print("Jit Script Model after optimization", timer(model3, t1)) print("TensorRT Model", timer(model4, t1))
I ran these models 10000 times and record the spent time.
The output is:
Original Model 5.6981117725372314
Jit Script Model 4.5694739818573
Jit Script Model after optimization 3.3332810401916504
TensorRT Model 4.772718667984009
Build information about Torch-TensorRT can be found by turning on debug messages
- PyTorch Version (e.g., 1.0): 1.11.0
- CPU Architecture: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
- OS (e.g., Linux): Linux, CentOS7
- How you installed PyTorch (
libtorch, source): conda
- Build command you used (if compiling from source): /
- Are you using local sources or building from archives: No
- Python version: 3.7
- CUDA version: 11.7
- GPU models and configuration:
- TensorRT version: 184.108.40.206
- Torch_tensorrt version: 1.1.0
The code of MHA is here.