Model inferenced with tensorrt is slower than regular pytorch

Description

Hello everyone

I am working on a pytorch object tracking model to convert it to tensorrt for faster inference

When inferencing tensorrt with a single batch the model is about 2x faster, but when adding batches, it becomes SLOWER

batch of 1 inference time:
pytorch - 40ms
tensorrt - 20ms

batch of 8 inference time:
pytorch - 50ms
tensorrt - 85ms

When adding batches the inference time on pytorch pretty much doesn’t increase, but inferencing time with tensorrt engine increases significantly!!!

I have exported the tensorrt model with batch size 1 and batch size 4 to nsight systems:

1 batch inference test: implicit1_admin.nsys-rep - Google Drive
4 batch inference test: implicit4_admin.nsys-rep - Google Drive

Can anybody help, why is the speed regression happening?

this is the code that I use to build the engine:

import pycuda.driver as cuda
import pycuda.autoinit


Explain
import numpy as np
import onnx
import tensorrt as trt
import torch

# Constants
ONNX_MODEL_PATH = 'new_full_explicit_batch4.onnx'
TENSORRT_ENGINE_PATH = 'new_full_explicit_batch4.engine'
MIN_BATCH_SIZE = 1
MAX_BATCH_SIZE = 16

# Set up the logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Create a TensorRT builder, runtime, and network
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
parser = trt.OnnxParser(network, TRT_LOGGER)
parser.set_flag(trt.OnnxParserFlag.NATIVE_INSTANCENORM)

# Parse the ONNX model file
with open(ONNX_MODEL_PATH, 'rb') as model:
    if not parser.parse(model.read()):
        print('ERROR: Failed to parse the ONNX file.')
        for error in range(parser.num_errors):
            print(parser.get_error(error))
        exit(1)

# Define optimization profile for dynamic batch size
profile = builder.create_optimization_profile()
profile.set_shape('im_patches', (MIN_BATCH_SIZE, 3, 288, 288), (4, 3, 288, 288), (MAX_BATCH_SIZE, 3, 288, 288))
profile.set_shape('train_feat', (MIN_BATCH_SIZE, 256, 18, 18), (4, 256, 18, 18), (MAX_BATCH_SIZE, 256, 18, 18))
profile.set_shape('target_labels', (1, MIN_BATCH_SIZE, 18, 18), (1, 4, 18, 18), (1, MAX_BATCH_SIZE, 18, 18))
profile.set_shape('train_ltrb', (MIN_BATCH_SIZE, 4, 18, 18), (4, 4, 18, 18), (MAX_BATCH_SIZE, 4, 18, 18))
config.add_optimization_profile(profile)

# Build the engine
builder.max_batch_size = MAX_BATCH_SIZE
# config.max_workspace_size = 1 << 30  # 1GB of workspace size
engine = builder.build_engine(network, config)

# Save the engine
with open(TENSORRT_ENGINE_PATH, 'wb') as f:
    f.write(engine.serialize())

and I use polygraphy for inference:


with open("new_full_explicit_batch16.engine", "rb") as f:
    engine_data = f.read()
runtime1 = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine1 = runtime1.deserialize_cuda_engine(engine_data)

trt_engine1 = TrtRunner(engine1)
trt_engine1.activate()

input_data = {
   "im_patches": test_x_stack.cpu(),
   "train_feat": train_feat_stack.cpu(),
   "target_labels": target_labels_stack.cpu(),
   "train_ltrb": train_ltrb_stack.cpu(),
}
rez = trt_engine1.infer(input_data)
scores_raw = rez["scores_raw"].to("cuda")
bbox_preds = rez["bbox_preds"].to("cuda")

perhaps I need to convert the engine differently?
or maybe running inference with polygraphy isn’t a good idea?
Or maybe the pytorch code is bad, what works for pytorch, doesn’t convert well to onnx?

anybody that has any idea please let me know

link to the onnx model: new_full_explicit_batch4.onnx - Google Drive

Thank you

P.S. the tensorrt version is the latest stable, I’ve tried to convert the model with dynamic shapes and static shapes, when inferencing multiple batches, the model slows down significantly!!!

Environment

TensorRT Version: 8.6.1
GPU Type: GTX 16650 Ti
Nvidia Driver Version: 546.01
CUDA Version: 12.1
CUDNN Version: 8.9.7
Operating System + Version: Windows 10
Python Version (if applicable): 3.10.13
PyTorch Version (if applicable): 2.1.2+cu121

Solved, Was just using wrong timing. Although I still think, that running in multiple batches should be faster, but perhaps this is because transformers are used