No speedup on batch size larger than 1

I’ve setup tensorRT to work on my yolov3 model where I’m running inference on each frame of a video stream. When I run with a single video stream and process each frame one at at time, I notice that the tensorRT version of the model gets a solid speedup over the regular model (going from 43 fps to 57 fps). However, when I try to process frames from larger batch sizes, like 5 different videos (and batch together 1 frame from each video into a batch size of 5), I don’t see any speedup with tensorRT.

I’m trying to understand why I see a speedup with batch size of 1 vs a batch size of 5. Any ideas why this might be happening or what I can look into for improving batch performance? I’m running with float 32 but would still expect a speedup for larger batch sizes for the tensorRT model.

Here is an outline of my steps for creating and running the tensorRT engine:

  1. Export yolo model to onnx using torch.onnx.export with the dynamic batches param
  2. Convert onnx to tensorRT engine
    • parse onnx model
    • create a single optimization profile for a specific batch size: profile.set_shape(, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))
    • build engine
  3. Load the tensorRT engine + context
    • select the right tensorRT engine based on input batch size to inference function
    • Set the binding shape: context.set_binding_shape(0, (BATCH_SIZE, 3, IMAGE_SIZE))
    • Set the optimization profile: context.active_optimization_profile = 0

Not sure if there’s anything else I should be doing but these steps seem to be fine for handling inference with larger batch sizes. I’m running this on the latest TensorRT 7 version with EXPLICIT_BATCH parameter set (seems like this is required) but I do have a dynamic shape for the batch size.

Is there anything I’m missing or worth trying to determine why this is happening?

Hi @prathikn,
Please share your model and script along with the below set of system information, so that we can help you better.

o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow and PyTorch version
o TensorRT version


I’m using a standard yolov3-spp model for predicting a single class, which you can see here:

For generating the tensorRT engine, here is the script I use:

def create_optimization_profiles(builder, inputs, batch_size): 
    # Creates tensorRT optimizations profiles for a given batch size
    profiles = []
    for inp in inputs:
        profile = builder.create_optimization_profile()
        shape = inp.shape[1:]
        profile.set_shape(, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))

    return profiles

def build_engine(onnx_file_path, engine_file_path, batch_size, verbose=True):
    logger = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()
    builder = trt.Builder(logger)
    config = builder.create_builder_config()

    # Specifies that network should have an explicit batch size (required in tensorRT 7.0.0+)
    explicit_batch = [1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)]
    network = builder.create_network(*explicit_batch)
    parser = trt.OnnxParser(network, logger)

    # Define standard settings for tensorRT builder environment
    builder.max_workspace_size = 1 << 30
    builder.max_batch_size = batch_size
    builder.fp16_mode = True
    # builder.strict_type_constraints = True

    # Parse onnx model
    with open(onnx_file_path, 'rb') as onnx_model:
        if not parser.parse(
            print("ERROR: Failed to parse onnx model.")
            for error in range(parser.num_errors):
    # Add optimization profiles
    inputs = [network.get_input(i) for i in range(network.num_inputs)]
    opt_profiles = create_optimization_profiles(builder, inputs, batch_size)
    for profile in opt_profiles:

    # Explicitly set the the output layer so engine knows where to expect final outputs
    last_layer = network.get_layer(network.num_layers - 1)
    if not last_layer.get_output(0):

    print('Building tensorRT engine...')
    engine = builder.build_engine(network, config)
    print('Successfully built engine')

    with open(engine_file_path, 'wb') as f:

Here is system information:

  • Ubuntu 18.04.4 LTS, x86-64, Linux 4.15.0-101-generic
  • GPU: GeForce RTX 2080 Ti
  • CUDA version: 10.2
  • CUDNN version: 7.6.5
  • Python version: 3.7
  • Pytorch: 1.5
  • TensorRT:

Once this engine is created I simply load the engine, set the optimization profile and binding as described above and run the engine using context.execute_async. Is there anything else I should be doing here?

Hi @AakankshaS, any update here on ideas/what to look into?

Hi @prathikn,
Sincere apologies for delayed response,
Are you still facing the issue?
I tried working with yolov3, and could not reproduce the issue.
Can you please help with your model?