No speedup on batch size larger than 1

prathikn · July 5, 2020, 10:36pm

I’ve setup tensorRT to work on my yolov3 model where I’m running inference on each frame of a video stream. When I run with a single video stream and process each frame one at at time, I notice that the tensorRT version of the model gets a solid speedup over the regular model (going from 43 fps to 57 fps). However, when I try to process frames from larger batch sizes, like 5 different videos (and batch together 1 frame from each video into a batch size of 5), I don’t see any speedup with tensorRT.

I’m trying to understand why I see a speedup with batch size of 1 vs a batch size of 5. Any ideas why this might be happening or what I can look into for improving batch performance? I’m running with float 32 but would still expect a speedup for larger batch sizes for the tensorRT model.

Here is an outline of my steps for creating and running the tensorRT engine:

Export yolo model to onnx using torch.onnx.export with the dynamic batches param
Convert onnx to tensorRT engine
- parse onnx model
- create a single optimization profile for a specific batch size: profile.set_shape(inp.name, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))
- build engine
Load the tensorRT engine + context
- select the right tensorRT engine based on input batch size to inference function
- Set the binding shape: context.set_binding_shape(0, (BATCH_SIZE, 3, IMAGE_SIZE))
- Set the optimization profile: context.active_optimization_profile = 0

Not sure if there’s anything else I should be doing but these steps seem to be fine for handling inference with larger batch sizes. I’m running this on the latest TensorRT 7 version with EXPLICIT_BATCH parameter set (seems like this is required) but I do have a dynamic shape for the batch size.

Is there anything I’m missing or worth trying to determine why this is happening?

AakankshaS · July 6, 2020, 6:34am

Hi @prathikn,
Please share your model and script along with the below set of system information, so that we can help you better.

o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow and PyTorch version
o TensorRT version

Thanks!

prathikn · July 6, 2020, 6:13pm

I’m using a standard yolov3-spp model for predicting a single class, which you can see here: yolov3/models.py at c7f8dfcb8734d604482992b13d10420ea5eb3fd3 · ultralytics/yolov3 · GitHub

For generating the tensorRT engine, here is the script I use:

def create_optimization_profiles(builder, inputs, batch_size): 
    # Creates tensorRT optimizations profiles for a given batch size
    profiles = []
    for inp in inputs:
        profile = builder.create_optimization_profile()
        shape = inp.shape[1:]
        profile.set_shape(inp.name, min=(batch_size, *shape), opt=(batch_size, *shape), max=(batch_size, *shape))
        profiles.append(profile)

    return profiles

def build_engine(onnx_file_path, engine_file_path, batch_size, verbose=True):
    logger = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()
    builder = trt.Builder(logger)
    config = builder.create_builder_config()

    # Specifies that network should have an explicit batch size (required in tensorRT 7.0.0+)
    explicit_batch = [1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)]
    network = builder.create_network(*explicit_batch)
    parser = trt.OnnxParser(network, logger)

    # Define standard settings for tensorRT builder environment
    builder.max_workspace_size = 1 << 30
    builder.max_batch_size = batch_size
    builder.fp16_mode = True
    # builder.strict_type_constraints = True

    # Parse onnx model
    with open(onnx_file_path, 'rb') as onnx_model:
        if not parser.parse(onnx_model.read()):
            print("ERROR: Failed to parse onnx model.")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return
    
    # Add optimization profiles
    inputs = [network.get_input(i) for i in range(network.num_inputs)]
    opt_profiles = create_optimization_profiles(builder, inputs, batch_size)
    for profile in opt_profiles:
        config.add_optimization_profile(profile)

    # Explicitly set the the output layer so engine knows where to expect final outputs
    last_layer = network.get_layer(network.num_layers - 1)
    if not last_layer.get_output(0):
        network.mark_output(last_layer.get_output(0))

    print('Building tensorRT engine...')
    engine = builder.build_engine(network, config)
    print('Successfully built engine')

    with open(engine_file_path, 'wb') as f:
        f.write(engine.serialize())

Here is system information:

Ubuntu 18.04.4 LTS, x86-64, Linux 4.15.0-101-generic
GPU: GeForce RTX 2080 Ti
CUDA version: 10.2
CUDNN version: 7.6.5
Python version: 3.7
Pytorch: 1.5
TensorRT: 7.0.0.11

Once this engine is created I simply load the engine, set the optimization profile and binding as described above and run the engine using context.execute_async. Is there anything else I should be doing here?

prathikn · July 9, 2020, 2:41am

Hi @AakankshaS, any update here on ideas/what to look into?

AakankshaS · July 31, 2020, 11:04am

Hi @prathikn,
Sincere apologies for delayed response,
Are you still facing the issue?
I tried working with yolov3, and could not reproduce the issue.
Can you please help with your model?

Thanks!

Topic		Replies	Views
TensorRT Batch Inferences : empty outputs TensorRT tensorrt , jetson-inference	8	1930	July 18, 2024
Yolov4 TensorRT slower than Yolov4 darknet TensorRT	6	3427	September 1, 2020
ONNX to TensorRT with dynamic batch size in Python TensorRT tensorrt , onnx	4	6270	October 12, 2021
TenorRT with python: execution return zeros if batch_size > 1 TensorRT	1	802	November 20, 2020
YOlov4-tiny with batch size 64 works , but batch size 1 gives wrong bounding boxes TensorRT tensorrt , yolo	1	1221	October 22, 2021
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1730	September 30, 2021
Building a engine takes too long TensorRT	13	3385	December 8, 2022
TensorRT Batch Inference: different results TensorRT	4	4239	December 1, 2021
Converting yolov4 onnx model to TensorRT for multi batch input TensorRT cudnn	3	663	January 31, 2024
ResNet18: Batch size 1 works, but batch size 10, 32 only has minor acceleration TensorRT	2	1774	February 20, 2020

No speedup on batch size larger than 1

Related topics