Inference time is linear respective to batch size while using TENSORRT MODEL

Description

Inference time linear proportionality with batch size while using Tensorrt engine for scaledyolov4 for object detection(scaled yolov4).
A clear and concise description of the bug or issue.
When I am increasing batch size, inference time is increasing linearly.

Environment

TensorRT Version:
Checked on two versions (7.2.2 and 7.0.0)
GPU Type:
Tesla T4
Nvidia Driver Version:
455
CUDA Version:
7.2.2 with cuda-11.1 and 7.0.0 with cuda-10.2
CUDNN Version:
7 with trt-7.0.0 and 8 with trt-7.2.2
Operating System + Version:
ubuntu-18.04
Python Version (if applicable):
3.6.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
nvcr.io/nvidia/tensorrt:20.12-py3 - trt-7.2.2
nvcr.io/nvidia/tensorrt:20.03-py3 - trt-7.0.0

FOR BATCH SIZE - 1:

Inference take: 48.5283 ms.
Inference take: 48.518 ms.
Inference take: 40.1897 ms.
Inference take: 40.0713 ms.
Inference take: 38.54 ms.
Inference take: 38.7829 ms.
Inference take: 38.6083 ms.
Inference take: 38.6635 ms.
Inference take: 38.1827 ms.
Inference take: 38.1016 ms

FOR BATCH SIZE - 2:

Inference take: 76.3045 ms.
Inference take: 74.9346 ms.
Inference take: 73.3341 ms.
Inference take: 73.9554 ms.
Inference take: 73.4185 ms.
Inference take: 75.4546 ms.
Inference take: 77.7809 ms.
Inference take: 78.3289 ms.
Inference take: 79.5533 ms.
Inference take: 79.0556 ms.
Inference take: 79.2939 ms.
Inference take: 77.214 ms.

FOR BATCH SIZE - 4:

Inference take: 158.327 ms.
Inference take: 157.001 ms.
Inference take: 157.107 ms.
Inference take: 154.237 ms.
Inference take: 155.899 ms.
Inference take: 157.408 ms.
Inference take: 155.758 ms.
Inference take: 155.906 ms.

I expected batch size not to have this proportionality. Can anything done to improve the inference time using batching?
TIY.

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

ScaledYOLOv4 model
I am using this model for object detection.

Model to ONNX conversion script

Followed this repository to convert my onnx model to tensorrt model.

I will try to run my model with trtexec. But at the EOD, I need to export my model as a python library. So I can’t use trtexec(which hinders my end goal ).
Thanks!

Hi @bschandu67,

Could you please check the gpu utilization.

Thank you.

83% volatile GPU and 1.7 GB memory was used for batch size 1 model. and 100% volatile GPU and 2.7 GB memory was used for batch size 4 model.
While using baremetal tensorrt engine for inference using python, 46% volatile for batch size 1 and 100% volatile for batch size 4.

Hi @bschandu67,

Thanks for providing gpu utilization. It is fine. Could you please also provide the engine build verbose log, and inference layer performance.

build: trtexec --verbose .....
inference: trtexec --dumpProfile --separateProfileRun

Thank you.

Hi @bschandu67 .

Have you been able to reduce the runtime with higher batch sizes?

Thanks!

Yes.
But not by a huge margin.
With batching, we were able to achieve 4ms less per frame.

Right. I realized that when the inference uses GPU’s full capacity even with a batch size of 1, increasing the batch size wouldn’t help much, according to this thread.

Was it the same case with your problem?

Thanks a lot!