Batch Inference using BatchSize=8 takes nearly as long as 8 individual runs of BatchSize=1


I am experimenting with object recognition, I have been using an existing project that builds a network from TensorRT APIs and then uses that to do inference. The code that does the inference is here.

I have had success using the sample code with BATCH_SIZE set to the default (1), on an RTX3070 I have consistently been getting 17ms inference time (compute capability 8.6, 5888 CUDA cores ) and on a Quadro K620 (compute capability 5.0, 384 CUDA cores) I am getting around 199ms. I am happy with this performance.

I did some reading about TensorRT and according to the documentation “A batch of inputs identical in shape and size can be computed on different layers of the neural network in parallel”, which suggests that a batch might take a similar time as a single inference if inference in the GPU is done in parallel.

I tested this by setting BATCH_SIZE = 8. I regenerated the engine file and ran again. But in this case inference was taking 100ms for the RTX3070 and 1447ms for the K620 (the code outputs inferencing time in ms). I had Task Manager open and for the selected GPU it showed no more than about 6% utilization during the actual inferencing. I suppose I was expecting to see significant usage of the GPU if there is parallelizing during inferencing.

Is this result expected? The K620 takes nearly the same time to do one run at BATCH_SIZE=8 as as it takes to do 8 separate BATCH_SIZE=1 inference runs, while the RTX3070 did only slightly better, taking about 6 times as long as a single inference.

Is there anything that I could do to improve this? The project moderator indicated that his results were similar with changing batch size.


TensorRT Version:
GPU Type: RTX 3070 and Quadro K620
Nvidia Driver Version:
CUDA Version: cuda_11.1.1_456.81
CUDNN Version: cudnn-11.2-windows-x64-v8.1.1.33
Operating System + Version: Windows 10 Pro build 21H1
Environment: Visual Studio 2017, C++ project
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
    Here is the guide I followed to build the project in Windows.
  • Exact steps/commands to run your repro
    First run with BATCH_SIZE=1 to build the engine and get a baseline for inference time (using command line parameters as shown below)
    Set BATCH_SIZE=8, recompile, run with command line parameters " -s test1.wts test1.engine l" to build engine file
    Then run again to inference with command line parameters “-d test1.engine ./SAMPLES” where SAMPLES is a folder of jpg images to infer with
    I can provide VS project if required.
  • Full traceback of errors encountered


We recommend you to please try on latest TensorRT 8.0 version. And let us know if you still face this issue.

Hi @spolisetty

Thanks. I did a clean install of Windows 10, and installed CUDA 11.3.1, CUDNN 8.2.1 and TensorRT
I tried running on an M2000M GPU and with BATCH_SIZE = 1, I was getting 102ms per inference, but with BATCH_SIZE = 8 I was getting 745ms.

So the results are essentially the same, ie. it still takes approximately 7.5 times as long as a single inference to do a batch of 8. I am essentially seeing no benefit of increasing batch size.


Could you please share issue repro minimal script and onnx model to try from our end for better assistance.