TensorRT 5.0.2 Batch Size Problem: bigger batch size Inference Time increase???

Hi,

I am new and started working on a project optimizing our detection model on tensorrt. However, when I set max_batch_size > 1, inference time increases proportionally. This happens both using python2/onnx_to_tensorrt.py approach, or run the program using trtexec. I firstly found the problem when running my own detection model, and then I changed to the sample model under TensorRT5.0.6 release /samples/python/yolov3_onnx, and it has the same behavior (logs below).

Basically:
batch size=1 fp16 infTime is about 10ms
batch size=4 fp16 infTime is about 39ms
batch size=1 fp32 infTime is about 34.4ms
batch size=4 fp32 infTime is about 129ms

The environment I am using: ubuntu16.04,
Driver Version: 418.39 CUDA Version: 10.1
Using a T4 like shown:
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 68C P0 71W / 70W | 2881MiB / 15079MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 30C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Cuda compilation tools, release 7.5, V7.5.17
TensorRT-5.0.2.6.Ubuntu-16.04.4.x86_64-gnu.cuda-10.0.cudnn7.3.tar.gz

Can somebody explain?

Thanks.

xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ ll *log
-rw-rw-r-- 1 aifi aifi 1646 Mar 27 20:35 yolov3cuda0bat1fp16.log
-rw-rw-r-- 1 aifi aifi 1640 Mar 27 20:27 yolov3cuda0bat1fp32.log
-rw-rw-r-- 1 aifi aifi 1645 Mar 27 20:30 yolov3cuda0bat4fp16.log
-rw-rw-r-- 1 aifi aifi 1639 Mar 27 20:27 yolov3cuda0bat4fp32.log

xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp16.log
Average over 10 runs is 17.3652 ms (host walltime is 17.4882 ms, 99% percentile time is 18.1637).
Average over 10 runs is 13.7891 ms (host walltime is 13.8339 ms, 99% percentile time is 14.0738).
Average over 10 runs is 10.3425 ms (host walltime is 10.3858 ms, 99% percentile time is 10.3828).
Average over 10 runs is 10.3546 ms (host walltime is 10.397 ms, 99% percentile time is 10.496).
Average over 10 runs is 10.3128 ms (host walltime is 10.3576 ms, 99% percentile time is 10.3372).
Average over 10 runs is 10.3787 ms (host walltime is 10.4212 ms, 99% percentile time is 10.4978).
Average over 10 runs is 10.337 ms (host walltime is 10.3795 ms, 99% percentile time is 10.3498).
Average over 10 runs is 10.3523 ms (host walltime is 10.3958 ms, 99% percentile time is 10.4998).
Average over 10 runs is 10.3498 ms (host walltime is 10.3923 ms, 99% percentile time is 10.3673).
Average over 10 runs is 10.3173 ms (host walltime is 10.3602 ms, 99% percentile time is 10.4877).

xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp16.log
Average over 10 runs is 45.7616 ms (host walltime is 45.8084 ms, 99% percentile time is 60.2297).
Average over 10 runs is 38.8411 ms (host walltime is 38.8866 ms, 99% percentile time is 39.2171).
Average over 10 runs is 39.1089 ms (host walltime is 39.1546 ms, 99% percentile time is 39.8079).
Average over 10 runs is 39.1751 ms (host walltime is 39.2205 ms, 99% percentile time is 39.6014).
Average over 10 runs is 39.1758 ms (host walltime is 39.221 ms, 99% percentile time is 39.9894).
Average over 10 runs is 39.0968 ms (host walltime is 39.1409 ms, 99% percentile time is 39.594).
Average over 10 runs is 38.8867 ms (host walltime is 38.9312 ms, 99% percentile time is 39.467).
Average over 10 runs is 39.0935 ms (host walltime is 39.1381 ms, 99% percentile time is 39.9408).
Average over 10 runs is 39.272 ms (host walltime is 39.3178 ms, 99% percentile time is 40.1793).
Average over 10 runs is 39.0277 ms (host walltime is 39.0727 ms, 99% percentile time is 39.4325).

xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp32.log
Average over 10 runs is 44.0139 ms (host walltime is 44.0628 ms, 99% percentile time is 67.2938).
Average over 10 runs is 34.338 ms (host walltime is 34.3821 ms, 99% percentile time is 35.2482).
Average over 10 runs is 34.3672 ms (host walltime is 34.4122 ms, 99% percentile time is 35.2721).
Average over 10 runs is 34.5807 ms (host walltime is 34.6255 ms, 99% percentile time is 35.3508).
Average over 10 runs is 34.2389 ms (host walltime is 34.286 ms, 99% percentile time is 34.777).
Average over 10 runs is 34.3533 ms (host walltime is 34.3982 ms, 99% percentile time is 34.7647).
Average over 10 runs is 34.4273 ms (host walltime is 34.4717 ms, 99% percentile time is 35.0437).
Average over 10 runs is 34.348 ms (host walltime is 34.3925 ms, 99% percentile time is 35.1601).
Average over 10 runs is 34.4369 ms (host walltime is 34.4814 ms, 99% percentile time is 35.0536).
Average over 10 runs is 34.3577 ms (host walltime is 34.4018 ms, 99% percentile time is 34.7955).

xxxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp32.log
Average over 10 runs is 137.07 ms (host walltime is 137.14 ms, 99% percentile time is 215.524).
Average over 10 runs is 128.768 ms (host walltime is 128.812 ms, 99% percentile time is 129.53).
Average over 10 runs is 128.943 ms (host walltime is 128.987 ms, 99% percentile time is 129.602).
Average over 10 runs is 129.165 ms (host walltime is 129.208 ms, 99% percentile time is 129.451).
Average over 10 runs is 129.211 ms (host walltime is 129.255 ms, 99% percentile time is 129.747).
Average over 10 runs is 129.358 ms (host walltime is 129.4 ms, 99% percentile time is 130.052).
Average over 10 runs is 129.455 ms (host walltime is 129.498 ms, 99% percentile time is 130.278).
Average over 10 runs is 129.661 ms (host walltime is 129.704 ms, 99% percentile time is 130.023).
Average over 10 runs is 129.802 ms (host walltime is 129.848 ms, 99% percentile time is 130.595).
Average over 10 runs is 129.841 ms (host walltime is 129.887 ms, 99% percentile time is 130.664).

Hi,

Batch size indicates the different input number.
For an input tensor=(N,C,H,W), batch size change the value of N.

Take image case as example,
Batch size equals to 1 -> inference one image per time.
Batch size equals to 2 -> you inference two image per time.

Since the computational works is proportional to N, the execution time will increase when N becomes bigger.
In general, the execution time will follow this:

T(N=1) < T(N=k) < k*T(N=1)

Thanks.

Hi,

I encounter the same issue: proportional increase of latency with batch size. TRT 5.1.5.0, C++ API, network converted from UFF.
However, the SDK documentation implies that increasing the batch size should not have large impact on the latency. The documenation states: Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16 or N=32. (https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html)

Is the documentation wrong or am I missing something?

@yinghe2000
Can you solved this problem? I faced the same problem.

@AastaLLL, Your mean are that for processing batch of images with Pure TensorRT, the jetson dosen’t allow to do all of batch at the same time? Processing batch of images is done one by one? If so, How the TensorRT-Tensorflow integrated library correctly worked for batch of image at the same time?

Hi LoveNvidia,

Please open a new topic for your issue. Thanks