The larger the batch size, the better when build engine?

zhouzhi9 · July 15, 2020, 7:07am

Hello,
I have a question when I am building the engine using tensorRT SDK. Is it better if I set the batch size larger when my GPU memory is enough? In my situation, I receive real-time video data from internet and then do some inference using tensorRT SDK, I can get different process speed when using different batch_size(set when building engine file). So what is the principle when set the value of batch size?

I have a 2080Ti GPU, the gpu memory used by my application(based on tensorRT) is about 3~4g, but the gpu-util is about 80%. I doubt if I can improve the application when I set batch_size larger. I can also take full advantage of my gpu memory.

note: no need to care about the latency of the data from internet when talking about this question in my situation

Thanks for your replies!

AakankshaS · July 15, 2020, 9:14am

Hi @zhouzhi9,
Batch size indicates the different input number.
For an input tensor=(N,C,H,W), batch size change the value of N.

Take image case as example,
Batch size equals to 1 → inference one image per time.
Batch size equals to 2 → you inference two image per time.
Since the computational works is proportional to N, the execution time will increase when N becomes bigger.

Now larger batch size may improve speed of inference .
But optimal batch size will vary depending on what DL model you are using and what hardware you are working on.
For example, optimal batch size for YoloV3 and YoloV4 may be around 8 ~ 16 for TRT standalone.

You can refer to the below link.
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#batching

Thanks!

zhouzhi9 · July 15, 2020, 9:38am

OK, Thanks for your reply.
We tested on 2080TI gpu for YoloV3/V4, the speed improved obviousely from 1 to 16, but it affects very little when changing from 16 to 32, 64, or 128.

AakankshaS · July 29, 2020, 9:49am

Hi,

Optimal batch size will vary depending on what DL model you are using based on your GPU and compute power.
More likely this could be memory / compute bandwidth issue or it could just be that there a more optimal CUDA kernels for things like batch size being a multiple of 8. So you could expect significant increase when going from 1->8 since 1 is not a multiple of 8. And 16/32/64 are all also multiples of 8 so they’re already fast (same or similar cuda kernel selected), and might not expect much more improvement over 8.

Thanks!

Topic		Replies	Views
tensorRT inference engine that setting bigger max_batch_size is slower? TensorRT	3	898	October 12, 2021
Batchsize performance differs greatly in the two application methods of tensorrt TensorRT	2	731	April 4, 2019
TensorRT builder->setMaxBatchSize(maxBatchSize); question Jetson TX2	9	6698	October 18, 2021
Batch Inference using BatchSize=8 takes nearly as long as 8 individual runs of BatchSize=1 TensorRT	3	1317	July 20, 2021
Question about tensorRT batch size DeepStream SDK tensorrt	2	941	October 12, 2021
No speedup on batch size larger than 1 TensorRT tensorrt , pytorch	4	1722	July 31, 2020
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1882	September 30, 2021
Why inference speedup increases with the increase of batch size in tensorrt int8？ TensorRT	1	2231	December 17, 2018
Inference time is linear respective to batch size while using TENSORRT MODEL TensorRT tensorrt , yolo	8	2985	May 5, 2021
TRT inference on batches is not giving any performance benefit Jetson TX2 tensorrt , nvbugs	11	1351	October 18, 2021

The larger the batch size, the better when build engine?

Related topics