The larger the batch size, the better when build engine?

I have a question when I am building the engine using tensorRT SDK. Is it better if I set the batch size larger when my GPU memory is enough? In my situation, I receive real-time video data from internet and then do some inference using tensorRT SDK, I can get different process speed when using different batch_size(set when building engine file). So what is the principle when set the value of batch size?

I have a 2080Ti GPU, the gpu memory used by my application(based on tensorRT) is about 3~4g, but the gpu-util is about 80%. I doubt if I can improve the application when I set batch_size larger. I can also take full advantage of my gpu memory.

note: no need to care about the latency of the data from internet when talking about this question in my situation

Thanks for your replies!

Hi @zhouzhi9,
Batch size indicates the different input number.
For an input tensor=(N,C,H,W), batch size change the value of N.

Take image case as example,
Batch size equals to 1 → inference one image per time.
Batch size equals to 2 → you inference two image per time.
Since the computational works is proportional to N, the execution time will increase when N becomes bigger.

Now larger batch size may improve speed of inference .
But optimal batch size will vary depending on what DL model you are using and what hardware you are working on.
For example, optimal batch size for YoloV3 and YoloV4 may be around 8 ~ 16 for TRT standalone.

You can refer to the below link.


1 Like

OK, Thanks for your reply.
We tested on 2080TI gpu for YoloV3/V4, the speed improved obviousely from 1 to 16, but it affects very little when changing from 16 to 32, 64, or 128.


Optimal batch size will vary depending on what DL model you are using based on your GPU and compute power.
More likely this could be memory / compute bandwidth issue or it could just be that there a more optimal CUDA kernels for things like batch size being a multiple of 8. So you could expect significant increase when going from 1->8 since 1 is not a multiple of 8. And 16/32/64 are all also multiples of 8 so they’re already fast (same or similar cuda kernel selected), and might not expect much more improvement over 8.