TRT inference on batches is not giving any performance benefit

nssreenivasalu · August 4, 2020, 1:21pm

Hi,
I am trying to run NVIDIA sample code for a custom Faster RCNN model. When I increase batch size, the time taken for inference execution goes up almost linearly.
So, Batch Size :1, Time : 430 ms
Batch Size: 4, Time: 1723 ms

Nvidia Documents state the following :
“Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16”

Needed help to figure out what am I missing ?

Thanks in advance.

AastaLLL · August 5, 2020, 3:53am

Hi,

Faster RCNN is a large model so the workspace value may have obvious impact on the performance.
It’s required for more memory if inferencing with a much larger batch size.

Have you tried to adjust the workspace configure before?

/usr/src/tensorrt/bin/trtexec --workspace=N

Would you mind to check if the performance is limited by the available memory first?
Thanks.

nssreenivasalu · August 5, 2020, 7:53am

No, I have not used trtexec command.
But I do configure the workspace by using setMaxWorkspaceSize function in c++ code. I even increased the size to 1GB. But nothing happens.

I used the following script :
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleFasterRCNN

Also, I am facing similar issue in Yolov4. The timings for batch size of 2 and 4 increase linearly.

Thanks.

AastaLLL · August 6, 2020, 5:58am

Hi,

Thanks for testing.

We are going to reproduce this on our environment.
Will let you know the following later.

nssreenivasalu · August 6, 2020, 6:09am

Thanks,
Waiting for your timings for your results.

Regards

AastaLLL · August 7, 2020, 4:00am

Hi,

This issue can be reproduced on our environment.
We are checking this with our internal team for more information.

Will update with you once we got any feedback.
Thanks.

nssreenivasalu · August 7, 2020, 6:16am

Thanks for confirming @AastaLLL .

Eagerly waiting for your reply.

Regards

chowdhsk · August 11, 2020, 5:17am

@AastaLLL any guidance on this issue ? Why is this not in line with TensorRT documentation " Larger batch sizes are almost always more efficient on the GPU. Extremely large batches, such as N > 2^16, can sometimes require extended index computation and so should be avoided if possible. Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16 or N=32. In this case, increasing the batch size from N=1 to N=32 would dramatically improve total throughput with only a small effect on latency." NVIDIA Deep Learning TensorRT Documentation

We have also observed the same issue on Tesla P100 GPU and NVIDIA GEForce 2080 Super GPU.

Machine 1 : - Tesla P100 GPU with 64 GB CPU RAM

Batch size Inference Time
1 27
2 49
4 91

Machine 2 :- GEForce RTX 2070 Super with 32 GB CPU RAM

Batch size Inference Time
1 16
2 28
4 48

Refer also to github issue I raised Batch inference timings increasing almost linearly · Issue #739 · NVIDIA/TensorRT · GitHub

AastaLLL · August 12, 2020, 4:55am

Hi,

Thanks for your patience. We got some reply from our internal team.

In general, BS==1 and BS==4 will choose similar algorithm.
For example, we found that most layers choose CuDNN, which means the performance depends on how fast device can run the kernel.
For TX2, it has only 2 SMs, BS == 1 already uses resource, so it is very likely that it has linear performance for bs==1 and bs==4.

Please understand that we cannot guarantee that large batch will bring throughput benefit on all network/platform. Increase batch to increase throughput is a possible improvement and worth a try.

It looks like the recommendation in our document may bring some confusing.
We will check how to improve the statement more precisely.

Thanks.

chowdhsk · August 12, 2020, 11:52am

Thank you @AastaLLL but then why does it not work on GPU systems also as I mentioned above. I know this thread is for Jetson TX2 category but I have also raised the issue on the github.

AastaLLL · August 13, 2020, 4:33am

Hi,

You can enable the --verbose when testing.
If the chosen algorithm is cuDNN, the performance will depend on the how fast device can run the kernel.

Thanks.

Topic		Replies	Views
Inference on large batch size TensorRT	5	4640	September 21, 2018
TensorRT 5.0.2 Batch Size Problem: bigger batch size Inference Time increase??? General	6	1573	October 12, 2021
Batchsize performance differs greatly in the two application methods of tensorrt TensorRT	2	688	April 4, 2019
Questions about using TensorRT - batch size TensorRT	0	469	March 12, 2020
tensorRT inference engine that setting bigger max_batch_size is slower? TensorRT	3	862	October 12, 2021
TensorRT ------ maxBatchSize & batchSize ------ kFLOAT & kHALF ------ sampleUffMNIST.cpp Jetson TX2	4	3467	October 18, 2021
Batch Inference using BatchSize=8 takes nearly as long as 8 individual runs of BatchSize=1 TensorRT	3	1236	July 20, 2021
Inference time is linear respective to batch size while using TENSORRT MODEL TensorRT tensorrt , yolo	8	2911	May 5, 2021
Performance drop for large batches and float16 TensorRT	2	687	July 15, 2019
Latency proportionally increases with batch size TensorRT	2	1126	September 12, 2021

TRT inference on batches is not giving any performance benefit

Related topics