TRT inference on batches is not giving any performance benefit

Hi,
I am trying to run NVIDIA sample code for a custom Faster RCNN model. When I increase batch size, the time taken for inference execution goes up almost linearly.
So, Batch Size :1, Time : 430 ms
Batch Size: 4, Time: 1723 ms

Nvidia Documents state the following :
“Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16”

Needed help to figure out what am I missing ?

Thanks in advance.

Hi,

Faster RCNN is a large model so the workspace value may have obvious impact on the performance.
It’s required for more memory if inferencing with a much larger batch size.

Have you tried to adjust the workspace configure before?

/usr/src/tensorrt/bin/trtexec --workspace=N

Would you mind to check if the performance is limited by the available memory first?
Thanks.

No, I have not used trtexec command.
But I do configure the workspace by using setMaxWorkspaceSize function in c++ code. I even increased the size to 1GB. But nothing happens.

I used the following script :

Also, I am facing similar issue in Yolov4. The timings for batch size of 2 and 4 increase linearly.

Thanks.

Hi,

Thanks for testing.

We are going to reproduce this on our environment.
Will let you know the following later.

Thanks,
Waiting for your timings for your results.

Regards

Hi,

This issue can be reproduced on our environment.
We are checking this with our internal team for more information.

Will update with you once we got any feedback.
Thanks.

Thanks for confirming @AastaLLL .

Eagerly waiting for your reply.

Regards

1 Like

@AastaLLL any guidance on this issue ? Why is this not in line with TensorRT documentation " Larger batch sizes are almost always more efficient on the GPU. Extremely large batches, such as N > 2^16, can sometimes require extended index computation and so should be avoided if possible. Often the time taken to compute results for batch size N=1 is almost identical to batch sizes up to N=16 or N=32. In this case, increasing the batch size from N=1 to N=32 would dramatically improve total throughput with only a small effect on latency." https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#batching

We have also observed the same issue on Tesla P100 GPU and NVIDIA GEForce 2080 Super GPU.

Machine 1 : - Tesla P100 GPU with 64 GB CPU RAM

Batch size Inference Time
1 27
2 49
4 91

Machine 2 :- GEForce RTX 2070 Super with 32 GB CPU RAM

Batch size Inference Time
1 16
2 28
4 48

Refer also to github issue I raised https://github.com/NVIDIA/TensorRT/issues/739

Hi,

Thanks for your patience. We got some reply from our internal team.

In general, BS==1 and BS==4 will choose similar algorithm.
For example, we found that most layers choose CuDNN, which means the performance depends on how fast device can run the kernel.
For TX2, it has only 2 SMs, BS == 1 already uses resource, so it is very likely that it has linear performance for bs==1 and bs==4.

Please understand that we cannot guarantee that large batch will bring throughput benefit on all network/platform. Increase batch to increase throughput is a possible improvement and worth a try.

It looks like the recommendation in our document may bring some confusing.
We will check how to improve the statement more precisely.

Thanks.

Thank you @AastaLLL but then why does it not work on GPU systems also as I mentioned above. I know this thread is for Jetson TX2 category but I have also raised the issue on the github.

Hi,

You can enable the --verbose when testing.
If the chosen algorithm is cuDNN, the performance will depend on the how fast device can run the kernel.

Thanks.