Optimization using Inference batch size

Dear Team,

We have clarification related to Optimization/performance improvement using larger Inference batch size.

Environment:
Jetson AGX Xavier
GStreamer: 1.14.5
Jetpack: 4.6
CUDA Version: cuda_10.2_r440
Operating System + Version: Ubuntu 18.04.6 LT
TensorRT Version: 8.0.1-1+cuda10.2
Python Version: 3.6.9

We tried to speed up the inference for our TensorRT model converted from ONNX model.
Model is based on MobileNetv1 architecture.
We did some experiments with different batch size and observed below results.
We measured inference execution time and observed the performance improvement is achieved
for batch sizes above 1024. Is this expected?
Please help us to analyze below table behavior in terms of throughput (QPS), latency
and inference time(sec).

Batch size Throughput Latency Total TRT engine Inference time(sec) Per image TRT engine Inference time(sec)
1 3786.82 qps 0.262451 ms 0.001962185 0.001962185
4 1671.8 qps 0.599274 ms 1.157729626 0.289432407
8 1254.14 qps 0.800781 ms 1.18355608 0.14794451
16 810.952 qps 1.23877 ms 1.15067625 0.071917266
32 494.04 qps 2.03979 ms 1.163462639 0.036358207
64 280.011 qps 3.60278 ms 1.182549238 0.018477332
128 152.859 qps 6.59839 ms 1.182056665 0.009234818
256 79.2021 qps 12.7529 ms 1.196300983 0.004673051
512 40.6503 qps 24.9807 ms 1.210321426 0.002363909
1024 19.2564 qps 52.2966 ms 1.353152514 0.001321438
2048 9.14893 qps 110.009 ms 1.396543026 0.00068190

Hi @user9377 ,

I took the liberty of moving your topic to the Jetson AGX specific forum to give it more visibility.

Please feel free to change it back if you think this is incorrect. You might also consider looking for information in the TensorRT specific categories here in the forums, for example Deep Learning (Training & Inference) - NVIDIA Developer Forums and its TensorRT sub-category.

Thanks!