Dear Team,
We have clarification related to Optimization/performance improvement using larger Inference batch size.
Environment:
Jetson AGX Xavier
GStreamer: 1.14.5
Jetpack: 4.6
CUDA Version: cuda_10.2_r440
Operating System + Version: Ubuntu 18.04.6 LT
TensorRT Version: 8.0.1-1+cuda10.2
Python Version: 3.6.9
We tried to speed up the inference for our TensorRT model converted from ONNX model.
Model is based on MobileNetv1 architecture.
We did some experiments with different batch size and observed below results.
We measured inference execution time and observed the performance improvement is achieved
for batch sizes above 1024. Is this expected?
Please help us to analyze below table behavior in terms of throughput (QPS), latency
and inference time(sec).
Batch size | Throughput | Latency | Total TRT engine Inference time(sec) | Per image TRT engine Inference time(sec) |
---|---|---|---|---|
1 | 3786.82 qps | 0.262451 ms | 0.001962185 | 0.001962185 |
4 | 1671.8 qps | 0.599274 ms | 1.157729626 | 0.289432407 |
8 | 1254.14 qps | 0.800781 ms | 1.18355608 | 0.14794451 |
16 | 810.952 qps | 1.23877 ms | 1.15067625 | 0.071917266 |
32 | 494.04 qps | 2.03979 ms | 1.163462639 | 0.036358207 |
64 | 280.011 qps | 3.60278 ms | 1.182549238 | 0.018477332 |
128 | 152.859 qps | 6.59839 ms | 1.182056665 | 0.009234818 |
256 | 79.2021 qps | 12.7529 ms | 1.196300983 | 0.004673051 |
512 | 40.6503 qps | 24.9807 ms | 1.210321426 | 0.002363909 |
1024 | 19.2564 qps | 52.2966 ms | 1.353152514 | 0.001321438 |
2048 | 9.14893 qps | 110.009 ms | 1.396543026 | 0.00068190 |