Lower FPS for engine file with higher batch size vs engine file with lower batch size

• Hardware Platform NVIDIA RTX A5000
• TensorRT Version 8.5.1-1+cuda11.8
• NVIDIA GPU Driver Version (valid for GPU only) 535.183.01**

I have a model trained on tao toolkit (3.22.05) which is built to two engine files with batch size 1 and 30. When I test them using the trtexec tool I am getting lower FPS for the engine with the higher batch size, help me understand this weird behaviour.

Following are the commands and logs respectively

trtexec --loadEngine=/opt/paralaxiom/vast/vast-platform/nvast/ds_vast_pipeline/d26_apr1924_yolov4_resnet18_epoch_045.etlt_b1_gpu0_int8.engine --batch=1

b1.log (13.4 KB)

trtexec --loadEngine=/opt/paralaxiom/vast/vast-platform/nvast/ds_vast_pipeline/d26_apr1924_yolov4_resnet18_epoch_045.etlt_b30_gpu0_int8.engine --batch=30

b30.log (13.2 KB)

How did you build these two engine files?

Engine files are generated from a deepstream pipeline. Following are the nvinfer configs for batch size 30 and 1 respectively
pgie_d26_apr1924_yolov4_resnet18_epoch_045_drop8_b30.txt (4.3 KB)
pgie_d26_apr1924_apm_fframe_yolov4_resnet18_epoch_045_drop8_b1.txt (4.1 KB)

Since you are using 22.05, the exported file is still .etlt file, could you use tao_toolkit_recipes/tao_forum_faq/FAQ.md at main · NVIDIA-AI-IOT/tao_toolkit_recipes · GitHub to change it to .onnx file?
And then run trtexec to check again. Refer to https://docs.nvidia.com/tao/tao-toolkit/text/trtexec_integration/trtexec_yolo_v4.html

It makes sense to test the engine file created in ds, rather than building the engine file using trtexec and testing it since on deployment we let deepstream bulid the engine files from the .etlt model files.

Just to narrow down.

generated the engine files for batch size 30 with the following command

trtexec --onnx=/onnx-path --calib=/cal-bin-path --int8 --saveEngine=/save-engine-path --maxShapes=Input:30x3x1056x1888 --workspace=2048

Surprisingly the FPS I got for this engine file is around 6 compared to the engine file generated in ds which was around 130 FPS, attaching the log
onnx.engine_b30.log (48.9 KB)

From the log, some layers are falling back to FP32. How did you generate cal.bin?

cal.bin is generated from tao, this is the same calib file used in the deepstream pipeline to generate the engine file which gives me ~130FPS.

Could you share the onnx file and cal.bin?

Hi, I cannot reproduce with your onnx file and cal.bin.
Because 30 * bs1's GPU Compute Time is larger than bs30's GPU Compute Time.

$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:1x3x1056x1888 --optShapes=Input:1x3x1056x1888  --maxShapes=Input:1x3x1056x1888 --saveEngine=fp32_bs1.engine --workspace=20480
$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:30x3x1056x1888 --optShapes=Input:30x3x1056x1888  --maxShapes=Input:30x3x1056x1888 --saveEngine=fp32_bs30.engine --workspace=20480
$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:1x3x1056x1888 --optShapes=Input:1x3x1056x1888  --maxShapes=Input:1x3x1056x1888 --saveEngine=fp16_bs1.engine --fp16 --workspace=20480
$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:30x3x1056x1888 --optShapes=Input:30x3x1056x1888  --maxShapes=Input:30x3x1056x1888 --saveEngine=fp16_bs30.engine --fp16 --workspace=20480
$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:1x3x1056x1888 --optShapes=Input:1x3x1056x1888  --maxShapes=Input:1x3x1056x1888 --saveEngine=int8_bs1.engine --int8 --calib=d26_apr1924_yolov4_resnet18_epoch_045.bin --workspace=20480
$ trtexec --onnx=d26_apr1924_yolov4_resnet18_epoch_045.onnx --minShapes=Input:30x3x1056x1888 --optShapes=Input:30x3x1056x1888  --maxShapes=Input:30x3x1056x1888 --saveEngine=int8_bs30.engine --int8 --calib=d26_apr1924_yolov4_resnet18_epoch_045.bin --workspace=20480

fp32_bs1:
[08/15/2024-05:04:09] [I] Throughput: 68.9179 qps
[08/15/2024-05:04:09] [I] Latency: min = 15.5417 ms, max = 16.423 ms, mean = 15.8086 ms, median = 15.7469 ms, percentile(99%) = 16.3875 ms
[08/15/2024-05:04:09] [I] End-to-End Host Latency: min = 28.3966 ms, max = 29.0903 ms, mean = 28.7187 ms, median = 28.7378 ms, percentile(99%) = 29.0209 ms
[08/15/2024-05:04:09] [I] Enqueue Time: min = 1.17188 ms, max = 1.43268 ms, mean = 1.24283 ms, median = 1.2316 ms, percentile(99%) = 1.42371 ms
[08/15/2024-05:04:09] [I] H2D Latency: min = 1.25488 ms, max = 2.12689 ms, mean = 1.35963 ms, median = 1.28418 ms, percentile(99%) = 1.89673 ms
[08/15/2024-05:04:09] [I] GPU Compute Time: min = 14.2705 ms, max = 14.6831 ms, mean = 14.4403 ms, median = 14.4507 ms, percentile(99%) = 14.6278 ms
[08/15/2024-05:04:09] [I] D2H Latency: min = 0.00732422 ms, max = 0.0109863 ms, mean = 0.00869192 ms, median = 0.00854492 ms, percentile(99%) = 0.0107422 ms
[08/15/2024-05:04:09] [I] Total Host Walltime: 3.04711 s
[08/15/2024-05:04:09] [I] Total GPU Compute Time: 3.03247 s

fp32_bs30:
[08/15/2024-05:39:28] [I] Throughput: 2.44966 qps
[08/15/2024-05:39:28] [I] Latency: min = 396.667 ms, max = 402.534 ms, mean = 399.33 ms, median = 399.164 ms, percentile(99%) = 402.534 ms
[08/15/2024-05:39:28] [I] End-to-End Host Latency: min = 738.49 ms, max = 747.976 ms, mean = 742.091 ms, median = 742.102 ms, percentile(99%) = 747.976 ms
[08/15/2024-05:39:28] [I] Enqueue Time: min = 1.05322 ms, max = 1.22046 ms, mean = 1.18385 ms, median = 1.19608 ms, percentile(99%) = 1.22046 ms
[08/15/2024-05:39:28] [I] H2D Latency: min = 27.3975 ms, max = 28.3652 ms, mean = 28.0859 ms, median = 28.1329 ms, percentile(99%) = 28.3652 ms
[08/15/2024-05:39:28] [I] GPU Compute Time: min = 368.506 ms, max = 374.238 ms, mean = 371.228 ms, median = 371.267 ms, percentile(99%) = 374.238 ms
[08/15/2024-05:39:28] [I] D2H Latency: min = 0.0136719 ms, max = 0.0184326 ms, mean = 0.016394 ms, median = 0.0162354 ms, percentile(99%) = 0.0184326 ms
[08/15/2024-05:39:28] [I] Total Host Walltime: 4.0822 s
[08/15/2024-05:39:28] [I] Total GPU Compute Time: 3.71228 s

fp16_bs1:
[08/15/2024-06:03:34] [I] Throughput: 143.516 qps
[08/15/2024-06:03:34] [I] Latency: min = 7.94495 ms, max = 8.11273 ms, mean = 8.02031 ms, median = 8.02319 ms, percentile(99%) = 8.10742 ms
[08/15/2024-06:03:34] [I] End-to-End Host Latency: min = 13.614 ms, max = 13.9358 ms, mean = 13.788 ms, median = 13.8126 ms, percentile(99%) = 13.9253 ms
[08/15/2024-06:03:34] [I] Enqueue Time: min = 1.01782 ms, max = 1.28809 ms, mean = 1.03474 ms, median = 1.02502 ms, percentile(99%) = 1.21899 ms
[08/15/2024-06:03:34] [I] H2D Latency: min = 1.02673 ms, max = 1.10205 ms, mean = 1.06123 ms, median = 1.06006 ms, percentile(99%) = 1.09723 ms
[08/15/2024-06:03:34] [I] GPU Compute Time: min = 6.86386 ms, max = 7.02368 ms, mean = 6.95004 ms, median = 6.96167 ms, percentile(99%) = 7.01746 ms
[08/15/2024-06:03:34] [I] D2H Latency: min = 0.00708008 ms, max = 0.0106201 ms, mean = 0.00903714 ms, median = 0.0090332 ms, percentile(99%) = 0.0104675 ms
[08/15/2024-06:03:34] [I] Total Host Walltime: 3.02406 s
[08/15/2024-06:03:34] [I] Total GPU Compute Time: 3.01632 s

fp16_bs30:
[08/15/2024-06:48:33] [I] Throughput: 5.4144 qps
[08/15/2024-06:48:33] [I] Latency: min = 203.022 ms, max = 205.748 ms, mean = 204.117 ms, median = 204.081 ms, percentile(99%) = 205.748 ms
[08/15/2024-06:48:33] [I] End-to-End Host Latency: min = 349.855 ms, max = 353.964 ms, mean = 351.773 ms, median = 351.872 ms, percentile(99%) = 353.964 ms
[08/15/2024-06:48:33] [I] Enqueue Time: min = 1.13431 ms, max = 1.25342 ms, mean = 1.22334 ms, median = 1.224 ms, percentile(99%) = 1.25342 ms
[08/15/2024-06:48:33] [I] H2D Latency: min = 27.4159 ms, max = 28.335 ms, mean = 28.1219 ms, median = 28.1774 ms, percentile(99%) = 28.335 ms
[08/15/2024-06:48:33] [I] GPU Compute Time: min = 174.941 ms, max = 177.789 ms, mean = 175.977 ms, median = 175.845 ms, percentile(99%) = 177.789 ms
[08/15/2024-06:48:33] [I] D2H Latency: min = 0.0136719 ms, max = 0.0204163 ms, mean = 0.0183945 ms, median = 0.0184937 ms, percentile(99%) = 0.0204163 ms
[08/15/2024-06:48:33] [I] Total Host Walltime: 3.69385 s
[08/15/2024-06:48:33] [I] Total GPU Compute Time: 3.51954 s

int8_bs1
[08/15/2024-08:16:28] [I] Throughput: 182.769 qps
[08/15/2024-08:16:28] [I] Latency: min = 6.36328 ms, max = 7.59448 ms, mean = 6.40866 ms, median = 6.38892 ms, percentile(99%) = 6.59973 ms
[08/15/2024-08:16:28] [I] End-to-End Host Latency: min = 10.1303 ms, max = 11.1544 ms, mean = 10.812 ms, median = 10.8091 ms, percentile(99%) = 11.0181 ms
[08/15/2024-08:16:28] [I] Enqueue Time: min = 0.553101 ms, max = 1.22742 ms, mean = 1.02463 ms, median = 1.01935 ms, percentile(99%) = 1.10059 ms
[08/15/2024-08:16:28] [I] H2D Latency: min = 0.916016 ms, max = 2.12402 ms, mean = 0.935331 ms, median = 0.918274 ms, percentile(99%) = 1.08249 ms
[08/15/2024-08:16:28] [I] GPU Compute Time: min = 5.43848 ms, max = 5.76923 ms, mean = 5.45895 ms, median = 5.45587 ms, percentile(99%) = 5.61865 ms
[08/15/2024-08:16:28] [I] D2H Latency: min = 0.00756836 ms, max = 0.0231934 ms, mean = 0.0143764 ms, median = 0.0142822 ms, percentile(99%) = 0.0178223 ms
[08/15/2024-08:16:28] [I] Total Host Walltime: 3.01473 s
[08/15/2024-08:16:28] [I] Total GPU Compute Time: 3.00788 s

int8_bs30
[08/15/2024-07:43:25] [I] Throughput: 7.78976 qps
[08/15/2024-07:43:25] [I] Latency: min = 150.749 ms, max = 152.283 ms, mean = 151.657 ms, median = 151.681 ms, percentile(99%) = 152.283 ms
[08/15/2024-07:43:25] [I] End-to-End Host Latency: min = 245.879 ms, max = 248.016 ms, mean = 247.156 ms, median = 247.399 ms, percentile(99%) = 248.016 ms
[08/15/2024-07:43:25] [I] Enqueue Time: min = 0.884354 ms, max = 1.23975 ms, mean = 1.19224 ms, median = 1.20685 ms, percentile(99%) = 1.23975 ms
[08/15/2024-07:43:25] [I] H2D Latency: min = 27.8704 ms, max = 28.1263 ms, mean = 27.9991 ms, median = 28.0181 ms, percentile(99%) = 28.1263 ms
[08/15/2024-07:43:25] [I] GPU Compute Time: min = 122.676 ms, max = 124.241 ms, mean = 123.632 ms, median = 123.682 ms, percentile(99%) = 124.241 ms
[08/15/2024-07:43:25] [I] D2H Latency: min = 0.0141602 ms, max = 0.0311279 ms, mean = 0.0262346 ms, median = 0.0266724 ms, percentile(99%) = 0.0311279 ms
[08/15/2024-07:43:25] [I] Total Host Walltime: 3.33771 s
[08/15/2024-07:43:25] [I] Total GPU Compute Time: 3.21443 s

I run on one A40 machine.

I have calculated the FPS based on this post Low FPS for pruned tao toolkit models on deepstream - #19 by Fiona.Chen
Looking at your answer I am guessing the times (GPU compute, Enque, etc…) outputted from trtexec is for a batch and not a single frame.

My result shows that the bs30 has larger FPS than bs1. That means I cannot reproduce the issue you mentioned.

In my result, take int8 as an example,
bs30 fps is: 30 * 1000 / 123.632 = 242.7
bs1 fps is 1 * 1000 / 5.45895 = 183.2

You can also leverage qps to compare.
bs30 qps is: 30 * 7.78976 = 233.7
bs1 qps is: 182.769

Both ways prove that bs30 has larger FPS.

Based on your calculations, the log I attached in the question (Lower FPS for engine file with higher batch size vs engine file with lower batch size) for the engine file with batch size 30 (b30.log) generated on deepstream it would give me an fps of 30 * 1000 ÷ ( 1.00687 + 6.37316 + 0.00874254 ) = 4060.21 .
Why is there this much difference between the fps b/w the engine file generated with trtexec compared to deepstream ?

What is the fps shown on deepstream log?

around 4060 . You can see the calculation Lower FPS for engine file with higher batch size vs engine file with lower batch size - #15 by adithya.ajith

From the log, you are running trtexec with below command and get around 4060 fps, right?
/usr/src/tensorrt/bin/trtexec --loadEngine=/opt/paralaxiom/vast/vast-platform/nvast/ds_vast_pipeline/d26_apr1924_yolov4_resnet18_epoch_045.etlt_b30_gpu0_int8.engine --batch=30

yes, that’s right.

So, what about another result which has “much difference” ?
Did you ever share the command and the log?

I have shared the command and log in the topic question!

The fps for the bs30 engine file generated by trtexec after I converted it to onnx is 242.7 (reference - Lower FPS for engine file with higher batch size vs engine file with lower batch size - #14 by Morganh)
The engine file in the topic question (reference - Lower FPS for engine file with higher batch size vs engine file with lower batch size) has mean GPU compute time of 6.37316 (command and logs are attached with the topic question for your reference) this translates to 4060.21 fps , calculation being 30 * 1000 ÷ ( 1.00687 + 6.37316 + 0.00874254 ) = 4060.21 (1.00687 is the H2D Latency and 0.00874254 is the D2H Latency) .

What I mean by “much difference” in fps is between these two values 4060.21 and 242.7 , both are the same models and the only difference being their engine files were created by ds and trtexec respectively, why are the values this far apart ?