Hi, there:
I’m heavily using your trtexec tool to measure throughput of Orin system.
I noticed if I set --batch=N, the inference throughput will increase to N times, even if N=100 or 1000. But the host wall time and gpu compute time don’t change much. That doesn’t make sense.
Below I attach two logs: batch=1 and bacth=1000.
And the throughput increases from 154.354 qps to 155814 qps, i.e.1000 times.
That’s too good to be true. Please advise. Thanks.
==== trtexec version v8401 ==================
trtexec --version
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --version
======== resnet50 352x672, batch=1 throughput: 154.354 qps ============
I ran with your nvidia resnet 50 with 352x672 input model:
[08/31/2022-16:12:18] [I] === Performance summary ===
[08/31/2022-16:12:18] [I] Throughput: 154.354 qps
[08/31/2022-16:12:18] [I] Latency: min = 6.71631 ms, max = 6.80017 ms, mean = 6.73104 ms, median = 6.72803 ms, percentile(99%) = 6.78564 ms
[08/31/2022-16:12:18] [I] Enqueue Time: min = 6.24182 ms, max = 6.53992 ms, mean = 6.38183 ms, median = 6.38116 ms, percentile(99%) = 6.49634 ms
[08/31/2022-16:12:18] [I] H2D Latency: min = 0.257324 ms, max = 0.302002 ms, mean = 0.260259 ms, median = 0.26001 ms, percentile(99%) = 0.268799 ms
[08/31/2022-16:12:18] [I] GPU Compute Time: min = 6.44946 ms, max = 6.53284 ms, mean = 6.46433 ms, median = 6.46167 ms, percentile(99%) = 6.51953 ms
[08/31/2022-16:12:18] [I] D2H Latency: min = 0.00463867 ms, max = 0.00817871 ms, mean = 0.00644009 ms, median = 0.00614929 ms, percentile(99%) = 0.00793457 ms
[08/31/2022-16:12:18] [I] Total Host Walltime: 3.01903 s
[08/31/2022-16:12:18] [I] Total GPU Compute Time: 3.01238 s
[08/31/2022-16:12:18] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[08/31/2022-16:12:18] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[08/31/2022-16:12:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/31/2022-16:12:18] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=nvidia_resnet50_352x672.engine --int8 --useDLACore=0 --allowGPUFallback --useSpinWait –batch=1
======== resnet50 352x672, batch=1000 throughput: 155814 qps ============
[08/31/2022-16:13:01] [I] === Performance summary ===
[08/31/2022-16:13:01] [I] Throughput: 155814 qps
[08/31/2022-16:13:01] [I] Latency: min = 6.58948 ms, max = 6.97827 ms, mean = 6.61973 ms, median = 6.61157 ms, percentile(99%) = 6.77606 ms
[08/31/2022-16:13:01] [I] Enqueue Time: min = 5.75806 ms, max = 6.9129 ms, mean = 6.34814 ms, median = 6.36295 ms, percentile(99%) = 6.80569 ms
[08/31/2022-16:13:01] [I] H2D Latency: min = 0.203369 ms, max = 0.34375 ms, mean = 0.210522 ms, median = 0.209198 ms, percentile(99%) = 0.2612 ms
[08/31/2022-16:13:01] [I] GPU Compute Time: min = 6.38049 ms, max = 6.76807 ms, mean = 6.40335 ms, median = 6.39771 ms, percentile(99%) = 6.49202 ms
[08/31/2022-16:13:01] [I] D2H Latency: min = 0.00390625 ms, max = 0.00805664 ms, mean = 0.0058536 ms, median = 0.00561523 ms, percentile(99%) = 0.00793457 ms
[08/31/2022-16:13:01] [I] Total Host Walltime: 3.01641 s
[08/31/2022-16:13:01] [I] Total GPU Compute Time: 3.00957 s
[08/31/2022-16:13:01] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[08/31/2022-16:13:01] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[08/31/2022-16:13:01] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/31/2022-16:13:01] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=nvidia_resnet50_352x672.engine --int8 --useDLACore=0 --allowGPUFallback --useSpinWait –batch=1000