Trtexec throughput linear to batch size, even if batch=1000?

Hi, there:

I’m heavily using your trtexec tool to measure throughput of Orin system.

I noticed if I set --batch=N, the inference throughput will increase to N times, even if N=100 or 1000. But the host wall time and gpu compute time don’t change much. That doesn’t make sense.

Below I attach two logs: batch=1 and bacth=1000.
And the throughput increases from 154.354 qps to 155814 qps, i.e.1000 times.

That’s too good to be true. Please advise. Thanks.

==== trtexec version v8401 ==================
trtexec --version
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --version

======== resnet50 352x672, batch=1 throughput: 154.354 qps ============
I ran with your nvidia resnet 50 with 352x672 input model:

[08/31/2022-16:12:18] [I] === Performance summary ===
[08/31/2022-16:12:18] [I] Throughput: 154.354 qps
[08/31/2022-16:12:18] [I] Latency: min = 6.71631 ms, max = 6.80017 ms, mean = 6.73104 ms, median = 6.72803 ms, percentile(99%) = 6.78564 ms
[08/31/2022-16:12:18] [I] Enqueue Time: min = 6.24182 ms, max = 6.53992 ms, mean = 6.38183 ms, median = 6.38116 ms, percentile(99%) = 6.49634 ms
[08/31/2022-16:12:18] [I] H2D Latency: min = 0.257324 ms, max = 0.302002 ms, mean = 0.260259 ms, median = 0.26001 ms, percentile(99%) = 0.268799 ms
[08/31/2022-16:12:18] [I] GPU Compute Time: min = 6.44946 ms, max = 6.53284 ms, mean = 6.46433 ms, median = 6.46167 ms, percentile(99%) = 6.51953 ms
[08/31/2022-16:12:18] [I] D2H Latency: min = 0.00463867 ms, max = 0.00817871 ms, mean = 0.00644009 ms, median = 0.00614929 ms, percentile(99%) = 0.00793457 ms
[08/31/2022-16:12:18] [I] Total Host Walltime: 3.01903 s
[08/31/2022-16:12:18] [I] Total GPU Compute Time: 3.01238 s
[08/31/2022-16:12:18] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[08/31/2022-16:12:18] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[08/31/2022-16:12:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/31/2022-16:12:18] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=nvidia_resnet50_352x672.engine --int8 --useDLACore=0 --allowGPUFallback --useSpinWait –batch=1

======== resnet50 352x672, batch=1000 throughput: 155814 qps ============
[08/31/2022-16:13:01] [I] === Performance summary ===
[08/31/2022-16:13:01] [I] Throughput: 155814 qps
[08/31/2022-16:13:01] [I] Latency: min = 6.58948 ms, max = 6.97827 ms, mean = 6.61973 ms, median = 6.61157 ms, percentile(99%) = 6.77606 ms
[08/31/2022-16:13:01] [I] Enqueue Time: min = 5.75806 ms, max = 6.9129 ms, mean = 6.34814 ms, median = 6.36295 ms, percentile(99%) = 6.80569 ms
[08/31/2022-16:13:01] [I] H2D Latency: min = 0.203369 ms, max = 0.34375 ms, mean = 0.210522 ms, median = 0.209198 ms, percentile(99%) = 0.2612 ms
[08/31/2022-16:13:01] [I] GPU Compute Time: min = 6.38049 ms, max = 6.76807 ms, mean = 6.40335 ms, median = 6.39771 ms, percentile(99%) = 6.49202 ms
[08/31/2022-16:13:01] [I] D2H Latency: min = 0.00390625 ms, max = 0.00805664 ms, mean = 0.0058536 ms, median = 0.00561523 ms, percentile(99%) = 0.00793457 ms
[08/31/2022-16:13:01] [I] Total Host Walltime: 3.01641 s
[08/31/2022-16:13:01] [I] Total GPU Compute Time: 3.00957 s
[08/31/2022-16:13:01] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[08/31/2022-16:13:01] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[08/31/2022-16:13:01] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/31/2022-16:13:01] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=nvidia_resnet50_352x672.engine --int8 --useDLACore=0 --allowGPUFallback --useSpinWait –batch=1000

Hi,

If the resources are enough (including memory and computing power), TensorRT will infer the batch tasks in parallel.

Maybe you can give a complicated model a try.
If the resources are fully-occupied, part of the inference needs to wait and will slow down the performance.

Thanks.

I tried my big model first, and saw the problem. Then I tried your company’s model and saw the same issue. It is not related to model size. I tried 10k batch size and throughput becomes 10k times bigger. It doesn’t make sense.

Thanks for the update.

We are checking this internally.
Will share more information with you.

Do you observe the same on GPU mode?
Thanks.

Hi,

We can reproduce this issue internally.

Please noted that the maximum batch size is decided when building time.
When inferencing with a serialized engine, the real batch size won’t change.

So you will get a similar execution time no matter the --batch value.
However, the qps value is calculated by inference time and batch value, which will be incorrect.

Regarding this, please create the engine with a different batch number (or a maxBatch) first.
Then run the corresponding engine instead.

Thanks.

Thanks. The onnx model has input 1x3x224x224, and trtexec command with onnx model doesn’t accept param such as batch or maxBatch.

I tried to modify the model to dynamic batch with this python in this link: Changing Batch SIze · Issue #2182 · onnx/onnx · GitHub I do see my model input changed to N batch size, but output is not N batch though.

Anyway, I tried the new batch model and run with trtexec. Now the performance is only half of the original model. I tried the engine file with option --shapes=data:32x3x224x224, the throughput is still only half.

I tried the script on this link: Changing Batch SIze · Issue #2182 · onnx/onnx · GitHub This does produce output N size. My original through was 1270. With batch=2, it becomes 656. (So frames per second is 656x2=1312, slightly better than 1270).

With batch=32, it becomes 36. Surprised. (So frame per second is 36*32=1152?, worse than 1270)

What’s the best way to run the original model with batch size 2 or more?

Hi,

Sorry for the late reply.

It depends on the model complexity and the GPU/memory resources.
For example, if the GPU resources are not enough to run batch=N concurrently.
Part of the batch will need to wait for the resources and causing some delay.

Thanks.