The "GPU Compute Time" doesn't change, when setting different batch size

Description

Hi teams,

 I am trying to use trtexec to run one onnx inference with dynamic shape. When setting the input shape into 8x3x224x224, the "GPU Compute Time" is near to the "GPU Compute Time" of the original ONNX model, whose input shape is 1x3x224x224.  Does this make sense? Could you help to comment this phenomenon?

Here is my command:   

(1) /usr/src/tensorrt/bin/trtexec --avgRuns=100 --duration=60 --onnx=swin_dynamic_dim.onnx --explicitBatch --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224 --shapes=input:8x3x224x224 --saveEngine=./swin_dynamic_dim_opt8.engine
[07/04/2022-15:24:54] [I] === Performance summary ===
[07/04/2022-15:24:54] [I] Throughput: 237.656 qps
[07/04/2022-15:24:54] [I] Latency: min = 4.15625 ms, max = 5.3999 ms, mean = 4.19379 ms, median = 4.19141 ms, percentile(99%) = 4.21484 ms
[07/04/2022-15:24:54] [I] Enqueue Time: min = 4.13281 ms, max = 5.56323 ms, mean = 4.17808 ms, median = 4.17578 ms, percentile(99%) = 4.19922 ms
[07/04/2022-15:24:54] [I] H2D Latency: min = 0.0195312 ms, max = 0.0314941 ms, mean = 0.0226546 ms, median = 0.0234375 ms, percentile(99%) = 0.0235596 ms
[07/04/2022-15:24:54] [I] GPU Compute Time: min = 4.125 ms, max = 5.37109 ms, mean = 4.16483 ms, median = 4.1626 ms, percentile(99%) = 4.18652 ms
[07/04/2022-15:24:54] [I] D2H Latency: min = 0.00195312 ms, max = 0.0185547 ms, mean = 0.00622008 ms, median = 0.00585938 ms, percentile(99%) = 0.00976562 ms
[07/04/2022-15:24:54] [I] Total Host Walltime: 60.0069 s
[07/04/2022-15:24:54] [I] Total GPU Compute Time: 59.3946 s
[07/04/2022-15:24:54] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2022-15:24:54] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2022-15:24:54] [I] Explanations of the performance metrics are printed in the verbose logs.

 (2)/usr/src/tensorrt/bin/trtexec --avgRuns=100 --duration=60 --onnx=swin.onnx --int8

[07/04/2022-15:20:18] [I] === Performance summary ===
[07/04/2022-15:20:18] [I] Throughput: 236.731 qps
[07/04/2022-15:20:18] [I] Latency: min = 4.15585 ms, max = 8.79932 ms, mean = 4.20709 ms, median = 4.18945 ms, percentile(99%) = 4.67383 ms
[07/04/2022-15:20:18] [I] Enqueue Time: min = 4.13693 ms, max = 9.19727 ms, mean = 4.19367 ms, median = 4.17578 ms, percentile(99%) = 4.62451 ms
[07/04/2022-15:20:18] [I] H2D Latency: min = 0.0195312 ms, max = 0.0400391 ms, mean = 0.0227486 ms, median = 0.0234375 ms, percentile(99%) = 0.0273438 ms
[07/04/2022-15:20:18] [I] GPU Compute Time: min = 4.12579 ms, max = 8.76465 ms, mean = 4.1779 ms, median = 4.16016 ms, percentile(99%) = 4.64746 ms
[07/04/2022-15:20:18] [I] D2H Latency: min = 0.00195312 ms, max = 0.034668 ms, mean = 0.00644053 ms, median = 0.00634766 ms, percentile(99%) = 0.0117188 ms
[07/04/2022-15:20:18] [I] Total Host Walltime: 60.0049 s
[07/04/2022-15:20:18] [I] Total GPU Compute Time: 59.347 s
[07/04/2022-15:20:18] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2022-15:20:18] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2022-15:20:18] [W] * GPU compute time is unstable, with coefficient of variance = 4.55863%.
[07/04/2022-15:20:18] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/04/2022-15:20:18] [I] Explanations of the performance metrics are printed in the verbose logs.

Could you give me some advice? Thanks a lot.

Environment

TensorRT Version: 8.4.0
GPU Type:
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.3.2.49
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi,
Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.

Thanks!

Hi team,

  I read those 2 links, but i don't think they are helpful to fix my question.  In my issue, i am just confused by the "GPU compute time" when inferring the network with different batch_size. When setting the batch_size = 1, the GPU compute time mean value is 4.1779 ms; When setting the batch_size = 8, the GPU compute time mean value is 4.16483 ms. These 2 values are close. I just want to know it is reasonable or not? if not, what should i try?
 Thanks

 Thanks

Hi,

Please refer to the following similar post, which may help you.

If you still face this issue, please share with us the ONNX model for better debugging.

Thank you.