Tensorrt is slower than pytorch

pytorch inference is 100x faster than tensorrt.

# my model code
dblock = nn.Sequential(
    nn.Conv1d(dim_embedding, dim_embedding, 8, 4, 2, bias=False),
    nn.GroupNorm(32, dim_embedding),
    nn.ReLU(inplace=True),
    nn.Conv1d(dim_embedding, dim_embedding, 8, 4, 2, bias=False),
    nn.GroupNorm(32, dim_embedding),
    nn.ReLU(inplace=True),
    nn.Conv1d(dim_embedding, dim_embedding, 8, 4, 2, bias=False),
    nn.GroupNorm(32, dim_embedding),
    nn.ReLU(inplace=True),
)

I use the volksdep to load the engine.
below is the information after I ran ‘trtexec --loadEngine=my engine file’.

[09/14/2021-16:26:29] [I] === Performance summary ===
[09/14/2021-16:26:29] [I] Throughput: 2409.12 qps
[09/14/2021-16:26:29] [I] Latency: min = 0.373535 ms, max = 1.46655 ms, mean = 0.383044 ms, median = 0.379639 ms, percentile(99%) = 0.421509 ms
[09/14/2021-16:26:29] [I] End-to-End Host Latency: min = 0.381104 ms, max = 1.4812 ms, mean = 0.395188 ms, median = 0.392273 ms, percentile(99%) = 0.434494 ms
[09/14/2021-16:26:29] [I] Enqueue Time: min = 0.364685 ms, max = 1.46265 ms, mean = 0.37836 ms, median = 0.375671 ms, percentile(99%) = 0.416306 ms
[09/14/2021-16:26:29] [I] H2D Latency: min = 0.00878906 ms, max = 0.0286865 ms, mean = 0.0103995 ms, median = 0.010376 ms, percentile(99%) = 0.0112305 ms
[09/14/2021-16:26:29] [I] GPU Compute Time: min = 0.357361 ms, max = 1.45093 ms, mean = 0.366774 ms, median = 0.363525 ms, percentile(99%) = 0.40448 ms
[09/14/2021-16:26:29] [I] D2H Latency: min = 0.00488281 ms, max = 0.041687 ms, mean = 0.00587121 ms, median = 0.00592041 ms, percentile(99%) = 0.0065918 ms
[09/14/2021-16:26:29] [I] Total Host Walltime: 3.00067 s
[09/14/2021-16:26:29] [I] Total GPU Compute Time: 2.65141 s
[09/14/2021-16:26:29] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[09/14/2021-16:26:29] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[09/14/2021-16:26:29] [I] Explanations of the performance metrics are printed in the verbose logs.

Environment

TensorRT Version: 8.0.1.6
GPU Type: V100-32G
Nvidia Driver Version: 450.119.04
CUDA Version: 10.2
CUDNN Version: 8.0.2
Operating System + Version: ubuntu 16.04
Python Version (if applicable): 3.7.9
PyTorch Version (if applicable): 1.6.0

replacing nn.GroupNorm(32, dim_embedding) to nn.BatchNorm1d(dim_embedding), trt inference is faster than pytorch.

Hi @616403121,

Please refer following doc to check supported layers in TensorRT.
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#supported-ops

Thank you.