Huge speed difference between engines built from scratch and engines built from onnx

frederikschoeller · June 26, 2021, 8:10am

Description

I have a yolov5 model which I would like to deploy.
I found that if I convert my model from onnx to TensorRT, trtexec indicates an inference speed of 25 fps.
But if I build the model layer for layer using INetworkDefinition, the inference speed triples.
How come the TensorRT model is so much faster when explicitly building the model instead of converting from onnx?
Both cases use int8 quantization.

Thanks!

Environment

TensorRT Version: 7.1.3
GPU Type: Jetson Xavier AGX
CUDA Version: 10.2.89
CUDNN Version: 8.0
Operating System + Version: Jetpack 4.5.1

AastaLLL · June 28, 2021, 3:08am

Hi,

If you building engine from ONNX, trtexec use onnx2trt to parse the operations.
Would you mind to dump the layer used by trtexec with us first:

$ /usr/src/tensorrt/bin/trtecec --dumpProfile ...

Pleas also share the layers you used by TensorRT API as comparsion.

Thanks.

frederikschoeller · June 28, 2021, 10:04am

Hi!
I ran $ /usr/src/tensorrt/bin/trtecec --dumpProfile ... for both models. The output is attached.
yolov5s6_from_onnx.txt (65.1 KB)
yolov5s6_from_scratch.txt (27.9 KB)

Thanks!

AastaLLL · July 1, 2021, 3:48am

Hi,

Thanks for sharing the log.

Based on your experiment, it seems that the corresponding convolution in the ONNX version tends to be slower.

May I know which data format do you use? NCHW or NHWC?
If different data formats are used, could you try to align them to NCHW first?

ONNX

[06/28/2021-11:59:06] [I] Conv_41 + Relu_42      175.68             1.44      5.5
[06/28/2021-11:59:06] [I] Conv_43 + Relu_44      118.46             0.97      3.7
[06/28/2021-11:59:06] [I] Conv_52 + Relu_53 || Conv_45 + Relu_46       47.78             0.39      1.5
[06/28/2021-11:59:06] [I] Conv_47 + Relu_48 input reformatter 0       46.29             0.38      1.5
[06/28/2021-11:59:06] [I] Conv_47 + Relu_48       30.25             0.25      0.9
[06/28/2021-11:59:06] [I] Conv_49 + Relu_50       60.85             0.50      1.9

TensorRT API

[I]                                                            (Unnamed Layer* 5) [Convolution] + (Unnamed Layer* 7) [Activation] input reformatter 0  48.02             0.20      1.6
[I]                                                                                (Unnamed Layer* 5) [Convolution] + (Unnamed Layer* 7) [Activation] 177.49             0.73      5.7
[I]                                                                               (Unnamed Layer* 8) [Convolution] + (Unnamed Layer* 10) [Activation]  98.51             0.40      3.2
[I]      (Unnamed Layer* 14) [Convolution] + (Unnamed Layer* 16) [Activation] || (Unnamed Layer* 11) [Convolution] + (Unnamed Layer* 13) [Activation]  44.75             0.18      1.4
[I]                                                                              (Unnamed Layer* 17) [Convolution] + (Unnamed Layer* 19) [Activation]  39.47             0.16      1.3
[I]                                                                              (Unnamed Layer* 20) [Convolution] + (Unnamed Layer* 22) [Activation]  77.51             0.32      2.5
[I]

Thanks.

AastaLLL · July 1, 2021, 3:51am

Hi,

More, could you also attach the ouput from nvprof in both case with us?

For example:

$ sudo /usr/local/cuda-10.2/bin/nvprof /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx

Thanks.

frederikschoeller · July 1, 2021, 7:56am

When I run the model, I use NCHW inputs.
Though can there be a mechanism that permutes the dimensions internally?
Any way of testing the actual format of the convolution?

Thanks

frederikschoeller · July 1, 2021, 7:57am

Profiling the two models yields the following outputs:
profile_onnx.txt (133.1 KB)
profile_from_scratch.txt (85 KB)

Thanks.

AastaLLL · July 2, 2021, 8:37am

Hi,

Based on the profiling data, there are lots of reformat kernels.
Which is used for data type or format transform.

Ideally, the inference input/output is compatible across layers.
But sometimes the reformat layer is added for supporting the various use cases.

Would you mind sharing the ONNX file with us so we can take a closer look?
Thanks.

frederikschoeller · July 3, 2021, 7:25am

Interesting. Here is the model:

Thanks a lot!

AastaLLL · July 20, 2021, 6:06am

Hi,

May I know how do you implement the slice operation in the beginnng of networks.

More, do you mind to sharing the libmyplugins.so with us so we can reproduce the scratch version as well?

Thanks.

frederikschoeller · July 20, 2021, 6:48pm

The first operation of the network is a restructuring of the channels of the image followed by a convolution:

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
        # self.contract = Contract(gain=2)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
        # return self.conv(self.contract(x))

libmyplugins.so is generated using this repository: GitHub - wang-xinyu/tensorrtx: Implementation of popular deep learning networks with TensorRT network definition API

libmyplugins.so (236.9 KB)

Thanks!

AastaLLL · August 3, 2021, 3:56am

Hi,

Thanks for sharing the library.
We are going to reproduce this issue and discuss it with our internal team.
Will get back to you later.

Thanks.

Topic		Replies	Views
Huge speed difference between engines built from scratch and engines built from onnx TensorRT	9	985	January 7, 2022
TensorRT Batch Inference: different results TensorRT	4	4228	December 1, 2021
Performance DECREASE with tensorRT under onnxruntime, pt2 Jetson AGX Xavier tensorrt	5	2912	May 25, 2022
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1142	January 19, 2022
ONNX Model Int64 Weights TensorRT	12	13384	February 17, 2024
Does TensorRT rewrite ONNX models to NHWC? TensorRT	11	1792	August 3, 2023
TensorRT run ONNX model with Int8 issue TensorRT	9	4219	October 12, 2021
TensorRT Engine Creation Methods’ Differences TensorRT tensorrt	1	423	September 27, 2023
Difference between running the inference with trtexec and tensorrt python API Jetson AGX Xavier tensorrt , python	4	3021	October 18, 2021
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1210	August 27, 2020

Huge speed difference between engines built from scratch and engines built from onnx

Description

Environment

ONNX

TensorRT API

Related topics