Huge speed difference between engines built from scratch and engines built from onnx

Description

I have a yolov5 model which I would like to deploy.
I found that if I convert my model from onnx to TensorRT, trtexec indicates an inference speed of 25 fps.
But if I build the model layer for layer using INetworkDefinition, the inference speed triples.
How come the TensorRT model is so much faster when explicitly building the model instead of converting from onnx?
Both cases use int8 quantization.

Thanks!

Environment

TensorRT Version: 7.1.3
GPU Type: Jetson Xavier AGX
CUDA Version: 10.2.89
CUDNN Version: 8.0
Operating System + Version: Jetpack 4.5.1

Hi,

If you building engine from ONNX, trtexec use onnx2trt to parse the operations.
Would you mind to dump the layer used by trtexec with us first:

$ /usr/src/tensorrt/bin/trtecec --dumpProfile ...

Pleas also share the layers you used by TensorRT API as comparsion.

Thanks.

Hi!
I ran $ /usr/src/tensorrt/bin/trtecec --dumpProfile ... for both models. The output is attached.
yolov5s6_from_onnx.txt (65.1 KB)
yolov5s6_from_scratch.txt (27.9 KB)

Thanks!

Hi,

Thanks for sharing the log.

Based on your experiment, it seems that the corresponding convolution in the ONNX version tends to be slower.

May I know which data format do you use? NCHW or NHWC?
If different data formats are used, could you try to align them to NCHW first?

ONNX
[06/28/2021-11:59:06] [I] Conv_41 + Relu_42      175.68             1.44      5.5
[06/28/2021-11:59:06] [I] Conv_43 + Relu_44      118.46             0.97      3.7
[06/28/2021-11:59:06] [I] Conv_52 + Relu_53 || Conv_45 + Relu_46       47.78             0.39      1.5
[06/28/2021-11:59:06] [I] Conv_47 + Relu_48 input reformatter 0       46.29             0.38      1.5
[06/28/2021-11:59:06] [I] Conv_47 + Relu_48       30.25             0.25      0.9
[06/28/2021-11:59:06] [I] Conv_49 + Relu_50       60.85             0.50      1.9

TensorRT API
[I]                                                            (Unnamed Layer* 5) [Convolution] + (Unnamed Layer* 7) [Activation] input reformatter 0  48.02             0.20      1.6
[I]                                                                                (Unnamed Layer* 5) [Convolution] + (Unnamed Layer* 7) [Activation] 177.49             0.73      5.7
[I]                                                                               (Unnamed Layer* 8) [Convolution] + (Unnamed Layer* 10) [Activation]  98.51             0.40      3.2
[I]      (Unnamed Layer* 14) [Convolution] + (Unnamed Layer* 16) [Activation] || (Unnamed Layer* 11) [Convolution] + (Unnamed Layer* 13) [Activation]  44.75             0.18      1.4
[I]                                                                              (Unnamed Layer* 17) [Convolution] + (Unnamed Layer* 19) [Activation]  39.47             0.16      1.3
[I]                                                                              (Unnamed Layer* 20) [Convolution] + (Unnamed Layer* 22) [Activation]  77.51             0.32      2.5
[I]                                                                                                                 

Thanks.

Hi,

More, could you also attach the ouput from nvprof in both case with us?

For example:

$ sudo /usr/local/cuda-10.2/bin/nvprof /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx

Thanks.

When I run the model, I use NCHW inputs.
Though can there be a mechanism that permutes the dimensions internally?
Any way of testing the actual format of the convolution?

Thanks

Profiling the two models yields the following outputs:
profile_onnx.txt (133.1 KB)
profile_from_scratch.txt (85 KB)

Thanks.

Hi,

Based on the profiling data, there are lots of reformat kernels.
Which is used for data type or format transform.

Ideally, the inference input/output is compatible across layers.
But sometimes the reformat layer is added for supporting the various use cases.

Would you mind sharing the ONNX file with us so we can take a closer look?
Thanks.

Interesting. Here is the model:

Thanks a lot!

Hi,

May I know how do you implement the slice operation in the beginnng of networks.

More, do you mind to sharing the libmyplugins.so with us so we can reproduce the scratch version as well?

Thanks.

The first operation of the network is a restructuring of the channels of the image followed by a convolution:

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
        # self.contract = Contract(gain=2)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
        # return self.conv(self.contract(x))

libmyplugins.so is generated using this repository: GitHub - wang-xinyu/tensorrtx: Implementation of popular deep learning networks with TensorRT network definition API

libmyplugins.so (236.9 KB)

Thanks!