There is no speed up with trt model compared with pytorch

919203716 · April 14, 2022, 9:44am

Description

After I convert my pth model to onnx to trt, the result shows no speedup, even slower…

Environment

TensorRT Version: 8.4.0
GPU Type: Tesla T4
Nvidia Driver Version: 460.106.00
CUDA Version: 10.2
CUDNN Version: 8.1.1
Operating System + Version: ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

We are in corporate intranet, Sorry that I couldn’t upload any file…

Steps To Reproduce

I convert my pth model to onnx with python and then convert to trt with trtexec. The code covert to trt shows below:
./trtexec --tacticSources=-cublasLt,+cublas --verbose --onnx=./model.onnx --explicitBatch --saveEngine=./model.engine --workspace=1000
But when I test the trt model, it shows no speed up. Then I check the model with trtexec profiling, the code shows below:
./trtexec --loadEngine=./model.engine --batch=1 --dumpProfile --profilingVerbosity=detailed --dumpLayerInfo
The result shows that there is one node named “{ForeignNode[(Unnamed Layer* 1000) [LoopOutput][length][Constant]…Concat_326]}” cost abuout 85% time at the end of the model. But that node should just do a ‘concat’ operation. Figs showed below:

And the netron onnx graph shows below:

And, I have tried to remove the concat layer in the pth model, but the time consume node also exists and move to the before node.

NVES · April 14, 2022, 10:07am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

919203716 · April 14, 2022, 12:32pm

Hello， thanks for your quick reply.
As I said, I have tried run my model with trtexec command, and the command is:
“./trtexec --loadEngine=./model.engine --batch=1 --dumpProfile --profilingVerbosity=detailed --dumpLayerInfo”
And to my knowledge, this does the simply inference with random input already without data pre and post processing.
As the fig result shows of my question, there’s six “concat” operation and the fifth is the most time-consuming. I wonder if this is because the stream need to synchornize or the data copy from device to host? or some other reasons?

spolisetty · April 20, 2022, 10:15am

Hi,

Looks like there is a misunderstanding of the log. You thought that the most time-consuming layer is concat, but in fact, it is a Myelin layer that corresponds to the entire graph.
Without the ONNX model, it would be hard for us to provide any useful suggestions or debug this.

If possible, could you please share with us the minimal issue repro ONNX model in DM?

Thank you.

919203716 · April 28, 2022, 2:07am

Hi, thank you for your reply.
I also issued this problem on github and had solved it.
It was a false alarm that the verbose didn’t print all the layers. The most time-consuming layer contained a bunch of layers.
Thanks for your advice again!

system · May 12, 2022, 2:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT is slower than pth TensorRT	1	378	May 29, 2023
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	948	January 26, 2022
TensorRT inference speed is lower than pytorch TensorRT	3	877	March 29, 2023
TensorRT is slower than pth TensorRT onnx	1	439	May 29, 2023
Dont see any speedups using TensorRT TensorRT	14	3006	October 12, 2021
So slow when open the trt file and create Runtime TensorRT tensorrt , cuda , ubuntu	4	813	January 16, 2022
Onnx -> tensorrt fp32 conversion performance degradation different outputs TensorRT tensorrt , pytorch , onnx	4	2125	November 29, 2022
Why TensorRT model is slower? TensorRT tensorrt	3	1396	June 20, 2022
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1185	January 19, 2022
Tensorrt can not speed up well TensorRT	7	1671	June 29, 2022

There is no speed up with trt model compared with pytorch

Description

Environment

Relevant Files

Steps To Reproduce

Related topics