How to Speed Up Deep Learning Inference Using TensorRT

jwitsoe · November 8, 2018, 3:06pm

Originally published at: https://developer.nvidia.com/blog/speed-up-inference-tensorrt/

Looking for more? Check out the hands-on DLI training course: Optimization and Deployment of TensorFlow Models with TensorRT The new version of this post, Speeding Up Deep Learning Inference Using TensorRT, has been updated to start from a PyTorch model instead of the ONNX model, upgrade the sample application to use TensorRT 7, and replaces…

anon64323374 · November 28, 2018, 8:56am

Thanks for this extremely informative post - I am attempting to replicate some of the numbers for throughput for inference attained here: https://developer.nvidia.co...

This example targets a Resnet-50, but the performance, even when enabling the mode for FP16, seems to not match the latencies/throughputs reported there. How should I be building the model differently?

anon45059945 · December 1, 2018, 5:17pm

Got errors while attempting make inside TensorRT-introduction/
Solved them by adding <math.h> and <numeric> in sampleOnnx*.cpp

anon34907300 · December 4, 2018, 10:45pm

Please keep in mind that this blog post sample is oriented towards new users. It does not include all possible optimizations. You might achieve better results with our existing benchmark tool "trtexec". It's possible to further optimize the code e.g. by using CUDA Graphs, but I think the scope of such optimizations is beyond this post.

Reference: https://docs.nvidia.com/dee...

anon64323374 · December 14, 2018, 7:19am

I think I ended up figuring this out. Initially, I got better results by increasing the workspace size (to around 14GB), which seemed to increase the compile time and generate more tactics options. Looking at what trtexec did, I stopped measuring the copy overhead of copying real inputs to the GPU (comment out the cudaMemCpy lines in simpleOnnx), and that got me closer to the correct number. This seems to be what is meant by results on a "synthetic" dataset.

However, we realized that it turns out that the example is actually running a different version of Resnet-50! The example asks you to download Resnet-50v2, but TensorRT seems to be much better optimized for earlier versions of Resnet-50. I tested on release 1.1 at https://github.com/onnx/mod.... Curiously, workspace size seems to have no impact on this older version of Resnet-50 at all, so decreasing to 1GB produces the same benchmark.

@disqus_mD6AGHAfPt:disqus , is there a guide to these further optimizations? I would like to get the maximum throughput in a real, non-synthetic use-case, but the memcpy for batch size 41 seems to add almost 3ms to the latency, landing a pretty heavy impact on throughput. How can I learn to make this come down even further?

anon34907300 · December 17, 2018, 12:53pm

@ankmathur96:disqus Did you try to overlap copy with compute operations? Here is an inspirational presentation by Stephen Jones:

http://on-demand.gputechcon...

anon41745279 · February 8, 2019, 1:27am

Got the same error. The script might need to be fixed.

anon41745279 · February 13, 2019, 12:52am

I was so confused that why you have
// Read input tensor from ONNX file if (readTensor(inputFiles, inputTensor) != inputTensor.size()) { cout << "Couldn't read input Tensor" << endl; return 1; }
until I realize that the input files you are using have only one sample in each pb file.

anon22993130 · February 25, 2019, 10:18pm

Hi,

Thanks very much for the example sharing. But I got error when I attempted to compile the code on the TensorRT 5:

"simpleOnnx_1.cpp:54:16: error: ‘IParser’ is not a member of ‘nvonnxparser’ "

Is this caused by the different version of TensorRT? Could you tell me how to fix it?

Thanks again.

anon7883082 · May 2, 2019, 11:54pm

Where can I get the CMakeLists.txt files to debug this code?

anon32070974 · May 6, 2019, 12:53am

I follow the guide and when I run make in TensorRT-introduction, it shows:

g++ -std=c++11 -Wall -I/usr/local/cuda/include -c -o ioHelper.o ioHelper.cpp
In file included from ioHelper.cpp:33:0:
/usr/local/include/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
<builtin>: recipe for target 'ioHelper.o' failed
make: *** [ioHelper.o] Error 1

do you have any suggestion? I only find onnx_ml.pb.h, this maybe the problem of onnx.

anon34907300 · May 6, 2019, 7:47pm

Please check onnx installation log for errors. This might be a problem with protobuf compiler; missing file onnx.pb.h is generated using protoc.

anon34907300 · May 6, 2019, 7:48pm

Sample comes with Makefile, which can be tweaked for debugging flags. Could you please elaborate on the issue?

anon34907300 · May 6, 2019, 7:49pm

This is likely due to using older TensorRT version. This sample should work with TensorRT 5.0 or newer.

anon32070974 · May 8, 2019, 3:21am

thansks reply, I solved it by add -DONNX_ML in makefile,
CXXFLAGS=-std=c++11 -Wall -I$(CUDA_INSTALL_DIR)/include -DONNX_ML

anon20017205 · May 14, 2019, 2:00pm

Hello,
I’ve a code in MXNET which I exported to ONNX, then from ONNX imported to TensorRT.

I’m using onnx-tensorrt(https://github.com/onnx/onn... in order to run the inference.

I’ve got an output after using

trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
Also I’ve got an output when I do a forward pass in my MXNET (on that output I find the bboxes values for the face).

Question: How can I convert the TensorRT’s inference output to match the MXNET’s inference output so I can classify the faces with the bboxes?

Also, maybe I don’t look at the right place and I need to ignore MXNET’s output and interpret ONNX’s output and use that instead? (I also verified that ONNX has the same output)

anon941035 · May 15, 2019, 4:16pm

This comment doesn't tie in directly with the topic of the post. You might try asking your question in the Deep Learning section of the NVIDIA Developer Talk Forum.

anon82663600 · August 23, 2019, 8:09am

Hello,
Is it possible to batch inputs in Python API?

anon34907300 · December 16, 2019, 9:59pm

Yes.

Topic		Replies	Views
TensorRT can not accelarate the onnx model for inferencing TensorRT tensorrt , cuda	3	768	April 17, 2020
Speeding Up Deep Learning Inference Using TensorRT Technical Blog	5	1053	November 9, 2021
Batch inference on tensorrt TensorRT tensorrt	4	504	February 15, 2021
ResNet18: Batch size 1 works, but batch size 10, 32 only has minor acceleration TensorRT	2	1862	February 20, 2020
Why does inference process terminate without error? TensorRT tensorrt	3	651	July 17, 2020
why tensorrt engine inference speed become much slower if I increase the input image size TensorRT	1	1144	December 18, 2019
Onnx -> tensorrt fp32 conversion performance degradation different outputs TensorRT tensorrt , pytorch , onnx	4	2312	November 29, 2022
TensorRT model is double size TensorRT	3	702	April 27, 2020
Tensorrt can not speed up well TensorRT	7	1831	June 29, 2022
No improvement seen using TFTRT TensorRT	3	791	May 18, 2021

How to Speed Up Deep Learning Inference Using TensorRT

Related topics