How to Speed Up Deep Learning Inference Using TensorRT

Originally published at:

Looking for more? Check out the hands-on DLI training course: Optimization and Deployment of TensorFlow Models with TensorRT The new version of this post, Speeding Up Deep Learning Inference Using TensorRT, has been updated to start from a PyTorch model instead of the ONNX model, upgrade the sample application to use TensorRT 7, and replaces…

Thanks for this extremely informative post - I am attempting to replicate some of the numbers for throughput for inference attained here:

This example targets a Resnet-50, but the performance, even when enabling the mode for FP16, seems to not match the latencies/throughputs reported there. How should I be building the model differently?

Got errors while attempting make inside TensorRT-introduction/
Solved them by adding <math.h> and <numeric> in sampleOnnx*.cpp

Please keep in mind that this blog post sample is oriented towards new users. It does not include all possible optimizations. You might achieve better results with our existing benchmark tool "trtexec". It's possible to further optimize the code e.g. by using CUDA Graphs, but I think the scope of such optimizations is beyond this post.


I think I ended up figuring this out. Initially, I got better results by increasing the workspace size (to around 14GB), which seemed to increase the compile time and generate more tactics options. Looking at what trtexec did, I stopped measuring the copy overhead of copying real inputs to the GPU (comment out the cudaMemCpy lines in simpleOnnx), and that got me closer to the correct number. This seems to be what is meant by results on a "synthetic" dataset.

However, we realized that it turns out that the example is actually running a different version of Resnet-50! The example asks you to download Resnet-50v2, but TensorRT seems to be much better optimized for earlier versions of Resnet-50. I tested on release 1.1 at Curiously, workspace size seems to have no impact on this older version of Resnet-50 at all, so decreasing to 1GB produces the same benchmark.

@disqus_mD6AGHAfPt:disqus , is there a guide to these further optimizations? I would like to get the maximum throughput in a real, non-synthetic use-case, but the memcpy for batch size 41 seems to add almost 3ms to the latency, landing a pretty heavy impact on throughput. How can I learn to make this come down even further?

@ankmathur96:disqus Did you try to overlap copy with compute operations? Here is an inspirational presentation by Stephen Jones:


Got the same error. The script might need to be fixed.

I was so confused that why you have
// Read input tensor from ONNX file
if (readTensor(inputFiles, inputTensor) != inputTensor.size())
cout << "Couldn't read input Tensor" << endl; return 1; }

until I realize that the input files you are using have only one sample in each pb file.


Thanks very much for the example sharing. But I got error when I attempted to compile the code on the TensorRT 5:

"simpleOnnx_1.cpp:54:16: error: ‘IParser’ is not a member of ‘nvonnxparser’ "

Is this caused by the different version of TensorRT? Could you tell me how to fix it?

Thanks again.

Where can I get the CMakeLists.txt files to debug this code?

I follow the guide and when I run make in TensorRT-introduction, it shows:

g++ -std=c++11 -Wall -I/usr/local/cuda/include -c -o ioHelper.o ioHelper.cpp
In file included from ioHelper.cpp:33:0:
/usr/local/include/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
<builtin>: recipe for target 'ioHelper.o' failed
make: *** [ioHelper.o] Error 1

do you have any suggestion? I only find onnx_ml.pb.h, this maybe the problem of onnx.

Please check onnx installation log for errors. This might be a problem with protobuf compiler; missing file onnx.pb.h is generated using protoc.

Sample comes with Makefile, which can be tweaked for debugging flags. Could you please elaborate on the issue?

This is likely due to using older TensorRT version. This sample should work with TensorRT 5.0 or newer.

thansks reply, I solved it by add -DONNX_ML in makefile,
CXXFLAGS=-std=c++11 -Wall -I$(CUDA_INSTALL_DIR)/include -DONNX_ML

I’ve a code in MXNET which I exported to ONNX, then from ONNX imported to TensorRT.

I’m using onnx-tensorrt( in order to run the inference.

I’ve got an output after using

trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
Also I’ve got an output when I do a forward pass in my MXNET (on that output I find the bboxes values for the face).

Question: How can I convert the TensorRT’s inference output to match the MXNET’s inference output so I can classify the faces with the bboxes?

Also, maybe I don’t look at the right place and I need to ignore MXNET’s output and interpret ONNX’s output and use that instead? (I also verified that ONNX has the same output)

This comment doesn't tie in directly with the topic of the post. You might try asking your question in the Deep Learning section of the NVIDIA Developer Talk Forum.

Is it possible to batch inputs in Python API?