TensorRT 3: Faster TensorFlow Inference and Volta Support

Originally published at: https://developer.nvidia.com/blog/tensorrt-3-faster-tensorflow-inference/

NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. NVIDIA released TensorRT last year with the goal of accelerating deep learning inference for production deployment. Figure 1. TensorRT optimizes trained neural network models to produce adeployment-ready runtime inference engine. In this post we’ll introduce…

Amazing article. I am very interested to try optimization 1 during training as well. Could I just use it in this setting on a centos machine?
In downloads page only Ubuntu packages are available.

Hi Nitin, thanks!. TensorRT is a deployment-only library, so you can't take advantage of these optimizations for training with TensorRT. Currently only ubuntu packages are tested and officially supported. TensorRT is also available for Jetson TX1 and TX2 embedded platforms.

Great, thank you.
Is there any way or future plan to pluck and use parts of this module (like optimization 1)? I am currently facing heavy slow downs on GPUs and not able to use its full potential and feel layer fusion can heavily speed up the training. Something like,

my_raw_tf.graph > tensor_rt3.layer_fusion > op_tf.graph
op_tf.graph.fit(X, Y)

Hi Nitin, I can't comment on future plans. Performance improvements depend on various factors including the specific graph structure and opportunities for fusion. TensorRT is able to deliver overall better performance through a combination of all the optimizations discussed in the post.

Try using NVIDIA optimized framework containers on NGC container registry. You can download and run them locally or on the latest V100 GPUs on AWS:
https://www.nvidia.com/en-u...

Is TensorRT provides a Custom Layer C++ API for Tensorflow to inference UFF file of tensorflow on DrivePX2?

Is there an example of this with the C++ API?

What about use on AWS p2.xlarge or other home made NVIDIA / CUDA based systems?

Useful information in this post, can't wait to try.

Hi
first of all, this is a very interesting and useful article.
I was wondering if I can call the inference method - infer - from within a kernel (runs on gpu)
if not, is there any future plan to support/provide this kind of device API?
It would be great to have it!

Thanks

is it possible to saher the code inside load_and_preprocess_images() method

see this problem,
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.4/site-packages/tensorflow/python/framework/importer.py", line 489, in import_graph_def
graph._c_graph, serialized, options) # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 2 but is rank 4 for 'import/dense_p7/MatMul' (op: 'MatMul') with input shapes: [1,256,1,1], [256,1]

we can successfully complete the tensorrt subgraph convension, but we meet the problem during the inference phases. My model is resnet-50 based tensorflow. who can help me solve this problem, thanks!

Using TensorFlow backend.
2018-09-05 18:27:17.202041: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2018-09-05 18:27:17.336380: W tensorflow/stream_executor/cuda/cuda_driver.cc:513] A non-primary context 0x60fa250 for device 0 exists before initializing the StreamExecutor. The primary context is now 0x60cc960. We haven't verified StreamExecutor works with that.
2018-09-05 18:27:17.337269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.70GiB
2018-09-05 18:27:17.337304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-05 18:27:17.991676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-05 18:27:17.991732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-09-05 18:27:17.991747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-09-05 18:27:17.991999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7408 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
model_data/yolo.h5 model, anchors, and classes loaded.
Using output node dense_2/Softmax
Converting to UFF graph
Traceback (most recent call last):
File "demo.py", line 193, in <module>
main(YOLO())
File "demo.py", line 43, in main
uff_model = uff.from_tensorflow_frozen_model("mars-small128.pb", ["dense_2/Softmax"])
File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 149, in from_tensorflow_frozen_model
return from_tensorflow(graphdef, output_nodes, preprocessor, **kwargs)
File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 120, in from_tensorflow
name="main")
File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/converter.py", line 76, in convert_tf2uff_graph
uff_graph, input_replacements)
File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/converter.py", line 53, in convert_tf2uff_node
raise UffException(str(name) + " was not found in the graph. Please use the -l option to list nodes in the graph.")
NameError: name 'UffException' is not defined

You might try the devtalk forums: https://devtalk.nvidia.com/...

HI, do you have the usage guide of tensorrt lite ?

@wade.wang – This might help: https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt_302/tensorrt-api/topics/topics/pkg_ref/lite.html.

@jwitsoe Yes, it is helpful, Thank you !