TRT issue with Graph Creation - TRTEngineOP

Hi,

I’m running TensorRT on 1070 and Jetson TX2 GPUs. On 1070 I have a couple of questions:

(1) Using Tensorflow / TRT via Python doesn’t seem to be producing optimized graphs with the ‘TRTEngineOP’ node that I could see in the corresponding graphs generated on TX2. On 1070 a graph generated by ‘trt.create_inference_graph’ was almost the same as TF frozen graph from which it was generated, except for a few ‘TransposeNHWCToNCHW’ nodes thrown in. This was on Python 3.7 (from anaconda, if that matters). However, I had another TF installation for Python 2.7 also that did generate the ‘TRTEngineOP’ node optimized graph.

I noticed that internally the TF framework was loading a library ‘_trt_engine_op.so’. And, this library on Python 3.7 TF framework that I was using was not linked with libnvinfer.so.5 on my system. However, TF on Python 2.7, which I have on the same computer, was actually linked with this library. And, on TX2 it was also linked.

And, using the following code produces TRT availability as False on Python 3.7, and True on Python 2.7 on the same computer with 1070:

import tensorflow.contrib.tensorrt as trt
from tensorflow.contrib.tensorrt.wrap_conversion import is_tensorrt_enabled

print (is_tensorrt_enabled())

which is consistent with the fact that _trt_engine_op.so on PY 3.7 is not linked with libnvifer and on PY 2.7 it is. Incase that really makes the difference.

However, the unsettling thing for me was that there was no indication from TF or TRT frameworks regarding this difference. There is a speedup of 1.2x - 2x using a graph produced with TRTEngineOP node and without (which is almost the same as TF frozen graph except the data transpose nodes mentioned earlier).

I got Tensorflow (with GPU) on both Py 3.7 and Py 2.7 using PIP (IIRC). And, Py 3.7 was from anaconda if that matters; Py2.7 was what came with Ubuntu 18.04.

Please advise on how to get Tensorflow on Py 3.7 which produces the optimized graph (presumably with TRTEngineOP node)?

(2) I read on these forums that FP16 is not optimized on 1070 (https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/post/5208194/#5208194) and INT8 is not optimized on TX2. On TX2 I do see that INT8 is slower than FP32 and FP16, which is consistent with the advice on this forum. However, on 1070 I still see INT8 as being slower, instead of FP16. The speedup on 1070 for FP32 or FP16 vs INT8 is about 120x - 150x in limited experimentation. That is not consistent with the NVIDIA quote in the link. Please advise.

In both (1) and (2) Resnet V2 50 was used for experimentation on 1070 (and also on TX2).

Here is the system configuration:

1070: Ubuntu 18.04 4.18.0-21-generic, Python 3.7.1, C++ 7.4.0, NVCC V10.0.166, TRT 5.1.5, TF 1.13.1, CUDA 10.0, CUDNN 7.5.0, NV DRV 418.43

TX2: Ubuntu 18.04 4.9.140-tegra, Python 3.6.7, C++ 7.4.0, NVCC V10.0.166, TRT 5.0.6, TF 1.13.1, CUDA 10.0, CUDNN 7.3.1, NV DRV Jetpack 4.2

Any help will be much appreciated. Thanks a lot.

Best regards,

Q./

Follow up to my post above. See here for an example where unoptimized TensorRT runs slower than a Tensorflow model from which it was derived:

https://imgur.com/a/AvXNrUn

Follow up #2: I grabbed NVIDIA supplied nvcr.io/nvidia/tensorrt:19.06-py3 Docker image. Installed tensorflow-gpu using pip.

And, got the following versions:

Python: 3.5.2
TF: 1.14
TRT: 5.1.5.0

However, TensorRT still produces an unoptimized graph (without the TRTEngineOP node). And, that still runs slower than the Tensorflow graph from which it was produced. Speedup ~0.9x on that MNIST image classification reported in the earlier message.

And, if I try to run a graph that does have a TRTEngineOP node then it throws an error:

tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'TRTEngineOp' in binary running on ec2a2c899fad. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

I did verify that Tensorflow was loading the CUDA libraries. In TF 1.14 it appears to load them dynamically. In 1.13 these dynamically libraries were linked into Tensorflow library image.

2019-07-17 01:41:43.257082: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-07-17 01:41:43.257093: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-07-17 01:41:43.257103: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-07-17 01:41:43.257112: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-07-17 01:41:43.257121: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-07-17 01:41:43.257130: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-07-17 01:41:43.257139: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

And, TRT nvinfer appears to be installed:

ls -l /usr/lib/x86_64-linux-gnu/libnvinfer*
lrwxrwxrwx 1 root root        19 Apr 27 05:00 /usr/lib/x86_64-linux-gnu/libnvinfer.so -> libnvinfer.so.5.1.5
lrwxrwxrwx 1 root root        19 Apr 27 05:00 /usr/lib/x86_64-linux-gnu/libnvinfer.so.5 -> libnvinfer.so.5.1.5
-rw-r--r-- 1 root root 145294696 Apr 27 04:59 /usr/lib/x86_64-linux-gnu/libnvinfer.so.5.1.5
lrwxrwxrwx 1 root root        26 Apr 27 05:00 /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so -> libnvinfer_plugin.so.5.1.5
lrwxrwxrwx 1 root root        26 Apr 27 05:00 /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.5 -> libnvinfer_plugin.so.5.1.5
-rw-r--r-- 1 root root   3552536 Apr 27 04:59 /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.5.1.5

Now, TRT graphs produced by this arrangement will still run as noted above. Though, at a reduced speedup than even Tensorflow. So, apparently, one loses the advantage of TRT.

So, NVIDIA folks, what am I doing wrong? How do I get optimized TRT graphs that I got earlier with Py 2.7 and TF 1.13 as noted in the first message?

Any help will be really appreciated.

Best regards,

Q./

Follow up #3: Ok, nvcr.io/nvidia/tensorflow:19.06-py3 Docker image does seem to create optimized TRT graphs. It comes with Tensorflow 1.13. The nvinfer libraries are installed. But, apparently TRT python packages are not. But, one should be able to install them by hand in the container.

However, this whole arrangement doesn’t appear to be very satisfying to me. Apparently, Tensorflow compiled by NVIDIA has TensorRT (optimized) support turned on. Google seems to consider TRT support in TF an optional thing. Things would have been nice if the state of affairs has been left here as they are. So that a warning / error would be generated trying to use TRT graphs in TF. However, my current understanding so far is that doesn’t appear to be the case. An unoptimized TRT support in TF is available in the tensorflow-gpu packages I have installed using pip (for python 3.7). Not sure if that is due to the reason that Bazel TRT config options were not set at compile time to use TRT, nvinfer, etc? Or if this is is due to some other reason? I can try compiling TF and test that hypothesis if I can find the time.

One doesn’t notice this scenario unless they inspect the graphs generated. TRT graphs generated by unoptimized support in TF runs even slower than the TF models they were derived from. That is what I’m seeing so far. And, everything happens transparently without any warning of sorts.

And, tensorflow-gpu installed by pip seems to bring in that unoptimized compiled TF package.

Im using NVIDIA 1080 GTX GPU, CUDA 10.0, Tensorflow-GPU=1.14.0, CUDAnn 7.4.2, TensorRT 5.1.5.0.
I have 2 issue:

  1. I’m using https://github.com/ardianumam/Tensorflow-TensorRT for trying out TRT.
    I believe even after using TRT both the graphs i.e optimized and original TF graphs are the same. There is no improvement in performance.(I do not see any TRT engine nodes using Tensorboard)

  2. a. Is there a way to verify that the graph generated is actually a TRT graph?
    b. Does TRT provide any logs? I have used NVPROF but I do not find TRT processes.
    c. I had read somewhere that NVIDIA 1080 does not support FP16, I have used that and I do not get errors, issue being with any precision mode, results are same.

Thank you in advance!

Kamatrohan13, to your questions:

(1) Yes, there definitely appears to be a case of certain Tensorflow packages producing more optimized TRT graphs. In fact, even with the same optimized TRT graphs produced by TF-TRT, I have noticed a difference in execution time between Python 2.7/TF 1.13.1/TRT 5.1.5 and Python 3.5.2/TF 1.13.1/TRT 5.1.5 (via nvcr.io/nvidia/tensorflow:19.06-py3 Docker image).

On 1070 GPU, FP32 and FP16 (Resnet V2 50) they behaved similarly. But, on INT8 (Resnet V2 50) Python2.7 TF-TRT arrangment was ~3.8x faster! On the same TRT graph generated. And, apparently INT8 is still the slowest, while as I linked earlier to an NVIDIA’s link where it was reported that actually FP16 should have been the slowest. To add further to the mystery, the platformHasFastFp16 function in NVIDIA TRT C/C++ API returned false for 1070, indicated that FP 16 is not supported (https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/classnvinfer1_1_1_i_builder.html#a1d18b948852cb088d22a87cff70a2b2f) - Though, I was still able to run FP16 TRT plans in C/C++ api also, and they ran fine. TBH, that might mean that only ‘fast’ support is not available, and it can still run FP16 kind of slow. I haven’t benchmarked that aspect of the C/C++ TRT api yet. Need to do.

(2) Inspecting the graphs is an option to figure out if they are optimized or not. One can also get a hint by looking at the nvinfer dynamic libraries either linked in or loaded at runtime by the Tensorflow library. Empirically it appears to me if linvinfer is not linked in / loaded at runtime then the graphs produced so far have been unoptimized. This might have to do with Bazel TRT setting at TF compile time? Perhaps.

(3) Yes, as indicated in (1) above, it appears the behavior around FP16 is not consistent with what NVIDIA folks say here on these forums for 1070 and perhaps 1080 also. And, TRT C/C++ api seems to say that FP16 is not supported via that function call mentioned above. Though, both of them seem to run FP16 fine. Apparently.

Sincerely,

Q./

Hey, thanks for raising this issue that you experienced.

In order to use TF-TRT, Tensorflow must be compiled with TRT support enabled. The version of TensorFlow in the NGC containers has this feature enabled.

The difference you noticed due to the python versions are because your Python2.7 is using a different TF installation then your Python3.7. One of these TF installations was built with TRT support and the other wasn’t, and it doesn’t have anything to do with the python version.

There should definitely be some sort of warning if users try to convert using TF-TRT for a tensorflow which was not built with TRT support. I think this was something we had in the past but got lost as changes were made.

The main philosophy of TF-TRT is that you will always get a usable TF graph after conversion which may or may not be optimized with TRT depending on the model (only certain ops are compatible with TRT). You are correct in that the only way to check that you are using actually using TRT is by looking for the presence of TRTEngineOps in the output graph.

You can do this by looking at the TensorFlow log output. The TensorRTOptimizer step will reduce the number of nodes if TRT is being used:

2019-07-19 21:06:14.216544: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:848] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 1239 nodes succeeded.
2019-07-19 21:06:14.352823: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:716] Optimization results for grappler item: tf_graph
2019-07-19 21:06:14.352871: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 1244 nodes (-205), 1426 edges (-205), time = 1717.09094ms.
2019-07-19 21:06:14.352878: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   layout: Graph size after: 1244 nodes (0), 1426 edges (0), time = 362.308ms.
2019-07-19 21:06:14.352883: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   constant folding: Graph size after: 1244 nodes (0), 1426 edges (0), time = 588.36ms.
2019-07-19 21:06:14.352888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:718]   TensorRTOptimizer: Graph size after: 6 nodes (-1238), 5 edges (-1421), time = 86346.9297ms.

The other changes to the graph that you noticed such as the TransposeNHWCToNCHW nodes are due to the other optimizers (layout, constant folding) which will still run even if the TensorRTOptimizer isn’t enabled.

You can also use this python snippet to count the number of TRTEngineOp easily:

frozen_graph = trt.create_inference_graph(...)
print('TRT node count:', len([1 for n in frozen_graph.node if str(n.op) == 'TRTEngineOp']))

I will try to add a warning for when trying to use TF-TRT but TRT support is not enabled.

Thank you, I tried the NGC container, it worked! Got 7X performance. Just one issue, Im trying my own code now. I’m not that familiar with TRT, and I’m still trying to get a hold of it, Can you please let me know what is the meaning of max_batch_size. I went through the documentation it says: This parameter is the maximum batch size that specifies the batch size for which TensorRT will optimize. At runtime, a smaller batch size may be chosen. At runtime, larger batch size is not supported.
Did’nt understand what this means
Im using mnist data:

mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images=test_images/255.0

and the TRT parameters are:
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph,
#minimum_segment_size=1,# frozen model
outputs=your_outputs,
max_batch_size=3,
#is_dynamic_op=True,
#use_calibration=True,# specify your max batch size #
max_workspace_size_bytes=4
(10**9),# specify the max workspace *#
precision_mode=“FP16”
)

I get this error :
Engine buffer is full. buffer limit=1, current entries=3, requested batch=10000
Failed to get engine batch, running native segment for TRTEngineOp_0

AFTER CONVERSION FROMTF TO TRT:
numb. of all_nodes in frozen graph: 31
numb. of trt_engine_nodes in TensorRT graph: 2
numb. of all_nodes in TensorRT graph: 17

I think its the issue with batch size, what would be the parameters to select appropriate batch size?
Thank You for your prompt replies

Thanks TMorris@NVIDIA,

I appreciate your response. I agree it will be helpful if you can add a message string to TF that it generates unoptimized TRT graphs when it doesn’t have the full support to do so.

However, I would also humbly suggest NVIDIA to make the version of TF they have compiled with optimized TRT support (say in nvcr.io/nvidia/tensorflow:19.06-py3 Docker image) publicly available, if it is not already so. And, in TX2’s case NVIDIA already does make that available.

Also, with the limited testing I have done, there appears to be little point in running unoptimized TRT with TF. It runs about 10% - 20% slower than the original TF model it was derived from. So there is no reason to increase the execution time just to be able to run TRT. IMHO, that could be due to the NHWC->NCHW transformation. Not just the transformation node. I noticed that the later computations (say conv2D) were also being done in the NCHW domain, which could be possibly slower than if done on NHWC.

Thanks again for your input.

Best regards,

Q./

Hey @qiqbal01, the containers are already publicly available at https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
“Unoptimized TRT” (Using TF-TRT with a TF not built with TRT support) is not intended to be used. If your TF was compiled with TRT support and you are still not getting any TRTEngineOp, then there may be some problem with your model.

@kamatrohan13 Max batch size should be set to the highest batch size that you will use when you execute the model. From the log you sent, you tried to use a batch size of 10000 (requested batch=10000). I think that might’ve been a mistake since that is also the size of your dataset. You should break the dataset into smaller batches such as 128 items per batch. In that case, you would use 128 as the max_batch_size parameter.

Thank You @tmorris, just one thing. On comparison of native TF model and TRT optimized model inference time, I realized that although TRT is faster, it varies from 10% to 47% when I run the code. The numbers are generated by taking an average of 50 runs each time the code is run. Is it normal? Thank You.
Also on INT8 calibration, my native model is like 3X faster than TRT.

2019-07-25 18:21:12.759781: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:192] Starting Calib Conversion
2019-07-25 18:21:13.349759: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:198] Construction of static int8 engine is not implemented yet!. Dynamic engine will be constructed

@TMorris, right the containers are publicly available. That is how I tested with them. May be I was not clear. What I meant was NVIDIA making Tensorflow compiled properly with optimized TRT support available as a Python package (not as a Docker image). Because, as I mentioned earlier, using pip to install Tensorflow for Python brings in the unoptimized version. At least for Python 3.5+ for me. And, I would suspect many out there would just use pip to install TF, and get unoptimized support as I got.

Hi, @tmorris, I am using tftrt . my tensorflow version is 1.14.1 , tensorrt version is 5.1.5. I got an tensorrt optimized savedmodel sucessfully. I see the TRTEngineOP nodes when printing all nodes in the optimized graph.

However, there is no performance as I expected. And how can I visualize the optimized graph with tensorboard?

1 Like