Hi,
I’m running TensorRT on 1070 and Jetson TX2 GPUs. On 1070 I have a couple of questions:
(1) Using Tensorflow / TRT via Python doesn’t seem to be producing optimized graphs with the ‘TRTEngineOP’ node that I could see in the corresponding graphs generated on TX2. On 1070 a graph generated by ‘trt.create_inference_graph’ was almost the same as TF frozen graph from which it was generated, except for a few ‘TransposeNHWCToNCHW’ nodes thrown in. This was on Python 3.7 (from anaconda, if that matters). However, I had another TF installation for Python 2.7 also that did generate the ‘TRTEngineOP’ node optimized graph.
I noticed that internally the TF framework was loading a library ‘_trt_engine_op.so’. And, this library on Python 3.7 TF framework that I was using was not linked with libnvinfer.so.5 on my system. However, TF on Python 2.7, which I have on the same computer, was actually linked with this library. And, on TX2 it was also linked.
And, using the following code produces TRT availability as False on Python 3.7, and True on Python 2.7 on the same computer with 1070:
import tensorflow.contrib.tensorrt as trt
from tensorflow.contrib.tensorrt.wrap_conversion import is_tensorrt_enabled
print (is_tensorrt_enabled())
which is consistent with the fact that _trt_engine_op.so on PY 3.7 is not linked with libnvifer and on PY 2.7 it is. Incase that really makes the difference.
However, the unsettling thing for me was that there was no indication from TF or TRT frameworks regarding this difference. There is a speedup of 1.2x - 2x using a graph produced with TRTEngineOP node and without (which is almost the same as TF frozen graph except the data transpose nodes mentioned earlier).
I got Tensorflow (with GPU) on both Py 3.7 and Py 2.7 using PIP (IIRC). And, Py 3.7 was from anaconda if that matters; Py2.7 was what came with Ubuntu 18.04.
Please advise on how to get Tensorflow on Py 3.7 which produces the optimized graph (presumably with TRTEngineOP node)?
(2) I read on these forums that FP16 is not optimized on 1070 (https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/post/5208194/#5208194) and INT8 is not optimized on TX2. On TX2 I do see that INT8 is slower than FP32 and FP16, which is consistent with the advice on this forum. However, on 1070 I still see INT8 as being slower, instead of FP16. The speedup on 1070 for FP32 or FP16 vs INT8 is about 120x - 150x in limited experimentation. That is not consistent with the NVIDIA quote in the link. Please advise.
In both (1) and (2) Resnet V2 50 was used for experimentation on 1070 (and also on TX2).
Here is the system configuration:
1070: Ubuntu 18.04 4.18.0-21-generic, Python 3.7.1, C++ 7.4.0, NVCC V10.0.166, TRT 5.1.5, TF 1.13.1, CUDA 10.0, CUDNN 7.5.0, NV DRV 418.43
TX2: Ubuntu 18.04 4.9.140-tegra, Python 3.6.7, C++ 7.4.0, NVCC V10.0.166, TRT 5.0.6, TF 1.13.1, CUDA 10.0, CUDNN 7.3.1, NV DRV Jetpack 4.2
Any help will be much appreciated. Thanks a lot.
Best regards,
Q./