Conversion with no speed improvement, TRT-TF

minasyanvaagn · April 11, 2019, 1:33pm

My machine info:
Ubuntu 16.04
GeForce GTX 1080
Nvidia Driver Version: 410.78
CUDA Version: 10.0
CUDNN version : 7.5.0
Tensorflow version 1.13.0 from official nvidia docker-hub nvcr.io/nvidia/tensorflow:18.09-py3.
Tensorrt version: from tensorflow package (5.0.2)
Problem description:
I have managed to convert my model to TRT-TF using create_inference_graph method from tensorflow.contrib.tensorrt module. However I do not get statistically significant increase in speed,if any , whatever precision (FP16 or FP32) I use and I am trying to clear up, what can be the reasons for such situation. (Bellow is the console output for optimization)

Questions:

Could it be due to the fact, that tensor shapes are unknown and optimization can not be implemented ?
Is it ok, that my converted .pb file exactly two times larger, than my original frozen not-optimized .pb file ?
Can the few number of supported TensorRT operations be the reason for that ?
I am going to try conversion on Nvidia RTX 2080 Ti , should I expect any benefits in comparison with my current Nvidia GTX 1080 ?

Nvidia GTX1080 console output for optimization :

INFO:tensorflow:Running against TensorRT version 5.0.2
2019-04-11 13:01:33.811488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-11 13:01:33.811947: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2019-04-11 13:01:33.812059: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-04-11 13:01:33.839792: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4200000000 Hz
2019-04-11 13:01:33.841161: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x1e5050c0 executing computations on platform Host. Devices:
2019-04-11 13:01:33.841243: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-11 13:01:33.845844: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x1e52c8e0 executing computations on platform CUDA. Devices:
2019-04-11 13:01:33.845926: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-11 13:01:33.846698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.21GiB
2019-04-11 13:01:33.846780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-11 13:01:34.155614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-11 13:01:34.155652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-11 13:01:34.155661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-11 13:01:34.155794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6926 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-04-11 13:01:34.772860: I tensorflow/contrib/tensorrt/segment/segment.cc:461] There are 7 ops of 4 different types in the graph that are not converted to TensorRT: ConcatV2, Transpose, Placeholder, NoOp, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-11 13:01:34.789697: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 2
2019-04-11 13:01:34.797911: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3728] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1
2019-04-11 13:01:34.797949: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1036] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 4 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1. Fallback to TF...
2019-04-11 13:01:34.798309: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3728] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,3,?,?] has an unknown non-batch dimension at dim 2
2019-04-11 13:01:34.798325: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1036] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 461 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,3,?,?] has an unknown non-batch dimension at dim 2. Fallback to TF...
2019-04-11 13:01:34.810444: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
2019-04-11 13:01:34.810478: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 468 nodes (0), 478 edges (0), time = 61.211ms.
2019-04-11 13:01:34.810485: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   layout: Graph size after: 478 nodes (10), 484 edges (6), time = 25.671ms.
2019-04-11 13:01:34.810490: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 473 nodes (-5), 484 edges (0), time = 40.989ms.
2019-04-11 13:01:34.810495: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   TensorRTOptimizer: Graph size after: 473 nodes (0), 484 edges (0), time = 261.698ms.

minasyanvaagn · April 23, 2019, 3:39pm

The problem was in undetermined ( dynamic) shape of input placeholder , e.g. [?,?,?,3] or something like that. Changing it to [?,W,H,C] solved the problem.

Topic		Replies	Views
No improvement in inference performance after Opt. with TensorRT TensorRT	6	1235	April 15, 2020
Dont see any speedups using TensorRT TensorRT	14	3007	October 12, 2021
No performance improvement with TF-TRT optimization (ResNet50, DenseNet121) TensorRT	4	1108	June 15, 2020
No speed up with TensorRT FP16 or INT8 on NVIDIA V100 TensorRT	7	2853	November 15, 2019
Don't get any 'TRTEngineOp' after optimizing model via TensorRT in Jeton TX2 TensorRT	17	3713	October 12, 2021
TRT issue with Graph Creation - TRTEngineOP TensorRT	12	3184	November 4, 2019
Graph conversion to FP16 not working TensorRT	6	1597	October 12, 2021
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2949	January 18, 2019
After converting ssdMobilnet from the examples, the model is slower Jetson Xavier NX tensorrt	4	510	October 18, 2021
supported ops problem for Tensorflow-TensorRT Frameworks (archived) tensorflow	6	1664	July 11, 2019

Conversion with no speed improvement, TRT-TF

Related topics