TF-TRT speed up not reproducible on custom trained SSD Inception model from TF model zoo.

Hi,

I have an observation on TF-TRT support for tensorflow models. I am able to reproduce 1.5x-2x TensorRT speed-up on TF model-zoo models as well as any custom retrained models of ssd_inception_v2 for 300x300 resolution (resolution used on model zoo), but I am not able to reproduce the same 1.5x -2x speed up for high resolution retrained models on 1920x1080 of ssd_inception_v2. The tf frozen_graph and the tensorrt optimized graph run roughly at the same fps.

Are there any aspects that might affect the kernel fusion or creating TRTEngineOps for TF subgraphs with the high resolution model (as the weight matrix is much larger for this than the 300x300 models). But, I would still expect some amount of speed up either way as I do see creation of TRTEngineOps the same as the 300x300 model conversion. Since they are the same architecture trained with different resolution imagery and that the converted TRTEngineOps introduced are also the same, what could be the cause for this discrepancy?

I use FP16 precision. An example of the logs printed from TRT graph conversion (for both 300x300 and 1920x1080) is:

[tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 4
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 945 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node Postprocessor/TRTEngineOp_1 added for segment 1 consisting of 3 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 5 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 4 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:265] Returning from TensorRTOptimizer
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 2014 nodes (-1393), 2809 edges (-1542), time = 331.582ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] layout: Graph size after: 2051 nodes (37), 2847 edges (38), time = 99.096ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 2041 nodes (-10), 2847 edges (0), time = 173.335ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] TensorRTOptimizer: Graph size after: 1088 nodes (-953), 1702 edges (-1145), time = 483.096ms.

hello,

we are triaging will keep you updated.

Thank you. For reference, I am using the same approach as mentioned here -https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection

Is the size 1920x1080 propagated throughout the network?
In other words, do you know if the inputs size of TRTEngineOp_i are different from the lower resolution case?

You can try verbose logging to see if any useful information appears: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#verbose

Could you post the performance numbers that you get from both of TF and TF-TRT in both of lower and higher resolution tests?

When MatMul sizes get very large, it’s possible that TensorRT optimizations such as fusion become less effective because native TF that uses CUDNN for such sizes is fast too.