TF-TRT speed up not reproducible on custom trained SSD Inception model from TF model zoo.

rajanand · May 16, 2019, 5:51pm

Hi,

I have an observation on TF-TRT support for tensorflow models. I am able to reproduce 1.5x-2x TensorRT speed-up on TF model-zoo models as well as any custom retrained models of ssd_inception_v2 for 300x300 resolution (resolution used on model zoo), but I am not able to reproduce the same 1.5x -2x speed up for high resolution retrained models on 1920x1080 of ssd_inception_v2. The tf frozen_graph and the tensorrt optimized graph run roughly at the same fps.

Are there any aspects that might affect the kernel fusion or creating TRTEngineOps for TF subgraphs with the high resolution model (as the weight matrix is much larger for this than the 300x300 models). But, I would still expect some amount of speed up either way as I do see creation of TRTEngineOps the same as the 300x300 model conversion. Since they are the same architecture trained with different resolution imagery and that the converted TRTEngineOps introduced are also the same, what could be the cause for this discrepancy?

I use FP16 precision. An example of the logs printed from TRT graph conversion (for both 300x300 and 1920x1080) is:

[tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 4
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 945 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node Postprocessor/TRTEngineOp_1 added for segment 1 consisting of 3 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 5 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 4 nodes succeeded.
[tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:265] Returning from TensorRTOptimizer
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 2014 nodes (-1393), 2809 edges (-1542), time = 331.582ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] layout: Graph size after: 2051 nodes (37), 2847 edges (38), time = 99.096ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 2041 nodes (-10), 2847 edges (0), time = 173.335ms.
[tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] TensorRTOptimizer: Graph size after: 1088 nodes (-953), 1702 edges (-1145), time = 483.096ms.

rajanand · May 21, 2019, 5:30pm

Hi, I was wondering if there is any update/investigation on this ?

nluehr · June 27, 2019, 4:36pm

In general, TRT optimizations (like kernel fusion) are good at eliminating latency bottlenecks. In the limit of large input sizes, latencies between kernels are negligible. Thus, the native TF implementation is able to efficiently saturate the GPU, leaving little room for improvement.

rajanand · July 15, 2019, 7:55pm

Hi, Can you provide more details regarding this reasoning. By TF implementation saturating the GPU, do you mean in terms of VRAM / CUDA cores?

nluehr · August 5, 2019, 7:39pm

GPU launch latencies are a common cause of poor GPU efficiency in small workloads. The time to launch each GPU operation is constant regardless of how long the op will execute on the GPU. For very short kernels, a large percentage of the application’s runtime is taken up with launches. Increasing the execution time per kernel (by increasing image size of batch size, for example), reduces the ratio of runtime used for launches, and leads to greater efficiency.